petermr's blog

What is an Institutional Repository for?

Posted on April 24, 2007 by pm286

Have had great time today at Colorado State University talking to the Library and Information Scientists. Lost of ideas, especially on the role and construction of Institutional Repositories. I am still revising my views about this and feel that the classic model (if anything 3 years old can be classic) where scholars deposit a finished digital gem may not be the only one.
In preparing my presentation I looked around for repository models and suddenly realised I had been using one for years – Sourceforge. This is an ideal model, as long as you accept the tenet of Open Source (not the same as Open Access, but philosophically aligned). SF is a repository for computer code and manages complete version control and also a complete collaborative environment. I make a change to the code – it gets a new version number, but also I can still retrieve all previous versions. Also Any of my collaborators can make changes and I update seamlessly to include all their enhancements. So why not use the same software – SVN – to manage our repositories.
Publishing a scholarly manuscript is a complex workflow. (OK, I’m a coward and usually find a co-author who does it for me). It goes like this:

Author A writes a draft and circulates it to B and C (at a different institution)
B makes some changes
C makes some changes
A updates with B’s changes
D (oh, yes, we can change the authorship) edits A’s original manuscript
there are now at least 3 versions of the manuscript circulating
A prints them all out and tries to reconcile changes
etc.
finally F makes the finished version and sends it to the publisher
publisher sends reviewers comments to F who forwards them to B and C
C makes changes and resends them to F
F sends revised draft to publisher
A complains that s/he didn’t see the comments
F sends further revised draft to publisher
weeks pass
and some more
X mails A saying why not put m/s in IR
Publisher only allows reposition of author’s m/s pre-submission
A mails F saying that publication has appeared
mail bounces saying F has moved
A tries to recover m/s from B, C and E (yes E was in it).
A edits the mess into what might have been sent to publisher and mails to X

BUT using SVN it’s trivial – assuming there is a repository.
So we do not speak of an Institutional Repository, but an authoring support environment (ASE or any other meaningless acronym. )
A starts a project in institutional SVN.
B joins, so do C, D, E, etc.
They all edit the m/s. Everyone sees the latest version. The version sent to the publisher is annotated as such (this is trivial). All subsequent stuff is tracked automatically.
When the paper is published, the insitution simply liberates the authorised version – the authors don’t even need to be involved.
The attractive point of this – over simple deposition – is that the repository supports the whole authoring process.
If you want to start, set up SVN – it’s easy and there are zillions of geeks who know how to do it. It’s free, of course, and also very good. That’s it. It’s easiest if the authoring is done in LaTeX as then the diffs are obvious. But Word will probably do fine (modern word is saved as XML). Start with single authors – thesis candidates, humanities, etc.
====
I was also honoured to have a videocast interview which CSU will make available (under CC) soon. I have a few personal observations on Open Foo, the role of publishers, of libraries, etc.

Posted in Uncategorized | 9 Comments

Meeting under the Blue Obelisk

Posted on April 23, 2007 by pm286

Here’s something that makes me feel pleased:
Jean-Claude Bradley posts:

NCI – UsefulChem Link

Earlier this week, I was contacted by Daniel Zaharevitz, Chief of the Information Technology Branch of the Developmental Therapeutics Program at the National Cancer Institute. He is also involved with the NIH Roadmap Molecular Libraries Initiative. We had a very interesting talk about Open Science and what kind of further impact it could have in drug development. Lets just say that we are on the same page on this issue and I’m really impressed with what Dan is trying to achieve.
The first thing we are going to do is start shipping the compounds we make for an automatic screening of 60 cell lines for tumor inhibitory activity. No, these compounds were not designed for this purpose but the screening service Dan offers is free and we might just learn something.
Also, if someone has a model of tumor suppression and would like to make a suggestion (hint, hint), I would be happy to re-prioritize the order in which we make our molecules. The UsefulChem project is designed to be flexible that way.
The Ugi reaction we are using is very simple and amenable to combinatorial approaches from commercially available compounds. Many of these compounds have probably been made and tested by pharmaceutical companies but the results are sitting behind a firewall.
Dan will also be visiting me in Philly next week.
Thanks to Peter Murray-Rust for catalyzing this connection!

=====

For new readers, Jean-Claude is promoting Open Science where the experiments are reported as they are done. This means the whole world can see them. Dan discovered me through the Open Source community lists – before we had formulated the Blue Obelisk but fully in the spirit of it. Dan has an exceiting vision of how if we pool our information on how chemical compounds interact with biological targets, publish it openly, we could make a huge amount of progress. There are many others in the NIH who share this vision and they have created the Molecular Libraries Roadmap project and they are funding this. The results of this will all be Open – in Pubchem.

Open data of this sort can provide huge amounts of new science.

Posted in Uncategorized | 2 Comments

Let's reclaim our own work

Posted on April 23, 2007 by pm286

From Peter Suber’s blog: Are OA repositories adequate for long-term preservation?

Peter B. Hirtle, Copyright Keeps Open Archives and Digital Preservation Separate, RLG DigiNews, April 15, 2007. Sadly, this is the last installment of the FAQ column in the final issue of DigiNews. Excerpt:

I have read that if I publish with a “green” publisher or use one of the author’s addenda, my articles can be preserved in an open access digital repository. Is this true?
The short answer: probably not….
[…]
Open access archives can be a valuable tool in making information immediately available. With time, the license terms that permit self-archiving may mature to explicitly permit digital preservation of the files as well as third party use of the archived material (the other great lacuna in the current agreements). For now, however, libraries will need to rely on the published journal literature for the long-term preservation of scholarly information. And, as library directors concluded in our recent report, E-Journal Archiving Metes and Bounds: A Survey of the Landscape, only journals that are part of formal third party journal archiving programs can be said to be effectively preserved. In sum, libraries cannot yet rely upon open archives for long-term access to the journal literature

This is so depressing. Most scholars would like their work to be read (OK, some are only interested in it being cited, but…). Here we are being told: “A scholar writes something, publishes it on their website, in their repository, etc. but they do not own their own work. The publisher has control over the long-term future”. Presumably the publisher could decide to close the back issues – destroy the archive – sell it to Disney – whatever.
I am a chemist/informatician. I am not a specialist in Open Access, copyright law, etc. I want a simple equation.

do some scholarly work/research
tell the world about it
get feedback – praise or criticism or apathy

I want to concentrate on science/technology – quantum mechanics, crystal structures, etc. When I started as a chemist that’s how it worked.
Now we are actively fighting our publishers. They hinder our ability to work-publish-feedback.
PeterS has blogged Stevan Harnad:
More on green OA without paying for gold OA

Stevan Harnad, OA or More-Pay? Open Access Archivangelism, April 18, 2007.

Summary: Springer Open Choice offers authors the choice of paying for Optional Gold OA: While all publication costs are still being paid for by institutional subscriptions, authors can pay Springer $3000 extra to make their article (Gold) OA for them.
   But there is no need (nor sense) to pay anyone an extra penny while institutional subscriptions are still paying all publication costs. Researchers’ institutions and funders should instead mandate that their researchers self-archive their published articles in their own Institutional Repositories in order to make them (Green) OA.
   Mandating deposit in an Institutional Repository is a university and funder policy matter in which the publishing industry should have no say whatsoever. The way to remove the publishing industry lobby from this research-community decision loop is the pro-tem compromise — wherever there is any delay in adopting an OA self-archiving mandate — of weakening the mandate into an immediate-deposit/optional-access mandate (ID/OA), so that it can be adopted without any further delay.
   (Such ID/OA mandates can be accompanied by a cap on the maximum allowable length for any publisher embargo on the setting of access to the (immediate) deposit as OA: 3 months, 6 months, 12 months: whatever can be agreed on without delaying the adoption of the ID/OA mandate itself. The most important thing to note is that most of the current, sub-optimal Green OA mandates that have already been adopted or proposed — the ones that mandate deposit itself only after a capped embargo period [or worse: only if/when the publishers “allows it”] instead of immediately — are all really subsumed as special cases by the ID/OA mandate. The only difference is that the deposit itself must be immediate in all cases, with the allowable delay pertaining only to the date of the OA-setting.) …

This is pretty clear. The author wants to publish their work. The publisher tells them that only a small proportion of the scientific world will be able to read it. And theat the publisher will make sure this is enforced. If they want more people to read it they have to pay more.

I was visiting Jimmy Stewart – a Blue Obelisk member – today in Colorado Springs. Jimmy has written an excellent program, MOPAC, which predicts molecular properties. He wants to write it up – and he (and I) are on the board of the Journal of Molecular Modeling so he looked to send it there. He wanted it to be open so that everyone could read it and the essential data files that define the essence of the program. These need to be preserved for all time. He looked at the Springer Open and was struck silent by the price…

It is quite clear that the academic community is supine. I compare them to the rabbits in Watership down – the farmers fed them carrots and then culled them for the pot. I am increasingly appalled by the lack of concern about Openness and it makes me angry.

The only answer is to take back our own work. It is pointless negotiating these complex agreements. Every publisher has byzantine legal agreements. “You can let people see your work as long as you don’t put it in this place, don’t publicise i, take it down after x years..”

We own it.

And, quite simply, it’s a disgrace that scientists cannot make their work available to the whole of the world.

Let’s declare 2008 “world non-transfer-of-copyright year”. That gives us time to coordinate the mass action.

We could do it if we wanted, couldn’t we?

Posted in Uncategorized | Leave a comment

Why preserve data?

Posted on April 22, 2007 by pm286

At the JISC/NSF meeting we had a very compelling example of why it is important to preserve data – both “scientific” and “humanities”. In 1994 Pang, You and Chu reported that ancient Chinese records could be used to calculate the exact time of an eclipse (the language in the abstract is fun: “in the 2nd (actually 12th) year of Sheng Ping reign period of King Shang (actually King Xi) the day began twice at Zheng” and in 1302 BC “three flames ate the Sun, big stars were seen”. This is interpreted as a sunset eclipse which is particularly easy to time.
The abstract is worth reading for the mixture of language. Essentially the time of the eclipse had an error of about 7 hours. Something had happened to change the rotational speed of the earth. Part of this was due to the tidal effect of the moon, but part was due to the PostGlacial rebound. As the ice melted from the glaciers, the earth expanded because of the removal of the weight. This expansion depended on the viscosity of the earth’s mantle, and from time difference it was possible to calculate a value of 10²¹ Pa s.
There must be many other stories of scientific discoveries hidden in the record like this. I’d love to know of them.

Posted in Uncategorized | Leave a comment

My footprints in the digital sand

Posted on April 21, 2007 by pm286

Egon Willighagen posts:

Clustering web search results

The Dutch Intermediair magazine of this week had a letter sent by a reader introducing Clusty, a web search engine that clusters the results. It does a pretty good job for ‘egon willighagen‘:

It seems to use other engine to do the searching and focus on the clustering. Source engine exclude Google, and include Gigablast, MSN and Wikipedia.
For chemoinformatics it comes up with the following top 10 clusters: ‘Drug Discovery’, ‘Structure’, ‘Cheminformatics’, ‘Research’, ‘Books’, ‘Conference, German’, ‘Textbook, Gasteiger’, ‘Laboratory’, ‘Handbook of Chemoinformatics’, and ‘School’. Quite acceptable and useful clustering.
This might be the next step in googling. Rich, it also might solve your problem: searching for ‘ruby chemoinformatics’ does not give a ‘Depth First’ or ‘Rich Apodaca’ cluster 🙂

Wow! This is the web summarising our digital sandy feet. So (what else) I have to put myself in and see what the web thinks of me:

All Results (187)

+

XML (58)
+

Science (38)
+

Markup, Language (20)
+

Cambridge (18)
+

Research (12)
+

Henry Rzepa (10)
+

Open Data (7)
+

Picture (6)
+

Scientist, Open-Access (6)
·

Search, Engine (5)
+

Articles, Peter Murray-Rust, Henry (6)
+

Chemical Information (5)
+

Semantic Chemical Web (5)
·

Applied, Pure (4)
·

Worldwide, Propose (4)
·

Definition (4)
·

Judith Murray-Rust (3)

It’s not what I expected! But at least it’s good to know that Judith and I are part of a cluster.

Posted in Uncategorized | Leave a comment

Useful Chem

Posted on April 21, 2007 by pm286

from Jean-Claude Bradley’s blog: UsefulChem and Skateboarding :
I just came across Karl Bailey’s blog, a chemistry teacher at Clark College who happens to teach virtually the same 3 organic chemistry classes that I do, in the same sequence following the Wade book. Clark has a quarter system like Drexel.
But what really caught my attention was his mention of UsefulChem and the image of skateboarders he used on the post. What a great representation of Open Source Science, at least the way that many of my friends and I conceive of it. I also get the same vibe from many of the young people that see me after I speak on the topic.
I suppose it represents a form of rebellion from the status quo, but not without standards for competence and dedication. Without that rebellion is just cynicism.

Posted in Uncategorized | 1 Comment

desert island in space

Posted on April 21, 2007 by pm286

There’s a splendid long-running program on the BBC – Desert Island Discs – which invites well known people (I hate the word celebrity and hope the concept disintegrates) to say what 8 records (discs) they would take to a desert island. They also can take one book (they get the Bible and Shakespeare anyway – it’s Britain) and one luxury.
Here’s a similar idea. Assume we know the world will be hit by an asteroid in a week’s time. And human civilisation – and probably the race – will go. All digital preservation on the planet will slowly decay – it needs electricity. Physical artifacts buried in the earth may survive, but they might get destroyed by dinosaurs or whatever comes next.
So we transcribe our digits (statically) onto some medium which we blast into geostationary orbit. Let’s say a million silicon wafers with a 10 megabytes on each – a terabyte (I am sure this has all been thought of many times and there is a better solution). We assume that this medium remains in orbit for millenia and that cosmic rays don’t degrade it.
The human race has to decide what single digital object they would put into space. It can’t be the complete web, and it has to be bounded (i.e. only have internal hyperlinks, not external). We have to be able to go along with a 10-terabyte disc to someone or some organisation and say – please copy iFoo … (and I would not be keen on current compression formats as the aliens might not understand them).
What should Foo be?

Posted in Uncategorized | 3 Comments

digital preservation of the scientific record

Posted on April 21, 2007 by pm286

On Monday I shall be taking at Colorado State University on the theme on “Digital preservation of the scientific record” – probably not the precise title. Digital preservation (WP):
“ refers to the management of digital information over time. Unlike the preservation of paper or microfilm, the preservation of digital information demands ongoing attention. This constant input of effort, time, and money to handle rapid technological and organisational advance is considered the main stumbling block for preserving digital information beyond a couple of years. Indeed, while we are still able to read our written heritage from several thousand years ago, the digital information created merely a decade ago is in serious danger of being lost.
Digital preservation can therefore be seen as the set of processes and activities that ensure the continued access to information and all kinds of records, scientific and cultural heritage existing in digital formats.”
This is hard. Let’s assume we actually know what we wish to preserve (I’ll blog about that later). I’ve lost most of by digital past. Every time I move employers (especially when they kicked me out) I was unable to transfer my record. They no longer maintain it. I have snippets (digital potshards) chached in various web searches engines – I came across one today in Clusty (another future post…). It was from 1996, and carried a Birbeck address. It probably only exists in the fragmented digital strata – certainly the machine at Birkbeck carrying is no more.
And, of course, there is context and process. If the bytes are in ASCII and use English that’s a start. But many are binary. And there is much context (metadata) that is lost. There are useful prosthetics – thus I link to Wikipedia wherever possible and I can assume that other named entities in my discourse are trivially discoverable on todays’ web. But will today’s web persist?
Many people talk of the sheer volume of data – that’s not the problem – it’s the complexity and interrelatedeness. I got a mail yesterday asking:
“I have a C++ program and data

-complex genotypes (ca. 10 000 lines)

-microsatellite genotypes (ca. 6 000 lines)

which we would like to put in a permanent repository

which others reading our research articles might

want to access gratis.

What data repositories are available?”
and I realised I couldn’t answer the question! I know how to archive Open Source software (I use Sourceforge) , and I know how to archive certain types of bio-data (proteins sequences, etc.) I expect that there are repositories that hold genotypes but I doubt that they accept it without being coupled to publication.
So I feel slightly awkward but have to say that we haven’t yet got good solutions. (I got very annoyed at the Glasgow meeting last year on digital scholarship when a smooth vendor of repositories told us how easy it was to put scientific digital objects into their system. And I let the meeting know that I didn’t think it was trivial.)

Posted in Uncategorized | Leave a comment

Memex

Posted on April 21, 2007 by pm286

I now realise I forget everything. Partly I suppose as brain cells disappear but largely due to the increasing flood and diversity of information, coupled with trying to sort out all the new ideas to the exclusion of actually observing what is going on.
The idea of a perfect memory is not new. In a short story (“Funes the memorious”) Borges described a boy who remembered everything (and therefore – though I think this was implicit – ran into the stack overflow of remembering his remembering…)
Vannevar Bush proposed the “memex”, (a portmanteau of “memory extender”) is the name given by Vannevar Bush to the theoretical proto-hypertext computer system he proposed in his 1945 The Atlantic Monthly article “As We May Think“. (from WP).
I am still digesting the implication of data-driven scholarship, and Marc Smith’s metaphor of our footprints in the digital sand – ca 10 terabytes per person. Clearly these are not at the level of Funes, though I think we can assume that virtually all our lives are non-private. For example when walking to our hotel in Phoenix we were videoed by three young skateboarders – presumably this will end up on YouTube – I have no idea why – are they budding film directors or identity thieves?
I think it’s inevitable that we shall soon have technology that records everything we do – not a new notion. A video implant in the forehead, an audio recoder in out earphone, a GPS for spacetime coordinates, intelligent software that adds metadata to people and other real-world objects. I expect this already exists in advanced computer science labs – I’m out of touch with nearly everything.
I hope it’s going to be fun.

(NB my spell checker is no longer working – another example of digital environment decay. Another >>15 minutes trying to put it right.)

Posted in Uncategorized | Leave a comment

ThermoML and TRC

Posted on April 21, 2007 by pm286

I have spent a splendid day at NIST in Boulder invited by Michael Frenkel of TRC – a group which captures thermochemistry data from the literature and elsewhere on behalf of the US Department of Commerce. Here’s what they do – and I’ll explain why it’s exciting.
======

Tasks

Located in Boulder, CO, TRC Group performs several functions related to providing state-of-the-art thermodynamic data:

compiles and evaluates experimental data
develops tools and standards for archival and dissemination of thermodynamic data, especially critically evaluated data
develops electronic database products
maintains a web-repository of published data in ThermoML — an XML format developed by TRC for the representation of thermodynamic data

About TRC

TRC specializes in the collection, evaluation, and correlation of thermophysical, thermochemical, and transport property data. The goals of TRC are to establish a comprehensive archive of experimental data covering thermodynamic, thermochemical, and transport properties for pure compounds and mixtures of well-defined composition, and correspondingly, to provide a comprehensive source of critically evaluated data.

Critically Evaluated Data

An important and useful aspect of our work here at TRC, and of the Physical and Chemical Properties Division of NIST as a whole, is to provide critically evaluated data. Critical evaluation is a process of analyzing all available experimental data for a given property to arrive at recommended values together with estimates of uncertainty, providing a highly useful form of thermodynamic data for our customers. The analysis is based on intercomparisons, interpolation, extrapolation,and correlation of the original experimental data collected at TRC. Data are evaluated for thermodynamic consistency using fundamental thermodynamic principles, including consistency checks between data and correlations for related properties. While automated as much as possible, this process is overseen by experts with a great deal of experience in the field of thermodynamic data. Professional staff are responsible for the evaluation of each set of data that is committed to the archive.
=====
This is the sort of foundation that the data-driven world of the future will be built on. Thermochemistry tells us how the atmosphere works, how energy can be transported, why chemical plants explode and much more.
Michael and his colleagues are typical data scientists and scholars. This wa a role emphasized by the JISC/NSF meeting – we critically need the data scholars of the future but we don’t reward them. It’s not easy to get tenure by collecting and publishing data. It’s difficult to get careers for those who can program, run software projects but don’t publish in “proper peer-reviewed journals”.
So JISC/NSF suggest peer reviewed data-journals – which should be regarded in the same light as text-based publications. The intellectual endeavour can be at least as challenging.
Anyway Michael has developed ThermoML – a markup language for themochemistry. Isn’t this a competitor to CML? Not at all. We’ve kept in touch for several years – been on the same platforms and agreed we’d keep in regular touch. But this was the first time I’ve been able to visit.
In fact there’s a wonderful complementarity between what we are doing. We have some common problems – how to create declarative markup for physical science (we’re going to look at OMDOC). And how CML can be embedded in ThermoML to manage compounds and mixtures. How CML can use the property vocabulary of ThermoML for publications involving physical measurements. As always with markup languages we are driven by real examples so we’ll be exchanging documents to see how easy this is, trying to create robust but flexible markup and will see if we can create rough consensus and running code.
TRC has abstracted ca 20,000 articles over 6 years. That’s a lot of manual labour although a lot is done by willing students. Some of the work is possibly useful for OSCAR…
… However Michael has persuaded 4 publishers into having their data converted into ThermoML and made available freely on the website. See the page with the titles. This is a great pointer to the future – if all journals did this I would have to find something else to blog about. Thanks to all at TRC.

Posted in Uncategorized | 2 Comments

What is an Institutional Repository for?

Meeting under the Blue Obelisk

NCI – UsefulChem Link

Let's reclaim our own work

Why preserve data?

My footprints in the digital sand

Clustering web search results

Useful Chem

desert island in space

digital preservation of the scientific record

Memex

ThermoML and TRC

Tasks

About TRC

Critically Evaluated Data

Recent Posts

Recent Comments

Archives

Categories

Meta