I have been reviewing the availability of Open Data for cyberscience – concentrating recently on crystallography and chemical spectra as examples. I’ll propose a new business model here, still very ill-formed and I welcome comments. It applies particularly to disciplines where the data are collected in a fragmented manner rather than being coordinated as in, for example, survey of the earth or sky. I call this fragmentation “hypopublication”.
However the Internet has the power to pull together this fragmentation if the following conditions are met:
- the data are fully Open and exposed. There must be no cost, no impediment to access, no registration (even if free), no forms to fill in.
- the data must conform to a published standard and the software to manage that standard must be Openly available (almost necessarily Open Source). The metadata should be Open.
- the exposing sites must be robot-friendly (and in return the robots should be courteous).
Such a state nearly exists in modern crystallography. The situation for macromolecules is that authors are required to deposit data in a central repository (http://www.rcsb.org). For small molecules there is less Open Data but a significant amount is available because of the work put in by:
- the International Union of Crystallography (IUCr), which for at least 30 years has pioneered the development of data standards and ontologies emerging in its current Crystallographic Information File specification.
- a number of publishers who have Openly exposed CIF data files on their websites for every article which contains relevant crystallography. They include the IUCr itself, the Royal Society of Chemistry, the American Chemical Society, the Chemical Society of Japan, and the American Mineralogist. (There may be others – if so I apologize and ask them to come forward). The licences are occasionally a bit fuzzy but the spirit and intention is clear. The data are there as a scientific record and to be re-used.
- The Crystallography Open Database – a volunteer activity which has aggregated approximately 50 K CIFs from donations.
The Internet now means that the data can be reliably aggregated as in our Crystaleye knowledgebase. This also acts as an immediate alerting system – as soon as a new piece of interesting crystallography is published, subscribers to our RSS feeds are notified immediately.
The criticism is sometimes made that unless data is inspected by humans it cannot be certified as fit for purpose. This depends entirely what the purpose is. It’s often better to have data of variable quality than no data at all. And it’s always better to have data of variable KNOWN quality rather than none, even if the quality is often known to be low. It’s a balance of precision and recall (Why 100% is never achievable). Joe Townsend here has shown in his PhD that if we lower the recall of crystallographic data (i.e. throw out everything that is known to have errors) we can get very high precision indeed without having to inspect the data.
Our remaining problem is that not all publishers expose the data Openly. The rest of this post explores why they should think of doing so.
Before the Internet it was necessary to have central repositories to put data in, but now with all publishers online the data can just as easily be posted on their sites. Even if there is no intrinsic search mechanism on the publisher sites, researchers like Nick Day (here) can create tools for managing the data and metadata in CrystalEye. So why don’t all publishers expose their crystallography – I think it’s just a matter of priorities and hope this post will advance the case.
Data costs money. True, but the amount is falling. I don’t know how much it costs the publishers above to manage the exposure of the crystallography files – and I’m not asking – but it’s obviously not prohibitive. They’ve done it (I assume) because they think it’s an important part of the publication process – allowing science to be verified, providing a record, allowing new research to build on old. So they have – presumably – included the cost within the general cost of publication (which is covered mainly by subscriptions but for some of the articles also paid-by-author/funder Open Access).
The main cost of the process – the creation of communal metadata – is already past. This is probably the largest barrier to any group trying to emulate the idea. But it’s also happening in thermochemistry (ThermoML) where a number of journals:
Journal of Chemical & Engineering Data (Elsevier)
The Journal of Chemical Thermodynamics (Elsevier)
Fluid Phase Equilibria (Elsevier)
Thermochimica Acta (Elsevier)
International Journal of Thermophysics (Springer)
all require data to be published at source and made Openly available. Here’s a sample issue which lists the Open data:
==================================
ThermoML Data for The Journal of Chemical Thermodynamics, Vol. 39, No. 6 June 2007
Developed in cooperation between The Journal of Chemical Thermodynamics and the Thermodynamics Research Center (TRC)
The full Table of Contents for this issue is available from JCT. The numbers below correspond to the numbers in the full Table of Contents.
2. Low pressure solubility and thermodynamics of solvation of oxygen, carbon dioxide, and carbon monoxide in fluorinated liquids
Pages 847-854
J. Deschamps, D.-H. Menz, A.A.H. Padua and M.F. Costa Gomes
ThermoML Data (To download: right-click on link and select “Save Link Target As” )
3. High pressure phase behaviour of the binary mixture for the 2-hydroxyethyl methacrylate, 2-hydroxypropyl acrylate, and 2-hydroxypropyl methacrylate in supercritical carbon dioxide
Pages 855-861
Hun-Soo Byun and Min-Yong Choi
ThermoML Data (To download: right-click on link and select “Save Link Target As” )
===================================
You’ll see that the data are Open.
So couldn’t this be a model for all of science? As I have posted recently I’m going to write to the editors of Elsevier’s Tetrahedron suggesting that they make all their crystallographic data available Openly. They agree it’s not their copyright, so it’s just a question of how to do it – files on a website shouldn’t be a major expense.
And funders should encourage this. If you are urging authors and journals to publish Open full-text, please extend this to data. Yes, there are some technical difficulties in some cases such as metadata, complexity and size but they probably aren’t too scary. And in any case the community will help work out how to use them.
Pingback: Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » HypoScience