The cost of decaying scientific data

My colleague John Davies, who provides a crystallographic service for the deparment has estimated that the data for 80% of crystal structures (in any chemistry department) never leave the laboratory. They are locally archived, perhaps on CDROM, perhaps on a local or departmenta machine. With the passge of time – changes in staff, organisation, machines – information decays and it is likely that crystallographic data wil be systematically lost.

Recently a number of UK groups have been funded by JISC – The Joint Information Systems Committee

to research the development of digital repositories. Three groups have been collaborating in chemistry, with a strong emphasis on crystallography and spectroscopy. This involves all aspects – building software, designing metadata specs, and understanding the way chemists work and think. We have found that the social aspects are at least as important as the technical – I won’t eleborate here yet as these will be reported at:

An eBank / R4L / SPECTRa Joint Consultation Workshop.
Digital repositories supporting eResearch: exploring the eCrystals
Federation Model

Why is it important to archive the data? Isn’t normal academic publication (including theses) sufficient? Isn’t it very costly and a waste of money that could be spent on proper research?
Well, the crystallographic community has archived its data for many years and research on this data alone has given rise to hundreds or even thousands of papers datamining this resource. Without this chemistry would be very much poorer as we would have little in the way of molecular or crystal structure systematics.
So what is the cost of the unpublished data? To carry out the structures at commercial rates would be about USD 1500-5000 for the size of structures currently published. Let’s assume a laboratory does 500 structures a year and if we assume that full economic costs are half the commercial (this is just a guess) – we are looking at half a million dollars per year to do crystal structures in a chemistry department. (I suspect the numbers are on the low side – I’d be interested in comments).
Allowing that there has been some publication of some of the material as comments in chemical papers I suspect that the information from quite a high proportion of the structures is never published in any form. How easy is it to find information in current theses, especially if you don’t know it’s there?
I think I would be safe in saying that wordlwide hundreds of millions of dollars’ worth of crystallographic data is lost each year. For spectra and synthetic chemistry it will be at least 10 times greater. Many synthetic chemists say they are interested in failed reactions – and these are almost never published!
If funders are aware of this they should be concerned about the loss. Funders are increasingly being proactive in requiring funded research to be Openly accessible. The Wellcome Trust is among the stromgest proponents:

Robert Terry on Open Access

and a quote

The Trust provides additional funding to cover the
costs relating to article-processing charges levied by
publishers who support this model.
• Approximately 1% of the research grant budget
would cover costs of open access publishing

This entry was posted in data, open issues. Bookmark the permalink.

3 Responses to The cost of decaying scientific data

  1. Thanks, Peter, for this post. You make a strong argument for open data parallel to what has already been established about OA journal articles. This is the kind of information that is most helpful to non-chemists looking at the overal purpose and scope of OD.
    If you have a chance, can you blog about any of these topics? My approach is rather simplistic, since I am really looking at the language and not the actual chemsitry involved in your work. This is a great perspective, however, because it allows me to examine the parts that make up the whole.
    1. As a community of chemists, most of the BO group agrees with the philosophy of sharing and collaboration. However, there seems to be discussion about how the information is shared with others. Is there one “reader” that can analyze any of the out-put formats, or is there a standard way to out-put information that everyone should use?
    2. In terms of what you have done so far and what you are working toward, what are some frustrations or challenges that are most impeding progress?
    3. What will be the best way to communicate these methods to the next generation of chemists?
    Thanks,
    Beth
    2.

  2. pm286 says:

    >1. As a community of chemists, most of the BO group agrees with the philosophy of sharing and >collaboration. However, there seems to be discussion about how the information is shared with >others. Is there one “reader” that can analyze any of the out-put formats, or is there a standard >way to out-put information that everyone should use?
    There are zillions of formats in chemistry – almost all programs create their own. This is a horrible problem as they have syntactic, semantic and ontological disconnects. This was traditionally solved by writing Foo2Bar converters. These are highly lossy, difficult to maintain, and usually incomplete. The next approach is OpenBabel which reads in Foo, converts it t a universal data strucure internally and outputs is in Bar format. This is lossy, but the best that can be done. For example if I have a 3-D structure (like a CIF) and output it in a format that does not hold coordinates (like SMILES) I lose the coordinates. Openbabel is highly valuable and that is why Geoff got a BO award.
    The better approach is to use an extensible approach – and the obvious and only candidate is XML. That is why Henry and I have developed CML – Chemical Markup Language . This is capable of carrying 95% of chemical material in current data and discourse (e.g. articles). It does this by using dictionaries which are also extensible. There are areas (scientific publishing, comp chem, open source) where this is recognised and valuable – we have to extend this into mainstream chemical practice.
    This is where Bioclipse comes in. It is robust, carries much of the functionality that we require and is an excellent platform for packaging the BO software. When Bioclipse has all of this we will start advocating it as a chemical desktop. The early adopters will like it and spread the word. But it’s slow and it’s hard and we can’t make any mistakes – it has to be right first time.
    >2. In terms of what you have done so far and what you are working toward, what are some >frustrations or challenges that are most impeding progress?
    The future of scientific information and therefore scientific research depends on open data. Beyond that is the relatively low appreciation of informatics and software in chemistry. That makes it hard to create the next generation.
    >3. What will be the best way to communicate these methods to the next generation of chemists?
    I think it will come on top of the social computing revolution. As everyone uses Flickr, MySpace, etc. the undergraduates will start to develop the next generation of informatics. Like Google. We need to have the vision to create the formal tools they need to link into these approaches. If they use RSS feeds we have CMLRss ready for use. If they use blogs, we can help them add InChI. Chemistry needs the formality of connection tables, coordinates, etc. and we have created these tools. The undergraduates who work on summer projects here are fantastic – they pick up the new technology by osmosis. So give them their opportunities and see what they can do.
    Thanks,
    Beth

  3. Peter is right about the problem with the variety of formats to represent chemicals (and chemical reactions ultimately). It is hard enough to convince chemists to label and discuss their compounds other than with a number in a scheme (for example ketone 11). Because most online databases take SMILES (e.g. emolecules.com) chemists are likely to start using that format before any other that may make more sense for searchability (like InChI). The approach we are taking in the short term is redundancy. Our system automatically converts the SMILES representation of our molecules of interest into InChI, CML, a pic and whatever else we need using OpenBabel. This is similar to what happened with the quest for a common language – Esperanto looked great on paper but more than 100 years later we are using the translate feature of search engines instead to understand each other.

Leave a Reply

Your email address will not be published. Required fields are marked *