Over the last 1-2 weeks Nick Day has been calculating NMR spectra and comparing the results with experiment. As there appears to be considerable interest we have agreed to make our conclusions Open on an almost daily basis. These lead to some compelling observations on the value of Open Data which I shall publish later. To summarise:
We needed an Open Data set of NMR spectra and associated molecular structures. The data had to be Open because we wished to publish all the results in detail without requiring to go back to the “owners”. The data also required annotation of the spectra with the molecular structure (“assignment of peaks”). Ideally the data should be in XML as this is the only way of avoiding potential semantic loss and corruption, but we would have managed with legacy formats.
There are well over 1 million spectra published each year in peer-reviewed journals. Almost NONE are published in semantic form – most are as textual PDFs or graphical PDFs. It is also unclear how many of these could be robotically downloaded without the publishers sending lawyers – at least Elsevier allow us to do this. In any case we would have to use OSCAR to extract the data, probably involving corruption and loss.
So we looked for Open collections of spectra. There are many, with an aggregated count of probably over a million spectra. However almost all are completely closed – they requires licence fees and forbid re-use. I have criticized this practice before and shall do so later, but here I note that the only Open collection of spectra is NMRShiftDB – open nmr database on the web. This has been created by Crhristoph Steinbeck and colleagues and contains somewhat over 20,000 spectra. Because Crhistoph is a member of the Blue Obelisk the data can be exported in CMLSpect (XML) without which Nick Day’s project would not have been possible.
We downloaded the whole of NMRShiftDB. When we started we had NO idea of the quality. This was deliberate, and part of the eScience- can a robot automatically determine the quality of data? Several correspondents have missed this point – It was more important to answer this question than whether the data were “good”. IF AND ONLY IF the data AND METADATA were of sufficient quality would it be possible to say something useful about the value of the theoretical calculations.
We knew in advance that certain calculations would be inappropriate. Large molecules (> 20 heavy atoms) would take too long. Molecules with floppy groups cannot be easily analysed. So we selected small molecules with rigid frameworks. This gave a starting set of about 500 candidates, each of which takes on average 1 day to calculate.
About 200 of these had computational problems – mainly that they were too large, they didn’t converge, Condor had problems managing the jobs, etc. So we have a final list of about 300 candidates.
We have listed the analysis of these results over the last few days. It is clear that some entries have “errors” and that there were also defects in the initial calculation models. Henry knew the latter, of course, but if we hadn’t known this at the start we would have been able to hypothethise that Br, Cl gave rise to serious additive deviation. So this is at least confirmation of a known effect for which we have made empirical corrections based on theory.
We have shown examples of poor agreements and anticipated all of them in principle. The data set contains many problems including:
- wrong structures. I am sure that at least one structure does not correspond to the spectrum
- misassignments. These are very common – probably 20% of entries have misassignments
- transcription errors. Difficult to say, but probably about 1-5%
- conformational problems. There are certain molecules which have conformational variability (i.e. floppy) but we have only calculaed one conformer. The most common example is medium-sized rings
- human editing of data leading to corruption. 2 entries at least
As a result Nick is going to produce a cleaned data set manually. He has already done most of this (has he slept?). He cannot do this automatically as the metadata are not present in the XML files. He will then be in a position to answer the questions:
- how much of the variance is due to experimental problems?
- if this is lower than, say, 70%, is it possible to detect systematic “errors” in the computational methodology.
- if so, can the method be improved.
If we can believe in the methodology then we can start to use it as a tool for analysing future data sets. But until then we can’t.
We realise that other groups have access to larger and, they claim, better data sets. But they are closed. I shall argue in a later post that closed approaches hold back the quality of scientific data.