Over the last 1-2 weeks Nick Day has been calculating NMR spectra and comparing the results with experiment. As there appears to be considerable interest we have agreed to make our conclusions Open on an almost daily basis. These lead to some compelling observations on the value of Open Data which I shall publish later. To summarise:
We needed an Open Data set of NMR spectra and associated molecular structures. The data had to be Open because we wished to publish all the results in detail without requiring to go back to the “owners”. The data also required annotation of the spectra with the molecular structure (“assignment of peaks”). Ideally the data should be in XML as this is the only way of avoiding potential semantic loss and corruption, but we would have managed with legacy formats.
There are well over 1 million spectra published each year in peer-reviewed journals. Almost NONE are published in semantic form – most are as textual PDFs or graphical PDFs. It is also unclear how many of these could be robotically downloaded without the publishers sending lawyers – at least Elsevier allow us to do this. In any case we would have to use OSCAR to extract the data, probably involving corruption and loss.
So we looked for Open collections of spectra. There are many, with an aggregated count of probably over a million spectra. However almost all are completely closed – they requires licence fees and forbid re-use. I have criticized this practice before and shall do so later, but here I note that the only Open collection of spectra is NMRShiftDB – open nmr database on the web. This has been created by Crhristoph Steinbeck and colleagues and contains somewhat over 20,000 spectra. Because Crhistoph is a member of the Blue Obelisk the data can be exported in CMLSpect (XML) without which Nick Day’s project would not have been possible.
We downloaded the whole of NMRShiftDB. When we started we had NO idea of the quality. This was deliberate, and part of the eScience- can a robot automatically determine the quality of data? Several correspondents have missed this point – It was more important to answer this question than whether the data were “good”. IF AND ONLY IF the data AND METADATA were of sufficient quality would it be possible to say something useful about the value of the theoretical calculations.
We knew in advance that certain calculations would be inappropriate. Large molecules (> 20 heavy atoms) would take too long. Molecules with floppy groups cannot be easily analysed. So we selected small molecules with rigid frameworks. This gave a starting set of about 500 candidates, each of which takes on average 1 day to calculate.
About 200 of these had computational problems – mainly that they were too large, they didn’t converge, Condor had problems managing the jobs, etc. So we have a final list of about 300 candidates.
We have listed the analysis of these results over the last few days. It is clear that some entries have “errors” and that there were also defects in the initial calculation models. Henry knew the latter, of course, but if we hadn’t known this at the start we would have been able to hypothethise that Br, Cl gave rise to serious additive deviation. So this is at least confirmation of a known effect for which we have made empirical corrections based on theory.
We have shown examples of poor agreements and anticipated all of them in principle. The data set contains many problems including:
- wrong structures. I am sure that at least one structure does not correspond to the spectrum
- misassignments. These are very common – probably 20% of entries have misassignments
- transcription errors. Difficult to say, but probably about 1-5%
- conformational problems. There are certain molecules which have conformational variability (i.e. floppy) but we have only calculaed one conformer. The most common example is medium-sized rings
- human editing of data leading to corruption. 2 entries at least
As a result Nick is going to produce a cleaned data set manually. He has already done most of this (has he slept?). He cannot do this automatically as the metadata are not present in the XML files. He will then be in a position to answer the questions:
- how much of the variance is due to experimental problems?
- if this is lower than, say, 70%, is it possible to detect systematic “errors” in the computational methodology.
- if so, can the method be improved.
If we can believe in the methodology then we can start to use it as a tool for analysing future data sets. But until then we can’t.
We realise that other groups have access to larger and, they claim, better data sets. But they are closed. I shall argue in a later post that closed approaches hold back the quality of scientific data.
You wrote: ….only Open collection of spectra is NMRShiftDB – open nmr database on the web.
Also the SDBS system can be downloaded – as far as I remember its limitited to 50 entries per day (should be no problem because QM-calculation are quite slow compared to HOSE/NN/Incr)
If you need 500 entries with a certain specification (e.g. by elements, molwt, partial structure, etc.) and you want to perform a common project, please let me know …..
You wrote: ….We downloaded the whole of NMRShiftDB. When we started we had NO idea of the quality. …
NO IDEA ??????? !!!!!!!!!!!!!!!!!!!!!
Please have a look into http://nmrpredict.orc.univie.ac.at where a detailed analysis has been presented – this analysis has led to a controverse discussion in many blogs, YOU YOURSELF commented on it in YOUR BLOG (see: http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=346 here you find: This entry was posted on Thursday, May 31st, 2007 at 2:00 pm …… ) AFTERWARDS a lot of activities have been started, which are documented on http://nmrshiftdb.org – a lot of errors have been detected by me, by Tony Williams from ACD and by ‘internal crosschecks’ (BTW: Christoph, please let me know, which error checks have been applied (hopefully this information is OPEN !)). I have written a lot of emails to one of the editors of NMRShiftDB when I have found further errors – I assume most of these errors have been already corrected. My last download is from MAY’07 and I didnt check afterwards in a systematic way. Please dont turn history around according to your needs !
You wrote: …. We downloaded the whole of NMRShiftDB. When we started we had NO idea of the quality. This was deliberate, and part of the eScience- can a robot automatically determine the quality of data? Several correspondents have missed this point – It was more important to answer this question than whether the data were “good”……..
I think the question ‘Can a robot automatically determine the quality of data’ has been answered in a very impressive way by me and ACD – both of us have run their automatic checks on this set of data and found many errors. The level of sophistication of the CSEARCH-algorithms can be seen on the automatic detection of the drawing error of the Pachyclavulide-series ( I have posted this error on this blog already a few days ago )
Peter, You have some interesting conclusions in this post and some are contrary to earlier observations made by others. First some comments:
1) Regarding “It is also unclear how many of these could be robotically downloaded without the publishers sending lawyers – at least Elsevier allow us to do this.” This is excellent news that one of the biggest publishers around allows you to robotically download spectra from their papers. Very good indeed!
2) Regarding “the only Open collection of spectra is NMRShiftDB – open nmr database on the web.” Just to clarify these are NOT NMR spectra actually. Unless NMRShiftDB has a capability I am aware of NMRSHiftDB is a database of molecular structures with associated assignments (and maybe in some cases just a list of shifts..maybe all don’t have to be assigned.) As an NMR spectroscopist the spectrum itself is what comes off the instrument, the one that can be re-referenced, phased, baseline corrected etc. NMRShiftDB is limited (I think) to a peak listing. This should not detract from the value of the data collection but it may cause confusion. Certainly one conversation I have had in the past 24 hours suggests that people think that NMRShiftDB contains NMR “spectra”. But Christoph named it appropriately as a SHIFT database.
3) REgarding “We downloaded the whole of NMRShiftDB. When we started we had NO idea of the quality.” I think you had an idea and point you to your own blogpostings. http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=278; http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=346. I recall you followed the scientific discourse between Wolfgang Robien and ACD/Labs regarding the quality and supported our conclusions that the data was of good quality. I recommend following the NMRShiftDB homepage (http://nmrshiftdb.ice.mpg.de/)where such reports get posted by Christoph as they occur:
a) NMRShiftDB Critique 2007-04-05 02:01 – NMRShiftDB
Prof. Wolfgang Robien from Vienna, maker of the CSearch system, has evaluated NMRShiftDB’s data quality and found a number of partly severe errors. Robien’s critique is summarized on his own site here.
b) NMRShiftDB review 2007-05-03 04:12 – NMRShiftDB
Antony Williams published an NMRShiftDB quality review in his ChemSpider blog. See here
c)Quality Campaign 2007-07-02 08:03 – NMRShiftDB
Between 2007-3-10 and today, altogether 72 spectra and/or structures in NMRShiftDB have been edited by the community to correct errors identified in analyses by Wolfgang Robien and Antony Williams as well as internal cross-checks.
4)Regarding “We knew in advance that certain calculations would be inappropriate. Large molecules (> 20 heavy atoms) would take too long. ” The 20 heavy atom limit is a real constraint. I judge that most pharmaceuticals in use today are over 20 atoms (xanax, sildenafil, ketoconazole, singulair for example). I would hope that members of the NMR community are watching your work as it should be of value to them but I believe 20 atoms is a severe constraint. That said I know that with more time you could do larger molecules but a day per molecule is likely enough time investment.
5) Regarding “Molecules with floppy groups cannot be easily analysed.” So, anything with a side chain then.
6) Regarding “So we have a final list of about 300 candidates.” Out of a total of over 20000 individual structures your analysis was performed on 1.5% of the dataset. How many data points was this out of interest. A structure is clearly not a data point since each structure has multiple nuclear centers and you are predicting individual shifts. I’ll estimate about 3000 shifts? The earlier validation I reported on was 214,000 shifts (http://www.chemspider.com/blog/?p=37) but that was an old version of the database and it has grown since then.
7) Regarding ” probably 20% of entries have misassignments and transcription errors. Difficult to say, but probably about 1-5%”. This suggests about 25% of shifts associated with my estimated 3000 shifts are in error. This is about 750 data points and this conclusion was made by the study of 300 molecules. For sure the 25% does not carry over to the entire database. It is of MUCH higher quality that that. My earlier posting suggested that there were about 250 BAD points. The subjective criteria are discussed here (http://www.chemspider.com/blog/?p=44). Wolfgang suggested about 300 bad points but we were both being very conservative.You discussed the difference between 250 and 300 here on your blog as you likely recall http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=346
8) Regarding “We realise that other groups have access to larger and, they claim, better data sets. But they are closed. I shall argue in a later post that closed approaches hold back the quality of scientific data.” I think your comments are regarding Wolfgang Robien and ACD/labs. That is true that we have access to larger datasets but we can limit the conversations to NMRShiftDB since we ALL have access to that. Robien’s and ACD/Labs algorithms can adequately deal with the NMRSHiftDB dataset. For the neural nets and Increment based approach over 200,000 data points can be calculated in less than 5 minutes (http://www.chemspider.com/blog/?p=213). You have access to the same dataset and can handle 300 of the structures. Your statement is moot..it is NOT about database size but about algorithmic capabilities.