I posted our intermediate conclusions on Nick Day’s computational NMR project, and have received two lengthy comments. I try to answer all comments, though as Peter Suber says in his interview sometimes comments lead to discourses of indefinite length. I am taking the pragmatic view that I will mainly address comments (and subcomments) that:
- address our project as we defined it (not necessarily the project that others would like us to have done)
- add useful information (especially annotation of suspected problems)
- or show that our scientific method is flawed or could be strengthened
- relate to Open issues. In our present stage of robotic access to and re-use of data we can only realistically use databases that explicitly allow re-use of data and do not require special negotiation with the owners
There has been a great deal of discussion (far more than we had expected) on our project. Some of this has been directly relevant in responding to our direct requests for annotation of specific outliers and we acknowledge posts from Egon Willighagen, Christoph Steinbeck, Jean-Claude Bradley, Wolfgang Robien, and the University of Mainz. A lot of the discussion has been of general interest but not directly relevant to the aims of the project which was to show what fully automatic systems can do, not to create specific resources (“clean NMRShiftDB”). It is possible, though not necessary, that the work might be more generally valuable depending on what we found.
Wolfgang Robien: October 27th, 2007 at 12:03 pm e
You wrote: ….only Open collection of spectra is NMRShiftDB – open nmr database on the web.
Also the SDBS system can be downloaded – as far as I remember its limitited to 50 entries per day (should be no problem because QM-calculation are quite slow compared to HOSE/NN/Incr)
PMR: thank you for reminding us. I have not used SDBS and it looks a useful resource for checking individual structures but it is inappropriate for the current work as:
- robotic download is forbidden
- there is no obvious way of downloading sets of structures – they need to be the result of a search
- there is no obvious machine-readable connection table (there is a semantically void image).
- there are no 3D coordinates (this is not essential but it meant Nick could work almost immediately)
It is possible that if we wrote to the maintainers they would let us have a dataset, but this would double the size of the project at least.
If you need 500 entries with a certain specification (e.g. by elements, molwt, partial structure, etc.) and you want to perform a common project, please let me know …..
PMR: Thank you. This is a generous offer and we may wish to take it up. Contrary to your comment that I am an NMR expert, I’m really not – I’m an eChemist and this exercise in NMR is because I wish to liberate NMR from the pages of journals. If we find that Henry’s program needs more data or that yours has fewer problems it could be extremely valuable. We would wish the actual data to be Open so that others can re-use it.
PMR is quoted as “We downloaded the whole of NMRShiftDB. When we started we had NO idea of the quality. …”
That is quite true. There were a number of public comments on NMRShiftDB ranging from (mild) approval to (mild) disapproval, some scalar values for RMS against various prediction programs and some figures on misassignments, etc. These gave relatively little indication of the detailed data quality – e.g. the higher moments of variation.
If there is currently a full list of NMRShiftDB entries with your annotations this would be valuable. Currently I can find a number of comments on individual entries with gross problems at
http://nmrpredict.orc.univie.ac.at/csearchlite/hallofshame.html
but these seem anecdotal rather than a complete list.
PMR: and the second set of comments
4)Regarding “We knew in advance that certain calculations would be inappropriate. Large molecules (> 20 heavy atoms) would take too long. ” The 20 heavy atom limit is a real constraint. I judge that most pharmaceuticals in use today are over 20 atoms (xanax, sildenafil, ketoconazole, singulair for example). I would hope that members of the NMR community are watching your work as it should be of value to them but I believe 20 atoms is a severe constraint. That said I know that with more time you could do larger molecules but a day per molecule is likely enough time investment.
PMR: We have strategies for dealing with larger molecules but are not deploying them here.
6) Regarding “So we have a final list of about 300 candidates.” Out of a total of over 20000 individual structures your analysis was performed on 1.5% of the dataset. How many data points was this out of interest.
PMR: I expect about 6-20 shifts per entry. Some overlap because of symmetry
7) Regarding ” probably 20% of entries have misassignments and transcription errors. Difficult to say, but probably about 1-5%”. This suggests about 25% of shifts associated with my estimated 3000 shifts are in error. This is about 750 data points and this conclusion was made by the study of 300 molecules. For sure the 25% does not carry over to the entire database. It is of MUCH higher quality that that. My earlier posting suggested that there were about 250 BAD points. The subjective criteria are discussed here (http://www.chemspider.com/blog/?p=44). Wolfgang suggested about 300 bad points but we were both being very conservative.You discussed the difference between 250 and 300 here on your blog as you likely recall http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=346
PMR: Nick will detail these later. We believe that the QM method is sufficiently powerful to show misassignments of a very few ppm – I will not give figures before we have down the work. With known variance it is possible to give a formal probability that peaks are misassigned. I have shown some examples of what we believe to be clear misassignments, but we have not gone back to the authors or literature (which often does not have enough information to decide). I do not believe you can compare your estimates with ours as you and we have not defined what a misassignment is.
8) Regarding “We realise that other groups have access to larger and, they claim, better data sets. But they are closed. I shall argue in a later post that closed approaches hold back the quality of scientific data.” I think your comments are regarding Wolfgang Robien and ACD/labs. That is true that we have access to larger datasets but we can limit the conversations to NMRShiftDB since we ALL have access to that. Robien’s and ACD/Labs algorithms can adequately deal with the NMRSHiftDB dataset. For the neural nets and Increment based approach over 200,000 data points can be calculated in less than 5 minutes (http://www.chemspider.com/blog/?p=213). You have access to the same dataset and can handle 300 of the structures. Your statement is moot..it is NOT about database size but about algorithmic capabilities.
PMR: My statement was about size and quality of datasets and is completely clear. It has nothing to do with algorithms. I am not interested in comparing the speed of algorithms but am concerned about metrics for the quality of data. I shan’t discuss speed of algorithms unrelated to the current project
PMR: …… Contrary to your comment that I am an NMR expert, I’m really not – I’m an eChemist and this exercise in NMR is because I wish to liberate NMR from the pages of journals. If we find that Henry’s program needs more data or that yours has fewer problems it could be extremely valuable. We would wish the actual data to be Open so that others can re-use it.
PMR is quoted as “We downloaded the whole of NMRShiftDB. When we started we had NO idea of the quality. …”
That is quite true. There were a number of public comments on NMRShiftDB ranging from (mild) approval to (mild) disapproval, some scalar values for RMS against various prediction programs and some figures on misassignments, etc. These gave relatively little indication of the detailed data quality – e.g. the higher moments of variation.
If there is currently a full list of NMRShiftDB entries with your annotations this would be valuable. Currently I can find a number of comments on individual entries with gross problems at
http://nmrpredict.orc.univie.ac.at/csearchlite/hallofshame.html
but these seem anecdotal rather than a complete list.
WR:
OK, you are not a NMR-spectroscopist, but you want to liberate NMR data from the pages of the journals:
There are so many people around working in this field, who are doing excellent science – they are coming from companies AND academia – I am quite sure they have different ideas about the value of the data and the access to them, but you never talked to them. You have talked to Christoph about that, I am quite sure, but did you also analyze his contributions ?
What are the facts ?
There is NMRShiftDB available since a few years – it holds at the moment about 20,000 structures with ca. 23,700 spectra
The project has been started ca. 6 years ago, at this time there were a few ‘players’ around (ACD,BIORAD,SPECINFO,CSEARCH) – there is severe structural overlap between NMRShiftDB and their collections. Conclusion: There was not even the slightest intention by Christoph (as project manager) to get information about these collections and to select journals, which are NOT covered by others – this would be really great to the scientific community to find spectral information, which cant be found anywshere else in a comfortable way. Furthermore this protocol would be very clever in the case that the collections can be combined sometimes in the future.
Data quality: It is known till March 2007 that there is a certain number of very severe errors – this has been shown by me and also ACD. Both of us didnt perform an exhaustive analysis, but detecting the ‘first layer’ of strange entries needs about 5 minutes of CPU-time when recalculating ALL C-SPECTRA, when applying a simple statistical test, it takes ca. 1 SECOND to recognize that.
Software quality: Which algorithm is new ? Sorry to say so: NO
Is there a better implementation of e.g. spectrum prediction: NO
(ACD,BIORAD,CSEARCH have it also available, slightly different implementations,
each difference is justified and therefore good science including solvent
dependency and stereochemistry – missing in NMRShiftDB)
Is there an alternative available: NO
(ACD has NN and Incr, CSEARCH has NN and Incr – again different implementations,
but based on good science)
The way to liberate NMR-spectra is definitely NOT to provide a poor software-package.
The way to liberate NMR-spectra is definitely NOT to type NMR-Textbooks into the computer
The way to liberate NMR-spectra is definitely NOT to use the data from the old BRUKER-collection (early ’80, the blue book) with a lot of misassignments, which has been already typed in by the DKFZ-people and some other (e.g. Arizona State University)
The way to liberate NMR-spectra is definitely NOT to ignore state-of-the-art data checking protocols
The way to liberate NMR-spectra is definitely NOT to bother the community with a trivial problem like N-Methyl-piperidone (this misassignment comes from the collection mentioned above)
You still insist on having NO idea about the data quality: (PMR: When we started we had NO idea of the quality. …” That is quite true. There were a number of public comments on NMRShiftDB ranging from (mild) approval to (mild) disapproval, some scalar values for RMS against various prediction programs and some figures on misassignments, etc. These gave relatively little indication of the detailed data quality )
You complain ‘gave relatively little indication of the detailed data quality’: Now lets analyze this:
The sentence ‘gave relatively little indication of the detailed data quality’ implies that after YOUR analysis the detailed indication will be there. Now the facts:
Your calculations based on QM-approaches is limited to MW
Sorry: my comment has been cutted off twice, because of the ‘.LT.’ symbol ….. (Peter please delete my second comment ! Thanks , Wolfgang )
You complain ‘gave relatively little indication of the detailed data quality’: Now lets analyze this:
The sentence ‘gave relatively little indication of the detailed data quality’ implies that after YOUR analysis the detailed indication will be there. Now the facts:
Your calculations based on QM-approaches is limited to MW .lt. 500, less than 21 heavy atoms, elements up to chlorine, no conformational problems, etc., etc. – therefore 500 molecules have been selected, in ca. 200 cases the calculation failed, what remains: 300 out of 23,000 – Now you assume you have provided the detailed analysis to the community ?!
Neither ACD nor me have the intention to do a thesis on this topic – we both simply used this dataset and applied our standard procedures on it ….. that’s it – the main difference is simply, that we could handle all structures.
In order not to be misunderstood: An exhaustive comparison of QM-calculations against database-oriented techniques will be appreciated – this gives a measure of performance of QM against HOSE/NN/Incr. The other topics like data-quality of NMRShiftDB have been already investigated – no doubt, you will find some/many other errors which havnt been detected so far. No doubt some cant be corrected without redoing some measurements.
A very detailed and excellent analysis by Antony Williams about what has been already done and what is still open can be found on his blog on http://www.chemspider.com
The community definitely DOES NOT need a ‘war’ about ‘closed systems’ versus ‘Open Systems’ – what you promote sounds like this. What the community really needs is a CONSENSUAL agreement, how to access (spectral) data for the benefit of all people involved. You are preaching fundamentalism in this topic – again have a look into history ….. Christoph Steinbeck calls himself an ‘Evangelist for Open Data’ (see his CV on his webpage http://www.steinbeck-molecular.de ). In NMR-spectroscopy we dont need a Messias, we need excellent scientists doing excellent work.
Peter – WOlfgang and I have been expressing ourselves on this blog in a very parallel manner. Both of us have made a number of pleas at this point to bring value to the community as a whole based on the enormous contribution in time made by you, Henry, Nick and even the readers and people posting to this blog. The flavor of this project continues to shift and if you read through the multitude of blogs you will see that it has morphed with time. I’m with Wolfgang though: “The community definitely DOES NOT need a ‘war’ about ‘closed systems’ versus ‘Open Systems’ – what you promote sounds like this. What the community really needs is a CONSENSUAL agreement.”
I see little value at this point in continuing this work unless you are trying to prove the SPECTRa project and source your next round of funding to expand into the Worldwide Molecular Matrix. Then I think we’ll just end colliding over ChemSpider versus WWMM. Of course…this would be old news too unfortunately. I’m finding it hard to build community in an environment of competition.
(1, 2) I have noted these comments and believe there is common ground (3)