I have been pleased by the interest in Open Notebook NMR but the current discussions have widened far too useful to be useful, so I want to be absolutely clear what the project and its limits are.
This is a part of Nick Day’s PhD thesis at Cambridge.
The only motivation is for Nick to be able to do good science, with advice and help which is publishable in his thesis. I repeat:
This is a part of Nick Day’s PhD thesis at Cambridge.
I made that absolutely clear at the beginning. Any broadening of the project is a distraction and could be detrimental to his work. For example I follow what Alicia is doing with Jean-Claude but I would never dream of suggesting she does other than what is agreed between them.
It is extremely unusual for a PhD student to be exposing his work as Open Notebook Science. I only suggested it because he is a good student and I believe his technical competence and commitment is such that he will do careful and valuable work. Remember that if anything is wrong it is extremely public.
We also made clear what the limits of the project were and I will repeat them as our hypothetical report:
We adapted Rychnovksy’s method of calculating 13C NMR shifts by adding (XXX) basis set and functionals (Henry has done this). We extracted 1234 spectra with predicted 3D geometries for rigid molecules in NMRShiftDB (no acyclic-acyclic bond nodes for heavy atoms). Molecules had < = 21 heavy atoms (<= Cl). These were optimised using Gaussian XXX and the isotropic magnetic tensors calculated using correction for the known solvent. The shift was subtracted from the calculated TMS shift (in the same solvent) and the predicted shift compared with the observed.
Initially the RMS deviation was xxx. This was due to a small number of structures where there appeared to be gross errors of assignment. These were exposed to the community who agreed that these should be removed. The RMS dropped to yyy. The largest deviations were then due to Y-C-X systems, where a correction was applied (with theoretical backing). The RMS then dropped to zzz. The main outliers then appeared to be from laboratory AAA to whom we wrote and they agreed that their output format introduced systematic errors. They have now corrected this. The RMS was now zzz. The deviations were analysed by standard chemoinformatics methods and were found to correlate with the XC(=Z)Y group which probably has two conformations. A conformational analysis of the system was undertaken for any system with this group and the contributions from different conformers averaged. The RMS now dropped to vvv.
This established a protocol for predicting NMR spectra to 99.3% confidence. We then applied this to spectra published in 2007 in major chemical journals. We found that aa% of spectra appeared to be misassigned, and that bb% of suggested structures were "wrong" – i.e. the reported chemical shifts did not fit the reported spectra values.
This is precisely what we have been doing and we are sticking to it. It would be irresponsible for a supervisor and student to do elsewise until unforeseen difficulties arose. It has gone according to plan:
- Initially the RMS deviation was ca. 3 ppm. This was due to a small number of structures where there appeared to be gross errors of assignment. These were exposed to the community who agreed that these should be removed. We have had comments from Christoph, Henry, Egon and Jean-Claude which have allowed us to remove 3 entries.
- The largest deviations were then due to C-Hal systems, where a correction was applied (with theoretical backing from Henry). We are now applying this correction and will report as soon as the new code has been written (because we use XML-CML this has a rapid turnround).
- The main outliers then appeared to be from laboratory AAA to whom we wrote and they agreed that their output format introduced systematic errors. A very serious error occurred in an entry from DKFZ spektren which was due to human editing of a apectrum (identified by Christoph). It is therefore not unreasonable to remove all entries from this source (note we have to remove all without inspecting them – we cannot just choose which we like).
- A conformational analysis of the system was undertaken for any system with this group and the contributions from different conformers averaged. It is clear from inspection of the data that some compounds have equipopulated conformers (e.g. C6H5-CH(=O)) and that is legitimate to identify this framework symmetry group and average over equivalent atoms if they have equal shifts.
-
This established a protocol for predicting NMR spectra to xxx% confidence. This is the primary aim of the work – to be able to show that machines can make decisions without humans within a given confidence level. It depends on being able to find a set of data which are accepted as accurately assigned. That is what we are asking the community for.
The immediate Open task is to help annotate outliers, which are being released as soon as they are identified and we repeat – any help with this is very much appreciated.
The project that Chemspider has identified is completely distinct from Nick Day’s thesis:
“The results from the GIAO calculations were compared with three other prediction approaches provided by Advanced Chemistry Development. These algorithms were not limited in the number of heavy atoms that could be handled by the algorithm, The algorithms were a HOSE-code based approach, a neural network approach and an “increment approach”. A distinct advantage of these approaches is the time for prediction relative to the quantum-mechanical calculations. The QM calculation took a number of weeks to perform on the dataset of 23475 structures on a cluster of computers. However, a standard PC enabled the HOSE code based predictions to be performed in a few hours, the Neural Net predictions in about 4 minutes and the Increment based predictions in less than 3 minutes.
A comparison of the approaches gave statistics for the non-QM approaches superior to those of the QM approach.
This has no bearing in any way on Nick’s work – it does not help certify entries as “correct”. The last sentence actually suggests that Chemspider believes their work is superior to what Nick is doing. Without a probably annotated data set such claims are meaningless.
Finally I should clarify what is meant by Open since there is confusion:
Ryan Sasaki Says:
October 24th, 2007 at 12:16 pm e
Hi Peter,
While I cannot speak for Brent Lefebvre, I have a couple of comments in relation to Open Notebook NMR and the potential involvement of commercial software companies in this study.
First of all, you mention that you will probably share your insights but not your data. If that is true, then I do not understand how this project can be referred as OPEN. If the dataset being used is not shared publicly, how can this be considered an open project?
We have so far shared every piece of data and metadata that we feel is fit to publish. Open does not mean “immediate”. The data is taken from NMRShiftDB which is Open. We have published our protocol as it is refined. We have said we are going to publish our outliers and we are doing so. When we get input from the community we shall publish more. The product will be a data set which will have community approval – along with the protocol this will be our major Open deliverables. Nick will have enough breathing space to make scientific discoveries – if any are to be made.
The biological community already operates in this way. When a lab does a protein structure they do not publish their photographs immediately. We do not expect them to, any more than we expect Rosalind Franklin to do so. But when the time comes for publication we shall publish all that is necessary to replicate the experiment – that is the key. And, along the way, we are asking the community for input. If they do so the result could be a data set that is relatively small but of very high quality and therefore useful for testing computational approaches.
Peter, I judge that we will not be able to validate the original hypothesis I suggested be looked at. This was whether or not GIAO predictions could outperform HOSE code, Neural Net or Increment based predictions. I understand your focus as a supervisor and Nick’s role as a student is to get the thesis issued and get Nick his PhD. It’s a valid focus…based on what I’ve seen of Nick’s work he deserves it.
At the end of the project I assume that NMRShiftDB will be “cleaned” with your input.and that you/Nick will have published performance statistics on that database and made the data Open? If so then we’ll wait until that stage and do the comparison at that time.
Nick…good luck with your thesis!
(1)
“At the end of the project I assume that NMRShiftDB will be “cleaned” with your input.and that you/Nick will have published performance statistics on that database and made the data Open? If so then we’ll wait until that stage and do the comparison at that time.”
NMRShiftDB is not our database it is Christoph’s, Stefan’s, and Cologne’s. We do not have rights to “clean” it.
We expect to get annotations from the following sources:
* humans (hopefully including yourself) about structures of interest, such as outliers. I think NMRShiftDB has mechanisms for adding annotations, but I think they are rather simple at present.
* application of our protocol. We expect the protocol to highlight certain entries as probably having certain errors. That would allow searches in the future to disregard these structures if they wished.
You mentioned in your paper that you had excluded certain structures from NMRShiftDB “for which obviously erroneous chemical shift assignments were detected”. Are there objective reasons for this exclusion? Have you communicated these annotations to NMRShiftDB curators? and have they added these annotations (this could be a useful filter for us to decide whether to exclude compounds).
Peter…excuse my confusion but I thought Christoph Steinbeck was onsite in Cambridge this week and would be discussing the project. A number of your posts have suggested that. You commented the project was 1-2 months in length and Nick’s intention was to identify outliers and potential errors in the data and you have made many commentaries about that date. NMRShiftDB is Open Data and my assumption was that you would be providing feedback, annotations, commentary to Christoph/Cologne/NMRShiftDB about these data points to allow them the opportunity to resolve/change/modify the data points in error since you would have likely determined reasons for the changes and providing the data back to Christoph/NMRShiftDB would be a good public service. My comments were that the NMRShiftDB would be cleaned with your input and the resulting data made Open. Based on your comment of “NMRShiftDB is not our database it is Christoph’s, Stefan’s, and Cologne’s. We do not have rights to “clean” it.” NMRShiftDB is now not Open Data. If you cannot download the data and remove bad data points etc and return the data to Christoph/NMRShiftDB with commentary then I remain confused about the rights around Open Data.
regarding: “You mentioned in your paper that you had excluded certain structures from NMRShiftDB “for which obviously erroneous chemical shift assignments were detected”. Are there objective reasons for this exclusion? Have you communicated these annotations to NMRShiftDB curators? and have they added these annotations (this could be a useful filter for us to decide whether to exclude
compounds).” Yes, of course the reasons were objective. I have already commented on your blog in my posts and on my own posts that these observations were returned directly to Christoph for him/his team to validate and check and make the apropriate decision/annotation in NMRShiftDB. According to Christoph’s comments a number of changes were made to the data already. See the comment at Sourceforge: http://sourceforge.net/forum/forum.php?forum_id=711772. Christoph was or is onsite..please ask him for details. Christoph was an author of the paper.
At the end of your work Nick/you/Henry will have generated a derivative work from the NMRShiftDB database. Errors will have been modified (I believe) and a final dataset will exist with which the team will have generated performance statistics regarding the GIAO predictions. This is a direct request to provide that file to address the hypothesis that HOSE code/Neural Network/Increment based predictions outperform GIAO. the work I want to do is to address this ongoing scientific question and is valuable work. Will you be willing to provide the data file? Thanks
(3)]Christoph Steinbeck was onsite in Cambridge this week and would be discussing the project]…. he was and did
[You commented the project was 1-2 months in length and Nick’s intention was to identify outliers and potential errors in the data and you have made many commentaries about that date]… he is doing exactly this and is more than half way through so the timescale is approximately correct
[NMRShiftDB is Open Data and my assumption was that you would be providing feedback, annotations, commentary to Christoph/Cologne/NMRShiftDB…] we are doing so. It is fairly slow as relatively few people are commenting on the outliers we have posted.
[My comments were that the NMRShiftDB would be cleaned with your input and the resulting data made Open…] It was never our intention to “clean NMRShiftDB”. Our initial studies have shown that at least 5% of entries need addressing in some way. That is > 1000 entries. We never intended to examine and annotate the whole of NMRShiftDB in this way and we never said so.
[Based on your comment of “NMRShiftDB is not our database it is Christoph’s, Stefan’s, and Cologne’s. We do not have rights to “clean” it.” …] I do not have write privileges at Cologne to replace their data with my own, nor should I have them. Christoph does not have write privilege on WWMM either. We take computer security seriously. We could, for example, take a copy of NMRShiftDB and “clean” it but we do not believe in forking Open projects without the consent. So the most we can do is send Christoph our annotations. We can and will make the annotations Open. Whether he wishes to “clean” NMRShiftDB is his decision.
[NMRShiftDB is now not Open Data. …] I thought it was.
[If you cannot download the data and remove bad data points etc and return the data to Christoph/NMRShiftDB with commentary then I remain confused about the rights around Open Data…] As I have said I can return them. This is not technically trivial, and I would intend to consult with Christoph about the best mechanism for communicating annotations *and metadata*.
[According to Christoph’s comments a number of changes were made to the data already. See the comment at Sourceforge: http://sourceforge.net/forum/forum.php?forum_id=711772. Christoph was or is onsite …please ask him for details. Christoph was an author of the paper….] Do you have a list of these corrections? Can we assume that the data we have downloaded recently are free of these errors? Also, in particular, you could help us by making it clear whether the outliers we have identified were excluded from your dataset and if so whether they have been removed from NMRShiftDB or corrected.
[At the end of your work Nick/you/Henry will have generated a derivative work from the NMRShiftDB database….] We may or may not. Depends on how it works out. It is likely to be of the order of 300 compounds.
[Errors will have been modified (I believe) …] That may depend on whether we get help from the community in agreeing whether there are objective criteria for making modifications.
[ and a final dataset will exist with which the team will have generated performance statistics regarding the GIAO predictions….] We would hope so.
[… This is a direct request to provide that file …] If such a file is created we currently intend it will be Open.
[…to address the hypothesis that HOSE code/Neural Network/Increment based predictions outperform GIAO. the work I want to do is to address this ongoing scientific question and is valuable work…
Open Data is available for almost any aspect of human endeavour. It can be used to test the hypothesis than NMR structure quality depends on the phase of quality of the electricity supply. If you wish to use this data to address a hypothesis – including that our work is outperformed by yours – you are welcome to do so. There are other restrictions on the use of people’s work which are not covered by Open Data.
(1) It is impressive to observe how a detailed analysis is applied to the NMRShiftDB data
(2) In order to be historically correct, I would like to mention that the NMRShiftDB-debate, which obviously initiated also these investigations – has been started on http://nmrpredict.orc.univie.ac.at/enjoy_its_free.html on March 12th, 2007.
(3) Copy from ‘pm286’ a few lines above:
* application of our protocol. ……..
It’s definitely your protocol, I never compared assignments with quantum-mechanical calculations in a systematic way. Within the data generation process within the CSEARCH-Project about 100 (!) different cross-checks are applied by my technicians responsible for data extraction and evaluation. These checking protocol is also applied in regular intervals to the whole dataset of more than 600,000 carbon-NMR’s I have online in my internal collection. At least 2 correction cycles are applied BEFORE data go into any external installation in order to ensure their quality.
(4) Within NMRShiftDB I never came across remarks what has been done to each dataset during its existence ( WHO has done WHAT, WHY and WHEN ) – this is simple GLP.
(5) Thanks.
I never came across remarks what has been done to each dataset during its existence ( WHO has done WHAT, WHY and WHEN ).
This is an important point for repositories – the creation of metadata.
The biological community already operates in this way.
Right — and it’s not “Open Notebook”. That’s not my term to define or defend, but I’d like to hear from JCB on this one. To my mind, “Open” does not mean “immediate” but “Open Notebook” does. If there is any insider information, any but a purely logistical delay between the notebook and the public, then it’s not “Open Notebook”. Open Notebook means you publish what you have, not what you think is fit for publication; that’s kinda the point, I thought.
I have to agree with Bill. This is where things get fuzzy for me.
Peter..relative to your question “Do you have a list of these corrections? Can we assume that the data we have downloaded recently are free of these errors? ”
As I already commented in my original post above “these observations were returned directly to Christoph for him/his team to validate and check and make the appropriate decision/annotation in NMRShiftDB.”
Again…you need to ask Christoph whether the issues we identified were validated by him and changes were made in the DB. Some changes were made (see my comment above) as he stated in Sourceforge. What changes were made I cannot answer…we didn’t make them. Christoph and/or his team did.
Relative to your statement “you could help us making it clear whether the outliers we have identified were excluded from your dataset and if so whether they have been removed from NMRShiftDB or corrected.” Unless I am misinterpreting your statement if the changes we suggested be made had been removed or corrected in the NMRShiftDB dataset then you would now not be seeing them since you would have downloaded a changed dataset. Our identification of outliers was done with about an hours work only. Wolfgang Robien did a similar level of analysis. I judge that neither of us did a deep analysis per se.
I recommend you read this post for details: http://www.chemspider.com/blog/?p=44 . Specifically I was estimating “large errors” ..see the post for a basic definition. I commented “Glaring errors are less than 250 in number based on my subjective criteria. Again, this does not mean that there aren’t hundreds or thousands of errors buried in the data…they are not obvious errors and require more manual examination.”
In regards to making it clear whether the identifiers you have identified were excluded I cannot confirm at present. When I left ACD/Labs I returned all files associated with my work with them to the company as is usual. However, I have requested that one of my ACD/Labs colleagues source the file (or derivative file) so that it can be of value to you. I’ll likely comment on our earlier work on the validation of the NMRShiftDB on ChemSPider blog in conjunction with the release of the JCIM paper on the recent work examining the NMRShiftDB. I’ll let you know when its posted.
Peter…to try and answer your questions with expediency I asked my colleague at ACD/Labs to check whether or not the information we provided to Christoph has been used to clean some of the errors on NMRShiftDB. My colleague, Kirill Blinov, has downloaded the October NMRShiftDB file and compared with the earlier file from April. Review of the data suggests that ALL of the feedback we gave has been used and the erroneous data have been cleaned up. Whether this meant deleting shifts or reassigning/swapping shifts I cannot comment as I have not had time to check. This is where annotation of the data is of value as Wolfgang Robien suggested.
Kirill has also run through the most recent dataset looking for new outliers and I will discuss this separately on the ChemSpider blog. I’ll put the entire report there for you/Nick/Henry to use in your Open Collaborative NMR studies and to help your project. This is our effort to build the bridge I commented about initially…”I believe this project offers the ability to help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science.”
Relative to your comment “The last sentence actually suggests that Chemspider believes their work is superior to what Nick is doing. Without a probably annotated data set such claims are meaningless.”
This entirely misrepresents …and is judgment and opinion. To clarify:
1) ChemSpider does not have any NMR prediction algorithms…at all. I have invited my old colleagues at ACD/Labs to participate in the project to compare GIAO with HOSE/NN/Inc. THEY have NMR prediction algorithms. I’m sure Wolfgang would be interested too. It is a question I have had personally for many years so I am investigating it for personal satisfaction
2) No offense is meant to Nick’s work and certainly no statement of superiority to Nick’s work is made. The question on the table is the superiority of algorithmic performance. As a scientist I choose to deal with the data and the science objectively. If the answer is that any of HOSE/NN/Inc outperform GIAO predictions then that is a statement for this point in time until there are changes in algorithms. Analysis of this type drives improvement. Nick’s work will deliver data for comparison..and that would likely have been executed superbly.
3) A dataset can be imperfect and still be analyzed to good effect. If three parties take the SAME dataset, warts and all, and analyze with their algorithms then the outcomes should be statistical results from GOOD data and correct identification of outliers. If the BAD data (outliers) are not observed (i.e are modeled in and are no longer outliers and cannot be identified as bad data) then that too is valuable data.