Data validation and protcol validation

This post replies to an ongoing debate about the quality of data and Open vs Closed data and systems. It’s specifically about NMR (spectroscopy) but my points are general. Since I have been publicly critical of some systems I must be prepared to take criticism myself (as happens here).
In Update: Robien on NMRShiftDB Ryan Sasaki (Technical Marketing Specialist for ACD/Labs) writes [PMR’s comments interspersed]:

If you have read my earlier post, you will be aware of Wolfgang Robien’s critique of the NMRShiftDB. Following this critique, Tony Williams from the ChemSpider Blog  and Peter Murray-Rust from the Unilever Cambridge Centre for Molecular Informatics replied to Wolfgang’s comments.

Well now, it appears that Wolfgang has responded to Tony’s comments.You can find his response here.
It appears that Wolfgang remains firm in his stance that the NMRShiftDB is not a good resource for scientists as it contains too many errors. He continues with the comments, “But: Enjoy – it’s free!”

PMR: My case has been that science is impoverished by lack of access to data and information. Neither are free, but there are new methods which lower the costs dramatically and also redistribute them. “free” may mean undefunded and therefore lower quality or it may mean “open” and capable of dramatic improvement by the community. In the case of NMRShiftDB I am firmly of the opinion that it leads the way in opening access to scientific information. If the community wishes it can use it as a growing point to develop more and better data. If they don’t, they will continue to use existing non-open systems (or in most cases not use anything at all).
I also state publicly that I support the activity of NMRShiftDB for several reasons. Firstly data (which for me are the central point of NMRShiftDB in this discussion):

  • It allows and promotes the public aggregation of data.
  • It contains mechanisms for assessing data quality automatically. For example software can be run that will indicate whether values are seriously in error.
  • It allows the public to identify errors and report them. It is also allows the creation of a developer/committer community that spreads the load of this process.
  • It allows mashups against other data resources (computation, crystal structures, etc.)
  • It acts as a model system that can be adapted for laboratories that wish to develop their own Open data aggregation systems. We had a DAAD-British_Council grant to collaborate with Koeln directly for this process and we see NMRShiftDB as a potential model to the extension of our SPECTRA program to capture data into institutional repositories.

Now software. I have no comment as to the relative merits of NMRShiftDB software against commercial systems. However the history of Open Source in chemistry has shown that within a few years software can be communally developed to become leaders in the field. 7 years ago relatively few people had heard of Jmol – now it is one of the leading display packages and widely used by publishers, pharma companies, etc. Similarly OpenBabel was a mess 5 years ago and the community has now put in so much work that they have made it the leading format conversion tool. It is therefore quite possible that NMRShiftDB software can do the same for the NMR community. Certainly if anyone is intending to build an NMR repository I would urge them to look closely at NMRShiftDB.

So I have a couple of responses in regards to Wolfgang’s comments in his follow-up:
When doing this job in a more systematic way not using specific examples as given here, the total number of incorrect assignments exceeds the above mentioned limit of 250 significantly. The intermediate number is at the moment around 300, but about ca. 1,000 pages of printouts are waiting for visual inspection.
Is 300 vs. 250 errors in a dataset of over 200,000 chemical shifts SIGNIFICANT? Is a difference of 50 errors in this dataset statistically significant? That’s 0.025%. I await Wolfgang’s final results and then we can judge whether it is significant. Meanwhile, he should also read the document we produced comparing the prediction accuracy between ACD/CNMR Predictor and Modgraph’s NMRPredict if he wants to challenge our findings. I think it is a good place to pick up our conversation.
“I definitely do not claim, that collections like CSEARCH, NMRPredict and SPECINFO are free of errors – the desired level of errors is always 0.0%; a value which can’t be reached – the acceptable limit is clearly below 0.1%, maybe 0.05% is good compromise between dream and reality.”
I agree, as I mentioned in my last post that while the desired level of error is 0.0%, this is a value that cannot be reached. I certainly would not claim that our prediction databases are free of error. Further, our work reveals about 8% errors in the form of mis-assignments, transcription errors, and incorrect structures within the peer-reviewed literature we comb. Error is human nature.

PMR: One of the core skills for all 21st C humans is to make judgements about the usability of any information. Without NMRShiftDB my current access to spectra is minimal. If I have 20,000 entries with 1% error that is an enormous advance. Biologists work everyday with the knowledge that their gene identifications, sequences, annotations, etc. have serious erros and they try to measure that error rate.

Let me say, I am very confused by the positioning of this question to Christoph Steinbeck:
“Why do you “reinvent” existing systems – there are a lot of systems (with much better performance !) already around  (a few in alphabetical order: ACD, CSEARCH, KnowItAll, NMRPredict, SDBS, SPECINFO)”
Why reinvent existing systems? To improve! To provide better resources for NMR spectroscopists and scientists around the world! While there is certainly better performing systems to date there is no reason to believe that these existing systems cannot be surpassed in terms of performance. Further, they offer an alternative to those institutions that do not have access to commercial products.
I think that Wolfgang is misunderstanding something here. From his writing, it seems that he feels threatened by the NMRShiftDB and is trying too hard to discredit the hard work and ideas behind this open source collection. What NMRShiftDB is providing, is something very different than anything the commercial products he names are offering. It is a truly open access and open source offering where scientists and spectroscopists can freely share their data and build an NMR database that is freely available to the scientific community.
It’s FREE! It’s not a commercial product like the ones he compares it to!

PMR: I would re-iterate this and add: It’s OPEN.

Christoph’s group is handling this very well and he mentions himself,
validations like Robien’s and the ones performed by us help make a strong case for open access and open source policy.

PMR: certainly. We are in the process of going live with CrystalEye, a near zero cost crystallographic knowledge base. We have made efforts to identify the error rate (which is lower than NMR but non-zero). Our value will be judged on the validity of the protocol, not the validity of individual entries (though we shall be adding automatic checks to them).

Finally, As I mentioned above, I can only make the assumption that Wolfgang has not seen my blog posting that compares the results of his algorithm vs. ACD/Labs. It should make for an interesting discussion.

PMR: In the future the means of publishing pre-validated data will continue to increase so large amounts of Open high-quality data will become available. The current method of human curation of data will only be useful where the values of the data are so important that life or law depends on them.
The role of the primary publisher is critical. If they want they can help speed up this process; if they want to possess and constrict (cf. Wiley copyrighting data) they will slow it down, but ultimately lose both the battle and their credibility.
I shall write more on our strategy in coming weeks.

This entry was posted in chemistry, data, open issues. Bookmark the permalink.

3 Responses to Data validation and protcol validation

  1. Pingback: ChemSpider Blog » Blog Archive » Further Comments on the Quality of NMRShiftDB and NMR Prediction Algorithm Validation

  2. I have blogged on your comments on the ChemSpider blog with a track back and we are in general agreement re the intent and value of the NMRShiftDB.
    I wanted to comment separately on “The role of the primary publisher is critical.” I agree that they can make it a lot easier to extract information and let’s discuss NMR data for now since this is the focus of this discussion. Validation engines will be required to confirm literature NMR data since year on year we have identified 8% errors in the peer-reviewed literature. Your comment re. 1% is one concern…8% is at a whole different level. Improved automated checking of data is possible. it is one of our primary missions to perform structure verification by NMR as well as auto-assignment and computer assisted structure elucidation. These technologies are not in their infancy…they are on the maturity curve now. The adoption of such tools by publishers, whether commercial or open source, will be essential if the generation of Open Access QUALITY databases is to proceed. I think I’m speaking to the converted of course….
    As an example of how computer algorithms for validation of NMR assignments can outperform even skilled spectroscopists I highlight the debacle around hexacyclinol. A search on this term tells an interesting story cited as “into the biggest stink-bomb in organic synthesis in many years” (http://pipeline.corante.com/archives/2006/07/23/hexacyclinol_rides_again.php). The Chemical Blog declares “La Clair to get ass handed to him on hexacyclinol” (http://www.thechemblog.com/?p=108). The story regarding NMR validation algorithms comes AFTER the material was synthesized and AFTER a crystal structure proved the structure and AFTER full H1 and C13 assignments were made of the material. The algorithm went on to show that the assignments were incorrect allowing 7-bond couplings. We have worked with the authors to reassign the molecule and a publication is in preparation to report on the FINAL assignments..and potentially the end of this story.

  3. Pingback: Ryan's Blog on NMR Software

Leave a Reply

Your email address will not be published. Required fields are marked *