Data validation and protcol validation
This post replies to an ongoing debate about the quality of data and Open vs Closed data and systems. It’s specifically about NMR (spectroscopy) but my points are general. Since I have been publicly critical of some systems I must be prepared to take criticism myself (as happens here).
In Update: Robien on NMRShiftDB
Ryan Sasaki (Technical Marketing Specialist for ACD/Labs) writes [PMR's comments interspersed]:
If you have read my earlier post, you will be aware of Wolfgang Robien’s critique of the NMRShiftDB. Following this critique, Tony Williams from the ChemSpider Blog and Peter Murray-Rust from the Unilever Cambridge Centre for Molecular Informatics replied to Wolfgang’s comments.
Well now, it appears that Wolfgang has responded to Tony’s comments.You can find his response here.
It appears that Wolfgang remains firm in his stance that the NMRShiftDB is not a good resource for scientists as it contains too many errors. He continues with the comments, “But: Enjoy – it’s free!”
PMR: My case has been that science is impoverished by lack of access to data and information. Neither are free, but there are new methods which lower the costs dramatically and also redistribute them. “free” may mean undefunded and therefore lower quality or it may mean “open” and capable of dramatic improvement by the community. In the case of NMRShiftDB I am firmly of the opinion that it leads the way in opening access to scientific information. If the community wishes it can use it as a growing point to develop more and better data. If they don’t, they will continue to use existing non-open systems (or in most cases not use anything at all).
I also state publicly that I support the activity of NMRShiftDB for several reasons. Firstly data (which for me are the central point of NMRShiftDB in this discussion):
- It allows and promotes the public aggregation of data.
- It contains mechanisms for assessing data quality automatically. For example software can be run that will indicate whether values are seriously in error.
- It allows the public to identify errors and report them. It is also allows the creation of a developer/committer community that spreads the load of this process.
- It allows mashups against other data resources (computation, crystal structures, etc.)
- It acts as a model system that can be adapted for laboratories that wish to develop their own Open data aggregation systems. We had a DAAD-British_Council grant to collaborate with Koeln directly for this process and we see NMRShiftDB as a potential model to the extension of our SPECTRA program to capture data into institutional repositories.
Now software. I have no comment as to the relative merits of NMRShiftDB software against commercial systems. However the history of Open Source in chemistry has shown that within a few years software can be communally developed to become leaders in the field. 7 years ago relatively few people had heard of Jmol – now it is one of the leading display packages and widely used by publishers, pharma companies, etc. Similarly OpenBabel was a mess 5 years ago and the community has now put in so much work that they have made it the leading format conversion tool. It is therefore quite possible that NMRShiftDB software can do the same for the NMR community. Certainly if anyone is intending to build an NMR repository I would urge them to look closely at NMRShiftDB.
So I have a couple of responses in regards to Wolfgang’s comments in his follow-up:
“When doing this job in a more systematic way not using specific examples as given here, the total number of incorrect assignments exceeds the above mentioned limit of 250 significantly. The intermediate number is at the moment around 300, but about ca. 1,000 pages of printouts are waiting for visual inspection.”
Is 300 vs. 250 errors in a dataset of over 200,000 chemical shifts SIGNIFICANT? Is a difference of 50 errors in this dataset statistically significant? That’s 0.025%. I await Wolfgang’s final results and then we can judge whether it is significant. Meanwhile, he should also read the document we produced comparing the prediction accuracy between ACD/CNMR Predictor and Modgraph’s NMRPredict if he wants to challenge our findings. I think it is a good place to pick up our conversation.
“I definitely do not claim, that collections like CSEARCH, NMRPredict and SPECINFO are free of errors – the desired level of errors is always 0.0%; a value which can’t be reached – the acceptable limit is clearly below 0.1%, maybe 0.05% is good compromise between dream and reality.”
I agree, as I mentioned in my last post that while the desired level of error is 0.0%, this is a value that cannot be reached. I certainly would not claim that our prediction databases are free of error. Further, our work reveals about 8% errors in the form of mis-assignments, transcription errors, and incorrect structures within the peer-reviewed literature we comb. Error is human nature.
PMR: One of the core skills for all 21st C humans is to make judgements about the usability of any information. Without NMRShiftDB my current access to spectra is minimal. If I have 20,000 entries with 1% error that is an enormous advance. Biologists work everyday with the knowledge that their gene identifications, sequences, annotations, etc. have serious erros and they try to measure that error rate.
Let me say, I am very confused by the positioning of this question to Christoph Steinbeck:
“Why do you “reinvent” existing systems – there are a lot of systems (with much better performance !) already around (a few in alphabetical order: ACD, CSEARCH, KnowItAll, NMRPredict, SDBS, SPECINFO)”
Why reinvent existing systems? To improve! To provide better resources for NMR spectroscopists and scientists around the world! While there is certainly better performing systems to date there is no reason to believe that these existing systems cannot be surpassed in terms of performance. Further, they offer an alternative to those institutions that do not have access to commercial products.
I think that Wolfgang is misunderstanding something here. From his writing, it seems that he feels threatened by the NMRShiftDB and is trying too hard to discredit the hard work and ideas behind this open source collection. What NMRShiftDB is providing, is something very different than anything the commercial products he names are offering. It is a truly open access and open source offering where scientists and spectroscopists can freely share their data and build an NMR database that is freely available to the scientific community.
It’s FREE! It’s not a commercial product like the ones he compares it to!
PMR: I would re-iterate this and add: It’s OPEN.
Christoph’s group is handling this very well and he mentions himself,
“validations like Robien’s and the ones performed by us help make a strong case for open access and open source policy.“
PMR: certainly. We are in the process of going live with CrystalEye, a near zero cost crystallographic knowledge base. We have made efforts to identify the error rate (which is lower than NMR but non-zero). Our value will be judged on the validity of the protocol, not the validity of individual entries (though we shall be adding automatic checks to them).
Finally, As I mentioned above, I can only make the assumption that Wolfgang has not seen my blog posting that compares the results of his algorithm vs. ACD/Labs. It should make for an interesting discussion.
PMR: In the future the means of publishing pre-validated data will continue to increase so large amounts of Open high-quality data will become available. The current method of human curation of data will only be useful where the values of the data are so important that life or law depends on them.
The role of the primary publisher is critical. If they want they can help speed up this process; if they want to possess and constrict (cf. Wiley copyrighting data) they will slow it down, but ultimately lose both the battle and their credibility.
I shall write more on our strategy in coming weeks.