Data validation in publications

Tony Williams’ comment to my post (Data validation and protcol validation – May 31st, 2007) has several valuable themes which I expand on here and in later posts. Tony and I are in agreement here and working towards something that we can separately and jointly promote. To summarise:

  • The published literature (even after peer-reviewing and technical editing) contains many factual errors.
  • machines can help to eliminate many of these before and during the publication process. In severe cases they can prevent bad science.
  • techniques include the direct computation of observed data and the comparison of data between datasets.
  • this is only reasonably affordable if the data are originally in machine-understandable form. PDF is not good enough.
  1. Antony Williams Says:
    May 31st, 2007 at 6:12 pm eI have blogged on your comments on the ChemSpider blog with a track back and we are in general agreement re the intent and value of the NMRShiftDB.
    I wanted to comment separately on “The role of the primary publisher is critical.” I agree that they can make it a lot easier to extract information and let’s discuss NMR data for now since this is the focus of this discussion. Validation engines will be required to confirm literature NMR data since year on year we have identified 8% errors in the peer-reviewed literature. Your comment re. 1% is one concern…8% is at a whole different level

PMR: It’s important to define what is an error (and much of the debate about NMRShiftDB is what is an error). Because of the byzantine method of hand-publishing to PDF many transcription errors get included. Our rough estimate is that many if not most published papers in chemistry contain at least one transcription error (maybe only punctuation, but it fouls machine-reading).
A second procedural error is associating the wrong molecule with data. We don’t know how common this is but we have certainly seen it. I suspect that these tow errors run at th 1% level.
These are separate from “scientific errors” where the “wrong structure is proposed (see below) or where the assignment (i.e. annotation) is “wrong”.  I have no comment on these yet – maybe this is the 8%.

  1. Improved automated checking of data is possible. it is one of our primary missions to perform structure verification by NMR as well as auto-assignment and computer assisted structure elucidation. These technologies are not in their infancy…they are on the maturity curve now. The adoption of such tools by publishers, whether commercial or open source, will be essential if the generation of Open Access QUALITY databases is to proceed. I think I’m speaking to the converted of course….

PMR: completely agreed. We have done the same thing with crystallography and discovered a number of experimental errors and artifacts. Routine calculation of molecular geometry and NMR spectra shoudl now be a pre-requisite for these types of studies.

  1. As an example of how computer algorithms for validation of NMR assignments can outperform even skilled spectroscopists I highlight the debacle around hexacyclinol. A search on this term tells an interesting story cited as “into the biggest stink-bomb in organic synthesis in many years” (http://pipeline.corante.com/archives/2006/07/23/hexacyclinol_rides_again.php). The Chemical Blog declares “La Clair to get ass handed to him on hexacyclinol” (http://www.thechemblog.com/?p=108). The story regarding NMR validation algorithms comes AFTER the material was synthesized and AFTER a crystal structure proved the structure and AFTER full H1 and C13 assignments were made of the material. The algorithm went on to show that the assignments were incorrect allowing 7-bond couplings. We have worked with the authors to reassign the molecule and a publication is in preparation to report on the FINAL assignments..and potentially the end of this story.

PMR: fully agreed. The blogosphere had a field day with this and helped to raise the issue of quality in publishing.
There is a general concern among many principals that scientific fraud or sloppiness is of sufficient concern that data must be deposited in repositories so that questions such as this can be at least partially addressed by referral back to the raw data.
So publishers can help by:

  • insisting machine-readable raw data is available
  • using computer-validation where possible.

Note, of course, that the crystallographers already do this – I shall blog on this again very shortly.

This entry was posted in chemistry, data, open issues. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *