experiment and theory – the liberation of data and source

Antony Williams on the ChemSpider blog has paid tribute to NMRShiftDB. I have copied this in full and comment later on how theory and experiment test each other:

Open Source Data, Testing Quality and Returning Value – Interactions with NMRSHIFTDB and the Blue Obelisk Community

Posted by: Antony Williams in Quality and Content

I give a thumbs up to the quality of the NMRSHIFTDB. We’ve validated it. Why would I care? I’m an NMR jock at heart. I also work for a commercial software company innovating NMR prediction software and compiling NMR databases as the basis of our work. Does this mean that the commercial software vendors and Open Source/Access communities can coexist and have mutual admiration. I believe so!

After 18 months work I finally signed off on one of those infamous copyright transfers for Elsevier, now the publishers of Progress in NMR. After over 18 months of work (that after hours style…much like blogging) a 360 page review article is finally submitted – “Computer-Assisted Structure Verification and Elucidation Tools In NMR-Based Structure Elucidation”. Proofs will arrive before end of month. It’s the culmination of over ten years of our own work as well as that of many contributors in the domain of CASE (Computer Assisted Structure Elucidation) systems. The complexity of structures that can be solved by computer algorithms is impressive…see examples here. Recently the StrucEluc CASE system solved the structure of an antibiotic of Mw>1150. Three NMR spectroscopists couldn’t solve it…a symbiotic relationship with software is VERY enabling!

One very active player in CASE is Christoph Steinbeck, a member of Blue Obelisk, one of the more active blogging groups on the net today.

Christoph’s group host NMRSHIFTDB. Recently ChemSpider linked to NMRSHIFTDB. In parallel I took interest in the recent critique of the quality of data published by Wolfgang Robien especially since during my “day job” I am directly involved with NMR prediction, structure verification using NMR and of course, CASE systems.

What was interesting about Robien’s post was the fact that it focused on the application of Neural Networks to prediction. With the availability of a public dataset we were able to repeat the analysis using our own Neural Networks as well as our classical approaches. The results will be reported elsewhere. What I want to confirm is PMR’s post regarding the quality of data. Peter commented, relative to our own efforts at ChemSpider. “There is little point in collecting 10 million structures if you cannot rely on any of them. It actually detracts from the hard work of people like Stefan, Christoph and others on NMRShiftDB as the general user of the database will judge all entries by the lowest common denominator.” After analyzing the data, over 200,000 individual chemical shifts I can say DON’T judge by the lowest common denominator. There is some junk on there..as seen by Wolgang Robien. But our estimates after our analysis is likely less than 250 data points in error. These are truly excellent statistics if you consider that this is an open access system where people are depositing data, that these data are free to download and utilize even for the development of derivative algorithms and that such systems can work. The addition or improvement of rigorous checking algorithms to the NMRSHIFTDB is the next natural step and flagging data to the submitter will have them check and validate the quality of their input. This will catch many errors during the submission process.

So, my compliments to Christoph and the team. The quality is excellent and there are “large errors” but minimum in number. I’ve already sent him a report to help cleanse the database though didn’t compare it with that of Robien…likely we saw the same things since they were very obvious. These errors should not detract from the effort ..with >200,000 data points it is obvious that there would be some. For ChemSpider we have the same problem…with >10 million structures there are errors….lots of them. But it’s very useful all the same!
======
So thank you Antony for this analysis. This represents best practice in the prediction of chemical properties:
  • find a set of molecules for which nmr spectra have been measured and which are available in machine parsable digital form (i.e. molecules in CML or legacy formats, spectra in CMLSpect or JCAMP-DX, not hamburger PDF)
  • compute the NMR spectrum for each molecule. ACDLabs have one of the best programs for doing this – which I think is based on lookup, heuristics and machine learning techniques. It’s also possible to use fundamental simulations such as ab initio quantum mechanics.
  • compare the observed and predicted values.
  • attempt to rationalise any discrepancies in terms of either or both experiment or theory. If this is possible then it may be possible to refine either of these methods.

In favourable cases we find that experiment and theory agree (within known limits). Each side of the equation this validates the other. This gives us great confidence in creating general protocols. Thus, from the account above, ACDLabs can predict NMR spectra within published limits and get it right 99.9% of the time.
Joe Townsend has done the same for crystallography – we are in the process of wrting this up but basically we get similar results. We create a protocol to identify poor experimental data and filter them out, and then compare the rest with predictions from quantum mechanics. We also get > 99.9% agreement. As a result we are confident that the data collected in Nick Day’s CrystalEye will be highly valuable.
So the message is simple. If you wish to predict properties, you must follow the steps above AND

  • publish your methodology and protocols
  • publish your test data set before and after filtering
  • publish the agreement
  • use this to give confidence limits to your predictive method.

Only then can you reasonably announce to the world that you have a useful method.
For example I am listening to Bobby Glen giving a talk about how we predict solubility of organic compounds in our Centre. The data in the literature can be awful – some measurements differ by a factor of 1000. Yet many groups have developed predictive methods based on them and these methods are widely deployed. So Bobby’s group is going back to the bench to measure solubility by new and better methods – but I’ll let them tell the story.
Unfortunately this is all too rare in chemoinformatics. Many publications report predictive methods but often none of the data, protocols, algorithms, software or analysis methods are normally reported or publicly available. The process is not repeatable outside the organisation that created the methodology. Unless the software is open source it is fundamentally impossible to verify algorithms – reports in traditional publications are far too brief to allow full tested implications. So, although for the NMR study the data were open, the program was closed source which means that only the authors can investigate any discrepancies.
Oh, and have I mentioned in other blogs that > 99.9% of NMR spectra are not available in Open machine-parsable form? Because publishers copyright them and try to sell them back. Because chemists do not see the value of preserving their own data. Because manufacturers have binary formats.
Well, the chemical blogosphere is going to change these attitudes. We are going to liberate data. We’ll start with crystallography and then move on to spectra. We’re not going to reveal all the methods in case people try to block us. But we’re confident it will work

This entry was posted in data. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *