Open NoteBook Science, Chemspiderman, GIAO and HOSE. We write the final paper

We posted yesterday about our proposed Open NoteBook approach to 13C chemical shifts from experiment and calculation and Chemspiderman has posted very helpful contributions, including offers of help. This is much appreciated and accepted. The main problem we have is HOW to do the open stuff (wiki, blog, etc.) not WHETHER. So welcome aboard. But don’t expect things to happen to order 🙂

  1. ChemSpiderMan Says:
    October 11th, 2007 at 3:04 am ePeter…I support and encourage the ONS. We’re working with Jean-Claude Bradley and helping where we can too.In regards to this work you are aware of the study done to apply computer assisted structure elucidation I believe (http://www.chemspider.com/blog/?p=77). A manuscript has just been submitted to J Nat Prod and is in review.
    We have just had a publication accepted to JCIM “Title: “Towards More Reliable 13C and 1H Chemical Shift Prediction: A Systematic Comparison of Neural Network and Least Squares Regression Based Approaches” Author(s): Williams, Antony; Elyashberg, Mikhail; Blinov, Kirill; Smurnyy, Yegor; Churanova, Tatiana”. This related to me finishing up publications from my time at ACD/Labs.
    A lot of work has been done by ACD/Labs to examine the NMRShiftDB database and validate the quality. You’ve seen us (both ACD/labs and ChemSPider) get very vocal about quality in SUPPORT of NMRShiftDB. A manuscript was submitted this week with Christoph Steinbeck (host of NMRShiftDB) as a co-author “”The Performance Validation of Neural Network Based 13C NMR Prediction Using a Publicly Available Data Source.” Authors: Williams, Antony; Blinov, Kirill; Smurnyy, Yegor; Elyashberg, Mikhail; Churanova, Tatiana; Kvasha, Mikhail; Steinbeck, Christoph; Lefebvre, Brent”
    My intuition is that the HOSE-code approach, neural network approach and LSR approach will outperform the GIAO approach. Certainly these approaches would be much faster I believe. It would be good to compare the outcome of your studies with these other prediction algorithms and if the data is open then it will make for a good study.Will it be possible to download the entire dataset with predicted list of shifts in a standard consumable format such as SDF?
    I think ACD/Labs has done a good job validating the quality of the NMRSHIFTDB database and this has been espoused at Ryan’s blog..the last post is http://acdlabs.typepad.com/my_weblog/2007/09/wolfgang-robien.html
    Either wiki or webpage with form for modifying works for me…no preference.
    Is there a timescale for the work? Good luck!
  2. ChemSpiderMan Says:
    October 11th, 2007 at 3:16 am ePeter…one more comment. Thanks to some very valiant efforts by Bob Lancashire to address Java compatibility issues with JSpecView the applet is now working well (http://www.chemspider.com/news/?p=81). The applet already supports CML from NMRShiftDB so you will likely want to adopt this for spectral viewing? http://sourceforge.net/forum/forum.php?forum_id=707556As an NMR jock I’d welcome the opportunity to help in the project. Of course, if Christoph’s engaged you’ll likely have all the bases covered :-)

PMR: Christoph is visiting the week after next (Christoph, I shall be here during that period so please visit).
Some comments. Everything here is very useful. The references are important and Nick need to know them. They’d be better on a Wiki than in blog comments so we need to think about that. Wikis: They collate very well but are not spam-free. They don’t easily alert in the same way as blogs. Blogs: They alert very well but are not spam-free. They do not collate well at all. We need a synthesis of the two. Neither are good at managing chemistry.
We are both agreed that NMRShiftDB is a valuable resource and needs growing. (It has been attacked by Robien – see above – primarily on the grounds that he got there first, that he has more spectra and that NMRShiftDB is a waste of tax-payers’ money). We have developed SPECTRa as a toolkit for uploading spectra to a repository and are continuing to develop the technology.
We are using crystalEye as the inspiration for this work. The data are robotically harvested, checked and repurposed into CML. In principle we could do this with NMR. It is fairly easy to scrape NMR spectra off publications and we could have 1 million quite easily. However it is more difficult to get the structure – we have ways of doing this.
Unfortunately this is bitterly opposed by some publishers, who see the re-use of data as undermining their business. What a short-sighted view. I shall write later about how this could be a brand-new business, and one we expect to be in. Some publishers so not even expose their data (sic) to non-paying customers. (Open access papers are no good, as they are “junk science”).
If the publishers actually WANT to expose data they could solve the problem in a year. Get the authors to provide InChIs (yes, organometallics are hard) and spectra as CML. All the tools exist.

“My intuition is that the HOSE-code approach, neural network approach and LSR approach will outperform the GIAO approach. Certainly these approaches would be much faster I believe. It would be good to compare the outcome of your studies with these other prediction algorithms and if the data is open then it will make for a good study.Will it be possible to download the entire dataset with predicted list of shifts in a standard consumable format such as SDF?”

I am a supporter of the HOSE code and NN approach, but I have also been impressed with the GIAO method. The time taken is relatively unimportant. A 20-atom molecule takes a day or so, smaller ones are faster. We can run 100 jobs a day – so 30,000 a month. That’s larger than NMRShiftDB.
We shall expose everything as CML. This is far more useful than SDF – e.g. atoms and peaks have clear structures. We’ll probably use an APP (Atom Publishing Protocol)
The main effort is trying to get more data. Here we won’t outline our strategy or the publishers may try to prevent the work before it starts (Wiley sent lawyers to Shelley Batts for posting a single graph with 10 points, so what would they do for a whole spectrum?).

I have extended George Whitesides’ ideas of writing papers that in doing research one should write the final paper first (and then of course modify it as you go along). So here’s what we will have accomplished (please correct mistakes):

We adapted Rychnovksy’s method of calculating 13C NMR shifts by adding (XXX) basis set and functionals (Henry has done this). We extracted 1234 spectra with predicted 3D geometries for rigid molecules in NMRShiftDB (no acyclic-acyclic bond nodes for heavy atoms). Molecules had < = 21 heavy atoms (<= Cl). These were optimised using Gaussian XXX and the isotropic magnetic tensors calculated using correction for the known solvent. The shift was subtracted from the calculated TMS shift (in the same solvent) and the predicted shift compared with the observed.
Initially the RMS deviation was xxx. This was due to a small number of strucures where there appeared to be gross errors of assignment. These were exposed to the community who agreed that these should be removed. The RMS dropped to yyy. The largest deviations were then due to Y-C-X systems, where a correction was applied (with theoretical backing). The RMS then dropped to zzz. The main outliers then appeared to be from laboratory AAA to whom we wrote and they agreed that their output format introduced systematic errors. They have now corrected this. The RMS was now zzz. The deviations were analysed by standard chemoinformatics methods and were found to correlate with the XC(=Z)Y group which probably has two conformations. A conformational analysis of the system was undertaken for any system with this group and the contributions from different conformers averaged. The RMS now dropped to vvv.
This established a protocol for predicting NMR spectra to 99.3% confidence. We then applied this to spectra published in 2007 in major chemical journals. We found that aa% of spectra appeared to be misassigned, and that bb% of suggested structures were "wrong" – i.e. the reported chemical shifts did not fit the reported spectra values.
We argue that if spectra and compounds were published in CMLSpect in the supplemental data it would be possible for reviewers and editors to check the "correctness" on receipt of the manuscript. We wrote to all major editors. aa% agreed this was a good idea and asked us to help. bb% said they had no plans and the community liked things the way it is. cc% said that if we extracted data they would sue us. dd% failed to reply.
We are submitting this journal to the Journal of Open Chemistry as it has been rejected by J. Wonderful Chem (Toll-access) because the work has been done on the Internet and is therefore junk.

This entry was posted in data, open issues. Bookmark the permalink.

5 Responses to Open NoteBook Science, Chemspiderman, GIAO and HOSE. We write the final paper

  1. Pingback: Science in the open » PMRs Open Notebook Project continued

  2. Pingback: Science in the open » How best to do the open notebook thing…a nice specific example

  3. Pingback: ChemSpider Blog » Blog Archive » An Invitation to Collaborate on Open Notebook Science for an NMR Study

  4. Here’s a suggested paragraph to insert in the paper
    “The results from the GIAO calculations were compared with three other prediction approaches provided by Advanced Chemistry Development. These algorithms were not limited in the number of heavy atoms that could be handled by the algorithm, The algorithms were a HOSE-code based approach, a neural network approach and an “increment approach”. A distinct advantage of these approaches is the time for prediction relative to the quantum-mechanical calculations. The QM calculation took a number of weeks to perform on the dataset of 23475 structures on a cluster of computers. However, a standard PC enabled the HOSE code based predictions to be performed in a few hours, the Neural Net predictions in about 4 minutes and the Increment based predictions in less than 3 minutes.
    A comparison of the approaches gave statistics for the non-QM approaches superior to those of the QM approach. Outliers were observed in both cases and were traced to misassignments. QM approaches were generally more capable of predicting exotic structures while for the majority of the NMRShiftDB made up of general organic chemicals non-QM approaches were superior.”
    I believe this project offers the ability to help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science. There has never been a study of the magnitude being discussed here comparing quantum-mechanical NMR prediction methods with the methods represented by commercial software products. I look forward to it! CAn you consider a close collaboration between your team, myself and the scientists at ACD/Labs. We have already submitted a paper to JCIM (presently in proofing and will be sent separately to you) and have another paper already submitted re. NMRShiftDB and co-authored with Christoph.

  5. Pingback: Good Science Takes Time: 16 months to examine NMR Prediction Performance at The ChemConnector Blog by Antony Williams - Observations and Musings for the Chemistry Community By Antony Williams, Freelance Scientist

Leave a Reply

Your email address will not be published. Required fields are marked *