We posted yesterday about our proposed Open NoteBook approach to 13C chemical shifts from experiment and calculation and Chemspiderman has posted very helpful contributions, including offers of help. This is much appreciated and accepted. The main problem we have is HOW to do the open stuff (wiki, blog, etc.) not WHETHER. So welcome aboard. But don’t expect things to happen to order 🙂
PMR: Christoph is visiting the week after next (Christoph, I shall be here during that period so please visit).
Some comments. Everything here is very useful. The references are important and Nick need to know them. They’d be better on a Wiki than in blog comments so we need to think about that. Wikis: They collate very well but are not spam-free. They don’t easily alert in the same way as blogs. Blogs: They alert very well but are not spam-free. They do not collate well at all. We need a synthesis of the two. Neither are good at managing chemistry.
We are both agreed that NMRShiftDB is a valuable resource and needs growing. (It has been attacked by Robien – see above – primarily on the grounds that he got there first, that he has more spectra and that NMRShiftDB is a waste of tax-payers’ money). We have developed SPECTRa as a toolkit for uploading spectra to a repository and are continuing to develop the technology.
We are using crystalEye as the inspiration for this work. The data are robotically harvested, checked and repurposed into CML. In principle we could do this with NMR. It is fairly easy to scrape NMR spectra off publications and we could have 1 million quite easily. However it is more difficult to get the structure – we have ways of doing this.
Unfortunately this is bitterly opposed by some publishers, who see the re-use of data as undermining their business. What a short-sighted view. I shall write later about how this could be a brand-new business, and one we expect to be in. Some publishers so not even expose their data (sic) to non-paying customers. (Open access papers are no good, as they are “junk science”).
If the publishers actually WANT to expose data they could solve the problem in a year. Get the authors to provide InChIs (yes, organometallics are hard) and spectra as CML. All the tools exist.
“My intuition is that the HOSE-code approach, neural network approach and LSR approach will outperform the GIAO approach. Certainly these approaches would be much faster I believe. It would be good to compare the outcome of your studies with these other prediction algorithms and if the data is open then it will make for a good study.Will it be possible to download the entire dataset with predicted list of shifts in a standard consumable format such as SDF?”
I am a supporter of the HOSE code and NN approach, but I have also been impressed with the GIAO method. The time taken is relatively unimportant. A 20-atom molecule takes a day or so, smaller ones are faster. We can run 100 jobs a day – so 30,000 a month. That’s larger than NMRShiftDB.
We shall expose everything as CML. This is far more useful than SDF – e.g. atoms and peaks have clear structures. We’ll probably use an APP (Atom Publishing Protocol)
The main effort is trying to get more data. Here we won’t outline our strategy or the publishers may try to prevent the work before it starts (Wiley sent lawyers to Shelley Batts for posting a single graph with 10 points, so what would they do for a whole spectrum?).
I have extended George Whitesides’ ideas of writing papers that in doing research one should write the final paper first (and then of course modify it as you go along). So here’s what we will have accomplished (please correct mistakes):
We adapted Rychnovksy’s method of calculating 13C NMR shifts by adding (XXX) basis set and functionals (Henry has done this). We extracted 1234 spectra with predicted 3D geometries for rigid molecules in NMRShiftDB (no acyclic-acyclic bond nodes for heavy atoms). Molecules had < = 21 heavy atoms (<= Cl). These were optimised using Gaussian XXX and the isotropic magnetic tensors calculated using correction for the known solvent. The shift was subtracted from the calculated TMS shift (in the same solvent) and the predicted shift compared with the observed.
Initially the RMS deviation was xxx. This was due to a small number of strucures where there appeared to be gross errors of assignment. These were exposed to the community who agreed that these should be removed. The RMS dropped to yyy. The largest deviations were then due to Y-C-X systems, where a correction was applied (with theoretical backing). The RMS then dropped to zzz. The main outliers then appeared to be from laboratory AAA to whom we wrote and they agreed that their output format introduced systematic errors. They have now corrected this. The RMS was now zzz. The deviations were analysed by standard chemoinformatics methods and were found to correlate with the XC(=Z)Y group which probably has two conformations. A conformational analysis of the system was undertaken for any system with this group and the contributions from different conformers averaged. The RMS now dropped to vvv.
This established a protocol for predicting NMR spectra to 99.3% confidence. We then applied this to spectra published in 2007 in major chemical journals. We found that aa% of spectra appeared to be misassigned, and that bb% of suggested structures were "wrong" – i.e. the reported chemical shifts did not fit the reported spectra values.
We argue that if spectra and compounds were published in CMLSpect in the supplemental data it would be possible for reviewers and editors to check the "correctness" on receipt of the manuscript. We wrote to all major editors. aa% agreed this was a good idea and asked us to help. bb% said they had no plans and the community liked things the way it is. cc% said that if we extracted data they would sue us. dd% failed to reply.
We are submitting this journal to the Journal of Open Chemistry as it has been rejected by J. Wonderful Chem (Toll-access) because the work has been done on the Internet and is therefore junk.