The NMR project that Nick Day has been working on for the last month has run its course. We said that it would finish at the end of October so as not to prolong Nick’s writing up. Like all research it has not gone completely smoothly but it has actually been well on track. Nick will be posting all the material in the next day or so and everyone will have access. This will approximate an Open Notebook and we’ll invite comments later as to whether this is satisfactory. I shan’t fill in numbers in this post
We set out our expected goals at the start of the project and this has proved extremely valuable. During an exciting period it has helped us stay focussed both in direction and extent. We had not anticipated the interest it would generate and it’s a credit to Nick that he has stayed clear-headed during the process. Here’s what we said we would do and whether we managed it (actual numbers and details will be posted soon):
We adapted Rychnovksy’s method of calculating 13C NMR shifts by adding (XXX) basis set and functionals (Henry has done this).[ DONE]
We extracted ca 400
spectra shifts with predicted 3D geometries for rigid molecules in NMRShiftDB (no acyclic-acyclic bond nodes for heavy atoms). Molecules had < = 21 heavy atoms (<= Cl).
[DONE – we also used Br]
These were optimised using Gaussian XXX and the isotropic magnetic tensors calculated using correction for the known solvent.
The shift was subtracted from the calculated TMS shift (in the same solvent) and the predicted shift compared with the observed.
Initially the RMS deviation was [large]. This was due to a small number of structures where there appeared to be gross errors of assignment.
[CORRECT. there were "wrong structures" and also misassignments of peaks were common]
These were exposed to the community who agreed that these should be removed.
[DONE in part. We are extremely grateful to the community for commenting on general methodology and individual entries. A small number of entries were clearly grossly wrong or mistreated.]
The RMS dropped to yyy. The largest deviations were then due to Y-C-X systems, where a correction was applied (with theoretical backing). The RMS then dropped to zzz.
[CORRECT. This is true for C-Br and C-Cl systems. We shall invite comments for some other groups.]
The main outliers then appeared to be from laboratory AAA to whom we wrote and they agreed that their output format introduced systematic errors. They have now corrected this. The RMS was now zzz.
[NOT DONE. The metadata are not included in the CMLSpect file so it this would have to be done manually. It is probably not a major contributor to variance. We would also like to have included dates but these are not easily extracted.]
The deviations were analysed by standard chemoinformatics methods and were found to correlate with the XC(=Z)Y group which probably has two conformations. A conformational analysis of the system was undertaken for any system with this group and the contributions from different conformers averaged. The RMS now dropped to vvv.
[STARTED, and the community can help. We have identified clear conformational effects in some cases, suspected tautomers, and unmodelled solvent effects.]
This established a protocol for predicting NMR spectra to 99.3% confidence.
[TO BE POSTED. We are confident that this method is applicable to a subset of chemistry and does not rely on fitted parameters. We are working in variance space but may be able to transform to confidence. The treatement of misassignment looks promising.]
We then applied this to spectra published in 2007 in major chemical journals. We found that aa% of spectra appeared to be misassigned, and that bb% of suggested structures were "wrong" – i.e. the reported chemical shifts did not fit the reported spectra values.
[NOT YET STARTED. We are hoping to build a submission system and invite the community to contribute. This is not part of Nick's thesis, though obviously if useful work is done before he finishes the writing he can include it in discussion. We may do some exemplars when we write the paper. We'd be very grateful for any examples of recent publications where the spectral peaks look reliable and the structure does not.]
[… ideas related to publishers snipped …]
We started this 3 weeks ago and have effectively finished all computation, tools for display, and much of the analysis. We feel confident in stating that initially most of the variance was due to problems in the "experimental" – i.e. the actual data and its metadata. We identified ca. 11 possible error types (post) and have actually found four of them:
- wrong compound assigned to spectrum (i.e. error in bookkeeping or drawing error)
- transcription errors in spectrum or peaks.
- misassignment of peaks to inappropriate atoms
- human editing of spectra including fraud
We also found a number of limitations in our model (so far we haven’t found any “bugs”).
- theoretical model has limitations. YES So far the main one appears to be lack of treatment of solvent (e.g. for C=O groups in CHCl3). We anticipated this and we think it shows up.
- Oversimplified chemical model. There are several common problems:
- only one conformer is calculated. YES. identifiable
- symmetry is not well treated. No clear exemplar other than conformers
- tautomerism is ignored. PROBABLY. We invite your comments
- isomerism (e.g. ring-chain is ignored). No examples.