Computational NMR: treatment of outliers, we need your help

We have posted an number of cases where the calculated NMR shifts do not agree with the observed ones, and also indicated over 25 possible reasons for this – some due to errors or features in the experiment, some in calculation. A priori we do not know which causes are more frequent. Indeed there have been a number of correspondents who have suggested that our methods and the data both contain significant problems. And it is also clear that if it is possible to get good agreement between calculations and a data set this would justify both and would be seen as of considerable communal value.
So we now intend to pursue these investigations of quality – in public, and resulting in Open Data, and although not conformant to Open Notebook within a very short period of creation. We are exposing all our results and thinking so that the final data set and protocol is as transparent as possible and so should avoid much of the debate that occurs for closed data and methods.
The first step is to analyse the causes of variance in the data. We have two measures, precision (the variance) and accuracy (agreement of absolute value). Initially they are not independent in that a few very large errors in single shifts will cause both variance and imprecision.
For each data set (which contains between 1 and 20 observed and calculated shifts) we fit the data to
y = x + c + eps
This contains 1 adjustable parameter, c – the offset. For each point in a data set we can calculate the signed deviation (delta) between observed and calculated. For each data set we can calculate the root mean square of delta (RMSD). Here is our current plot, with NO data points omitted.
scatter.PNG
Most of the data are clustered in a roughly normal fashion with a mean value of c close to zero. Do not try to read any more information into any apparent systematic variation as the outliers are not uniformly distributed and have high leverage.
At this stage we do not know whether any outlier is caused by failings in the data, failings in the method or both. It would be totally irresponsible to omit points simply because they didn’t agree. We must identify the causes of error for major outliers, and – if they make scientific sense – we may then argue they should be omitted.
If each error is independent of every other error then there is no option but to examine every case in detail. For example if the sole cause were serious transcription errors there is no way of automating this checking (other than by running OSCAR over publishers PDFs). However there may be some causes of error which occur frequently. Egon has already shown us that misassignment of peaks can happen and in fact we believe this is a frequent occurrence. If we can detect this in a manner than convinces the community then we can legitimately remove these entries from the set.
It is also possible that certain types of experiment may show large variance. Egon has suggested that some solvents (e.g. acetone) may affect the shifts. If, for example, outliers contained a high proportion of  acetone as solvent relative to – say – chloroform then we could hypothesize that acetone caused variance. We might try to correct for this in some scientifically acceptable way (e.g. by modelling acetone in the QM calculations) – alternatively we might reduce the scope of our calculations to those experiments not done in acetone.
Similarly we may find that certain types of chemical groups are associated with outliers. This is indeed tru for the heavier halogens (and heavier elements). Henry had already predicted this from his own work, but if he had not it would have been legitimate to hypothesize that halogens caused systematic error and normal regression techniques could lead to values for correction (and the variance of the correction). Indeed we may find it useful to compute regression-based values for those elements showing spin-orbit-coupling.
However we still expect outliers with their own, isolated, causes. In this case the first action is to return to the literature source. I have done this for one of the outliers, and can find no transcription errors so I have appealed to the community for their collective wisdom. Jean-Claude has suggested it may be due to tautomerism, but I would welcome other ideas.
What we intend to do, therefore is to publish the interactive data for outliers (i.e. clickable plots, highlighting atoms in Jmol and with links to NMRShiftDB) and ask for community input. All results will be Open and will be immediately available.
Our intention – as we set out earlier – is:

  • create a small subset of NMRShiftDB which has been freed from the main errors we – and hopefull the community – can identify.
  • Use this to estimate the precision and variance of our QM-based protocol for calculating shifts.
  • refine the protocol in the light of variance which can be scientifically explained.
This entry was posted in nmr, open notebook science. Bookmark the permalink.

4 Responses to Computational NMR: treatment of outliers, we need your help

  1. A few frequently occuring errors in NMR assignments:
    (a) Within the C=C-C=O fragments (e.g. cinnamic acid derivatives, coumarines, etc.) the shiftval
    ues at the alpha/beta-positions with respect to the CO are mixed up in about 5% of the cases
    (b) C-C-C=O: The beta-position is shielded – very frequent source of errors
    (c) Alkyl chain: 14-22-31-29-29-29-……… is the usual sequence
    The N-methyl-piperidone problem:
    CNMR data can be found in:
    J.ORG.CHEM.,41,455(1976)
    J.CHEM.SOC.PERKIN-2,1927(1974)
    J.ORG.CHEM.,37,2332(1972)
    ORG.MAGN.RES.,10,31(1977)
    ORG.MAGN.RES.,12,339(1979)
    BULL.POL.AC.SCI.,28,263(1980)
    The data are available in:
    CAS-SCifinder ( I assume University of Cambridge has a campus license to access CAS )
    NMRPredict ONLINE FULL ( http://www.mestrec.com ) – Euro 155.- per year
    CHEMGATE ( http://chemgate.emolecules.com ) – price depends on licensing model, academic discount
    NMRPREDICT ( http://www.modgraph.co.uk ) – price published on their website, cheap academic version is mentioned above as “NMRPredict ONLINE Full”
    SPECINFO ( http://specinfo.wiley-vch.de )
    KnowItAll ( http://www.biorad.com )
    ACD-Prediction Software ( http://www.acdlabs.com )
    and
    WEB-Based version of CSEARCH ( http://nmrpredict.orc.univie.ac.at/csearchlite ) – a total number
    of 4 CNMR-spectra and 2 O17-NMR datasets are available – online since more than 1 year ! BTW:
    Free of charge !
    and maybe some more I forgot to mention here …….
    What would be great to the scientific community: Do calculations on compounds where sophisticated NMR-techniques either fail or are very difficult to perform – e.g. proton-poor compounds or simply ask for a list of compounds which are really suspicious (either the structure is wrong or the assignment is strange, but the puzzle can’t be solved, because the compound is not avail
    able for additional measurements).

  2. Egon @ Wii says:

    Please forgive me any typo’s. The Wiimote requires some practice… mmm, need to try jmol.org…
    Anyway, ended up here via Scintilla which does not work on anything but the Wii here… do you plan to use tagging as social curation system? One prediction == one URL == identifier for social annotation…
    BTW, the CrystalEye RSS feeds are empty for me.

  3. pm286 says:

    (1) Thanks Wolfgang – extremely useful. It’s probably easiest if we expose all our problem data.
    I shall write more later…

  4. pm286 says:

    (2) Egon – please explain further…
    For the RSS feeds suggest you contact Jim Downing

Leave a Reply

Your email address will not be published. Required fields are marked *