Open Notebook NMR – another outlier

Here is another unexplained outlier in the first 100 entries. We’d be very grateful if anyone could confirm that it is in error (probably requires reading the original paper).
nmrshiftdb2562-1 (solvent: chloroform)
2562.PNG
most of the outliers can be explained by the errors we have already proposed.

Posted in data, nmr, open notebook science | 3 Comments

Open Notebook NMR – motivations and confusions

I have been pleased by the interest in Open Notebook NMR but the current discussions have widened far too useful to be useful, so I want to be absolutely clear what the project and its limits are.
This is a part of Nick Day’s PhD thesis at Cambridge.
The only motivation is for Nick to be able to do good science, with advice and help which is publishable in his thesis. I repeat:
This is a part of Nick Day’s PhD thesis at Cambridge.
I made that absolutely clear at the beginning. Any broadening of the project is a distraction and could be detrimental to his work. For example I follow what Alicia is doing with Jean-Claude but I would never dream of suggesting she does other than what is agreed between them.
It is extremely unusual for a PhD student to be exposing his work as Open Notebook Science. I only suggested it because he is a good student and I believe his technical competence and commitment is such that he will do careful and valuable work. Remember that if anything is wrong it is extremely public.
We also made clear what the limits of the project were and I will repeat them as our hypothetical report:

We adapted Rychnovksy’s method of calculating 13C NMR shifts by adding (XXX) basis set and functionals (Henry has done this). We extracted 1234 spectra with predicted 3D geometries for rigid molecules in NMRShiftDB (no acyclic-acyclic bond nodes for heavy atoms). Molecules had < = 21 heavy atoms (<= Cl). These were optimised using Gaussian XXX and the isotropic magnetic tensors calculated using correction for the known solvent. The shift was subtracted from the calculated TMS shift (in the same solvent) and the predicted shift compared with the observed.
Initially the RMS deviation was xxx. This was due to a small number of structures where there appeared to be gross errors of assignment. These were exposed to the community who agreed that these should be removed. The RMS dropped to yyy. The largest deviations were then due to Y-C-X systems, where a correction was applied (with theoretical backing). The RMS then dropped to zzz. The main outliers then appeared to be from laboratory AAA to whom we wrote and they agreed that their output format introduced systematic errors. They have now corrected this. The RMS was now zzz. The deviations were analysed by standard chemoinformatics methods and were found to correlate with the XC(=Z)Y group which probably has two conformations. A conformational analysis of the system was undertaken for any system with this group and the contributions from different conformers averaged. The RMS now dropped to vvv.
This established a protocol for predicting NMR spectra to 99.3% confidence. We then applied this to spectra published in 2007 in major chemical journals. We found that aa% of spectra appeared to be misassigned, and that bb% of suggested structures were "wrong" – i.e. the reported chemical shifts did not fit the reported spectra values.

This is precisely what we have been doing and we are sticking to it. It would be irresponsible for a supervisor and student to do elsewise until unforeseen difficulties arose. It has gone according to plan:

  • Initially the RMS deviation was ca. 3 ppm. This was due to a small number of structures where there appeared to be gross errors of assignment. These were exposed to the community who agreed that these should be removed. We have had comments from Christoph, Henry, Egon and Jean-Claude which have allowed us to remove 3 entries.
  • The largest deviations were then due to C-Hal systems, where a correction was applied (with theoretical backing from Henry). We are now applying this correction and will report as soon as the new code has been written (because we use XML-CML this has a rapid turnround).
  • The main outliers then appeared to be from laboratory AAA to whom we wrote and they agreed that their output format introduced systematic errors. A very serious error occurred in an entry from DKFZ spektren which was due to human editing of a apectrum (identified by Christoph). It is therefore not unreasonable to remove all entries from this source (note we have to remove all without inspecting them – we cannot just choose which we like).
  • A conformational analysis of the system was undertaken for any system with this group and the contributions from different conformers averaged. It is clear from inspection of the data that some compounds have equipopulated conformers (e.g. C6H5-CH(=O)) and that is legitimate to identify this framework symmetry group and average over equivalent atoms if they have equal shifts.
  • This established a protocol for predicting NMR spectra to xxx% confidence. This is the primary aim of the work – to be able to show that machines can make decisions without humans within a given confidence level. It depends on being able to find a set of data which are accepted as accurately assigned. That is what we are asking the community for.

The immediate Open task is to help annotate outliers, which are being released as soon as they are identified and we repeat – any help with this is very much appreciated.
The project that Chemspider has identified is completely distinct from Nick Day’s thesis:

“The results from the GIAO calculations were compared with three other prediction approaches provided by Advanced Chemistry Development. These algorithms were not limited in the number of heavy atoms that could be handled by the algorithm, The algorithms were a HOSE-code based approach, a neural network approach and an “increment approach”. A distinct advantage of these approaches is the time for prediction relative to the quantum-mechanical calculations. The QM calculation took a number of weeks to perform on the dataset of 23475 structures on a cluster of computers. However, a standard PC enabled the HOSE code based predictions to be performed in a few hours, the Neural Net predictions in about 4 minutes and the Increment based predictions in less than 3 minutes.
A comparison of the approaches gave statistics for the non-QM approaches superior to those of the QM approach.

This has no bearing in any way on Nick’s work – it does not help certify entries as “correct”. The last sentence actually suggests that Chemspider believes their work is superior to what Nick is doing. Without a probably annotated data set such claims are meaningless.
Finally I should clarify what is meant by Open since there is confusion:

Ryan Sasaki Says:
October 24th, 2007 at 12:16 pm e
Hi Peter,
While I cannot speak for Brent Lefebvre, I have a couple of comments in relation to Open Notebook NMR and the potential involvement of commercial software companies in this study.
First of all, you mention that you will probably share your insights but not your data. If that is true, then I do not understand how this project can be referred as OPEN. If the dataset being used is not shared publicly, how can this be considered an open project?

We have so far shared every piece of data and metadata that we feel is fit to publish. Open does not mean “immediate”. The data is taken from NMRShiftDB which is Open. We have published our protocol as it is refined. We have said we are going to publish our outliers and we are doing so. When we get input from the community we shall publish more. The product will be a data set which will have community approval – along with the protocol this will be our major Open deliverables. Nick will have enough breathing space to make scientific discoveries – if any are to be made.
The biological community already operates in this way. When a lab does a protein structure they do not publish their photographs immediately. We do not expect them to, any more than we expect Rosalind Franklin to do so. But when the time comes for publication we shall publish all that is necessary to replicate the experiment – that is the key. And, along the way, we are asking the community for input. If they do so the result could be a data set that is relatively small but of very high quality and therefore useful for testing computational approaches.

Posted in nmr, open notebook science | 11 Comments

Open Notebook NMR – technical update

Two useful contributions:
Henry flagged up the importance of spin-orbit coupling before we started the calculations. He writes:

the effects can be calculated, and are somewhat basis set dependent.  For our
basis, Br should be corrected by  -12 ppm (and approx  -24 for two)  and
Cl by  -3 ppm.  S is probably  -2ppm,  and Iodine  -28 ppm.  That should
probably suffice for the halogens.

He tells me that it’s possible to calculate the effect exactly but the methodology is current very hairy – running different programs sequentially. So the approximate averages will suffice for now. If they don’t work then we’ll adjust them or go back to calculations.
It’s clear that it’s important from the following:

br2.PNG where the second largest deviation is due to the CBr2 fragment (the largest is the mistranscribed ClCN=O fragment). We shall introduce Henry’s offset’s into the calculations and re-present the data.

  1. baoilleach Says:
    October 24th, 2007 at 1:08 pm eThe mean and variance assume a normal distribution and are sensitive to outliers. You should use the median and the inter-quartile range. This isn’t a fudge-factor – the values of the mean and variance are misleading when looking at non-normal distributions.
    Also, why are you plotting the absolute value rather than the actual value? You are throwing away interesting information by folding +ve and -ve values on top of each other.
    This diagram is per structure. Unless you suspect that particular structures have systematic errors, you should also do one per predicted shift. Presumably, particular environments of C atom are more difficult to calculate than others…?

These are perceptive and useful points. At present these are being used as a rough guide to identify structures (NOT shifts) with problems. I agree that the skewness for a structure may be important but not yet. (All structures have been fitted to their own means at present as we do not yet know how to treat the offsets – indeed Henry thinks that the calculations for TMS need careful inspection).
Recal that we are fitting the data for each structures separately to:
y = x + c + eps
Here is the spread of the c values (x-axis is 13C ppm):
c1.PNG
This shows that we are in the right ball park – if we had miscalculated TMS by 10 ppm the data would have been centered round 10.  But some of the spread is the systematic errors of the halogens, and others are because of assignment errors in the structures.
So the RMS is a simple measure to identify the potential problems and ask the world to comment.

Posted in nmr, open notebook science | Leave a comment

Open Knowledge Foundation

3 Years ago Rufus Pollock met up with me in Cambridge – I think in conjunction with concern over European legislation on copyright. He told me that he was starting “knowledge forge” – a similar approach to sourceforge, but for knowledge, not code. Rufus has enormous enthusiasm, great knowledge about many areas and when he asked me to be on his advisory board of OKFN I accepted, mainly because I felt he was someone who was going to help change the world for the better, rather than because I understood in detail what this was about. I’m not sure that I still do, but he has recently posted about the OKF defintion:

This chemspider blog post expresses considerable uncertainty as to the respective roles and relationship of the Open (Knowledge/Data) Definition and Creative Commons. This kind of uncertainty, particularly as to whether the OD and CC are in some way competing ’standards’, is something I’ve increasingly encountered over the last year or so. I therefore really think this is something that it is important to clarify. Below is my effort to do so.

and continues:

1. The Open Knowledge/Data definition is (like it says) a definition. It is not a license. In this respect it resembles the open source definition (on which it is modelled).
2. Its aim is to lay out a set of simple principles that make it clear what we mean when we say a ‘work’ (be it a dataset of a sonnet) is ‘open’. Informally this involves providing freedom of access, reuse and redistribution to the work (or rather providing freedom of access under a license that permits these things). The full set of principles can be found in the definition.
3. Like the open source definition it has a list of ‘conformant/compatible’ licenses. These may be found at: http://opendefinition.org/licenses/.

PMR: This is EXTREMELY helpful. As Rufus says, if I want an Open Source Licence I can go to Open Source Initiative This lists many licences that conform and I know that by picking one (I generally use Artisitic) I will conform to the OSI definition. Here is the list of licences which conform to the Open Knowledge definition:

  1. Conformant Licenses
    1. ‘MIT’ Database License
      1. Full Text
      2. Comments
      3. How to Apply
    2. Creative Commons Attribution License (cc-by)
    3. Creative Commons Attribution Share-Alike (cc-by-sa)
    4. Design Science License
    5. Free Art License
    6. GNU Free Documentation License (GFDL)
    7. Talis Community License (TCL)
    8. UK PSI (Public Sector Information) Click-Use Licence

The point is that by choosing ANY one of these I know that I will conform to the principles of the definition. I therefore regard the definition as a meta-licence (I don’t know whether this is a neologism) in the same way as XML is a meta-markup-language that allows the construction or markup language and gives rules for their conformance.
and he adds more comment:

4. This is unlike Creative Commons whose explicit aim is to provide licenses. While all of the CC licenses are more ‘liberal’ (or ‘open’ even) than traditional copyright not all of the licenses are ‘open’ in the sense of the Definition.
5. This is not surprising — CC is about providing license choice and flexibility, not about providing a consistent set of licenses embodying a particular approach. In particular it is not the case that a particular CC license is ‘compatible’ with a given other CC license in the sense that one can intermix material made available under the different licenses. For example, any CC non-commercial license is incompatible with the CC Attribution-ShareAlike (by-sa) license.
6. By contrast one would hope and expect that any license which is conformant with the Open Knowledge/Data Definition would be compatible with any other such license — in the sense that one could freely combine two separate works made available under (different) open licenses together. This is important as one of the major benefits of an openness is to permit freedom of sharing and reuse in the open knowledge ‘commons’. Again this is very similar to the situation with the Open Source Definition.
7. Thus, in my opinion, the Definition is not a rival to Creative Commons but a complement which seeks to do something different. In particular the Definition does not develop licenses but CC does (many of which are conformant with the Definition). CC does not attempt to define a ’standard’ but the Definition clearly does. By linking to a CC license you are saying: my stuff is available under this specific license. When you link to the Open Definition you are saying: my stuff meets this general standard.

PMR: here are some well-known licences that do not conform:

  1. Non-Conformant Licenses
    1. Creative Commons No-Derivatives Licenses
    2. Creative Commons NonCommercial
    3. Project Gutenberg License

The reason why CC-NC does not conform is that it restricts the full use of the content or object. That is why I don’t generally feel it is useful for science. (This blog is CC-BY, I just have to get the button changed).

RP: As an aside: I think this is where some people may get misled by the Creative Commons name since the set of CC licenses do not (necessarily) result in the creation of a “commons” — works made available under different CC licenses cannot necessarily be mixed together. (This is not a criticism of CC, by the way. At lease in terms of licenses, CC is about a wide choice. However it is noteworthy that recent CC project’s such as ccLearn have, I belive, explicitly focused on a particular (open) license — in ccLearn’s case CC Attribution).

This is a separate but important point – I suspect it is difficult – though not impossible – to create a commons solely using licences, even under CC-BY. It needs the political, economic, socialogocal and philosophical dimensions as well.
So do I eat my own dog food? There is nothing that says I have to share everything with everyone all the time and I hope this is generally realised. However there should not be obvious conflicts where these are avoidable. I publish in non-Open journals, but I publish in Open ones where possible. I do not require my colleagues to adopt my views. However being an advocate of Open Data I have to practice it where possible.
Open Data has turned out to be much harder than I thought. So I am extremely grateful for the OKFN collection of Open Knowledge licences. It’s certainly premature to say “one is the best for science”, so we labelled CrystalEye – WWMM with the OKFN “Open Data” rather than a licence. In a sense that says:
“at this stage intention is more important than details”
Yes, the data can be abused and yes, we couldn’t defend it in court very easily. But we are primjarily interested in doing science and we’ll take that risk.

Posted in data, open issues | Leave a comment

Open Notebook NMR – variance is both experimental and theoretical

When making claims in foo-metrics and foo-informatics it is essential to have access to the data and the algorithms used. That’s why, for example, Peter Corbett and Sciborg colleagues are so careful in constructing their corpus. In developing our Open Notebook NMR we have to do the same for our data. As I have explained earlier, variance can come from many sources and all must be examined. So here is the first pass at our raw data – a histogram of the (absolute) internal RMS within a structure, fitted to
y = x + c + eps
where x has units of 13C ppm.
rmsd1.PNG
It’s obvious that we should throw away the outlier isn’t it?
But we cannot – absolutely must not – unless we have good reason to do so. Until then it contributes to the variance.

Posted in Uncategorized | 1 Comment

Well Done OA

Peter Suber’s latest post needs no comment:

Tonight the Senate passed the Labor-HHS appropriations bill containing the provision to mandate OA at the NIH.  More, the vote was a veto-proof 75-19.
Comments
  • Neither of the harmful Inhofe amendments was part of the final bill.
  • Yes, this is big, even if we cleared this hurdle only to face a Bush veto.
  • When the same language was adopted by the House (July 19, 2007), it only received 276 votes, when it needed 290 to be veto-proof.  Hence, it’s not at all clear that the full Congress will be able to override a Bush veto, something both sides know very well.  However, as we go into post-veto strategies, we’re much better off with this language having passed both houses than having passed only one.  More later.
But it does deserve a (CAUTIOUS) celebration. We’ll be at the Panton Arms as usual on Friday lunchtime – 1230.
Posted in open issues | Leave a comment

Open Notebook NMR – the good and the ugly

We’ve started going through the structures in serial order. Here are two in the first 4 I looked at. One shows near perfect agreement, the other is frankly awful.
Here’ s the link to NMRShiftDB: nmrshiftdb2470-1 (solvent: chloroform)
2470.PNG
You can see that the RMS is of the order of 1 ppm (we haven’t yet calibrated the offset so it’s improper to give a clearer figure).
Now here is one that is all over the shop: nmrshiftdb2275-1 (solvent: acetone)
2275.PNG
Here is the Additional data on NMRShiftDB:

Spectral Data Additional Data
Molecule (2275)
http://www.nmrshiftdb.org:8080/portal/_sdathe_Fri Oct 04 10:25:31 CEST 2002.002.ms
Chemical name(s) N-Mercapto-4-formylcarbostyril
Molecular weight 205.233
Number of all rings, size of smallest set of smallest rings 3, 2
CAS-Number  
Molecule keywords bacteria, antibiotics, Pseudomonas fluorescens G308,
Type 13C
Temperature [K] 298
Solvent Acetone-D6 ((CD3)2CO)
Additional comments antifungal effects in vitro against several plant pathogenic fungi; Multiplicities generated automatically from H count

I have no idea what is going on. Nor has Christoph. So here is your chance to solve the problem. Here are some ideas I suggested

  • spectrum and compound got muddled.
  • compound is “wrong”. I am not clear whether N-SH groups are stable but I don’t think they are common.
  • there is some serious tautomerism going on.
  • this compound disobeys the laws of physics

Anyway you know as much as we do now and you have a link to the entry and the literature. Please add your comments below and if you do discover something fantastic remember where you found the data.

Posted in nmr, open issues, open notebook science | 6 Comments

Open Notebook NMR: Commercial re-use of data?

Antony Williams of Chemspider has offered to participate in our Open Notebook NMR experiment. Now this offer has been joined by ACDLabs – I am not sure of the formal relation between the companies but they have clear common interests. I had originally thought this was one individual making a personal offer – now there is a company that is requesting our data for them to work on.I have some genuine concerns about how we should proceed so am clearing my thoughts online. Readers will recall that I have been strongly critical of companies or nonprofits which use closed source and protect closed data. I have roughly equal numbers of correspondents who think I have been too hard on these organizations and those who think I need to be tougher. So I am taking a measured tone here.

Peter – FYI ACD/Labs are ready to participate in the work as discussed: http://www.chemspider.com/blog/?p=213#comment-3735
Peter and Tony,I think this is a fantastic project and am very keen to see how accurate the QM techniques prove to be for the subset of structures that you choose from the NMRShiftDB, and then how helpful they can be in improving the accuracy of experimental shifts in this wonderful resource.For the purposes of this work, we would be willing to provide the chemical shift predictions from the ACD/Labs software if you would like to use them in your comparison. If, for instance, they prove to be accurate enough to find many of these problems without the need for time consuming QM calculations, it may be preferrable to use the faster calculation algorithms that are available in our software. It may turn out that the ACD/Labs predictions could serve as a pre-filter to define which structures need the QM calculations and which don’t. Many variations on this theme come to mind, but we won’t know which are useful until we do the work.
Sincerely,
Brent Lefebvre
NMR Product Manager
Advanced Chemistry Development, Inc.

and Antony also requests the data:

Peter, all that is needed to perform the calculations for comparison using the ACD/Labs NMR predictors is a download of the exact dataset Christoph provided to you (we have already had issues with comparing algorithm to algorithm but using different versions of the NMRShiftDB database…not good). Also, if Nick can send us the ID of the structure inside the NMRShiftDB this should be enough. Thanks

and he points to how he/ACDLabs can predict chemical shifts:

Also, plots of the nature shown at the following page http://www.acdlabs.com/products/spec_lab/predict_nmr/chemnmr/ are generally of value when comparing improvements in algorithm performance. it’s a good way to compare your own improvements in algorithm (or Henry’s) as well as between algorithms

The link here points to a report of NMR calculations done with ACDLabs software and ChemNMR (Cambridgesoft). Note that both companies have developed their own software for chemical predictions, but the software is closed in that it can only be used if purchased, and anyway the algorithms are not visible. Note that Cambridgesoft patent algorithms – I don’t know whether ACDLabs do. The data are also closed in that they are not generally available and I have no idea of their general chemical makeup. From the page about calculating chemical shifts

During the history of the development of the ACD/Labs NMR prediction tools we have compiled and checked data from thousands of literature articles to build databases of over 165,000 assigned structures for 1H and 13C, 8000 for 15N, 13,800 for 19F, and 22,600 for 31P. With these assignments and structures as a basis and using our correlation algorithms developed over the past ten years, it should be obvious that our fragment-based predictions will offer superior performance to any rules-based systems.Fragment-based prediction offers the opportunity to cover wide ranges of structural diversity not available to rules-based systems. It is this type of performance that has enabled Advanced Chemistry Development to become the industry leader in NMR prediction today. Rigorous testing of our capabilities relative to other software packages has resulted in our software becoming the package of choice on a worldwide level for companies such as Pfizer, Astra Zeneca, GlaxoSmithKline, 3M, Eastman Kodak Co., Millennium Pharmaceuticals and many others.

and the results:

13C NMR Prediction Comparison

r r2 Standard Error (ppm) Predicted 13C Chemical Shifts
ACD/CNMR 8.0 0.999 0.998 2.33 68,129
CambridgeSoft ChemDraw 8.0 0.995 0.990 5.30 67,841

Table 2: Statisics comparing the accuracy of ACD/CNMR predictions to Cambridge Soft

Once again the table illustrates the remarkable accuracy advantage of ACD/CNMR as its error is less than half that of ChemNMR. Fully 64% of the predictions from CNMR are within 1 ppm of the experimental shift as opposed to only 32% in ChemNMR (Figure 2).

I do not know if the work is formally published in the scientific manner but it is available from an unusual source:

Please request the complete comparison document from your account manager or email sales@acdlabs.com.

This clearly raises issues that need agreeing, so I’m asking the blogosphere for comment.

  • If ACDLabs use their software to make calculations on our data what is the value? We cannot repeat the calculation and have no independent assessment of whether it is valid. I’m not accusing them of fudging the results but it is extremely easy to get good results by (even unconscious) selection of data. We have heard that formal analysis of chemoinformatics papers shows that many are suspect because of over- or under-fitting. Data sets and algorithms are never available, whereas we wish to make all our approach public.
  • If we do make our approach and data public then they can be assimilated into ACDLabs software and operations. In principle I’m not too worried about that as long as the Open Data is honoured. But I am more concerned that ideas will be assimilated without credit – this is a general problem of Open Notebooks and one we all have to address. This includes Henry’s ideas as well as Nick’s. Note that many people deliberately use CC-NC licences (non-commercial) on their data. I have resisted this (the CC-NC on the blog is simply that I haven’t found out how to take it down and replace by CC-BY.
  • If they take our data they are in a position to scoop us if, as they  say, they have the best software in the world and much more data.
  • I am worried by the high level of marketing in their scientific report. While it is perfectly reasonable for a company to promote its products and also reasonable to show that they outperform others there is a danger that the primary motivation is to use the current exercise for marketing rather than science and we would not wish to be associated with this.

These are clear issues but they cannot be solved easily. I am actually not clear what ACDLabs wishes to do – their data is a superset of ours as they have had access to NMRShiftDB like everyone else. I suspect our data will be cleaner as it is not easy to validate 150,000 + entries (although we have managed this with crystalEye). And I’m not quite clear what we get out of it – we can calculate empirical (fragment-based) shifts with Christoph’s software – at least enough for a reasonable filter.
But I welcome comments.

Posted in nmr, open issues, open notebook science | 8 Comments

Open Notebook NMR – interesting outlier(s)

Here is an interesting outlier which caught me out, until Christoph explained it. Here are two spectra. They are both from the same source:
10009121.PNG
(There are 3 peaks – note the disagreement)
10006328.PNG
Also 3 peaks.
Here is the metadata for the second one (the first is similar and also from DKFZ spektren.

Spectral Data Additional Data
Molecule (10006328)
www.nmrshiftdb.org_dkfz_2003/02/28_04:43:20_0096
Chemical name(s) 1,2-CYCLOHEXANEDIONE,ENOL FORM
Molecular weight 112.127
Number of all rings, size of smallest set of smallest rings 1, 1
CAS-Number  
Molecule keywords  
Type 13C
Temperature [K] 298
Field Strength [MHz] Unreported
Additional comments DKFZ ‘spektren’ database: TMS/; No temperature information available, assumed 298 K.; Multiplicities generated automatically from H count
Spectrum categories dkfz spektren database

What’s going on? Can you reconcile the apparent discrepancies? The actual “error” was one of the many possibilities we listed before we started.

Posted in nmr, open notebook science | Leave a comment

Open Notebook NMR – Outliers

Egon asks:

  1. Egon Willighagen Says:
    October 23rd, 2007 at 4:00 pm eWhere in which wiki can I find the outliers? That would allows people to indicate problems, and possible annotate existing publications with ratings (”this article has a incorrectly assigned NMR spectrum”).

Egon and others… We met with Christoph, Nick and Jim today and talked with Henry on the phone. What we plan is roughly:

  • Nick has worked out the variance of each structure (precision), and also its offset from the origin (accuracy). Serious errors usually affect both of these. We hope we can find a set of outliers that primarily show variance because these will be interpretable (accuracy may be to effects we cannot see such as machine settings).
  • There are some outliers due to known systematic errors that Henry has analysed and will correct, so we won’t be publishing these until we have made the corrections.
  • We shall then start to publish the outliers as we extract them. Some will have known problems and we shall indicate these with our error categories. These will be available for anyone to comment on and we believe we can make good educational use of this.

So here is the first and worst outlier:
10006060.PNG
The difference is enormous – 135 calc vs. 60 obs. So Henry went back to the original paper and found:
nmr1.jpg
The compound is 2a. (I am offline and it would cost 30 USD to read the paper so I will take Henry’s word. It is clear that the observed peak is 122.5 not 60.
We’ll be releasing further outliers as we go. Ideally these should be on a wiki and we should provide identifiers.  Initially we had thought about making the whole data set available but since companies have requested all the data to compute this has made us think about data release strategy more carefully. We’ll have to have split the data between public and private and this will take time.

Posted in data, nmr, open notebook science | Leave a comment