We are delighted at the practical and helpful contributions from members of the community in helping to understand or correct outliers in the data set we are using. This is exactly what we hoped would happen at the start of the project and it has not started to gain momentum. I list some of them below to acknoledge the help. It is also highlighting the need for better tools for such collaborative projects – a blog is a poor mechanism but wikis also have their failings.
To reiterate:
- Nick has been through the dataset by hand and identified all data sets with potential misassignments or other anomalies. This has been done by comparing agreements within each set. A data set is likely to have been flagged if (a) it has a single widely outlying shift (b) two peaks (a, b) have coordinates yb, xa (as we have shown) giving an “X”-like pattern (c) has a large general scatter considerably greater than the average.
- Nick will post the major outliers based on RMSD. I don’t know how many there will be but I expect about 50 (hence the “20%”). These will be clickable – i.e. anyone with an SVG browser can imemdiately find our which peak is linked to which atom.
- After, and only after, these have been cleaned or accepted we will try to see if there are systematic effects in the data – either the variance or the precision. We could expect that data from various sources could provide much of the variation, or the date, or the field strength, or the temperature, or the solvent. Unfortunately we do not have all the metadata as it isn’t present in the CMLSpect files.
- Finally we may be able to comment on Henry’s method. It is possible that certain functional groups have problems (Nick has some suspicions) but at present these are overwhelmed by variance from other sources in the experiment or its capture
So here are examples of useful comments. (I am not sure why Pachyclavulide-A is relevant – I can’t find it by name search in NMRShiftDB – but the effort is appreciated. However we are primarily looking for comments on the outliers we have identified.)
October 26th, 2007 at 1:15 am eThe first one is another misassignment. Look up the structure in the NMRShiftDB and you will see one correctly assigned and one misassigned spectrums. This kind of issues should be filed as ‘data’ bug report at:
http://sourceforge.net/tracker/?atid=560728&group_id=20485&func=browse
I’m will do this one.
October 26th, 2007 at 1:17 am eFiled as:
http://sourceforge.net/tracker/index.php?func=detail&aid=1820353&group_id=20485&atid=560728
October 26th, 2007 at 8:48 am eAnother error: Pachyclavulide-A (should be C26 instead C27), MW=510
Found automatically by the following procedure within CSEARCH:
Search all unassigned methylgroups located at a ring junction. The methylgroup must be connected either with an up or down bond. As an additional condition, it can be specified if only “Q’s” are missing or if the multiplicity of missing lines can be ignored. I think a quite sophisticated check which goes into deep details of possible error sources. […]
October 27th, 2007 at 12:02 pm eMisassignments NMRShiftDB (10008656-2) removed.
October 27th, 2007 at 5:28 pm eMisassignments NMRShiftDB (10006416-2) removed. 45.0 and 34.4 reversed.