Wolfgang Robien has posted some valuable comments and I think this gives us a positive way forward. I won’t comment line by line but refer you to the links. For background Wolfgang suggest that I have a religious take on this and am trying to impose this on the NMR community which already has adequate and self-sufficient processes. [In all this we differentiate macromolecular/bioscience from smallMolecule/chemistry which have completely different ethics and practice. Here we refer only to small molecules.] I am not religious about NMR.
I’ll start by saying that I think Wolfgang and I may have very significant common ground and this is an attempt to address it. I also think that our differences are confined to different fields of endeavour.In summary I believe that:
- NMR data are published in non-semantic ways (PDF, etc.) and that this destroys much useful machine-interpretable information. By contrast crystallography is semantic and the quality at time of publication is very much higher.
- A significant number of papers contain NMR data which do not correspond exactly to the structures – often referred to as “wrong”. By contrast this hardly ever happens in crystallography.
- Crystallographic data is subject to intense validation before publication and the algorithms and code are freely available. This has raised the quality of crystallography over the last 15 years and the data in crystalEye show this clearly. With the advent of computational methods in NMR (whether HOSE or GIAO) it should be possible to carry out similar validation before publication.
- The crystallographic data as published constitute a global knowledgebase which can be re-used in many ways in a semantically valid framework. This is currently not possible for NMR but it could be if the community wished it.
Wolfgang mentions religiosity – I try not to be but the publishing community is rapidly fracturing over the Open-Closed line and I personally see this as having little middle ground. Others disagree. I am insistent that the words “Open Access” be used in a manner which is consistent with the Open Access definitions, just as for Open Source. There is a tendency for people to describe resources as Open when they do not conform to the definition. I hold the same view for Open Data.
Where I think we have common ground is that we both agree that:
- there are too many publications where the NMR-structure is simply wrong
- it would be possible to validate many of these using software
- that it would be useful to publish the spectra in semantic form rather than text and PDFs. (Wolfgang may disagree here and see value in having the data retyped by humans, and if so I’d like to see the case. In practice we have shown that the data can go straight from the instrument to the repository without semantic loss, but that the business processes are not yet clear).
In principle I would be very happy to collaborate on developing an NMR protocol which would validate data in publications. I think we would need a variety of methods and data resources. We can’t do this in Nick Day’s project and I can’t speak for Henry, but it sounds promising. Methods like this exist for crystallography and thermochemistry (ThermoML). Spectroscopy and computational chemistry are the most tractable and valuable next steps.
One reason we used NMRShiftDB was that we knew that the data were heterogeneous and possibly contained errors. This simulated what we might find in publications. We can use our OSCAR and other software to extract spectra and structures from the literature though the assignments are harder without explicitly numbering schemes in connection tables. Clearly the requirements on analysing questionable data and creating a validation procedure are more difficult in this case but we are prepared to defend it.
Ultimately my vision is that all NMR in journals would be validated and in semantic form (e.g. CMLSpect) before being published. Other disciplines have already achieved it, so it’s a matter of communal will rather than absence of technology. I think we have a mutual way forward, though not in the timescale of Nick Day’s thesis.
Wolfgang Robien Says:
[links to comments broken in WordPress]
WR:
OK, you are not a NMR-spectroscopist, but you want to liberate NMR data from the pages of the journals:
PMR: This is exactly right. It is virtually the sole motivation for this work. Anything else (NMRShiftDB/WR, GIAO/HOSE-NN) is secondary. It is also coupled to the capture of data from eTheses (the SPECTRa and SPECTRa-T projects) where we have shown that most data rapidly gets lost. It is about validation, semantic quality, dissemination, preservation, and closely tied to the capture of academic output in institutional and other repositories.
WR: There are so many people around working in this field, who are doing excellent science
PMR: I am unaware of major scientific laboratories who are making major efforts in changing the way that NMR Spectra are published in journals or theses or captured in repositories. I do claim to be aware of semantic scientific publication and repositories and am regularly invited by both the Open and Closed publishers to talk about this. If there is major work ongoing in pre-publication validation and semantic output of NMR I haven’t heard of it
Peter: The links in this article point to other posts – something wrong
Please feel free to delete this message after reading – it ‘dilutes’ the on-topic discussion …
PMR: NMR data are published in non-semantic ways (PDF, etc.) and that this destroys much useful machine-interpretable information.
WR: YES – agreed
PMR: By contrast crystallography is semantic and the quality at time of publication is very much higher.
WR: NO surprise – the XRAY-machine gives you a structure, the NMR-machine a spectrum, which must be translated into a structure ! There are a few programs around, which can deal with fairly large ‘small molecules’ (e.g. http://www.acdlabs.com / SESAMI – Morton Munk, Arizona State / MOLGEN – Kerber, University of Bayreuth, etc. ) – there was a lot of progress during the past few years ….
In XRAY there is well defined mathematical formalism how to go from data to structures, in NMR it is not.
PMR: A significant number of papers contain NMR data which do not correspond exactly to the structures – often referred to as “wrong”. By contrast this hardly ever happens in crystallography.
WR: Same argument / either you have n triples (x,y,z) = structure, or you have nothing / must be (nearly) error-free
PMR: Crystallographic data is subject to intense validation before publication
WR: Validation in NMR is a ‘hot’ topic – see e.g. Ryan Sasaki blog ( http://www.acdlabs.typepad.com ) – validation routines are already available e.g. with ACD software and Bruker software presented a few weeks ago, others will come soon. The right place is directly at the spectrometer (or the place where you do the Fourier transformation, etc.) – I have already offered a service during the 90’s where you could send a structure to a dedicated email-address and got the predicted spectrum back ….
PMR: there are too many publications where the NMR-structure is simply wrong
WR: YES – agreed, see also the statistical evaluation from Ryan Sasaki’s blog, what ACD has found with respect to this question
PMR: it would be possible to validate many of these using software
WR: This statement is partly wrong – the correct version according to my opinion is:
It is already possible since nearly 30 years to apply existing prediction software to
validate structural proposals (you can substitute ‘prediction software’ by ‘protocols’
or ‘HOSE-technology’ ….. )
PMR: that it would be useful to publish the spectra in semantic form rather than text and PDFs. (Wolfgang may disagree here and see value in having the data retyped by humans, and if so I’d like to see the case.)
WR: Would be great to have millions of spectra in machine-readable form ….. My ‘wage slaves’ are well-educated and have a lot of experience, which I definitely honor ….. the ‘typing’ itself is less than 30% of the whole production process, the rest is evaluation/correction by excellent software support ( the robots, you are talking about … )
PMR: In practice we have shown that the data can go straight from the instrument to the repository without semantic loss, but that the business processes are not yet clear.
WR: When you want an archive it’s OK, if you want a ‘state-of-the-art’ database of well-assigned spectra, you need highly trained ‘editors’/’evaluators’ AND excellent software support !
PMR: One reason we used NMRShiftDB was that we knew that the data were heterogeneous and possibly contained errors.
WR: I thought, you had NO idea 😉
PMR: … used NMRShiftDB …. and possibly contained errors ….
WR: For sure ! Has been shown by Antony Williams and me ……
NMRShiftDB:
Please dont think that I feel ‘attacked’ by NMRShiftDB – I definitely do not ! I like competition – after my analysis and all the subsequent stuff, improvements have been built into ACDLABS prediction software and also into NMRPredict. There is now a visible improvement in 2 prediction packages – we had a different opinion about the effect of the errors in NMRShiftDB, but we both were able to find them, because of our excellent software-support developed over years.
The NMRShiftDB-people have corrected what has been found by Tony Williams and me.
I agree that NMRShiftDB is a resource for NMR-spectra with a certain value for the community. It has been created by many volunteers, that’s great too. I have problems in understanding the philosophy behind and the project management – that was the main-point of my original post, the data errors found were simply the last motivation going public with my analysis:
Facts:
Project registered since 2001-02-13 06:10 on http://sourceforge.net/projects/nmrshiftdb
Project online over the period 2002-2006 (see homepage, footer)
No criteria for data checking (I asked at least twice for this information about data checking protocols – DATA CHECKING PROTOCOLS ARE THE MAIN POINT OF OUR DISCUSSION HERE – no answer till now)
What was available 2001 or afterwards:
HOSE-code prediction including solvent dependency and stereochemistry (CSEARCH,ACD,CAST-code)
Implemented into NMRShiftDB: NO (you might use here as argument ‘not open source’)
Now lets go to ‘open source’: There is CDK available (this is Christoph’s ‘baby’, isn’t it ?), one of the ‘pearls’ is ‘SDG’ (structure diagram generator), which is mentioned very frequently and seems to be of high quality according to the feedback I have seen. Why, the hell, are structures in NMRShiftDB so ugly ? (e.g. steroids, linalool, etc. – I am NOT talking about the overlap between the structure and the numbering, this is a different topic – I am just talking about display coordinates itself). NMRShiftDB is also Christoph’s ‘baby’ – that means, that the PROJECT MANAGER himself is not able to run his OWN PROGRAM over his OWN DATA in order to generate a nice layout.
my message is simple:
I would never use voluntariness as an excuse for nonprofessional methodology – either do a job or keep your fingers off. When you decide to do it, do it 100%.
There is a certain ‘state-of-the-art’ in the field of structure verification, prediction technologies and data collections – I agree this an active field with a lot of changes within a short time, but implementing protocols which are not based on this ‘state-of-the-art’ are waste of time and resource. The people on the forefront in this field are known …..
Sorry for being so harsh during the last few paragraphs, but I want to do serious science, I like competition – Captain Kirk is another topic ……
(2) I am not worried about harshness – because what you are criciticizing is not what I am proposing. I am suggesting that authors are responsible for checking their data – as far as possible – before publications, and that technical editors shoudl require it. So, for example,
# hko Says:
October 29th, 2007 at 10:04 pm e
Comment to: NMRShiftdB Molecule (2275)
http://www.nmrshiftdb.org:8080/portal/_sdathe_Fri Oct 04 10:25:31 CEST 2002.002.ms
Chemical name(s) N-Mercapto-4-formylcarbostyril
Probably wrong structure. N-OH and C=S group would better correspond to shifts
than of N-SH and C=O group.
Assignments of some shifts are wrong.
PMR: I agree with this assessment. The structure is probably wrong. I checked the paper and the structure and assigments are correctly transcribed. My point is that the *author* should and could have checked this – using your software, Bruker’s software, CDK-HOSE – I don’t care. All of them would have suggested something was wrong. As it is we have a wrong structure in the literature, that could have been avoided.