Wolfgang Robien has posted some valuable comments and I think this gives us a positive way forward. I won’t comment line by line but refer you to the links. For background Wolfgang suggest that I have a religious take on this and am trying to impose this on the NMR community which already has adequate and self-sufficient processes. [In all this we differentiate macromolecular/bioscience from smallMolecule/chemistry which have completely different ethics and practice. Here we refer only to small molecules.] I am not religious about NMR.
I’ll start by saying that I think Wolfgang and I may have very significant common ground and this is an attempt to address it. I also think that our differences are confined to different fields of endeavour.In summary I believe that:
- NMR data are published in non-semantic ways (PDF, etc.) and that this destroys much useful machine-interpretable information. By contrast crystallography is semantic and the quality at time of publication is very much higher.
- A significant number of papers contain NMR data which do not correspond exactly to the structures – often referred to as “wrong”. By contrast this hardly ever happens in crystallography.
- Crystallographic data is subject to intense validation before publication and the algorithms and code are freely available. This has raised the quality of crystallography over the last 15 years and the data in crystalEye show this clearly. With the advent of computational methods in NMR (whether HOSE or GIAO) it should be possible to carry out similar validation before publication.
- The crystallographic data as published constitute a global knowledgebase which can be re-used in many ways in a semantically valid framework. This is currently not possible for NMR but it could be if the community wished it.
Wolfgang mentions religiosity – I try not to be but the publishing community is rapidly fracturing over the Open-Closed line and I personally see this as having little middle ground. Others disagree. I am insistent that the words “Open Access” be used in a manner which is consistent with the Open Access definitions, just as for Open Source. There is a tendency for people to describe resources as Open when they do not conform to the definition. I hold the same view for Open Data.
Where I think we have common ground is that we both agree that:
- there are too many publications where the NMR-structure is simply wrong
- it would be possible to validate many of these using software
- that it would be useful to publish the spectra in semantic form rather than text and PDFs. (Wolfgang may disagree here and see value in having the data retyped by humans, and if so I’d like to see the case. In practice we have shown that the data can go straight from the instrument to the repository without semantic loss, but that the business processes are not yet clear).
In principle I would be very happy to collaborate on developing an NMR protocol which would validate data in publications. I think we would need a variety of methods and data resources. We can’t do this in Nick Day’s project and I can’t speak for Henry, but it sounds promising. Methods like this exist for crystallography and thermochemistry (ThermoML). Spectroscopy and computational chemistry are the most tractable and valuable next steps.
One reason we used NMRShiftDB was that we knew that the data were heterogeneous and possibly contained errors. This simulated what we might find in publications. We can use our OSCAR and other software to extract spectra and structures from the literature though the assignments are harder without explicitly numbering schemes in connection tables. Clearly the requirements on analysing questionable data and creating a validation procedure are more difficult in this case but we are prepared to defend it.
Ultimately my vision is that all NMR in journals would be validated and in semantic form (e.g. CMLSpect) before being published. Other disciplines have already achieved it, so it’s a matter of communal will rather than absence of technology. I think we have a mutual way forward, though not in the timescale of Nick Day’s thesis.
Wolfgang Robien Says:
[links to comments broken in WordPress]
OK, you are not a NMR-spectroscopist, but you want to liberate NMR data from the pages of the journals:
PMR: This is exactly right. It is virtually the sole motivation for this work. Anything else (NMRShiftDB/WR, GIAO/HOSE-NN) is secondary. It is also coupled to the capture of data from eTheses (the SPECTRa and SPECTRa-T projects) where we have shown that most data rapidly gets lost. It is about validation, semantic quality, dissemination, preservation, and closely tied to the capture of academic output in institutional and other repositories.
WR: There are so many people around working in this field, who are doing excellent science
PMR: I am unaware of major scientific laboratories who are making major efforts in changing the way that NMR Spectra are published in journals or theses or captured in repositories. I do claim to be aware of semantic scientific publication and repositories and am regularly invited by both the Open and Closed publishers to talk about this. If there is major work ongoing in pre-publication validation and semantic output of NMR I haven’t heard of it