I recently posted ( Open NMR and OSCAR toolchains ) about how OSCAR can extract data from chemical articles, and in particular chemical theses. Wolfgang Robien points out November 24th, 2007 at 11:03 am e
I think, no, I am absolutely sure, this functionality can be achieved with a few basic UNIX-commands like ‘grep’, ‘cut’, ‘paste’, etc. What you need is the assignment of the signals to specific carbons in your structure, because this (and EXACTLY THIS) is the basis of spectrum prediction and structure verification – before this could be done, you need the structure itself.
Wolfgang is correct that the basis of this part of OSCAR is based on regular expressions (which are also used in grep). However developing such regular expressions that work across a variety of styles (journals, theses, etc.) is a lot of work – conservatively this took many months. The current set of regexes runs to many pages. Initially when I started this work about 7 years ago I thought that chemical papers could be solved by regexes alone, but this is quite infeasible. Even if the language is completely regular (as is possible, but not always observed in spectra data) we rapidly get a combinatorial explosion. Joe Townsend, Chris Waudby, Vanessa de Sousa and Sam Adams did much of the pioneering work here and showed the limitations. In fact the current OSCAR, which we are refactoring at this moment consists of several components:
- natural language parsing techniques (including part of speech tagging and, to come, more sophisticated grammars)
- identification of chemical names by Bayesian techniques
- chemical name deconstruction (OPSIN)
- heuristic chunking of the document
- lookup in ontologies
- regular expressions
These can interact in quite complex manners – for example chemical names and formula can be found in the middle of the data. For this reason OSCAR – and any parsing technique – can never be 100% perfect. (We should mention, and I will continue to do so, that parsing PDF – even single column – is a nightmare).
Wolfgang is right that we also need the assignment of the carbons to the peakList and also the chemical structure. Taking the structure first, we can try to determine it by the following methods:
- interpreting the chemical name. OPSIN does a good job on simple compounds. I don’t have metrics for the current literature but I think it’s running at ca 20%. That may sound low, but name2structure requires the compilation of many sub-lexicons and sub-grammars (e.g. for multicyclic systems, saccharides, etc.) If there is a need for this, much can be done by community action.
- interpreting the chemical diagram. Open tools are starting to emerge here and my own dabbling with PDF suggests that perhaps 20-25% can be extracted. The main problems are (a) finding the diagrams and linking them to the serial number and (b) the publishers’ claim that images are copyright.
- using the crystallography. If a crystal structure is available then the conversion to connection table, including bond orders and hydrogens, is virtually 100%. Again there may be a problem in linking the structure to the formula.
- reconstruction from spectral data. For simple molecules this should be possible – after all we set this in exam questions so a robot should be able to do some. The combination of HNMR, CNMR and IR should constrain the possibilities. Whether this is a brute force approach (generate all structures and remove incompatible ones) or whether it is based on logic and rules may depend on the software available and the system.
(Of course if the publisher or student makes an InChI available all this is unnecessary).
There are two ways of trying to add the assignment. One is simply by trusting the shifts from the calculation (whether GIAO or HOSE). A problem here is that the authors may – and do – omit peaks or mis-transcribe them. I think I have an approach to manage simple cases here. The other is by trying to interpret the author’s annotations. This is a nice exercise because there is no standard way of reporting it and there is almost certainly no numbering scheme. So we will need to build up databases of numbering schemes and also heuristics of how most authors annotate C13 spectra.