Category Archives: oscar

How OSCAR interprets text and data

I recently posted ( Open NMR and OSCAR toolchains ) about how OSCAR can extract data from chemical articles, and in particular chemical theses. Wolfgang Robien points out November 24th, 2007 at 11:03 am e

I think, no, I am absolutely sure, this functionality can be achieved with a few basic UNIX-commands like ‘grep’, ‘cut’, ‘paste’, etc. What you need is the assignment of the signals to specific carbons in your structure, because this (and EXACTLY THIS) is the basis of spectrum prediction and structure verification - before this could be done, you need the structure itself.

Wolfgang is correct that the basis of this part of OSCAR is based on regular expressions (which are also used in grep). However developing such regular expressions that work across a variety of styles (journals, theses, etc.) is a lot of work - conservatively this took many months. The current set of regexes runs to many pages. Initially when I started this work about 7 years ago I thought that chemical papers could be solved by regexes alone, but this is quite infeasible. Even if the language is completely regular (as is possible, but not always observed in spectra data) we rapidly get a combinatorial explosion. Joe Townsend, Chris Waudby, Vanessa de Sousa and Sam Adams did much of the pioneering work here and showed the limitations. In fact the current OSCAR, which we are refactoring at this moment consists of several components:

  • natural language parsing techniques (including part of speech tagging and, to come, more sophisticated grammars)
  • identification of chemical names by Bayesian techniques
  • chemical name deconstruction (OPSIN)
  • heuristic chunking of the document
  • lookup in ontologies
  • regular expressions

These can interact in quite complex manners - for example chemical names and formula can be found in the middle of the data. For this reason OSCAR - and any parsing technique - can never be 100% perfect. (We should mention, and I will continue to do so, that parsing PDF - even single column - is a nightmare).

Wolfgang is right that we also need the assignment of the carbons to the peakList and also the chemical structure. Taking the structure first, we can try to determine it by the following methods:

  • interpreting the chemical name. OPSIN does a good job on simple compounds. I don't have metrics for the current literature but I think it's running at ca 20%. That may sound low, but name2structure requires the compilation of many sub-lexicons and sub-grammars (e.g. for multicyclic systems, saccharides, etc.) If there is a need for this, much can be done by community action.
  • interpreting the chemical diagram. Open tools are starting to emerge here and my own dabbling with PDF suggests that perhaps 20-25% can be extracted. The main problems are (a) finding the diagrams and linking them to the serial number and (b) the publishers' claim that images are copyright.
  • using the crystallography. If a crystal structure is available then the conversion to connection table, including bond orders and hydrogens, is virtually 100%. Again there may be a problem in linking the structure to the formula.
  • reconstruction from spectral data. For simple molecules this should be possible - after all we set this in exam questions so a robot should be able to do some. The combination of HNMR, CNMR and IR should constrain the possibilities. Whether this is a brute force approach (generate all structures and remove incompatible ones) or whether it is based on logic and rules may depend on the software available and the system.

(Of course if the publisher or student makes an InChI available all this is unnecessary).

There are two ways of trying to add the assignment. One is simply by trusting the shifts from the calculation (whether GIAO or HOSE). A problem here is that the authors may - and do - omit peaks or mis-transcribe them. I think I have an approach to manage simple cases here. The other is by trying to interpret the author's annotations. This is a nice exercise because there is no standard way of reporting it and there is almost certainly no numbering scheme. So we will need to build up databases of numbering schemes and also heuristics of how most authors annotate C13 spectra.

Open NMR and OSCAR toolchains

I am currently refactoring Nick Day's code that has supported "NMREye" - the collection of Open experiments and Data that he has generated as part of his thesis and have been trailed on this blog ( View post). One intention of this - which got lost in some of the other discussion is to be able to see whether published results are "correct". This is, of course, not new to us - students here developed the OSCAR toolkit for checking experimental data (View post). The NMREye work suggest that it should be possible to validate the actual 13C NMR values reported in a scientific experiment.

Nick will take it as a compliment that I am refactoring his code. It was written on a very strict timescale - he had to write the code, collect and analyse the results in little more than a month. And his work has a wider applicability within our group. So I am trying to design a library system that supports his ideas while being generally re-usable. And this has very useful consequences for CML - the main question as always is "does CML support enough chemistry in a simple fashion and can it be coded?". As an example here's an example of data from a thesis we are analyzing in the SPECTRaT project:

13C (150 MHz) d 138.4 (Ar-ipso-C), 136.7 (C-2), 136.1 (C-1), 128.3, 127.6, 127.5 (Ar‑ortho-C, Ar-meta-C, Ar-para-C), 87.2 (C-3), 80.1 (C-4), 72.1 (OCH2Ph), 69.7 (CH2OBn), 58.0 (C-5), 26.7 (C-6), 20.9 ((CH3)AC-6), 17.9 ((CH3)BC-6), 11.3 (CH3C‑2), 0.5 (Si(CH3)3).

(the "d" is a delta but I think everything has been faithfully copied from the Word document. Note that OSCAR can :

  • understand that this is a 13C spectrum
  • extract the frequency
  • identify the peak values (shiofts) and identify the comments

Try to think how you would explain this to a robot and what additional information you would need. Indeed try to explain this to a non-chemist - it's a useful exercise.

What OSCAR and the other tools cannot do yet is:

  • extract the solvent (this is mentioned elsewhere in the thesis)
  • understand the comments
  • manage the framework symmetry group of the phenyl ring
  • understand peakGroup (the aromatic ring)

So the toolchain has to cover this and much more. However the open source chemistry community (in this case all Blue Obelisk) has provided most of the components. More on this later.

OPSIN/OSCAR: you + us = we; please help

I'm exploring how you and we may be able to work to improve OSCAR and OPSIN. Even if you aren't interested in chemical names, you may find the general principles useful.

One of the drawbacks of full Open Source and Open Access is that you don't always get feedback on whether what you are doing is appreciated. There is a small measure of downloads, accesses, and so on but you can only guess at the motivation and followup. When we go to meetings people (often in industry) say things like: "Oh we use OSCAR for identifying compounds and it's very useful", "We started using InChI and it's saved us lots of money", etc. But unless someone actually tells us what they are doing and what they want we don't know and can only guess.

Sometimes we get letters of support, and some explore whether they can offer help. I may be able to expand later, but here are a number of positive things you can do:

(a). Simply tell us that you are using OPSIN and use some words that show us why. If possible we can publish this anonymously "we regularly use OSCAR and have integrated it into our Y process". That gives motivation to us to continue in the dark hours of the night, to refactor, to document, etc.

(b). Contribute in-kind data. I'll expand "how" later, but the sorts of things we would like are:

  • names with connection tables (especially if not in Pubchem)
  • ontologies (e.g. similar to ChEBI)
  • tutorials
  • regular expressions for specific tasks (though Peter often has better approaches)
  • insight into document structuring (e.g. the variability and commonality of organization in theses)
  • corpora (especially annotated). These are very valuable but should not be approached casually
  • acronyms
  • numbering systems, bothy arbitrary and algorithmic
  • journal- and thesis-specific regular expressions

(c) contribute in-kind code. You can, of course, do this anyway through Sourceforge, but it is best to coordinate. Major areas that OPSIN requires are:

  • stereochemistry (even R- and S- cannot be parsed at present)
  • bridged ring systems (e.g. bicyclo[3.2.1]octane). (I wrote an algorithm for this at one stage but it's not in OPSIN)
  • fused rings. (This can be fairly hairy, but partial lookup can help a lot)
  • specialised vocabularies (e.g. saccharides, nucleic acids).

(d) support the community.

(e) Invite one of us to talk with your organization.

(f) financial support. This can range from a summer student (a few thousand USD) to larger structured projects, perhaps involving other partners and initiatives (JISC, FP7, NSF, etc.). One of the enormous attractions of Openness is that you immediately get all the benefits of everyone else involved. (If, in our work with RSC, Nature and IUCr we had insisted on all IP being held in silos we would never have had OSCAR - it's that simple).

SciBorg and OPSIN
It's valuable to understand where OSCAR fits into the large picture. in 2004 Ann Copestake and Simone Teufel (from the ComputerLaboratory) together with Andy Parker (Cambridge eScience) and myself bid for and got an EPSRC grant which we called "SciBorg". It includes the very real and positive collaboration of Nature, Int. Union of Crystallography, and the Royal Society of Chemistry. The objectives are:

  1. To develop a natural-language oriented markup language which enables the tight integration of partial information from a wide variety of language processing tools, while being compatible with GRID and Web protocols and having a sound logical basis consistent with Semantic Web standards.

  2. To use this language as a basis for robust and extensible extraction of information from scientific texts.

  3. To model scientific argumentation and citation purpose in order to support novel modes of information access.

  4. To demonstrate the applicability of this infrastructure in a real-world eScience environment by developing technology for Information Extraction and ontology construction applied to Chemistry texts.

The project is larger and more visionary than simply extracting chemical names from text. I like to use the phrase "machine-understanding of scientific literature" - but may be pulled up on over-stressing "understanding". But the idea is to use a variety of techniques to understand the deep structure of the language - at sentence level, paragraph, and document structure. To go beyond the linguistic form to infer motivation "why is this citation important?", "is this paper challenging conventional views".

To do this we need to understand chemical language, and that is where OSCAR3 comes in. It works out the role of chemical words and phrases so that more powerful tools can interpret the larger context. So "methane" is a noun (CM) while "methyl" is an adjective (CJ) and "methylated" is a reaction/verb. (You might mention "methylated spirits", but exceptions and ambiguity are all part of the fun of parsing human-generated language). And OSCAR gives these decisions a probability ("P450 demethylates caffeine" is more likely to be a verb than "methylated spirit"). But in the grander scheme of things SciBorg does not usually need to know what the actual compounds are so work out the deep structure of the language. (Obviously as we advance there is a chance to add validity and inference - thus "C-14 demethylation of lanosterol" is meaningful whereas "C-15..." is not - but we can't do it all yet).

The point is that Peter is only partially working on OSCAR. Along the way he has developed a lot of clever tricks and this account does not do him justice. But we now need to take OSCAR/OPSIN forward on a broader front and this post addresses some of this.
OSCAR and OPSIN have become complex. Peter Corbett inherited bleeding-edge code and data which at that time had bits from all sorts of authors (Joe Townsend, Fraser Norton, Sam Adams, Chris Waudby, Richard Marsh, James Bell, Vanessa de Sousa (who wrote "Nessie"), Justin Davies, and PeterMR. It covered document structure, import of legacy, regular expressions, data checking, name2structure, lexicons and other bits and pieces.

The original OSCAR (sometimes OSCAR-1 or even OSCAR-2) was primarily based on regular expressions and data checking. OSCAR-3 continues to identify this part but does not support the OSCAR-2 GUI. nor do the regexes fit all journals and theses - they were aimed at RSC articles. So Justin and Richard created OSCAR-DATA which consumes the data section output of OSCAR3 and then applies rules and presents it. We have a sustainability path for that. More later.

Meanwhile OSCAR3 is being used for a wider range of applications than simply journal articles, in particular patents, internal documents and theses. So we have been able to get funding from Unilever for David and Lezan, who joins us on Monday and from JISC for Alan and Diana. That gives us a critical mass in the more direct support for OSCAR3 functionality.

The first thing is refactoring. No-one likes refactoring, but it feels good when you've done it. Jim is designing the overall approach and he is able to do this in a way I can only marvel at. It uses design patterns, testing (of course) but combined with lightweight services (REST, Web 2.0) and JFDI pragmatism. We'll expose more of this later but some of the themes are likely to be:

  • use all the Open strategies (SVN, Eclipse, JUnit, Maven, etc.)
  • use open source components where possible (PDFBox, Lucene, JUMBO, CDK, etc.)
  • explore how to modularise them where necessary (e.g. CDK)
  • separate components and communicate through XML or RDF (e.g. OSCAR3 -> OSCAR-DATA can be almost completely decoupled
  • devise a framework based on standard approaches (e.g. Eclipse)
  • actively design and highlight and explain extension points. This is essential if others are to add code and resources in parallel

So we have to do this for ourselves.

If the community can give us clear indications of what they can contribute then this will help us to identify the extensions.