OPSIN and OSCAR – Chemical language processing

This blog is about new developments in our chemical language processors OSCAR and OPSIN and about how OMII (eScience) and we are taking them forward.  WE also have a JISC project with NacTEM – CheTA and I’ll write more later about that.

Many of you will know that we have been interested for several years in the Natural Language Processing (NLP) of chemistry texts. “Text-mining” – the extraction of information from texts – is now commonplace (and will remain so until we move away from PDF as the only means of communication). Our interest has been wider – with Ann Copestake and Simone Teufel in the Computer Laboratory we’ve been trying to get machines to understand that language of chemical discourse – “why was this paper written?”, “what is the authors relation to others?”, etc.

But to do this we needed language processing tools which were chemistry-specific, and since 2002 we’ve developed the OSCAR and OPSIN tools (see http://sourceforge.net/projects/oscar3-chem) . OSCAR was the first, developed initially by Joe Townsend and Chris Waudby through summer studentships from the Royal Society of Chemistry. The first version of OSCAR was developed to check the validity of data in chemical syntheses and has been mounted on the RSC’s website for 5-6 years.

I know from hearsay that this is widely used though I don’t have any download figures.This software is variously referred to as OSCAR and internally as OSCAR-DATA or OSCAR1. It is a measure of its quality that it has been mounted for > 5 years and has run with no reported problems and required no maintenance. I continue to emphasize the value of making undergraduates full members of the research and development process and why in our group we continue to highlight their importance.

You will need some terms now:

  • chemical natural language processing – applying the full power of NLP to chemically oriented text.  This includes approaches such as tree banking where we try to interpret all the possible meanings of a sentence or phrase: “time flies like an arrow” (Marx) or “pretty little girls school”. There are relatively few systems which do this, at least in public.
  • chemical entity recognition. A subset of chemical NLP where the parsers identity words and phrases representing chemical concepts. To do this properly it’s necessary to recognize the precise phrase. Thus “benzene sulfonic acid” represents a single phrase and to interpret is as “benzene” and “sulfonic acid” is wrong. We also recognize phrases to do with reactions, enzymes, apparatus, etc.  This is an area where we have put in a lot of work.
  • Chemical name recognition is an important subset of chemical entity recognition. Names can be recognised by at least (a) direct lookup – required for trivial or trade names (“cholesterol”, “panadol”) (b) machine-learning techniques on letter or n-gram frequencies and (c) interpretation (below).
  • Chemical name interpretation, e.g. of (IUPAC) names (e.g. 1-chloro-2-methyl-benzene). The Int. Union of Pure and App. Chemistry (IUPAC) oversees the rules for naming chemicals which runs to hundreds of pages. It looks algorithmic to code or decode chemical names. It is NOT. Some computer scientists have taken this as a toy language system and been defeated, because it is actually a natural language with rules, exceptions, irregular formations and a great deal of non-semantic vocabulary. It includes combinations (semi-systematic) such as  7-methyl-guanosine where if you don’t know what guanosine is you can make little progress (but not none, you know there is a methyl group).
  • Information extraction. The (often large-scale) extraction of information from documents. This is never 100% “correct”, partly through lack of vocabulary, partly through variations in language including “errors”, and partly because of ambiguity. We use the terms recall (how many of the known chemical phrases were actually found) and precision (how many of the retrieved phrases were correctly identified as chemical). Note that this requires agreement as to which phrases are chemical and this must be done by humans. This annotated corpus requires much tedious work, and to be useful must be redistributable in the community. Without it any reported metrics on the performance of tools are essentially worthless. There is commercial value in extracting chemical information and so, unfortunately, most metrics in this area are published as marketing figures. Note that the performance of a tool is not absolute but depends critically on the selection of documents on which it is run.

During this process Joe and Chris enhanced OSCAR by adding chemical name recognition using n-grams and bayesian methods. This gave a tool which was able to recgnize and interpret large amounts of the wrold’s published chemical syntheses. It’s at that stage that we run into the non-technical problems such as publisher firewalls, contracts, copyright and all the defences mounted against the free digital era (but that’s a different post).

The next phase was a collaborative grant between Ann Copestake and Simone teufel of the Cambridge Computer Laboratory and myself, funded by EPSRC (SciBorg). I reemphasize that Sciborg is about many aspects of language processing besides information extraction. We were delighted to include publishers as partners, RSC, Int. Union of Crystallography and Nature Publishing Group. All these have contributed corpora, although these are not wholly Open.

In NLP an important aspect is interpreting sentence structure through Part-of-speech-tagging. Thus “dihydroxymanxane reacts with acetyl chloride” has the structure NounPhrase Verb Preposition NounPhrase. There’s a splendid tool, Wordnet, that will interpret natural language components – here is what it does for “acetyl chloride” (identifying it as a Noun). But it fails on “dihydroxymanxane” – not suprising as my colleague Willie Parker coined the name manxane in 1972 and the dihydroxy derivative is generated semi-systematically. There are an infinite number of chemical names and we need tools to identify and interpret them.

OSCAR was therefore developed futher by Peter Corbett to recognise chemical names in text and our indications are that its methods are not surpassed by any other tool. Remember that results are absolutely depedent on an annotated corpus and on the actual corpora analysed. It’s easy for any tool to get good results on the corpus it’s been trainied on and lousy ones for different material. But, on a typical corpus from RSC publications OSCAR3 scores over 80% combined precision and recall. (Before you brag that your tool can do better, the study also showed that expert chemists only agreed 90%, so that is the upper limit. If chemists cannot agree on something, then machines cannot either).

OSCAR3 is now widely used. There have been over 2600 downloads from SourceForge (yes, of course OSCAR3 is Open Source). We get little feedback because chemistry is a secretive science but this at least means that there are relatively few bugs. Of course there may also be people who find they can’t install OSCAR3 but don’t contact us. The European Patent Office has used OSCAR3 on over 70,000 patents.

So OSCAR can justify some effort to make it even more usable and that’s why we have approached OMII. See below…

When we first started OSCAR we realised that we needed a name2structure parser if we were going to understand the chemistry. It’s valuable to know that dihydroxymanxane is a chemical, but even better if we know it is 1,5-dihydroxybicyclo[3.3.3]undecane because chemists can interpret that. So I started  by writing a separate tool to interpret chemical names (there weren’t and there aren’t now any other Open Source programs to do this). Joe Townsend took over and researched the literature for parsing methods, and handed this over to PeterC at the start of Sciborg. Peter made useful enhancements to this and included it as a subcomponent OPSIN. Peter deliberately did enough work to interpret common chemical names and included it in the OSCAR processing chain.

I want to be very clear. OPSIN has never been promoted as a tool to compete with commercial name2structure tools (there are 3-4) . It was an adjunct in the Sciborg program. If PeterC or I had spent more time increasing its power it would have been at the expense of what the grant was for. It met its given purpose well – to highlight the value of automatic translation and markup of names, and led – in part – to the RSC’s development of Project Prospect where chemical concepts in publications are semantically marked. From time to time we see anecdotal reports that OPSIN is not up to the standard of commercial tools and that is used as an argument for poor quality in Open Source projects and – sometimes – the relative inability of academics to do things properly. That’s unfair, but we have to bite our lips.

That’s now massively changing and I believe that in a few months time OSCAR and OPSIN will be seen as a community standard in chemical language processing and chemical entity interpretation. Being Open Source that will lead to increased community effort which has the power to leapfrog some of the commercial offerings. More in the next blog post.

This entry was posted in "virtual communities", Uncategorized. Bookmark the permalink.

4 Responses to OPSIN and OSCAR – Chemical language processing

  1. Pingback: Unilever Centre for Molecular Informatics, Cambridge - funding models for software, OSCAR meets OMII « petermr’s blog

  2. Pingback: Unilever Centre for Molecular Informatics, Cambridge - OPSIN: why it can become the de facto name2structure « petermr’s blog

  3. Rich Apodaca says:

    Peter, OPSIN is an excellent piece of open source software that works very well within its scope. I’ve written a bit about it’s use. For example, this article describes how to convert an IUPAC name to SMILES or InChI using Opsin together with other open source software:
    and this one shows how to use OPSIN in other contexts:
    One limitation of Opsin when I last used it was in parsing complex heterocycle nomenclature. What’s the current status for complex heterocycle parsing in OPSIN?

    • pm286 says:

      @rich [sorry first reply got lost]
      Heterocycles are probably the most complicated remaining subset. Daniel has implemented all singly fused systems (i.e. A[…]B syntax). He believes that two fusions are relatively simple and will do these RSN. However the general rules for fusion are complex and not always unambiguous. The problem is not always in the actual fusion, but in the numbering, which may be recursive. Some nuclei require that a 2-D structure be created and oriented in a particular way (the longest lines of rings horizontal with rings hanging over the right hand edge, for example). This may actually depend on the geometric idiom used to draw the rings and I suspect there is some divergence between interpretations.
      We have to balance the cost of doing this against the number of structures that it adds. The most important missing area is macromolecular monomers.
      Are you offering to contribute?

Leave a Reply

Your email address will not be published. Required fields are marked *