In a previous post I reviewed our chemical language processing tools – OSCAR and OPSIN. This post updates progress on OPSIN, the IUPACName2Structure converter.
Why do we need a name2structure converter? It’s because chemists use language to communicate the identities of obejcts. It’s possible to talk simple chemistry over the phone whereas it wouldn’t ben easy to describe star maps, isotherms, engineering drawings, etc. And, because of this, chemists often abbreviate names – it’s easier to say “mesitylene” than “1,3,5-trimethyl benzene” or “DDT” instead of “paradichlorodiphenyltrichloromethane” (experts will cringe at the horror of this name which is seriously non-systematic and which could not be worked out by man or machine. There is, however, a lovely limerick based on it).
The rules for naming compounds are set out by the Int. Union or Pure and Applied Chemistry. Even if you are not a chemist, have a look at: IUPAC Nomenclature Home Page which represents years of devoted work by chemists, much of the organization done by Gerry Moss. There are many reasons why the field is complicated:
- almost all compounds can be named in many ways. Thus CH3-O-CH3 could be called methyl ether, dimethyl ether, 2-oxa-propane and so on. IUPC has recomendations for which of these should be used but they are often ignored, and sometimes are honoured in the breach. Most practising chemists, unless they routinely patent a lot of compounds neither know these recommendations nor care.
- Errors are common. Letters can be elided, brackets missed etc. and plain mistakes made. How many readers could say accurately what the structure (if any) is of capric chloride, caproic chloride, caproyl chloride, caprilyl chloride, and capriloyl chloride. Don’t be a goat, it matters :
So nomenclature is a black art. It’s semi-finite in that there are currently a finite number of compounds known (some 10s of millions) and a finite set of rules that can be used to generate an infinite set of names. In a similar way there are a finite set of English words that can be used to generate an infinite set of articles. So, in principle, we could encode a finite set of rules, updated every year when IUPAC generate more rules that would completely interpret chemical name space.
In practice however the labour of doing this has been too great for anyone. Even the marker leaders in name2structure would not correctly interpret all the examples in the IUPAC rulebook. There’s a very long tail – many rules which apply to only a few compounds – or none – in the 30 million. Not cost-effective at this stage. [There would be a cost-effective way if IUPAC rules were semantically encoded, but that’s many years away if at all.].
Ideally there should be one name2structure converter, sanctioned by IUPAC. Just like there is one InChI, sanctioned by IUPAC. In bioscience this would have happened. But in chemistry we have a mess of competitive products, of very variable quality. They cost money (some are free to academics), have many errors, have no agreed standard of quality, have no believable metrics, have no way of input from the community.
A classic picture of anticommons.
So why are we developing OPSIN? In research terms it’s a “solved problem”. We are frequently told academia shouldn’t do things that the commercial sector does better.
In fact we are doing things better and we are doing language research. The motivations are:
- generic use of language. Chemistry often uses phrases like “substituted pyridines”. There is no formal way of representing this concept and we are developing languages that provide a grammar. This is hard, it’s research and it’s valuable for the community, such as interpreting patents.
- disambiguation. This is a key problem in NLP and certainloy worthy of research. What does “chloroethylbenzene”? It’s ambiguous and could be any of 5 structures (ClCCc1ccccc1,CC(Cl)c1ccccc1, Clc1ccccc1CC, Clc1cc(CC)ccc1, Clc1ccc(CC)cc1) or which one has further stereoisomers. Which did the author mean? Can this be deduced from context?. OPSIN will indicate whether a structure is ambiguous and in time may even attempt to reason what what meant.
These are the research reasons. We’ve now been joined by Daniel Lowe, a first-year PhD student supported by Boehringer Ingelheim to do research into machine interpretation of patents containing chemistry. Daniel’s made an excellent start, primarily by extending OPSIN. When he took this over from PeterC it was not a competitive tool.
Now it is.
How do we measure its success? There are no agreed corpora or metrics for chemistry NLP so we have to be careful. The essentials are to be Open and systematic and to invite community buy-in.
In essence Daniel has taken a representative set of 10000 “formally correct” IUPAC names and analysed them with OPSIN and 2 other commercial programs. (You will appreciate that it is not easy to get funding to buy programs simply to test them so there are others we cannot use). At present we find for one corpus progA ~ OPSIN ~ progB and in two others progA > OPSIN > progB (yes, you will be kept guessing).
Treat all metrics with great suspicion, but Opsin’s recall (i.e names it translates correctly) is around 80% and it has the lowest error rate (incorrectly translated names) of all programs (ca 1%). [You should ask “on what corpus?” – and shortly we’ll tell you and Open it.]
We believe than the main reason why OPSIN < progA is vocabulary. Adding vocabulary is tedious as there is a very long tail. It’s good to do while watching cricket (as I am doing) but it’s still slow.
So this is the time when we can invite crowdsourcing. Until recently that wasn’t an option, but now OPSIN has a good infrastructure and it’s possible to add vocabulary without having to modify code. Much of OPSIN’s vocabulary is in external files which are fairly easy to modify and which won’t break the system.
OPSIN has, of course, always been Open Source and so – in principle – anyone could modify it. But in practice many OS projects have an incubation period where the infrastructure is being built and it’s very difficult to have an uncontrolled community process. Now we can offer a controled community process where large numbers of people can make small but useful contributions.
There are two methods of approach, and we’ll start with the first:
- become a developer on Sourceforge and modify the template files to add vocabulary. Some examples of vocabulary we are missing are cabohydrates, nucleic acid components and amino-acids.
- We should develop an interface that allows users of OPSIN to add vocabulary interactively. Thus is it fails to parse 1,5-dihydroxymanxane, OPSIN tell the user it didn’t know what maxane was and ask for a structure+locants.
So if you are interested in helping with OPSIN please let us know. Half a dozen vocabulary contributors could make rapid progress.
And when this is done we’ll have a tool that interprets IUPAC names and which, as it is Open, can become a de facto standard.