Chemical names - the challenge

Antony Williams on Chemspider posts a serious question and I give a serious answer. I wonder if it's what he was expecting... I'll state it, and then comment on NLP before giving an answer:
06:05 17/05/2008, Antony Williams,
[...]

How many chemicals are mentioned in this paragraph?

“She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”

The question is basically to identify chemical names in free human text. This is different from working out what those names mean. Peter Corbett, Colin Batchelor (Royal Soc Chemistry), Ann Copestake and others in our Sciborg project have been working hard on this for many months. PeterC- please forgive me if I get details wrong. PeterC is presenting this shortly at a Natural Language Processing conference and it was also highlighted in Ann's presentation to the British Computer Society last Tuesday (which I still have to blog). Peter gave a dry run to the Unilever Centre two weeks ago and I'll try to do this justice.
Firstly there are several ways of types of usage which involve chemical names. Peter uses "pyridine" as an example.
  • the reactants were dissolved in pyridine
  • nicotinic acid and other pyridine derivatives
  • the signal from the protons in the pyridine ring were shifted upfield
In only the first does "pyridine" refer to a single, identifiable, compound, but all three sentences contain one or more chemical names. But is "reactant" a chemical? "Nicotinic acid" is. Are "protons" chemicals?
Language is flexible and ambiguous. Ann argued strongly, and I support her, that language is ipso fact ambiguous and its power comes from its ambiguity. We humans are very good at resolving this. I went into a room in the chemistry department recently and there was a large cabinet with O2 on it. Was this a chemical [see below]?
Almost all sentences are ambiguous. The work in Sciborg involves exhaustive multiple parses for any sentence. Here's a simple example:
"There was nothing on Tuesday"
The meaning of this seems obvious, but a machine would probably come up with at least 10. For longer sentences it is not impossible to have thousands of parses and even run out of stack space on older machines. Interpretations could be:
"Nothing significant happened on Tuesday"
"No mail arrived on Tuesday"
"There was nothing significant on the TV, Tuesday" (US usage often omits "on" before a date")
"The day on which something was expected to happen was not Tuesday"
and more far-fetched:
"The universe disappeared into a void on Tuesday"
"Tuesday [Weld] had no clothes on"
"The mob had no hold on [Ruby] Tuesday"
The machine would almost certainly come up with most of these. The only manageable way is to give them all probabilities. A typical parse could hold millions of possibilities for a document, and the challenge of NLP is to try to dfisambuiguate them. For example,
"She waited for his letter. There was nothing on Tuesday"
is still very hard to parse, but some of the ambiguities can pragmatically be removed.
So when OSCAR3 parses text Peter adds a probability to each putative named entity. Here's a simple example:
ne id="o71" surface="imidazole" type="CM" confidence="0.9257968817491067" SMILES="c1c[nH]cn1" InChI="InChI=1/C3H4N2/c1-2-5-3-4-1/h1-3H,(H,4,5)/f/h4H" cmlRef="cml9" ontIDs="CHEBI:16069"
OSCAR has identified "imidazole" as a putative chemical compound with a confidence of 93%. For
ne id="o101" surface="2H" type="CM" confidence="0.3341514144473448" rightPunct=","
it gives only 33% for the string "2H", which could be deuterium, or two Hydrogen atoms, or - as in this case - an annotation of a hydrogen atom in a spectrum.
And surely we can aim for 100%?
No. The responsible way to do research in NLP is to mark up a corpus first. So three annotators (Peter, Colin and David Jessop) spent some weeks marking up papers kindly provided by Royal Society of Chemistry. (One of the major problems in text-mining is that you can't usually get a decent corpus because many publishers won't let you). The average agreement was about 90%.
So there cannot be a single answer to Antony's question. A meaningful question would be:
"Using a given corpus, previously annotated by experts, and with agreed guidelines for marking up chemicals, what compounds occur in the following paragraph with a probability of greater than x (e.g. 0.9)"
OSCAR3 has been developed in this manner. Currently it achieves about 80% rather than the 90% that expert humans do. Among the strategies that OSCAR uses or might use are:
  • comparison with English-language lexicon. If a word is also an English language word it is less likely to be a chemical.
  • comparison with chemical lexicon, e.g. ChEBI. If it's in there, its probability is increased
  • part of speech. If it's a noun it's increased, if a verb it's decreased
  • lexical form. footyloxybarate is not a known chemical, but its lexical form makes it highly probable it is a fictitious one and not, say, a film star or pop group.
  • Hearst patterns. "bioactive compounds such as aspirin or spat". Even if not in a lexicon "spat" is probably a chemical rather than the past tense of spit.
  • And usage (probabilistic). "take an aspirin" is a common phrase. "take a benzene" is of very low probability. So although "Dagger" (capitalised) is a trade name in Pubchem, I doubt there are any extant uses of "a dagger" as apposed to "some dagger".

Peter has other clever tricks (and I suspect that there are some that are unique to our project).

So my answer to Antony's question is that although there are several lexical forms which occur in certain lexicons (primarily Pubchem) their local syntactic and semantic occurrence makes it extremely improbable that any of them would be meaningful compounds. OSCAR will find only two possible compounds:
  • He. Unfortunately short strings (He, As, In, Be, etc and many abbreviations are difficult. OSCAR weights these down and the probability is low.
  • aspirin - by lookup.
If you think it's easy to identify chemicals, try this phrase:
"She used her platinum card to buy a gold necklace, then crossed the iron bridge across the water as gold flecks decorated the sunset. Salt spray blew as she walked across the sand... "
I doubt that Peter and Colin would achieve 100% on that. But it's an unfair test as the chemical guidelines were based on 14 papers from the RSC, not bodice-ripper novellas.

PMR: [answer to O2: No, there is a telcom supplier in the UK with the trade name O2and it was full of telecomms gear. No oxygen except what comes from the air.

2 thoughts on “Chemical names - the challenge

  1. pm286

    (1) I don't understand. What should I test? I tried to post a short comment on your blog just to notify you of the current post. It didn't seem to work - I don't know why

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>