Chemical names – the challenge

This is NOT a Trick Question But An Example of the Challenge (Part 1)

06:05 17/05/2008, Antony Williams,

[…]

How many chemicals are mentioned in this paragraph?
“She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”

The question is basically to identify chemical names in free human text. This is different from working out what those names mean. Peter Corbett, Colin Batchelor (Royal Soc Chemistry), Ann Copestake and others in our Sciborg project have been working hard on this for many months. PeterC- please forgive me if I get details wrong. PeterC is presenting this shortly at a Natural Language Processing conference and it was also highlighted in Ann’s presentation to the British Computer Society last Tuesday (which I still have to blog). Peter gave a dry run to the Unilever Centre two weeks ago and I’ll try to do this justice.

“There was nothing on Tuesday”

“Nothing significant happened on Tuesday”

“No mail arrived on Tuesday”

“There was nothing significant on the TV, Tuesday” (US usage often omits “on” before a date”)

“The day on which something was expected to happen was not Tuesday”

“She waited for his letter. There was nothing on Tuesday”

ne id=”o71″ surface=”imidazole” type=”CM” confidence=”0.9257968817491067″ SMILES=”c1c[nH]cn1″ InChI=”InChI=1/C3H4N2/c1-2-5-3-4-1/h1-3H,(H,4,5)/f/h4H” cmlRef=”cml9″ ontIDs=”CHEBI:16069″

“Using a given corpus, previously annotated by experts, and with agreed guidelines for marking up chemicals, what compounds occur in the following paragraph with a probability of greater than x (e.g. 0.9)”

comparison with English-language lexicon. If a word is also an English language word it is less likely to be a chemical.
comparison with chemical lexicon, e.g. ChEBI. If it’s in there, its probability is increased
part of speech. If it’s a noun it’s increased, if a verb it’s decreased
lexical form. footyloxybarate is not a known chemical, but its lexical form makes it highly probable it is a fictitious one and not, say, a film star or pop group.
Hearst patterns. “bioactive compounds such as aspirin or spat”. Even if not in a lexicon “spat” is probably a chemical rather than the past tense of spit.
And usage (probabilistic). “take an aspirin” is a common phrase. “take a benzene” is of very low probability. So although “Dagger” (capitalised) is a trade name in Pubchem, I doubt there are any extant uses of “a dagger” as apposed to “some dagger”.

Peter has other clever tricks (and I suspect that there are some that are unique to our project).

“She used her platinum card to buy a gold necklace, then crossed the iron bridge across the water as gold flecks decorated the sunset. Salt spray blew as she walked across the sand… “

PMR: [answer to O2: No, there is a telcom supplier in the UK with the trade name O₂and it was full of telecomms gear. No oxygen except what comes from the air.

2 Responses to Chemical names – the challenge

Antony Williams says:

May 19, 2008 at 4:24 am

Test before posting comments.

pm286 says:

May 19, 2008 at 6:39 am

(1) I don’t understand. What should I test? I tried to post a short comment on your blog just to notify you of the current post. It didn’t seem to work – I don’t know why

Chemical names – the challenge

2 Responses to Chemical names – the challenge

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta