In a recent post ( Text-mining at ERBI : Nothing is 100%;) I set a little problem – asking you to make estimates of the number of named chemical entities in a piece of experimental text. At the ERBI meeting estimates ranged between 4 and 11. I got three blog answers from 8 to 15 (though the latter was looking for lexical as well as semantic matches, so really 8 to 11).
The main point is that unless the problem and methodology are clearly defined there is no answer and since I didn’r define the problem I expected a spread. The message is that if you talk about text-mining without defining the methodology then your results are meaningless.
In this exercise we are addressing the need for interannotator agreement (IA). In this a series of domain experts are given precise rules on how to annotate the text (in this case by marking up the named chemical entities). Since I didn’t tell you what a NCE was I again expected a spread.
In the SciBorg project, which includes help from RSC, Nature and IUCr, Peter Corbett, Colin Batchelor (RSC) and Simone Teufel (Computer Lab) have produced a set of rules for identifying chemical entities. Not interpreting them into connerction tables or looking them up but identifying them as chemical entities and giving the start and end of the character string. There are several classes (adjectives, enzymes, etc.) but the passges I gave you contained only nouns relating to “chemicals”, designated by CM. (Language processing folks like abbreviations – so “was” is past tense of “be” so it is BEDZ). Here’s the guide:
Peter Corbett, Colin Batchelor and Simone Teufel Annotation of Chemical Named Entities. BioNLP 2007: Biological, translational, and clinical language processing, Prague, Czech Republic.
This is quite precise about what a CM is. Most of them were “obvious” but the guide makes clear that “reaction mixture” and “yellow oil” are not CMs while “molecular sieve” and “petroleum ether” are. “Petroleum ether:diethyl ether” contains 2 distinct CMs. So the answer is 9 unique CMs of which one (petroleum ether) occurs twice.
The process necessarily involved arbitrary boundaries. Some of you might feel that they should be more or less broad. The authors have had to set them somewhere and that’s what they have done. They have had to write them up very clearly (about 30 pages). We hope you feel this is a useful resource.
So if we are now given the guidelines we should all agree 100% shouldn’t we? Well, Peter and Colin tried it on themselves. They took 14 papers from all branches of chemistry and annotated these. It takes ca. 2 weeks. They did not get 100% agreement – they got 93%. Even though they had written the guidelines. This disagreement is universal. There is no IA of 100% except in trivial tasks. They involved a third party and the tripartite agreement was about 90%.
So if humans can only agree 90% of the time we can’t expect machines to do better. And they didn’t. OSCAR3 has been trained on a similar corpus with some being used for trainging and some for metrics. Here we text OSCAR3 against the “Gold Standard” – papers marked up (jointly) by the experts as their best estimate of what the guidelines suggest. OSCAR then identifies the character strings which are CMs. The metrics are harsh. If OSCAR gets “ether” instead of “petroleum ether” it gets a negative mark (not even zero).
There are two measures – precision and recall. Each can be improved at the expense of the other. One metric is the geometric mean or F-score. OSCAR gets about 80% (Peter has some novel ways of approaching this using probabilities).
So if you hear someone saying “My technique identifies 90% of chemical compaunds in text” ask them “where is your corpus?”, “what is your interannotator agreement?”, “where are your guidelines?” and “how did you compute your metrics?”.
If they can’t answer all 4, don’t believe a word.