Teaching #ami2 to recognize biological names (binomial)

Erithacus rubecula (Wikimedia Commons) “the Robin”

#ami2 can now read the text of scientific articles as HTML (she has a little trouble with bold letters and strange fonts but we’ll teach her how to manage). Here is how she finds organisms in text. Having created the HTML (which is also XML) she can search it with XPath. XPath is one of the simplest and most powerful search tools for moderate chunk of information. Here she searches a page for italic phrases with at least one space (e.g.

I heard an Erithacus Rubecula
Erithacus rubecula today. (@rmounce points out the capitalization!)

AMI has extracted the HTML (… means italics)

I heard an Erithacus rubecula today.

Now she creates an xpath :

“.//html:i[contains(.,’ ‘)]”

This means:

.// anywhere in the document (we can increase the precision later)
html:i a chunk of italics
contains(.,’ ‘) which (.) contains a space (‘ ‘)

It’s not flowing prose but it’s trivial for AMI. And the result (using Jaxen query() in XOM) is:

& Evolution
16S, COI
16S, COI, COII
16S, P
Achillea macrophylla, Adenostyles alliarae
Achillea, Adenostyles, Cirsium, Doronicum, Petasites, Senecio
Advances in Chrysomelidae Biology 1.
Ae. triuncialis
Aegilops geniculata
Annals of the Entomological Society of
Annals of the Entomological Society of America
Annual Review of Ecology and
Applied Statistics
BMC Bioinformatics
BMC Evolutionary Biology
Bioinformatics 2005, 21(24):4423-4424. 69. Sikes DS, Lewis PO: PAUPRat: PAUP implementation of the parsimony ratchet.
Biological Journal
Biology and Evolution
Boston University, Boston,
COI (13 PPIc among 16 polymorphic sites) and
COII, P
Cladistics-the International Journal of the Willi Hennig Society
Current Biology
Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs
Diabrotica virgifera
Die Käfer Mitteleuropas.
Doronicum clusii
Doronicum grandiflorum

Clearly not all italics are organisms. Many are bibliographic indicators. There are two simple ways to improve the precision:

Remove false positives. We can probably remove most of the bibliography by context (they occur on title pages and in references)
Include only known species. This is probably the best way forward and we have an excellent Open Source tool (Linnaeus) from Casey Bergmann and colleagues at Manchester with > 10000 commonest species.

There are other ways:

Morphology and lexical analysis of digraphs (the letter frequency in organisms is very different from English prose – higher vowel frequency for example).
Local context (include Hearst patterns … but hey, I have to go…)

So we easily get:

Achillea macrophylla, Adenostyles alliarae
Achillea, Adenostyles, Cirsium, Doronicum, Petasites, Senecio
Ae. triuncialis
Aegilops geniculata
Diabrotica virgifera
Doronicum clusii
Doronicum grandiflorum

So I hope you are now clear about how powerful content-mining is, how it will revolutionise science and how it is a crime against human knowledge to restrict its deployment.

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Teaching #ami2 to recognize biological names (binomial)

Ruth Lewis says:

April 4, 2013 at 1:52 pm

Just wondering if there was a tool that could somehow detect words on image files – This image tagging project came to mind when I read your post http://blog.biodiversitylibrary.org/2013/03/bhl-image-collection-on-eol.html
Probably not but just thought I’d mention it.

- pm286 says:
  
  April 4, 2013 at 3:38 pm
  
  Depends.
  We find some images as PNG (etc) with overlaid characters. These are easily detectable (though maybe not easily interpretable). Examples are (a), titles, and arrows pointing to whatever. Some phylo trees work that way.
  If the annotation is part of the image (i.e. bits) then probably not although if there is a common style (e.g. annotating gels or histology) that will be possible in the future. And the more they are from a common authoring source, the more likely.
  So hard but not impossible

Teaching #ami2 to recognize biological names (binomial)

Erithacus rubecula (Wikimedia Commons) “the Robin”

2 Responses to Teaching #ami2 to recognize biological names (binomial)

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta