Semantic markup and text-mining

Here's an interesting development which emphasizes texmining but actually seems to be a form of semantic authoring. (via Peter Suber)...

Structured Digital Abstracts - Easier Literature Searching But Not Democratic

<!--
google_ad_client = "pub-2914975009096200";
//bottom link unit
google_ad_slot = "3498866735";
google_ad_width = 468;
google_ad_height = 15;
//-->

FEBS Letters is this month carrying out an interesting experiment that could make literature searching easier for both human and computers.

The experiment centres on Structured Digital Abstracts (SDA). SDA are extensions of the normal journal article abstracts that describe the relationship between two biological entities, mentioning the method used to study the relationship. Each sentence is preceded by one or more identifiers pointing to the corresponding database entries that contain the full details of the interaction e.g. protein A interacts with protein B, by method X.

The aim of SDA is to assist data entry, text mining and literature searching by extracting the salient data from the article into simple sentences using a defined structure and controlled vocabularies.

Gianni Cesareni, Editor of FEBS Letters explains:

Many articles in biological journals describe relationships between entities (genes, proteins, etc.) yet this information cannot be efficiently used because of difficulties in retrieving from text. Databases capture this valuable information and organize it in a structured format ready for automatic analysis. The experiment of using SDAs will facilitate database entry and improve disclosure, to the benefit of authors and readers.

This month’s edition FEBS letters contains a number of articles annotated with SDA, along with some articles on SDA itself.

This is a simple but very good idea and I would certainly appreciate anything that makes literature searching easier.

But I can’t help noting the delicious irony in the title of the first article in the issue that trumpets the arrival of SDA: “Finally: The digital, democratic age of scientific abstracts”.

The first irony is that reading this article on digital democracy requires a subscription to FEBS Letters.

The second irony is that while SDA make it easier to find articles of interest, reading the original article also requires FEBS Letters subscription, effectively making them marketing tools for the journal.

So useful they may be, “digital” they may also be but “democratic” they are certainly not.

Wouldn’t the flow of information be better served if everyone just published in open access journals?

PMR: Obviously I agree with the comments on Openness and "democracy". I'd comment in general that  efforts by closed access publishers to "dumb-down" the exposure of information because of business processes is likely to be counterproductive. For example the Nature inititiative on text-mining  (OTMI) chops an article into words and snippets (mainly sentences) and wraps them in XML.  [The "Open" does not mean that the information is open, nor that the governance or process is open, but that the DTD is published]. I have many good things to say about Nature but OTMI is not what we need. Text-miners need to know the full text, because the usage at different points may be context-dependent. And it is so clear that the inadequacy is driven by commercial considerations rather than technical ones.

I haven't got into work so haven't yet looked at examples of FEBS but the description seems to be of semantic markup provided by the journal. This, in general, is a highly desirable process. A paper on the "Indian hedgehog gene" suggests spiky animals roaming round the Taj Mahal - actually I believe it's a drosophila gene (signified by the unique code Ihh). So it's enormously valuable to have this markup. In collaboration with our laboratory the Royal Society of Chemistry has pioneered this in its Project Prospect where the molecules and some other concepts are marked up and hyperlinked to ontologies.

And as FEBS says you need structure and controlled vocabularies. That's a credit to the hard work done over many years by the gene annotation community and other bioscientists who have built Gene Ontology, ChEBI and other resources.

We don't have that in chemistry. Why not? readers of this blog will know - it's almost completely due to restrictive business practices. Authority-based identifiers (CAS, Elsevier) for chemicals exist - they just aren't open. So how about it, authorities?
My vision is that this markup should not be done by the journal or the reviewers or by machines but by the authors. They are the ones who know what an Indian hedgehog is (at least we hope so). Of course the journals can check. What we need are semantic authoring tools.

This isn't as far away as it may seem. Modern authoring tools such as Word2007, Open Office and LaTeX now allow much customisation. The authorship - that's you - is getting much smarter. We're used to plugins. So semantic plugins that - say - scan the text for Indian hedgehogs and add "Ill" wouldn't be difficult. Given that most people use Word, this is a good place to start. (I'll be writing more about this over the next few weeks, particularly about chemistry). For braver adventurers we recommend Open Office and in particular the ICE plugin/framework from Peter Sefton who we are working with.

So let's go straight for semantic authoring. A major benefit will be that the author controls the process. And the markup should then be free and Open.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>