petermr's blog

A Scientist and the Web


Blogging and the chemical semantic web

This post will explain how chemically-aware blogs can be indexed and searched. If you’re not a chemist, but still interested in the semantic web, this may be interesting.

I revealed in recent posts that molecules in blogs can be indexed on their chemical structure, thus making the web chemically semantic. (I use the lower-case version to show that we are not using the heavyweight Semantic Web (OWL, triples, etc.) but something much more akin to microformats. Anyway the idea is simple…

For any document containing chemistry, mark up the compounds with the InChI tag that can be guaranteed unique for each of these. I’m going to concentrate on blogs, but the idea extends to any web document. (I’ll exclude most chemical papers as they are generally closed and so we can only access them with subscriptions and often are prevented legally from the indexing below).

The main ways of adding InChI tags are:

  • persuade the author to do this when they create the post. Most of the current types of chemical software either create InChIs or create a file that can be converted into InChIs (e.g. with our WWMM services). With practice this would probably take 1-2 extra minutes per compound, especially if we can create a drag-and-drop InChIfication service at Cambridge or elsewhere. The InChI (which is simply a text string) can either be added to the blog or hidden in the alt tags of the imgs for the chemical structures. Again fairly straightfoward (though I have had to fight my editor). And I think we can expect blog tools to become semantic – at least for microformats - during the next months.
  • extract the structure from the blog and turn it into InChI. This is harder (unless the authors use a robust format such as CML or possibly SMILES). One way is to interpret chemical names as structures – we’ll explain our work on this later. But semantic authoring is far better.
  • extract a known Open chemical ID from the site. Pubchem is the only realistic approach (it has ca. 6 million compunds); CAS numbers are closed and copyright so cannot be used. If we do this, then I would suggest the Pubchem entry is indexed like this “CID: 2519″ . (This is very easily cut-n-pasted from the pubchem site). I am normally hesitant to use IDs but I think we can make an exception for Pubchem.

A good example of an InChIfied site is: the Carcinogenic Potency Database (CPDB) at Berkeley which contains a list of chemicals with a typical entry which shows the InChI (scroll to bottom part of page). This site consistently gets good hits on Google when searched by the InChI string (try it at our GoogleInchI server).
So, this post is to suggest to chemical bloggers that they add InChIs to their blogs. There are about 15 blogs that seem to have enough chemistry to make this worthwhile (I’ve taken these from Post Doc Ergo Propter Doc ) and I’d be grateful for comments on what I have misrepresented or what I’ve left out. The loose criteria for inclusion are (a) are there frequent chemical strucure diagrams or (b) are there enough chemical names that are worth tagging.

I add:

but exclude RSS and CMLRSS feeds at this stage (though they will be the future of some chemical newsfeeds).

So this is to encourage chemical bloggers to add InChIs (or Pubchem CIDs) to your blogs. If you do, we can index your blogs and we’ll be showing some more magic RSN.

Leave a Reply