Adding semantic markup with InChI

If we could require all authors to provide machine-readable chemical structures in their chemistry articles the quality of chemistry would increase dramatically and immediately. We could create Open databases immediately, that were machine-searchable (just like crystalEye). No-one doubts that, but who is prepared to make it work?

  1. Richard [from RSC] Says:
    October 15th, 2007 at 10:17 am e[…]
  2. Sitting here as a publisher, we don’t have half the power you suggest – we have to satisfy authors (we want the best) as well as readers (to give them the best service), and making submission as easy as possible is an absolute requirement. Demanding InChIs from authors isn’t a realistic option yet – we can show the advantages of this information via the enhanced HTML and work towards it, but compulsion’s an attractive but ultimately futile option. The more you push, the worse data you’ll get.The publishers aren’t the problem – it’s because the possibilities of processing and reusing this information have only comparatively recently been apparent, and frankly because most people want to read the text and look at the pictures. As authors and readers are encouraged to look beyond print/PDF it’ll happen, but keeping the data within the publishing process is a community issue rather than a publisher one. We’d love it of course!

PMR: Well, who supports it? From Nascent at nature.com

Lunch with Egon Willighagen

We had lunch yesterday with Egon Willighagen who in his spare time runs the Chemical Blog space, now situated at http://cb.openmolecules.net/ (running on postgenomic code).The chat over lunch was pretty good, it turns out that Egon’s favorite molecule might be Ascorbic acid. One of the topics that really animated Egon was how how to link molecules to academic papers. By this I mean for example if you do a search in google, or in some dedicated search engine, for a molecule, how does your search engine know which papers deal with this molecule. There are a couple of problems with solving this. One is that many different fields use different terminology for molecules, especially as the molecules become large, so a plain text search for the name will not get all of the papers that you might be interested in, also papers don’t have semantic markup of molecules.One solution to marking up molecules is to use an InChi (an IUPAC International Chemical Identifier). These have been championed by Peter Murray Rust and there is an extensive InChi FAQ available. The short story is that an InCHi is a character string which uniquely describes a chemical substance. From any chemical structure you can generate an InChi.
Peter has a writeup on using inCHi in blogs, and if every chemical that appeared everywhere was somehow marked up with it’s InChi, or the article referring to it tagged with them then the findability problem would be solved by simple string searching.OK great, well what’s the problem? For a start there is an alternative system SMILES (which is a Simplified Molecular Input Line Entry System), a markdown for molecules if you like. There is a very good description of the syntax here and the KinasePro blog has a short comment on how many people use SMILES vs InChi. The bottom line is that more people use SMILES, but it seems easier to search Google with InChi. I’m not a chemist, but it seems from my naive stand point that the SMILES syntax seems closer to the text description of chemistry that we know from school, wheres the InChi system is more rigorous, it requires one further step of abstraction. It reminds me of the difference between LaTeX for math and MathML. MathML is a hell of a lot easier to write a parser for than LaTeX, as LaTeX can be quite expressive, however no one writes raw MathML. Scientists are lazy and that extra step of abstraction might be the reason why SMILES seems to be used more frequently at the moment.Egon suggested as a solution that journals should require papers dealing with chemicals to include InChis. He said that every tool for drawing chemicals (standard issue for anyone writing a paper on the subject) can now output the InChi with the click of a button. Sounds reasonable, seems easy, but there are problems with this approach. I have heard a few times people say, you are Nature, you can make authors do anything in order to get a paper published so why not get them to do x. Well, for a start, that’s an editorial decision, but even so, making more demands on scientists may not be the best decision when the process of publication is already pretty fraught and stressful. Even if we did this what would that gain? A small selection of the literature would be marked up, but the vast majority of journals in the area would need to follow suit in order to gain full coverage. Of course an argument that we should not do x because other people are not doing x is not what I am getting at here, but rather that this cannot be seen to be a final solution to the problem. Journals are naturally shy of any step that can delay the publication time of an article, and so I am also skeptical that we would see such obligatory requirements. Better, I think, to have this step as a voluntary one. Practically all journals allow supplementary information and I am sure all of them would accept InChi as supplementary information.Even then one is still left with the vast existing corpora of papers that are already published. Egon points out that no one uses the literature in this area from 50 years ago, as modern techniques have advanced so far that this literature is functionally of little use. The implication here is that 50 years in the future we will only need to go back as far as today’s papers. Even so there has to be a value in seeing the evolution of an idea for insertion into the literature right through to where it has led today, and Egon agreed with this.So what can we do now to help making connections between papers and molecules? Peter Corbett, who works with Peter Murray Rust, is working on automated methods of getting computers to read chemistry papers and output semantic markup of them. Tools like this can begin to fill in the semantic blanks, both for papers from the past and for the current literature. Egon has now created rdf pages for molecules on openmolecules.net. These pages use the InChi in their structure, and now each molecule had it’s own web page. Egon’s pages check Connotea, and pull from Connotea co-tags of InChi tags (Here is a short description of this). If we work on this a bit more we should be able to set up a system where if you tag a paper with an InChi, that paper could appear on Egon’s pages. We got quite excited about this idea yesterday and are certainly going to discuss this further. It’s a small start, but a start nonetheless.
PMR: and Richard replies:
Getting InChIs out from the chemical drawing is easily done now, but I don’t think there will be a realistic way to get them into the authoring process until the tools offer a robust way to keep the InChIs in the right place (and validated). Certainly it’s not a burden we could currently expect of the majority of authors, which is why RSC Project Prospect relies on a combination of text mining and input by skilled technical editors. It’s quite hard to do in practice, but it’s worth it when you see the results which won us the ALPSP/Charlesworth Publishing Innovation award this year. The InChIKey should help to promote acceptance and use as Tony suggests, along with common treatment of these standards across publishers .

PMR: so there seems to be a can-do in biology that is missing in chemistry. So let’s float a revolutionary idea for capturing biology in articles. Let’s start with protein sequences. Now these are complex molecules with lots of atoms, so we’ll make them simpler. We’ll call one group of atoms “A”, another “C” and so on to “Y”. We’ll just use 20 letters.
This will be very very difficult for biochemists. They haven’t had nearly as long as chemists to learn informatics and their molecules are much larger. Insulin has hundreds of atoms – ten times larger than most common molecules and it’s one of the smallest. But even so too long to fit on the page:
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
so we’ve had to break it in the middle.
Unfortunately that’s much too difficult for a biochemist to include in a paper and anyway it’s meaningless. You can’t understand it.
Well it was worth a try. It could actually revolutionise biology and perhaps create something we could call bio-informatics (similar to chemo-informatics).

This entry was posted in chemistry, open issues. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *