Why bother with new technology?

Kinasepro has blogged about discussions of new chemoinformatics technology (specifically CML (Chemical Markup Language) and InChI (chemical identifier)). Here’s the post and some correspondence. It’s basically about the introduction of new technology. Obviously I’m not neutral but I will try to discuss it in a neutral manner. For that reason I have copied it more or less in full.

There’s been a fair amount of talk [ChemBark] over the last little while on the topic of chemoinformatics and chemblogs. Here’s my two cents.
smiles inchi Aldrich
smiles inchi ChemExper
smiles inchi The PDB
smiles inchi Chemdraw (until v10)
smiles inchi The entire pharmaceutical industry.
smiles inchi Peter Murray Rust
smiles inchi IUPAC
So somehow a couple librarians have convinced Google that inchi > smiles. Result? Google may well do Inchi, but noone but the librarians are currently using it, and meanwhile google doesn’t index smiles very well. I’m reminded of a day when it was thought to be a good idea to put the CAS#s of new entities at the bottom of ACS journal articles. Don’t worry, we survived those librarians too.
PMR: I’m not sure who the librarians are. I’d label all of (us) as chemical informatics. The institutions include NIST, RSC, and University of Cambridge. I don’t think Google has been convinced of anything – chemistry is relatively too small for Google to worry about. But yes, we have visited and had very useful and forward-looking conversations. Watch out for Googlebase…
Lookit, we don’t need a string of XML code that you need an advanced degree to use. We don’t need people telling us to tag our blog posts, we need an integrated solution. We need something that can draw structures and present them attractively in an index friendly HTML format. Near term: Get google to index picture descriptions, and code a firefox plugin that can insert smiles into said descriptions.
PMR: I am not quite sure what “index picture descriptions” means. Google indexes the fact that there is a picture but not the content. There are major efforts in image recognition, but I am not aware that any of this is being done in chemistry. I think that indexing chemistry in published GIFs is extremely difficult. I’ve looked at this over the years and conclude that it would be much easier if authors simply made their molecular files available.
Till google has a smiles substructure search, I’m not going to bother.
PMR: This is a perfectly valid response from an individual in the system. It’s rather less encouraging if it reflects the whole of chemistry (which currently it does). If the chemical informatics community says “at some stage Google will solve all our chemical problems, until then we’ll do nothing” that’s regrettable. (All other major scientific disciplines – physics, astronomy, bioscience, geosciences, etc.) are making major efforts to develop informatics infrastructure.Some of us are, in fact, thinking about how to do this. The problem is that there has to be some software somewhere. It can be in the following places:
  • client (i.e. your browser)
  • Google (we have discussed this with Google and it’s not impossible)
  • third party (who may or may not charge for it).

Given that Openbabel can search millions of structures quite rapidly there are some encouraging opportunities.

  1. 1 totallymedicinal Dec 5th, 2006 at 3:09 pm
    Couldn’t agree more with the sentiment – not only does my ancient version of ChemDraw not support this exotic format, but I have enuff hassle in my life without learning some obscure new coding system.

PMR: Again this is a perfectly valid response. Any approach to chemoinformatics requires tools. And I suspect or your institution would have to pay for an upgrade to Chemdraw. Obviously there is the opportunity of some Open Source free tools but they are not yet widely deployed and are effectively for early adopters.

  1. 3 Paul Dec 7th, 2006 at 4:06 am
    I could not agree more about the need for an integrated solution! I got a really thoughtful response from Peter Murray-Rust and friends, and I feel kind of bad about not acting on it, but putting random InChI designations at the bottom of all our blog posts doesn’t seem worth it to me. I think that CML is indeed the future, and I look forward to the day of being able to download a CML plugin for WordPress that will take care of everything for us lazy bloggers.
PMR: There is no doubt there are technical problems and they will require some early adopters to solve. I have tried to hide the InChI – it is an effort and is fragile. Given that I have problems with simple computer code in WordPress I expect the same with chemistry. However we have some new ideas of how to take this away from the WordPress process.
  1. 4 Chris Dec 9th, 2006 at 12:23 pm
    The argument against SMILES seems to be they are not an Open Format and it is possible to represent a single molecule with multiple SMILES strings. For my part I can read and write SMILES, (and SMARTS and SMIRKS). I find InChi impenetrable and I don’t think there is syntax for substructure or similarity queries, in addition I don’t think there is a system for describing reactions.I’ve started to add SMILES to my web pages in the hope that someone will build an index at some point, I guess it would help if there was a SMILES tag.
PMR: SMILES was a groundbreaking language when it came out. In general I have no problem with non-Open formats if there are free tools to manage them. There is a canonicalization algorithm for SMILES but it is closed and proprietary. I have regularly discussed the value of making it openly available with Daylight management but they are not prepared to do this. This is a legitimate business approach – control the market through trade secrets. IN the current case, however, it has the practical downside that several groups have created incompatible “canonical” SMILES.
The main virtue of InChI is that it is a public Open Canonicalization algorithm. It’s perfectly possible to convert InChI to SMILES if you want. It would not be “canonical SMILES” in the strict sense, but it would be canonical. That may, in fact, be a useful approach for certain types of compound. As InChI has a richer set of concepts than SMILES there may be some information loss.
In summary, if Daylight had made the SMILES algorithm public and it had been used responsibly I doubt very much whether we would have InChI. It has been driven by the lack of interoperability in chemistry – coming in some part from government agencies and the publishing community.
InChI is by definition impenetrable. It’s an identifier. Do you find DOI, ISBN, security certificates impenetrable? I hope so 🙂
  1. 5 kinasepro Dec 9th, 2006 at 7:03 pm
    InchI and CML may well be the future, and no-one will embrace it more then me, but SMILES is the present. For people working in the field not to understand that boggles the mind!

PMR: I’m not sure who “people working in the field” are. If it includes me, then I fully understand it. I am simply trying to bring the future to the present a bit quicker and a bit more predictably. 🙂

  1. I’ve experimented on this site a little with smiles. For instance a google search of the following string brings you here:O=C(C2=CN=C(NC3=NC(C)=NC(N4CCN(CCO)CC4)=C3)S2)NC1=C(C)C=CC=C1ClOf note I’m not the only one with that string on the web! Maybe thats an important compound? Sadly google indexed that page under my SRC tag rather then as a standalone page. Put that together with the fact that smiles strings are not substructure searchable via google and its clear to me that google is not ready to be a chemistry informatics platform. It’s sad really, because it doesn’t seem to me that it would be that difficult for them to make SMILES strings substructure searchable via the same algorithm the PDB, relibase, aldrich and everybody else is using.

PMR: This is a very important point and at the heart of the problem. Google works by indexing text. It’s good at it and can distinguish different roles for text and can look for substrings. This is a simple, powerful model. But at present it doesn’t index other objects (faces, maps, etc.) These are both harder and require specialist software. By contrast PDB, Relibase, Aldrich do index chemical structures. That means that they have to have specialist software running on their servers. Which means a business model. And that someone has to pay somewhere. PDB gets a grant, Relibase is commercial, Aldrich will see this as the basis for selling more compounds. All completely valid. But there is no business reason for Google to invest in chemistry-specific software – as I said chemistry is too small for Google to bother with. It’s not helped by the fact that all the information is proprietary and that one of the major chemical information suppliers (CAS/ACS) sued Google. So unless you convince them differently – and I have gently tried – it won’t happen.
So this is all about the introduction of new technology. The primary messages from the chemistry community are something like:

  • We’re happy with what we’ve got – it’s worked for the last 20 years and will go on doing so. Yes, for a little while.
  • When it’s necessary CambridgeSoft, Chemical Abstracts, Elsevier will develop a new technology and we’ll pay them to use it. Unfortunately I don’t see any movement from any of these to embrace the new Web metaphors. Biology, geoscience, etc. are working hard to develope the semantic web in the subjects – apart from a few of us noone in chemistry is.
  • Well, it’s a bit of a mess, but it’s not at the top of my priorities. I’ll come back in a few years.
There are movements in chemistry, particularly in three areas:
  • computational chemistry. We are having a visit of COST D37 (EU) to Cambridge tomorrow to create an interoperable infrastructure for computational chemistry. It will be based on communal agreements and use XML/CML as the infrastructure.
  • chemoinformatics. The Open Source community (e.g. Blue Obelisk) supports both current (legacy) formats (SMILES, Mol) etc. and also CML/InChI. This can provide a smooth path towards the wider adoption of these newer approaches, including toolkits. The toolkits are free, which some see as a disadvantage, in which case you will have to convince the commercial suppliers to create them.
  • publishing. Commercial publishing is universally based on XML (and variants) so it is easy for them to include CML and related systems. I won’t give details but I’d be surprised if there weren’t major changes in the next 2-3 years here which I hope will answer some of the obejctions raised here.

There are also general major drivers elsewhere for the abandonment of legacy formats. They include the semantic web, RSS, institutional repositories, archival, etc. These efforts require interoperability and freely available tools – you can’t archive – say – a binary chemistry file and expect it to be readable in 5 years time. There are a lot of people to whom that matters.
So I’m not telling anyone to do anything – I’m putting ideas, protocols and tools where they may wish to pick them up. If 5% of a community is enthusiastic that’s a good beginning. It worries me that the pharma industry has no concept of interoperability. But I’ve said that already.

This entry was posted in chemistry, open issues, XML. Bookmark the permalink.

5 Responses to Why bother with new technology?

  1. Chris says:

    Hi,
    Craig James of eMolecules has donated SMILES canonicalisation code to OpenBabel (http://sourceforge.net/mailarchive/message.php?msg_id=37211914), I’ve not compared it to Daylight but at least it is Open Source 🙂
    eMolecules aims to index the world of Chemistry and uses OpenBabel under the hood, this used to be Chmoogle (google for chemists) but due to legal action they decided to change the name 🙁
    I’m not sure the issue is new technology, but more the fact you need the whole widget before it is useful, primary data, database, indexing terms, query language etc. I suspect the authors and commentors on blogs are amoung the most computer literate and most willing to embrace new technology but untill you have the whole widget it may be a struggle.

  2. pm286 says:

    (1) Thanks very much Chris…

    Craig James of eMolecules has donated SMILES canonicalisation code to OpenBabel (http://sourceforge.net/mailarchive/message.php?msg_id=37211914), I’ve not compared it to Daylight but at least it is Open Source 🙂

    Yes indeed – and of course I’m a keen supporter of Openbabel. I suspect a lot of people also depend a lot on Openbabel but – for some reason – don’t let us know. If a significant number of people said:
    “yes we are prepared to use Open Source”,
    “yes we use and like Openbabel”,
    “yes, we’ll use this algorithm for canonicalization AND PUBLICIZE IT”
    but at the moment they don’t. I think if Openbabel disappeared tomorrow there would be a lot of complaints – but I’m simply asking for people to be positive.

    eMolecules aims to index the world of Chemistry and uses OpenBabel under the hood, this used to be Chmoogle (google for chemists) but due to legal action they decided to change the name 🙁

    As I sadi this is a third party solution. If they have a sustainable business model and convince enough people good luck to them. But always remember that you will work with their offering – which could change, or fail to develop, or be removed. How many of you are able to use Chime under Firefox?

    I’m not sure the issue is new technology, but more the fact you need the whole widget before it is useful, primary data, database, indexing terms, query language etc. I suspect the authors and commentors on blogs are amoung the most computer literate and most willing to embrace new technology but untill you have the whole widget it may be a struggle.

    You are right, of course, and we know only too well it is a struggle! However we continue to make steady progress and I think we are close to the point where things get their own momentum. We are reinforced by progress in neighbouring domains – yes, I get some funding for/from Digital Libraries and I’m proud of it.
    I’m open to any method of speeding up the progress. We have most of it built – we need a mixture of sponsor, early adopter community, and demonstrators that will continue to convince people. As long as we get 5-10% interest in a group we are happy – the Internet amplifies the ease of finding people who think the same way.

  3. Deepak says:

    It is often frustrating to me, and I am speaking in generalities, that scientists often seem the most averse to accepting change. Take for example Phil Bourne’s attempts to bring markup languages to protein structure. I think the debate seems to boil down to the following: human readable formats vs. machine readable formats. Markup languages are machine readable and not easily accessible to humans and scientists just seem to stay away from them. If there were readers/writers/generators that would make the format almost transparent to the user, wouldn’t the barrier to adoption go down significantly?

  4. pm286 says:

    (3) Thanks very much Deepak.
    Yes, you’re quite right. I think it’s a matter of time – it’s just how long? Over the last 20 years the world has got used to hierarchical files systems, hyperlinks, client server, etc. Markup and semantics has to come – I suspect it will be something like social computing that drives it.
    And a younger generation, perhaps. I look to the 18-year-olds

  5. Pingback: The (Chemical Information) World is Flat

Leave a Reply

Your email address will not be published. Required fields are marked *