Open Data, Open Science. Closed Data...

(I have been fighting the blogging software - on several occasions it has published a blank post. So please excuse these bits of the "learning curve". I shall now write my posts in an editor and paste them. This is a repost in case you get a garbled one earlier).

I am speaking at the ACS on Sunday on the general theme of eChemistry - the application of eScience - Grid - cyberinfrastructure to chemistry. Unfortunately that's fairly simple - outside the Blue Obelisk community (more of that later) and a very few early adopters there is very little. By eChemistry I mean more than simply compiling in-house data and running programs - I mean semantically enriched chemistry that machines can help to process. By contrast there are huge and exciting developments in bioscience , geoscience and many others. So I'll be asking why this is.

The single fundamental requirement in eScience is that there is shared data. Ideally this should be semantic, and that's a challenge, but at least it should be there and shared. In chemistry there is virtually none. What there is has almost all come from bioscience (e.g. NCI and PubChem) and some of the US government agencies. However mainstream cheistry is totally unintersted in sharing chemical data and when it needs it expects to have to pay provate sector providers. As a result innovation in eChemistry and chemoiformatics is stifled - more of this in later posts.

This is exmplified by a question from JohnIrwin on the Indian CHMINF-L list (I doubt it has been archived yet - when it is perhaps someone could add the link). John has compiled a wonderful list of compounds (ZINC) from a wide variety of sources such as chemical suppliers and made it available to PubChem - as a result of this and similar efforts PubChem has ca 5 million compounds (information, not physical samples). He quite reasonably asks whether we can do the same for chemical reactions.  Read on...

At 02:14 07/09/2006, John J. Irwin wrote:

JOHN.. Dear CHMINF-L Gurus

PETER..This is a very exciting question, John.

PETER..We have been developing the technology to do this and many of the components are now available. It won't give you 100% recall or precision, but it could get a lot. However I suspect that if the technology is deployed we will have the lawyers after us immediately because it dares to actually read full-text papers automatically and that is not allowed, except by a few journals such as Molecules and Beilstein New Journal.

JOHN..I'd like to know whether a particular compound, or more generally, a particular ring system / scaffold, has ever been reported synthesized. I'd like to pass e.g. a SMARTS pattern and get back a package including the literature citation or patent identifier, and perhaps an XML structure containing reagents and reaction conditions. I'd like to do this millions, possibly billions of times to build a database of "been there, done that" scaffold space.

PETER..I shall be presenting the technology at ACS on Sunday - at least as much as I can get into 30 mins. It is early days, and some of the steps are not yet well developed.

JOHN..Is anything approximating this possible? In particular, can you direct me to how to script queries (e.g. SMARTS match on the product of a reaction) to CAS?

PETER..The ideal situation would be if all publishers put connection tables and reaction in full semantic form in their paper. This is technically possible - it's just a question of will and getting a new business model. Then you would simply set a robot to read all the molecules and reactions from every published paper and aggregate the content. The search technology is widely available.

PETER..There are two tiny difficulties.
* the journals do not encode structures and reactions in a meaningful machine-interpretable form. There are some slight signs they are interested in doing so - if so we have all the technology (when I say "we" I mean the Blue Obelisk group in general)
* you aren't allowed to spider the full text. Probably.

PETER..What we have done - and what I shall be reporting is to spider all the published crystal structures from journals that allow this. We haven't spidered the ACS because they stamp copyright on the factual data deposited as supplemental and no-one except me and Henry has challenged this. Personally I regard this as illegal and certainly unacceptable but while communities like CHMINF-L accept this there is not a lot that 2 individuals can do other than make a fuss. But the crystal structures deposited with the RSC and Acta Cryst are freely extractable and we now have ca 50, 000. Moreover we have done exactly what you want and extracted all the fragments from them. This means that perhaps 100,000 chemical fragments are browsable without additional software (Obviously we use InChI). So IF we had the structures from synthetic papers the problem would be solved.

PETER..Note, of course, that this does not just give comprehensive coverage of the modern literature, it gives immediate comprehensive coverage of the modern literature. Our robots can report a new structure within 5 minutes of it being published.

PETER..The holy grail is semantic chemical publishing - what can we do before then? We have to use full-text. Unfortunately there is no Open software that can interpret chemical diagrams. I think it would be great to have some - it's not trivial and you won't get 100%. And even then the problem doesn't finish as it can be very difficult to link the graphical schemes to the compounds - e.g.what numbers in a scheme relate to the compound identifier rather than an atom label, quantifier, etc. Graphical reaction schemes and Markush structures in current chemical publishing are often very effective chemical obfuscation tools. I hope to be able to show some small steps to de-obfuscation.

PETER..So before the grail arrives we have to get the structures out somehow, without a connection table. Peter Corbett is addressing this here through OSCAR, which translates names to structure. It runs at over 50% and
could. OSCAR will read a paper and where possible create a complete connection table. Obviously it's only as good as the authors' naming and when they have got that wrong - and it happens - OSCAR will get the wrong structure. But it is a step forward.

PETER..Reactions are more tricky. This is because chemists write in unnecessarily convoluted language:
"To a solution of X was added 3 g of Y". which is equivalent to "To my dog was donated a bone by me" (instead of "I gave my dog a bone" which is the sensible way). If we wrote "I added 3 g of Y to X" current grammars could parse it but this absurd mandation of the passive makes it a lot harder and we have to write a passive chemical grammar.  But when we have cracked it, then we should be able to extract reactions from full-text.

JOHN..Thanks, and sorry if this question seems naive.

PETER..It's a perfectly sensible question and very exciting, but be prepared for disinterest and opposition from most of the community. We've been collaborating with Indiana on the use of a distributed OSCAR system and there are lots of areas where other people could help as long as they don't mind working with Open Source.

JOHN..John Irwin
UCSF Pharmaceutical Chemistry

PETER..If you are at SF we must meet. Apart from my talk on Sunday I'm pretty free other than a Blue Obelisk beer evening meeting on Tuesday - I am sure you'd be very welcome.

Note: SPARC set up a list on Open Data for which I am the moderator. Technical difficulties meant I haven't been able to do much there. The business of intergating the technical moderation into my email system was just too complicated for me. Maybe this is a good time to rekindle my involvement.

This entry was posted in general, open issues. Bookmark the permalink.

5 Responses to Open Data, Open Science. Closed Data...

  1. Bill Hooker says:

    If you and Jean-Claude Bradley don't already know each other, you should meet.

  2. Peter,

    do you know this blog article?

    Joerg

  3. Peter,

    previous comment post was fishy, I need to get familiar with your blog;-)

    do you know this blog article?
    Good News and Bad News for Open Access Publishing

    Joerg

  4. Egon says:

    Just a general comment: please put in more a-href's whenever you can. Especially, when discussing articles: put in a link to the article, possibly using the DOI, e.g http://dx.doi.org/10.1021/ci060138m.
    This makes the reading more pleasent, and actually increases the semantics. At least for humans. Use microformats and you increase the semantics even more.

  5. Bill,
    I'll be at the ACS next week so hopefully I'll get to meet Peter and many of the people posting here.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>