Text-mining at ERBI : Nothing is 100%; please comment

I was delighted to be asked to speak at a meeting of ERBI in Cambridge yesterday evening. ERBI is (roughly) a get-together of scientists and IT people in the dynamic biotech companies from the Cambridge region (“The Health, Wealth and Growth of Biotech in the [Eastern] Region”). Here was the program:
ERBI IT Special Interest Group Meeting – ‘Text Mining – Finding Buried Treasure’

17:00-17:35 – Richard Kidd, Head of Informatics, Royal Society of Chemistry
‘Prospecting chemical and biochemical literature’

The RSC’s Project Prospect, which was the first application of semantic web technologies to primary research publishing, won the 2007 ALPSP/Charlesworth Award for Publishing Innovation. We will discuss the problems with the conventional publication process which we tried to address, the development process, and successes and failures in applying new standards. We will look at the InChI and identifying chemical entities using text mining, using existing ontologies and building new ones, and their real-life application.
17:35-18:10 – Phil Hastings, Director Business Development, Linguamatics
Finding Answers from Text for Life Sciences

An overview of the application of Natural Language Processing (NLP) to text mining for researchers and information specialists, its potential impact and benefits. The presentation will include case studies from pharma/biotech. We will also include some insight into current challenges and potential opportunities for text mining in the future.
18:10-18:45 – Julie Barnes, Chief Scientific Officer, Biowisdom Ltd
‘A new information format for a new information age’

Julie will present opportunities to generate a new format for information, enabling us to better exploit the realms of historic literature and electronic information available. Case studies pertaining to drug safety will highlight the analytical power of assertional metadata for generating new insights for the purpose of pharmaceutical R&D.
18:45-19:20 – Peter Murray-Rust, Unilever Centre for Informatics, Cambridge University
‘The Chemical Semantic Web’

The semantic web is set to change the way we think about and use information. By providing explicit descriptions of concepts we make them accessible to machines and open up the possibility of simple reasoning. Chemical Markup Language (CML) can describe substances, reactions, molecules, solid state, spectroscopy and recipes. If chemistry is published in this way we can then use machines to do most of the tedious and error-prone work such as searching, transforming between formats, and integrating into documents. The presentation will include a number of interactive demonstrations.

PMR: The evening has a very nice logical flow. Although not consciously planned each presenter was able to rely on the audience (half scientists, half IT people – most from SME biotechs or IT companies). So – as always at the start of a meeting – I didn’t know what I was going to say. Who was the audience? If they had all been managers no point in giving the geek-candy. If scientists, go easy on the XML. If IT people, not too many slides filled with -ases, -ins, -ates, -osides, MAPKKKs, ERBs, etc. (Richard showed a lovely slide incorporating gene names such as “sleepy, bashful”, etc. I guessed it was from Drosophila (this community has a histiry of “amusing” names for genes). I was wrong – it was zebra fish. So the genomicists are competing for the Minister of silly names. Great fun, unless you are doing text mining. It’s difficult enough parsing “He filled the apparatus” (personal pronoun or element?). A good tactic used to be to throw away the common english words in a chunk of text – you can’t do that now or the zebra fish goes down the plughole.
So I changed my theme to “Chemistry in Documents” and pained the picture of publishing completely semantic chemistry. I also made it clear that nothing in this area is 100% correct. We have to adapt to this idea. There is no “right strucrure” for a compound. There are structures which have a very high probability of being associated with a name. There are names which have a probability of representing a chemical entity.
So I set the audience a question. Here’s a chunk of text from a thesis we – or rather OSCAR-the-journal-eating-robot – is reading. There are no tricks – it’s exactly as is. I asked them to same how many chemical entities there were in this chunk. (Ideally we should ask for where they start and end, put I just asked for a show of hands for the total count.

To a solution of crude bisynol 85 (1.05 g, assume 2.34 mmol) in dichloromethane (50 cm3) were added 4 Å molecular sieves (1.17 g), 4-methylmorpholine N-oxide (410 mg, 3.52 mmol) and TPAP (82 mg, 0.23 mmol). The reaction mixture was stirred at ambient temperature for 1 h. The crude reaction mixture was filtered through a plug of silica, washed with diethyl ether (100 cm3) and concentrated under reduced pressure. Gradient flash column chromatography (Petroleum ether:diethyl ether, 100:0 ? 95:5) afforded 1-(tert-butyldiphenylsilyloxy)-trideca-5,8-diyn-7-one 86 (750 mg, 32% over two steps) as a yellow oil:

PMR: The audience gave answers varying betwee between 4 and 11. To be fair some were not scientists and although they’d had an hour and a half of slides from the others they were not used to reading this sort of stuff.
So how many do YOU think there are?. Just a number between 4 and 11, although you can add comments if you wish. This competition is not open to Peter Corbett, Colin Batchelor, their friends or colleagues.

This entry was posted in Uncategorized. Bookmark the permalink.

12 Responses to Text-mining at ERBI : Nothing is 100%; please comment

  1. Martin Griffies says:

    9.
    I wish that I’d known about the event in advance: I’d have made the effort to attend.

  2. pm286 says:

    (1) I know. I should have blogged it earlier. Sorry. But I had a lot of other things… We could meet in a pub sometime

  3. pm286 says:

    [Quiz] Peter Corbett and I have discussed “the answer” but ideally I’d like some of you to submit comments with numbers in. You don’t have to give your name.

  4. justme says:

    Hmm….my offer is 9.

  5. pm286 says:

    (4) Thanks for getting the ball rolling…

  6. Mat Todd says:

    I get 15, assuming a tungsten-arsenic alloy.

  7. pm286 says:

    (6) No – this is not a lexical game (so “was” is not a chemical entity). These are real chemical entities. There are no tricks.

  8. 9, so excluding the sieves. Obviously, I hope to be wrong. Otherwise we just have more boring stuff 🙂

  9. pm286 says:

    (9) Many thanks.
    Text-mining IS boring. That’s why we try to leave it to robots.

  10. Mat Todd says:

    No tricks – nine. “Reaction mixture” is a chemical entity, in that I can see it, and it’s full of chemicals, but it’s vague.

  11. Chris Rusbridge says:

    My guess was 8, leaving out process things such as “solution”…

  12. Pingback: Unilever Centre for Molecular Informatics, Cambridge - Ramblings » Blog Archive » A challenge for Chemists and OOXML

Leave a Reply

Your email address will not be published. Required fields are marked *