LBM2011 Singapore; a milestone in Text-mining and Natural Language Processing, OSCAR, OPSIN, ChemicalTagger

Yesterday I gave an invited plenary lecture at http://lbm2011.biopathway.org/ – The Fourth International Symposium
on Languages in Biology and Medicine (LBM 2011) Nanyang Technological University, Singapore .14th and 15th December, 2011

The meeting was on Natural Language Processing – using computational techniques to “understand” science in publications and get machines to help us with the often boring and error-prone part of extracting detailed meaning. It’s an exciting field and progress is steady.

The meeting itself was great. Very high standard of talks. I understood most of them both in their intent and their methodology. And a really great atmosphere – ca 30 people in a relaxed atmosphere and prepared to exchange ideas.

I realised the night before that this lecture represented a milestone in my NLP/TextMining career so I took considerable time to adjust it to the audience and the occasion. Normally in my lectures I don’t know what I am going to say (and choose from HTML slides). This time I wanted to pay tribute to all the people who have contributed so much over the years and have brought us to this milestone. So here’s the second slide of my talk:

OSCAR2: Joe Townsend, Fraser Norton, Sam Adams, Chris Waudby, RSC
OPSIN: Daniel Lowe
Sciborg/OSCAR3: Anne Copestake, Simone Teufel, Peter Corbett, RSC, IUCr, Nature PG
Ontologies: Nico Adams, Colin Batchelor (RSC)
OSCAR4: David Jessop, Egon Willighagen, BalaKrishna Kolluru, OMII/EPSRC, NaCTeM, JISC
ChemicalTagger: Lezan Hawizy
CML: Henry Rzepa
And the BlueObelisk (CDK, Bioclipse, OpenBabel, etc.)
http://www-pmr.ch.cam.ac.uk/wiki/OSCAR4_Launch
http://www.jcheminf.com/series/semantic_mol_future
http://www.dspace.cam.ac.uk/handle/1810/238302 (DavidJ thesis)

(I’ve only included each person once). If there is anyone who is omitted let me know.

I’d like first to thank the people in the Centre I have had the chance to work with (including of course Jim Downing). They’ve been unusual in a good way in that they haven’t been obsessed with academic competition. They have worked as a team, creating joint products and they have also put a high premium of creating things that are useful and work. That’s not so common in academia and this group has traded H-indexes for software and systems that are out there and being used. If only academia gave credit for that they would be stars. That time will come.

I’m proud to say that they’ve all joined high-tech UK companies which have to be part of our future. Making digital things that people want to buy. Thermodynamics owes more to the steam engine than the steam engine to thermodynamics. We should be learning from the companies that this group has gone into. I’m proud of the software engineering that Jim introduced and that the group adopted without mandates or coercion but simply because it was so evidently right.

That pride has gone into the three products, OPSIN, OSCAR4 and ChemicalTagger.

In a real sense we can draw a line under them. They work, they are “out there” and they are used. We don’t know how much they are used because people are so secretive. I’d guess that there are probably 20-100 installations of OSCAR. We get little feedback because the software works. (We’ve got no formal feedback from OSCAR2 and we know that it’s widely used).

And I’m going to be unusually boastful, because it’s for them, not me:

OPSIN, OSCAR and ChemicalTagger are the best in their class that we know about. There may be private confidential programs that we don’t know about, but hey! Because they are Open Source people don’t try to compete and duplicate the functionality. They re-use it. So OSCAR is used in Bioclipse, used at EBI for their chemical databases and ontology. Open source doesn’t necessary make a program functionally better per se but it allows other people to work on it. More testers, more bugs discovered, more progress.

Why can we draw a line?

Because essentially we have done what we needed to. We’ve built them as frameworks and we are confident that the frameworks will work for some time before they need refactoring (everything needs refactoring). So if you think OSCAR4 has less functionality than OSCAR3, that’s because it’s modular. There is no point in US writing web interfaces that you need to put on your server. Instead we have written an API that is so simple and powerful it’s 2 lines of code in its basic form. Easy to understand, easy to test, easy to install, easy to customise.

There’s lost more to do, but it doesn’t involve rewriting the programs. OSCAR is designed to be extended through APIs. If you want to use a new corpus there’s and API. A new dictionary/lexicon? An API. A new machine-learning algorithm? Yes, an API. It should be hours, not years to reconfigure. Here’s how to do it:

ChemicalEntityRecogniser myRecogniser = new PatternRecogniser()

Oscar oscar= newOscar();

oscar.setRecogniser(myRecogniser);

oscar.setDictionaryRegistry(myDictionaryRegistry);

List<ResolvedNamedEntity> entities =

oscar.findResolvableEntities(text);

Five lines of code (of course someone has to write that recogniser and the dictionary but then you can plug and play them). So if you want OSCAR to use Conditional Random Fields, find an Open Source library (there are lots) and bolt it in as a Recogniser.

Yes, group, I am proud of you and yesterday was the day I said so publicly. I used your own (Powerpoint) slides!

So where does it leave us? What does the signpost point to?

I said yesterday that Language Processing research had two goals and I’ll prepend a third:

LP research develops new approaches to LP. Our main contribution here has been to add chemistry and we’ve covered most of what’s involved. There is no reason why the technology shouldn’t be extended to different human languages, different corpora. We’ve not made any great Chomskian-like breakthroughs in understanding language itself. We haven’t been able to compare chemical corpora because we don’t have a collection of Open ones.
LP uses discourse to give insights into the science itself. I’d hoped to do that in chemistry, but the universal refusal of chemistry publishers to provide Open corpora has meant that we have been restricted to patents. And patents, designed to conceal as well as to reveal, are not where new ideas in the fundamentals of chemistry come from. Contrast bioscience where language is a primary tool for understanding the discipline
LP as a tool provides new useful knowledge to the whole world. Here again we are stymied by the publishers. As this blog has shown publishers are unwilling to make papers Openly available and for the extracted knowledge also to be Open. At a conservative estimate publishers have held back LP by a decade.

I then divided chemical LP into two areas:

Chemical LP in chemical corpora. There’s nothing useful we can do here until the scientific literature is Opened. There are 10 million syntheses published a year, and even being pessimistic, PubCrawler and OSCAR could analyse 2 million of these (I think it’s higher). Richard Whitby in Dial-a-Molecule could use all this in his project for designing the molecules we will need in the future. But we are simply legally forbidden.

So I am giving up LP in chemical documents. There is no point. Some commercial companies will possibly use OSCAR or OSCSAR-like tools to do a small bit of this – but necessarily inefficiently. We can forget the idea of a chemical Bingle. Chemistry remains stagnated
Chemical LP in biological documents. Unlike the chemists, biologists really need and want LP/Textmining. They are also hampered by the restrictive practices, but they can probably work out the scope (so long as they don’t publish the extracted data – all our data are belong to the publishers). There’s areas such as metabolism (where Peter Corbett had some great and easily implantable ideas) which would yield massive results. Metabolism with why drugs work and why they don’t work. It matters. Lots of it is in the existing literature but technically and legally locked up, gathering dust.

So I am encouraging the bioscientists to use our software. I am happy to work with anyone – I am not slavishly tied to generate REF points. There is some valuable chemistry to discover by mining the bio-literature

And I intend to go more into the patient-oriented literature and to use LP to help the scholarly poor. Because it may help them to become scholarly richer in spite of everything. And I picked up quite a lot about medical LP at the meeting so I’m fired up.

LBM2011 Singapore; a milestone in Text-mining and Natural Language Processing, OSCAR, OPSIN, ChemicalTagger

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta