OPSIN/OSCAR: you + us = we; please help

I’m exploring how you and we may be able to work to improve OSCAR and OPSIN. Even if you aren’t interested in chemical names, you may find the general principles useful.
One of the drawbacks of full Open Source and Open Access is that you don’t always get feedback on whether what you are doing is appreciated. There is a small measure of downloads, accesses, and so on but you can only guess at the motivation and followup. When we go to meetings people (often in industry) say things like: “Oh we use OSCAR for identifying compounds and it’s very useful”, “We started using InChI and it’s saved us lots of money”, etc. But unless someone actually tells us what they are doing and what they want we don’t know and can only guess.
Sometimes we get letters of support, and some explore whether they can offer help. I may be able to expand later, but here are a number of positive things you can do:
(a). Simply tell us that you are using OPSIN and use some words that show us why. If possible we can publish this anonymously “we regularly use OSCAR and have integrated it into our Y process”. That gives motivation to us to continue in the dark hours of the night, to refactor, to document, etc.
(b). Contribute in-kind data. I’ll expand “how” later, but the sorts of things we would like are:

  • names with connection tables (especially if not in Pubchem)
  • ontologies (e.g. similar to ChEBI)
  • tutorials
  • regular expressions for specific tasks (though Peter often has better approaches)
  • insight into document structuring (e.g. the variability and commonality of organization in theses)
  • corpora (especially annotated). These are very valuable but should not be approached casually
  • acronyms
  • numbering systems, bothy arbitrary and algorithmic
  • journal- and thesis-specific regular expressions

(c) contribute in-kind code. You can, of course, do this anyway through Sourceforge, but it is best to coordinate. Major areas that OPSIN requires are:

  • stereochemistry (even R- and S- cannot be parsed at present)
  • bridged ring systems (e.g. bicyclo[3.2.1]octane). (I wrote an algorithm for this at one stage but it’s not in OPSIN)
  • fused rings. (This can be fairly hairy, but partial lookup can help a lot)
  • specialised vocabularies (e.g. saccharides, nucleic acids).

(d) support the community.
(e) Invite one of us to talk with your organization.
(f) financial support. This can range from a summer student (a few thousand USD) to larger structured projects, perhaps involving other partners and initiatives (JISC, FP7, NSF, etc.). One of the enormous attractions of Openness is that you immediately get all the benefits of everyone else involved. (If, in our work with RSC, Nature and IUCr we had insisted on all IP being held in silos we would never have had OSCAR – it’s that simple).
SciBorg and OPSIN
It’s valuable to understand where OSCAR fits into the large picture. in 2004 Ann Copestake and Simone Teufel (from the ComputerLaboratory) together with Andy Parker (Cambridge eScience) and myself bid for and got an EPSRC grant which we called “SciBorg”. It includes the very real and positive collaboration of Nature, Int. Union of Crystallography, and the Royal Society of Chemistry. The objectives are:

  1. To develop a natural-language oriented markup language which enables the tight integration of partial information from a wide variety of language processing tools, while being compatible with GRID and Web protocols and having a sound logical basis consistent with Semantic Web standards.

  2. To use this language as a basis for robust and extensible extraction of information from scientific texts.

  3. To model scientific argumentation and citation purpose in order to support novel modes of information access.

  4. To demonstrate the applicability of this infrastructure in a real-world eScience environment by developing technology for Information Extraction and ontology construction applied to Chemistry texts.

The project is larger and more visionary than simply extracting chemical names from text. I like to use the phrase “machine-understanding of scientific literature” – but may be pulled up on over-stressing “understanding”. But the idea is to use a variety of techniques to understand the deep structure of the language – at sentence level, paragraph, and document structure. To go beyond the linguistic form to infer motivation “why is this citation important?”, “is this paper challenging conventional views”.
To do this we need to understand chemical language, and that is where OSCAR3 comes in. It works out the role of chemical words and phrases so that more powerful tools can interpret the larger context. So “methane” is a noun (CM) while “methyl” is an adjective (CJ) and “methylated” is a reaction/verb. (You might mention “methylated spirits”, but exceptions and ambiguity are all part of the fun of parsing human-generated language). And OSCAR gives these decisions a probability (“P450 demethylates caffeine” is more likely to be a verb than “methylated spirit”). But in the grander scheme of things SciBorg does not usually need to know what the actual compounds are so work out the deep structure of the language. (Obviously as we advance there is a chance to add validity and inference – thus “C-14 demethylation of lanosterol” is meaningful whereas “C-15…” is not – but we can’t do it all yet).
The point is that Peter is only partially working on OSCAR. Along the way he has developed a lot of clever tricks and this account does not do him justice. But we now need to take OSCAR/OPSIN forward on a broader front and this post addresses some of this.
OSCAR and OPSIN have become complex. Peter Corbett inherited bleeding-edge code and data which at that time had bits from all sorts of authors (Joe Townsend, Fraser Norton, Sam Adams, Chris Waudby, Richard Marsh, James Bell, Vanessa de Sousa (who wrote “Nessie”), Justin Davies, and PeterMR. It covered document structure, import of legacy, regular expressions, data checking, name2structure, lexicons and other bits and pieces.
The original OSCAR (sometimes OSCAR-1 or even OSCAR-2) was primarily based on regular expressions and data checking. OSCAR-3 continues to identify this part but does not support the OSCAR-2 GUI. nor do the regexes fit all journals and theses – they were aimed at RSC articles. So Justin and Richard created OSCAR-DATA which consumes the data section output of OSCAR3 and then applies rules and presents it. We have a sustainability path for that. More later.
Meanwhile OSCAR3 is being used for a wider range of applications than simply journal articles, in particular patents, internal documents and theses. So we have been able to get funding from Unilever for David and Lezan, who joins us on Monday and from JISC for Alan and Diana. That gives us a critical mass in the more direct support for OSCAR3 functionality.
The first thing is refactoring. No-one likes refactoring, but it feels good when you’ve done it. Jim is designing the overall approach and he is able to do this in a way I can only marvel at. It uses design patterns, testing (of course) but combined with lightweight services (REST, Web 2.0) and JFDI pragmatism. We’ll expose more of this later but some of the themes are likely to be:

  • use all the Open strategies (SVN, Eclipse, JUnit, Maven, etc.)
  • use open source components where possible (PDFBox, Lucene, JUMBO, CDK, etc.)
  • explore how to modularise them where necessary (e.g. CDK)
  • separate components and communicate through XML or RDF (e.g. OSCAR3 -> OSCAR-DATA can be almost completely decoupled
  • devise a framework based on standard approaches (e.g. Eclipse)
  • actively design and highlight and explain extension points. This is essential if others are to add code and resources in parallel

So we have to do this for ourselves.
If the community can give us clear indications of what they can contribute then this will help us to identify the extensions.

This entry was posted in chemistry, open issues, oscar, programming for scientists, XML. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *