What is OSCAR4 and why we created it


On Wednesday we are launching OSCAR4 (/pmr/2011/04/08/oscar4-launch/) . OSCAR4 has involved a very large amount of work (“refactoring”) which has resulted in some change to the surface functionality and a huge change to the architecture.

What does that mean? Essentially it means that we (the extended group of Egon, Daniel, David, Lezan, Sam and Bala and others with not much PMR) have almost completely ripped out the guts of OSCAR3 and replaced them with a series of modules that are engineered to work reliably and interoperate. Here are two steam engines of roughly the same format. The first (Heath Robinson, from Wikipedia) is made of sealing wax and string , glue, wood and a prayer

while the second http://uk.wikipedia.org/wiki/%D0%A4%D0%B0%D0%B9%D0%BB:Lego_Creator_4837_-_Mini_Trains.jpg is made from reusable components. That’s a major difference between OSCAR3 and OSCAR4 – we now have something that can be extended and interoperate. (That is not to belittle the efforts of the authors of OSCARs1, 2 and 3 who have built excellent software that is useful and widely used. But every piece of software tends to become bloated and refactoring is an essential part of software engineering. The world changes and expectations change. Fred Brooks says: (http://www.softwarequotes.com/printableshowquotes.aspx?id=556) “Plan to throw one away; you will anyhow.” So time for OSCAR4.

What’s the difference? OSCAR4 consists of a “core” of OSCAR3 which is the main language engine. We’ve removed the following from the core:

  • Chemical substructure and similarity search. Structure isn’t fundamental to the language processing (unlike OPSIN where chemical structure matters). So searching for entities can be done through decoupled services or other libraries.
  • Scrapbook. A place where people can keep the structures in their documents. Again we decouple this – we could, for example, now use Chem#.
  • Lookup from Pubmed. Again this can be decoupled.
  • Annotation. This is useful for training models but doesn’t need to be part of the main libraries

Everything else has been kept in some form. We’ve also added:

  • Configurable Lexicons (Dictionaries, …). This allows anyone to add their own names and structures
  • Configurable workflow (perhaps the most powerful refactoring). This means that you can swap in your own Tokenizer, Hyphenator, Machine-learning model, Dictionaries, and ontologies (name – identifier pairs). It makes OSCAR compatible with tools such as UIMA.
  • ChemicalTagger (the chemical phrase analyser from Lezan Hawizy). Although this isn’t formally part of the core it’s very likely to be used in conjunction with OSCAR4. This combination is a very powerful chemical language analyser.

On Wednesday we’ll take you through this (hopefully including those online). It’s important to realise that OSCAR4 is a library of components, not an application. (It’s easy to build applications, of course). So OSCAR is not a web server (but can be bolted into one). It’s not a mobile app but should be capable of being included in one. Etc. Think gearboxes and axles, not cars. The scientist of the future will build their applications from components, just as they build their glassware from ground-glass components. OSCAR4 is designed as a tool that can be included in any application, whether Open or commercial (it’s Open Source).

This means understanding how to bolt things together. It’s not hard, any more than building trains from Meccano™. We’ll give some simple examples of how to process a document, how ChemicalTagger works, how one might create a server. We’ll show how to use Java, understand the docs, etc. And how to extend and modify OSCAR4 without fear of breaking it.





This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *