Open Data Means Better Science

I am delighted that Biomed Central – an Open Access publisher (i.e. almost all its content is required to be full Open Access – CC-BY) has adopted and promoted the idea of Open Data. They have done me to honour twice of asking me to present their Open Data award (to authors who provide the most impressive use of Open Data). A little while ago they asked for simple phrases to go with the buttons and banners on the site and there were communal suggestions – and they chose this.

Simply – if your data are Open then the science is better. Why?

  • Others can check your work
  • Others can more easily evaluate your ideas and methodology. Science is about communication and Open Data helps scientific communication
  • Others can re-use you data. You cannot guess what others may see in your “boring” data.

So this banner occurs in the add space of BMC. And the Open Data button occurs on many papers (i.e. the ones that have data specifically attached).

The Banner links back to the Panton Principles (http://pantonprinciples.org/) which was formulated by four of us and finally launched from the Panton Arms pub in Cambridge.

 

What’s Open Data?

http://www.biomedcentral.com/info/about/openaccess/#opendata

By open data we mean that it is freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. We encourage the use of fully open formats wherever possible.

Anyone is free:

  • to copy, distribute, and display the work;
  • to make derivative works;
  • to make commercial use of the work;

Under the following conditions: Attribution

  • the original author must be given credit;
  • for any reuse or distribution, it must be made clear to others what the license terms of this work are;
  • any of these conditions can be waived if the authors gives permission.

Statutory fair use and other rights are in no way affected by the above.

Simple. All you have to do is add the Open Data button and the licence to your data.

Open Data is coming and will be increasingly mandatory. Funders require grantees to have data management policies. Here’s any example of the steady progress (it’s about Open Access rather than data, but the two march together).

 

EPSRC joins other major funding agencies with new open access policy

Earlier this month, the Engineering and Physical Sciences Research Council (EPSRC) announced its ‘Policy on Access to Research Outputs’, stating that all EPSRC-funded research must be published as open access (OA) documents. From September 1st 2011, all research must be published as either ‘Gold’ or ‘Green’ OA with the decision resting with the author.

EPSRC joins numerous major funding agencies, such as the Wellcome Trust, NIH and NSF, in adopting an OA mandate. This policy has been adopted “in recognition of the need for increased availability and accessibility of publicly funded research findings.” Further information is available on their website.

This is an example of a meme that is spreading virally. Publishers, funders , authors will increasingly see he OPEN DATA button. If they don’t know what its, they’ll click and find out. Then they’ll think. And every time they see it they’ll think. And, sooner or later, they will realise that Open Data is the only way forward for data.

 

Posted in Uncategorized | 2 Comments

Open Bibliographic Workshop at #OKCon2011

We’ve just run our Workshop in Open bibliography at OKCon: (Open Bibliographic Data Workshop
by Peter Murray-Rust, Mark McGillivray & Adrian Pohl).

Mark McGillivray has written a great account of the Open Bibliography project, what we have achieved, what the tools are, what the content is, etc at http://openbiblio.net/2011/06/30/final-product-post-open-bibliography/. Mark has also kept an Etherpad of the workshop (I’ll post the URL later).

[The breadth of interest at OKCon is enormous and here, for example, is a slice of the program (http://okcon.org/2011/programme). There’s science, government, technology, legal, arts, culture, etc…): So many things going on that I haven’t been able to attend many of the star speakers.

Anyway… our workshop was 90 minutes (a good allocation) and well attended. About 25+ people, and interested in making bibliography Open. There’s general agreement on what Bibliographic (meta(data is, why it should be Open. We collected ideas on what the benefits will be and how to make it happen.

So we are pulling together a strategy on how to engage with the various stakeholders and move as quickly as possible to showing that Open Bibliography is possible and valuable. Meanwhile some snippets from Mark’s post :

Open bibliographic datasets

Source

Description

Availability

Cambridge University Library

This dataset consists of MARC 21 output in a single file, comprising around 180000 records. More info…

get the data

British Library

The British National Bibliography contains about 3 million records – covering every book published in the UK since 1950. More info…

get the data
query the data

International Union of Crystallography

Crystallographic research journal publications metadata from Acta Cryst E. More info…

get the data
query the data
view the data

PubMed Central

The PMC Medline dataset contains about 19 million records, representing roughly 98% of PMC publications. More info…

get the data
view the data

 

Products demonstrating the value of Open Bibliography

OpenBiblio / Bibliographica

Bibliographica is an open catalogue of books with integrated bibliography tools for example to allow you to create your own collections and work with Wikipedia. Search our instance to find metadata about anything in the British National Bibliography. More information is available about the collections tool and the Wikipedia tool.

it is possible to create a living map of scholarship, and we show three examples carried out with our bibliographic sets.

This is a geo-temporal bibliography from the full Medline dataset. Bibliographic records have been extracted by year and geo-spatial co-ordinates located on a grid. The frequency of publications in each grid square is represented by vertical bars. (Note: Only a proportion of the entries in the full dataset have been used and readers should not draw serious conclusions from this prototype). (A demonstration screencast is available at http://vimeo.com/benosteen/medline; the full interactive resource is accessible with Firefox 4 or Google Chrome, at http://benosteen.com/globe.)

 

This example shows a citation map of papers recursively referencing Wakefield’s paper on the adverse effects of MMR vaccination. A full analysis requires not just the act of citation but the sentiment, and initial inspection shows that the immediate papers had a negative sentiment i.e. were critical of the paper. Wakefield’s paper was eventually withdrawn but the other papers in the map still exist. It should be noted that recursive citation can often build a false sense of value for a distantly-cited object.

This is a geo-temporal bibliographic map for crystallography. The IUCr’s Open Access articles are an excellent resource as their bibliography is well-defined and the authors and affiliations well-identified. The records are plotted here on an interactive map where a slider determines the current timeslice and plots each week’s publications on a map of the world. Each publication is linked back to the original article. (The full interactive resource is available at http://benosteen.com/timemap/index.)

These visualisations show independent publications, but when the semantic facets on the data have been extracted it will be straightforward to aggregate by region, by date and to create linkages between locations.

 

Benefits of Open Bibliography products

Anyone with a vested interest in research and publication can benefit from these open data and open software products – academic researchers from students through to professors, as well as academic administrators and software developers, are better served by having open access to the metadata that helps describe and map the environments in which they operate. The key reasons and use cases which motivate our commitment to open bibliography are:

  1. Access to Information. Open Bibliography empowers and encourages individuals and organisations of various sizes to contribute, edit, improve, link to and enhance the value of public domain bibliographic records.
  2. Error detection and correction. Community supporting the practice of Open Bibliography will rapidly add means of checking and validating the quality of open bibliographic data.
  3. Publication of small bibliographic datasets. It is common for individuals, departments and organisations to provide definitive lists of bibliographic records.
  4. Merging bibliographic collections. With open data, we can enable referencing and linking of records between collections.
  5. A bibliographic node in the Linked Open Data cloud. Communities can add their own linked and annotated bibliographic material to an open LOD cloud.
  6. Collaboration with other bibliographic organisations. Reference manager and identifier systems such as Zotero, Mendeley, CrossRef, and academic libraries and library organisations.
  7. Mapping scholarly research and activity. Open Bibliography can provide definitive records against which publication assessments can be collated, and by which collaborations can be identified.
  8. An Open catalogue of Open scholarship. Since the bibliographic record for an article is Open, it can be annotated to show the Openness of the article itself, thus bibliographic data can be openly enhanced to show to what extent a paper is open and freely available.
  9. Cataloguing diverse materials related to bibliographic records. We see the opportunity to list databases, websites, review articles and other information which the community may find valuable, and to associate such lists with open bibliographic records.
  10. Use and development of machine learning methods for bibliographic data processing. Widespread availability of open bibliographic data in machine-readable formats should rapidly promote the use and development of machine-learning algorithms.
  11. Promotion of community information services. Widespread availability of open bibliographic web services will make it easier for those interested in promoting the development of scientific communities to develop and maintain subject-specific community information.

And more…

Posted in Uncategorized | Leave a comment

Software and demos for Cheminformatics Open Source meeting at EBI

#wwmm #quixotechem #blueobelisk

I do not present software through Powerpoint but through living links and demos. These all worked when they were entered here. All demos decay…

 

We (http://wwmm.ch.cam.ac.uk) have built a wide-ranging series of components to support Open chemistry. They are mainly based on Chemical Markup Language (CML)

  • Recent presentation from Sam Adams at ACS covers much of the ground (especially Chempound)

http://www-pmr.ch.cam.ac.uk/mediawiki/images/9/94/Sam-Adams-CLARION-ACS-April2011.pdf

 

Multicomponent demos include:

  • Searching Chempound repository for a Chem4Word document and using OPSIN to translate name2structure
  • Running Lensfield on commandline to parse directory of compchem output into CML

All demos and software ore Open (Artistic, GPL , CC-BY, PDDL, CC0)

 

(CML) infrastructure and include:

OSCAR4, a modular system for textmining;

Chem4Word

OPSIN, a name2structure converter (Daniel Lowe et al.);

ChemicalTagger, a natural language system for chemistry;

JUMBOConverters which process collections of legacy material (including computational logfiles) into semantic form;

Chempound, a semantic RDF repository for any chemistry;

Crystaleye, an automatic aggregator of crystal structures and publications;

Lensfield, a make facility for data.

Metaprint, an Open source tool for predicting sites of metabolism

JNI-InChI (Sam Adams)

See also:

Much of this is being formally written up for a special issue of J. Cheminformatics (BMC).

Posted in Uncategorized | Leave a comment

Cheminformatics – presentation at EBI – why we must espouse Openness

#quixotechem

#blueobelisk

I’ve been invited to talk to a group of cheminformaticians – mainly pharma – at the European Bioinformatics Institute today. The topic of the 3-day meeting is “Open Source”.

The simplisitic view of Cheminformatics is:

  • Discover data. This is extremely difficult and its quality is highly variable
  • Extract “features” and properties
  • Extract molecular components and calculate “features”
  • Develop a machine-learning model
  • Analyze the output of the model
  • Possibly, though rarely, develop a human-understandable hypothesis

The components in this process are normally all CLOSED. That leads to non-reproducibility, poor or non-existent hypotheses, sloppiness (including inadvertent data selection) and fraud.

Only Open Data, Open Specs and Open Source can challenge this.

This is an exciting opportunity to promote the value of Open Source and I think that I’ll be talking to the – at least partially – converted. There seems to be a realisation in Pharma that the closed approaches of the past are not very successful and that we need to complement (not replace) them with Open approaches. This realization is show most prominently by the Open Drug discovery projects – GSK has donated a large chunk of data into the Open domain, and Mat Todd is spearheading a very exciting Open project on developing new anti-malarials. But these are few.

I’m going to argue that Open approaches are beneficial on a purely utilitarian basis. There are also ethical , moral and legal reasons why we should use Openness – for example when the work is publicly funded – but I think the utilitarian case alone is compelling.

I have the privilege of kicking off so I’ll try to cover at least:

  • The current state of cheminformatics and the role of Openness. I’ll argue that we now have at least one exemplar of every necessary component in software, but that we are shackled by restrictive practices in creating and disseminating data and metadata. I’ll also note (with sadness) the almost zero-effort put into any metadata and ontologies compared with bioinformatics and again the restrictive practices of those who should be trying to help the community.
  • A vision of a semantic framework for chemistry that would enormously enhance development time, quality, validation and innovation
  • Demonstrations of some of our own components
  • Suggestions of ways forward that would allow the pharma industry to support Open Data, Open Specifications and Open Source (the Blue Obelisks’s ODOSOS).

The current “market” in cheminformatics (if we include computational chemistry, data bases, literature resources etc.) runs to over USD 1 billion. (I’d be glad of a better estimate). Much of this is paid by the pharma industry – a smaller (but increasingly painful) amount by academia. Unlike bioinformatics very little of this feeds back into a better, more innovative infrastructure. Indeed much of the current product development is in integration and widget frosting rather than fundamental design. These are important but not at the cost of a stagnant research effort.

The problem – as often – is that the economics are broken. We do not measure the opportunity costs, the cost of broken information. An audit of cheminformatics and chemical information would show that it is grossly inefficient when measured against the public and private goods it produces.

We need a better business model and I hope that we can explore that. I don’t have a magic bullet, but I shall avoid the trap of taking my Open output into yet-another closed system. Free-as-in-beer (gratis) usually goes down ratholes of licences, restrictions and badly engineered solutions. Free-as-in-speech (libre) is required.

The bioscientists have it in information. Chemistry does not. But with a 1 billion dollar market we should be able to change that.

If we have the will …

Here is my abstract, anyway. It’s a closed meeting, so I don’t know how much I can report and whether we use Chatham House rule.

 

Open Source Chemoinformatics

 

Peter Murray-Rust1

 

1Unilever Centre for Molecular Informatics,Department of Chemistry, University of Cambridge, Cambridge, UK

 

In many disciplines it is routine to require both data and software to be available for reviewers or readers for Open validation and re-use. Chemical software, and the representation of chemical data, are based on well-established principles and most of the common algorithms are completely understood and published. The implementation of chemistry as Open Source is therefore possible and has many advantages:

  • The source code is always available and the algorithms or defects are transparent
  • The assumptions in running software (e.g. parameterization) are also clear
  • It is possible to modularize the computational so that different algorithms, data and strategies can be varied with minimal effort.
  • Publishing and validating the results becomes easier.
  • Re-use of previous work, including program outputs and knowledgebases is possible.

The chemistry Open Source community, epitomized by the Blue Obelisk (http://www.blueobelisk.org) , aims at creating re-usable interoperable components of high, tested, quality which are developable by the community. In general there are Open Source components for almost all of the widely used algorithms and processes.

 

Data are equally important, though harder to acquire even when published conventionally as they are obscured by copyright or monopolies which restrict access and use. It is possible to extract much information from conventional publications using machines. We can also build aggregator and publisher systems for data encouraged by funder requirements.

 

We (http://wwmm.ch.cam.ac.uk) have built a series of components to support Open chemistry. They are based on the Chemical Markup Language (CML) infrastructure and include: OSCAR4, a modular system for textmining; OPSIN, a name2structure converter; JUMBOConverters which process collections of legacy material (including computational logfiles) into semantic form; Chempound, a semantic RDF repository for any chemistry; Crystaleye, an automatic aggregator of crystal structures and publications; Lensfield, a make facility for data. All interoperate with Blue Obelisk software.

 

I shall also review virtual communities (e.g. Quixote, http://quixote.wikispot.org/Front_Page for computational chem) and the social principles for successful modern Open Source and Data.


 

Posted in Uncategorized | 4 Comments

What are the formal restrictions on text-mining?

#oscar4 #okfn #pantonpapers

A little while ago I suggested that we create whitepapers (“Panton Papers”, /pmr/2010/07/24/open-data-the-concept-of-panton-papers/ ) to help our development of open science. We’ve come up with some titles and I’ve drafted one on text-mining /pmr/2011/03/28/draft-panton-paper-on-textmining/ . There’s now a useful response from Todd Vision on the Open-science discussion list (http://lists.okfn.org/pipermail/open-science/2011-April/000698.html )

Peter’s draft whitepaper on text-mining is badly needed and nicely put. I was particularly interested in this passage:

 

“The provision of journal articles is controlled not only by copyright but also (for most scientists) the contracts signed by the institution. These contracts are usually not public. We believe (from anecdotal evidence) that there are clauses forbidding the use of systematic machine crawling of articles, even for legitimate scientific purposes.”

 

We have also heard tell of the existence of such clauses, but also have not been able to secure first-hand evidence for them. It would be very nice to promote this from “anecdotal” to “documented”, and I would like here to put out a wider plea for anyone who might be able to provide the language of these contractual re[s]trictions. Alternatively, I would welcome suggestions for how we are to know what exactly we are prohibited from doing in light of the confidential nature of the contracts.

 

If copyright holders really wish to enforce such restrictions, it seems odd that their very existence is little more than a rumor. Can secret restrictions be legally enforced?

 

Todd

 

So I’d very much like to have authoritative evidence on this area. If anyone has first hand evidence of the FORMAL restrictions on text-mining please let us know on this blog or open-science. Ideally this would actually be hard figures but if institutions are actually debarred from publishing this information then I, as a reader of the material, am in an interesting position. So before poking around in my usual fashion I’d like any evidence that’s publishable without fear of lawsuits.

If not I will have to try to find this out the hard way.

 

 

 

Posted in Uncategorized | 2 Comments

OSCAR4 launch roundup

#oscar4

IMO the OSCAR4 launch was a great success. We had visitors from outside the Unilever Centre and also remotely on Twitter and the streaming video. The talks were very well presented and were all captured on stream and more permanently. I have had a peek at the recordings and they will be useful in their own right – the text is a little small in places for mine (but the blog is public). For me there was a real surprise – captured on video. I compared OSCAR3 to OSCAR4 as a Heath Robinson train to a Lego ™ train. As I switch back to the web page there was a Google splash page with Stevenson’s rocket – completely unplanned. For a few secs I thought it was deliberate but it’s just a coincidence. Follow the web page http://www-pmr.ch.cam.ac.uk/wiki/OSCAR4_Launch to see details of where the videos are available

Here are the famous cupcakes and the journal-eating-robot…

OSCAR4 is a library – we trailed bits of its API – an Dave Jessop did an excellent job of pulling out the essentials and showing how with a few commands we could search for chemical terms, customize the dictionaries and ontologies and create chemical structures.

The first task was to extract NamedEntities from text. If you know what text is, and know what a NamedEntity is then it’s simple. Here’s the code. The whole code:

 

        Oscar Oscar = new Oscar();

List<NamedEntity> namedEntityList = oscar.findNamedEntities(text);

 

This 2-line program (idiomatic in almost all modern languages) says:

  • Create an Oscar object
  • Feed it some text and get a list of the named entities.

That’s it. If you understand the terms “text” and “named entity” then you can run Oscar. If you don’t know what a named entity is then just run the program and look at what comes out.

Simple.

Here’s OSCAR munching a spectrum:

        String text = “1H NMR (d6DMSO, 400MHz) d 9.20 (1H, m), 8.72 (1H, s),” +

                ” 8.58 (1H, m), 7.80 (1H, dd, J=8.3 and 4.2 Hz) ppm.”;

 

Let’s take it in reverse order. In a well engineered library there’s often only one way you can fit the bits together. Like the Lego™ train.

 

Hmm, I need a DataParser (that was hinted at);

 

        List<DataAnnotation> annotationlist = DataParser.findData(procDoc);

 

So now I need to feed it a ProcessingDocument:

 

        ProcessingDocument procDoc = ProcessingDocumentFactory.getInstance().makeTokenisedDocument(tokeniser, text);

 

And to make one of those I need a Tokeniser. Ah, here’s one of those

 

        Tokeniser tokeniser = Tokeniser.getDefaultInstance();

 

So I made a default one. This is the convention-over-configuration stuff – use what is in the box and it will work, usually in a reasonable manner.

Put them together in the right order:

        Tokeniser tokeniser = Tokeniser.getDefaultInstance();

ProcessingDocument procDoc =

ProcessingDocumentFactory.getInstance().makeTokenisedDocument(tokeniser, text);

        List<DataAnnotation> annotationlist = DataParser.findData(procDoc);

 

And I have my DataAnnotation (only one because there is only one spectrum). It’s in XML and can be easily converted into CML so displayed in CML-compliant tools.

Simple?

So here we are in the Panton afterwards..

 

 

Posted in Uncategorized | Leave a comment

OSCAR: The Journal-eating robot

#oscar4

Follow http://www-pmr.ch.cam.ac.uk/wiki/OSCAR4_Launch for all news and URLs (live stream). Hashtag #oscar4.

To introduce the launch of OSCAR4 I will give a short timeline and hopefully credit those who have made contributions.

When I joined the Unilever centre I had the vision that the whole of chemical information could be used as a basis for intelligent software agents – a limited form of “artificial intelligence”. By collecting the world’s information, machines would be able to make comparisons, find trends, lookup information, etc. We would use a mixture of algorithms, brute force, canned knowledge, machine learning, tree-pruning, etc.

The knowledge would come from 2 main sources:

  • Data files converted into semantic form
  • Extraction of information from “written” discourse. This post deals with the latter – theses, articles, books, catalogues, patents, etc.

I had started this is ca 1997 when working with the WHO on markup of diseases and applying simple regular expressions. (Like many people I thought regexes would solve everything – they don’t!).

In 2002 we started a series of summer projects, which have had support from the RSC, Nature, and the IUCr among others. We started on the (very formulaic) language used for chemical synthesis to build a tool for:

  • Checking the validity of the information reported in the synthesis
  • Extracting and systematising the information.

This was started by Joe Townsend and Fraser Norton and at the end of the summer they had a working applet.

2003

Sam Adams and Chris Waudby joined us and turned the applet into an application (below); Vanessa de Sousa created “Nessie”, a tool which linked names from documents to collections of names (such as collections of catalogues (at that stage these collections were quite rare).


Here is OSCAR-the-journal-eating-robot in 2004 eating a bit of journal, analysing it, validating it and transforming the data into CML (e.g. a spectrum).

By 2004 OSCAR was in wide use on the RSC site (http://www.rsc.org/Publishing/Journals/guidelines/AuthorGuidelines/AuthoringTools/ExperimentalDataChecker/index.asp ) and could be downloaded for local work. During this period the variants of OSCAR were lossely named as OSCAR, OSCAR-DATA, OSCAR1 and OSCAR2. I pay tribute to the main authors (Joe Townsend, Chris Waudby and Sam Adams) who have produced a complex piece of software that has run without maintenance or complaint for 7 years – probably a record in chemistry. At this time Chris developed an n-gram approach to identifyig chemical names (e.g. finding strings such as “xyl” and “ybd” which are very diagnostic for chemistry).

By 2005 OSCAR’s fame had spread and (with Ann Copestake and Simone Teufel) we won an EPSRC grant (“Sciborg“) to develop natural language processing in chemistry and science (http://www.cl.cam.ac.uk/research/nl/sciborg/www/ ). We partnered with RSC, IUCr and Nature PG all of whom collaborated by making corpora available. By that time we had learnt that the problem of understanding language in general was very difficult indeed. However chemistry has many restricted domains and that gave a spectrum of challenges. Part of SciBorg was, therefore, to develop chemical entity recognition (i.e. recognising chemical names and terms in text). Simple lookup only solves part of this as (a) there are many names for the same compound (b) hypothetical or recently discovered/synthesised compounds are not in dictionaries. This work was largely carried out by Peter Corbett (with help from the RSC) and the tool became known as OSCAR3.

Entity recognition is not an exact science and several methods need to be combined: lookup, lexical form (does the word look “chemical”), name2Structure (can we translate the name to a chemical?), context (does it make sense in the sentence?) and machine learning (training OSCAR with documents which have been annotated by humans). There are many combinations and Peter used those which were most relevant in 2005. OSCAR rapidly became known as a high-quality tool.

This required large amounts of careful, tedious work. A corpus had to be selected and then annotated by humans. Strict guidelines were prepared; here’s an example from Peter Corbett and Colin Batchelor (RSC). What is a compound? There are some 30 pages like this.

.

The humans agreed at about 90% precision/recall, setting the limit for machine annotation (OSCAR is about 85%). OSCAR can do more than just chemical compounds and here is a typical result:

 

In 2008 Daniel Lowe took up the name2Structure challenge and has developed OPSIN to be the world’s most accurate name2structure translator, with a wide range of vocabularies. OPSIN was bundled as part of OSCAR3 which also acquired many other features (web server, chemical search, annotation, etc.). We were joined by Lezan Hawizy who initiated Unilever-funded work on extracting polymer innformation from the literature. Joe Townsend worked on the JISC SPECTRaT project looking at extracting information from theses. In 2009 we won a JISC grant (Cheta) to see whether humans or machines were better at marking upn chemistry in text and work proceeded jointly at Cambridge (Lezan Hawizy) and the National Centre for Text Mining (NaCTeM, Manchester).

In late 2009 OMII contracted to start the refactoring of OSCAR and in 2010 we won an EPSRC grant to continue this work in conjunction with Christoph Steinbeck (ChEBI at EBI), and Dietrich Rebholz. The hackforce included David Jessop, Lezan Hawizy and Egon Willighagen resulting in OSCAR4.

OSCAR is widely used but much of our feedback is anecdotal. It’s been used to annotate PubMed (20 million abstracts) and we have used it to analyse > 100, 000 patents from the EPO (“the green chain reaction”; example: http://greenchain.ch.cam.ac.uk/patents/results/2009/solventFrequency.htm ). In 2010 we were funded by EPSRC “Pathways to Impact” to extend OSCAR to Atmospheric chemistry (Hannah Barjat and Glenn Carver).

Besides the people mentioned above thanks to Jonathan Goodman, Bobby Glen, Nico Adams, Jim Downing.

Summary:

Thanks to our funders and supporters: EPSRC, DTI, Unilever, RSC, Nature, IUCr, JISC.

OSCAR4 now has a robust architecture and can be extended for work in a variety of disciplines. It interfaces well with other Open Source tools (e.g. from the Blue Obelisk) and can be customised as a component (e.g. in Bioclipse) or using OS tools and its own components.The main challenges are:

  • The commonest form of scientific document, PDF, is very badly suited to any information extraction.
  • Almost all publishers forbid the routine extraction of information from their publications.

 

Posted in Uncategorized | Leave a comment

OSCAR4 how you can make a library work for you

#oscar4

As we’ve said OSCAR4 is a set of library components that can be reconfigured in many ways (rather than a single monolithic application). On Wednesday (in Cambridge or remotely) we’ll be looking at how to bolt the bits together. We’ll make a Java IDE (Integrated Development Environment, such as Eclipse, Netbeans or IntelliJ )available to participants. (Actually this is probably the hardest part of the whole thing – getting your environment working.)

Now you have to get OSCAR4 from https://bitbucket.org/wwmm/oscar4/. It’s managed by a system called Mercurial which looks after what to download and how to configure it on your system. Create a directory on your system and download OSCAR4 (Bitbucket even tells you how:

hg clone https://petermr@bitbucket.org/wwmm/oscar4 )

 

When you’ve done that you’ll find a series of directories representing the main sub-projects in OSCAR (which hopefully have meaningful names):

(Exactly how this displays will depend on the options in your API).

OSCAR4 provides the simplest options through the oscar4 package. We use the “convention over configuration” approach http://en.wikipedia.org/wiki/Convention_over_configuration . What’s that? Basically this says that you create default actions, default naming conventions , etc. so that you can run the system as easily as possible and get the most useful results. If you want something more sophisticated you change the configuration. Here’s the structure:

The main class is Oscar (actually uk.ac.ch.wwmm.oscar.Oscar) and it has a series of methods. This is not as bad as it looks. The get methods (getters) and the setters can be ignored for now (because we will use the default). So this leaves us with only 4 methods we might use. Here’s the most important:


/**

* Wrapper method for identification of named entities. It calls the methods

* normalise(String), #tokenise(String), and #recogniseNamedEntities(List).

*

* @param input the input text.

* @return the recognised chemical entities.

*/


public List<NamedEntity> findNamedEntities(String input);

 

Note that this already calls tokenise() and recogniseNamedEntities(List). So we don’t need to worry about these at first reading. The final method is:


/**

* Wrapper method for the identification of chemical named entities

* and their resolution to connection tables. It calls the methods

* normalise(String), tokenise(String), recogniseNamedEntities(List), and

* ChemNameDictRegistry.resolveNamedEntity(NamedEntity)

*

* @param input String with input.

* @return the recognised chemical entities as a List of ResolvedNamedEntitys,

* containing only those named entities of type NamedEntityType.COMPOUND

* that could be resolved to connection tables using the current dictionary registry.

*/


public List<ResolvedNamedEntity> findResolvableEntities(String input);

 

So, if we understand the meanings of the nouns this is almost English prose. Identifying Chemical Named Entities is the primary purpose of OSCAR4. Normalising the string is necessary, but we’ll trust OSCAR. Tokenization is splitting the sentence into words (not necessarily just at spaces). And The ChemNameDictRegistry is a registry of ChemNameDicts. These are dictionaries mapping chemical names to structures (e.g. “methanol” -> CH3OH). So, in this case instead of returning namedEntities, we resolve them into Structures before returning.

 

Simple?!

 

So the first exercise is:

 

    /**

     * Use the Oscar API class to identify named entities in

     * the text string below. Print each to the console.

     */

    public static void main(String[] args) {

        

        String text = “Ethyl acetate and acetone both contain the carbonyl group.”;

        //TODO Implement your method here!

        

    }

 

To get you started, I mention that you have to create an OSCAR, and then select the appropriate method:

 

OSCAR oscar = new OSCAR());

oscar.doSomething(argument);

 

and that’s all there is to it…

Posted in Uncategorized | Leave a comment

What is OSCAR4 and why we created it

#oscar4

On Wednesday we are launching OSCAR4 (/pmr/2011/04/08/oscar4-launch/) . OSCAR4 has involved a very large amount of work (“refactoring”) which has resulted in some change to the surface functionality and a huge change to the architecture.

What does that mean? Essentially it means that we (the extended group of Egon, Daniel, David, Lezan, Sam and Bala and others with not much PMR) have almost completely ripped out the guts of OSCAR3 and replaced them with a series of modules that are engineered to work reliably and interoperate. Here are two steam engines of roughly the same format. The first (Heath Robinson, from Wikipedia) is made of sealing wax and string , glue, wood and a prayer


while the second http://uk.wikipedia.org/wiki/%D0%A4%D0%B0%D0%B9%D0%BB:Lego_Creator_4837_-_Mini_Trains.jpg is made from reusable components. That’s a major difference between OSCAR3 and OSCAR4 – we now have something that can be extended and interoperate. (That is not to belittle the efforts of the authors of OSCARs1, 2 and 3 who have built excellent software that is useful and widely used. But every piece of software tends to become bloated and refactoring is an essential part of software engineering. The world changes and expectations change. Fred Brooks says: (http://www.softwarequotes.com/printableshowquotes.aspx?id=556) “Plan to throw one away; you will anyhow.” So time for OSCAR4.

What’s the difference? OSCAR4 consists of a “core” of OSCAR3 which is the main language engine. We’ve removed the following from the core:

  • Chemical substructure and similarity search. Structure isn’t fundamental to the language processing (unlike OPSIN where chemical structure matters). So searching for entities can be done through decoupled services or other libraries.
  • Scrapbook. A place where people can keep the structures in their documents. Again we decouple this – we could, for example, now use Chem#.
  • Lookup from Pubmed. Again this can be decoupled.
  • Annotation. This is useful for training models but doesn’t need to be part of the main libraries

Everything else has been kept in some form. We’ve also added:

  • Configurable Lexicons (Dictionaries, …). This allows anyone to add their own names and structures
  • Configurable workflow (perhaps the most powerful refactoring). This means that you can swap in your own Tokenizer, Hyphenator, Machine-learning model, Dictionaries, and ontologies (name – identifier pairs). It makes OSCAR compatible with tools such as UIMA.
  • ChemicalTagger (the chemical phrase analyser from Lezan Hawizy). Although this isn’t formally part of the core it’s very likely to be used in conjunction with OSCAR4. This combination is a very powerful chemical language analyser.

On Wednesday we’ll take you through this (hopefully including those online). It’s important to realise that OSCAR4 is a library of components, not an application. (It’s easy to build applications, of course). So OSCAR is not a web server (but can be bolted into one). It’s not a mobile app but should be capable of being included in one. Etc. Think gearboxes and axles, not cars. The scientist of the future will build their applications from components, just as they build their glassware from ground-glass components. OSCAR4 is designed as a tool that can be included in any application, whether Open or commercial (it’s Open Source).

This means understanding how to bolt things together. It’s not hard, any more than building trains from Meccano™. We’ll give some simple examples of how to process a document, how ChemicalTagger works, how one might create a server. We’ll show how to use Java, understand the docs, etc. And how to extend and modify OSCAR4 without fear of breaking it.

 

 

 

 

Posted in Uncategorized | Leave a comment

NWChem and Quixote: a first JUMBO template-based parser

#quixotechem #nwchem #jumboconverters

As I’ve mentioned (/pmr/2011/04/09/nwchem-a-fully-open-source-compchem-code-from-pnnl/ ) I now see NWChem as my flagship project for the Open Source Quixote (http://quixote.wikispot.org/Front_Page ) project to create an open source semantic framework for computational chemistry.

I was really flattered and encouraged by the NWChem group at PNNL and so spent much of the time on that way back (plane, taxi) and the weekend hacking a JUMBO parser (/pmr/2011/03/24/extracting-data-from-scientific-calculations-and-experimental-log-files/ ). This is quite different from normal parsers written in C, Perl, Python, Java, etc. as it’s declarative. There is no need for the parser writer to write a single line of procedural code.

Here’s an example.I want to parse:

Brillouin zone point: 1

weight= 1.000000

k =< 0.000 0.000 0.000> . <b1,b2,b3>

=< 0.000 0.000 0.000>

.. terminated by a blank line ..

Assuming that the phrase “Brillouin zone point:” only occurs in this setting (and there are things I can do if it’s more complicated), I find it with a regular expression:

<template id=”brillouninzp” repeat=”*” pattern=”\s*Brillouin zone point:\s*” endPattern=”\s*”>

This gives the starting regex and ending regex ( \s* means any number of spaces) and the template can be repeated as many times as wanted (*). In the case of NWChem it works very well. Then to parse the 4 lines (records) we use per-line regexes (only active in the current scope (brillouinzp).

<record id=”brill”>\s*Brillouin zone point:{I,n:brill.zone}</record>

<record id=”weight”>\s*weight={F,n:brill.wt}</record>

<record id=”k”>\s*k\s*=&lt;\s*{F,n:brill.h}{F,n:brill.k}{F,n:brill.l}>.*</record>

<record id=”k1″>\s*=&lt;{F,n:brill.h}{F,n:brill.k}{F,n:brill.l}>.*</record>

This extracts the fields into CML elements (cml:scalar in this case) each identified with a unique ID. The {…} are shorthands for regular expressions for Integer, Float, etc. with an associated QName. The format n:foo maps onto a unique URI, with xmlns:n=”http://www.xml-cml.org/dictionary/nwchem“. (This dictionary will shortly exist publicly, but we hope the definitive one will be managed by the NWChem group – they know what the terms mean!)
Every field must have an entry in the NWChem dictionary and so far we have extracted about 200 terms from the example files – I expect this to reach about 3-400). Thus the entry with id=”brill.wt” will describe the floating point number (F) that is the weight of the zone point.

There are tools to extract arrays, matrices and molecules and to transform to CML where necessary (the above will probably be transformed into concise and semantic CML). Finally we terminate the template:

</template>

This declarative approach (inspired by XSLT) has many advantages over the procedural:

  • It’s much quicker to write and more concise
  • It’s much easier to see what is going on without delving into the code
  • There are no IF statements to take care of unexpected or variable chunks of output
  • It’s much easier to document (it’s all in XML and can be processed by standard tools)
  • It’s easier to create sub-parsers for special cases
  • There is no loss of information – unparsed records are reported as such
  • It maps easily onto a dictionary structure
  • It preserves the implicit hierarchical structure of documents
  • It would be possible to generate some templates directly from a corpus of documents
  • It provides an excellent input for Avogadro and Jmol

It requires the authors to learn regex, but they would have to do that anyway. Its main limitations are:

  • It’s based on lines (records) and does not work well where line ends are simply wrapping whitespace
  • It relies on distinctive phrases (especially alphabetic) – it’s not designed for dense numeric output (though it will work for some)

     

There’s about 120 templates so far for NWChem and it’s stood up well to new examples. It has parsed files of ca. 1 MByte in a few seconds. (Remember that these files can take days to compute so the time is trivial). So I’m convinced that it works, and scales. I don’t yet know how easy others will find it, but we’ve had good first impressions.

I will keep you in touch. Open Data, Open Source and Open Standards is coming to computational chemistry!

 

Posted in Uncategorized | Leave a comment