I’ve been invited to talk to a group of cheminformaticians – mainly pharma – at the European Bioinformatics Institute today. The topic of the 3-day meeting is “Open Source”.
The simplisitic view of Cheminformatics is:
- Discover data. This is extremely difficult and its quality is highly variable
- Extract “features” and properties
- Extract molecular components and calculate “features”
- Develop a machine-learning model
- Analyze the output of the model
- Possibly, though rarely, develop a human-understandable hypothesis
The components in this process are normally all CLOSED. That leads to non-reproducibility, poor or non-existent hypotheses, sloppiness (including inadvertent data selection) and fraud.
Only Open Data, Open Specs and Open Source can challenge this.
This is an exciting opportunity to promote the value of Open Source and I think that I’ll be talking to the – at least partially – converted. There seems to be a realisation in Pharma that the closed approaches of the past are not very successful and that we need to complement (not replace) them with Open approaches. This realization is show most prominently by the Open Drug discovery projects – GSK has donated a large chunk of data into the Open domain, and Mat Todd is spearheading a very exciting Open project on developing new anti-malarials. But these are few.
I’m going to argue that Open approaches are beneficial on a purely utilitarian basis. There are also ethical , moral and legal reasons why we should use Openness – for example when the work is publicly funded – but I think the utilitarian case alone is compelling.
I have the privilege of kicking off so I’ll try to cover at least:
- The current state of cheminformatics and the role of Openness. I’ll argue that we now have at least one exemplar of every necessary component in software, but that we are shackled by restrictive practices in creating and disseminating data and metadata. I’ll also note (with sadness) the almost zero-effort put into any metadata and ontologies compared with bioinformatics and again the restrictive practices of those who should be trying to help the community.
- A vision of a semantic framework for chemistry that would enormously enhance development time, quality, validation and innovation
- Demonstrations of some of our own components
- Suggestions of ways forward that would allow the pharma industry to support Open Data, Open Specifications and Open Source (the Blue Obelisks’s ODOSOS).
The current “market” in cheminformatics (if we include computational chemistry, data bases, literature resources etc.) runs to over USD 1 billion. (I’d be glad of a better estimate). Much of this is paid by the pharma industry – a smaller (but increasingly painful) amount by academia. Unlike bioinformatics very little of this feeds back into a better, more innovative infrastructure. Indeed much of the current product development is in integration and widget frosting rather than fundamental design. These are important but not at the cost of a stagnant research effort.
The problem – as often – is that the economics are broken. We do not measure the opportunity costs, the cost of broken information. An audit of cheminformatics and chemical information would show that it is grossly inefficient when measured against the public and private goods it produces.
We need a better business model and I hope that we can explore that. I don’t have a magic bullet, but I shall avoid the trap of taking my Open output into yet-another closed system. Free-as-in-beer (gratis) usually goes down ratholes of licences, restrictions and badly engineered solutions. Free-as-in-speech (libre) is required.
The bioscientists have it in information. Chemistry does not. But with a 1 billion dollar market we should be able to change that.
If we have the will …
Here is my abstract, anyway. It’s a closed meeting, so I don’t know how much I can report and whether we use Chatham House rule.
Open Source Chemoinformatics
1Unilever Centre for Molecular Informatics,Department of Chemistry, University of Cambridge, Cambridge, UK
In many disciplines it is routine to require both data and software to be available for reviewers or readers for Open validation and re-use. Chemical software, and the representation of chemical data, are based on well-established principles and most of the common algorithms are completely understood and published. The implementation of chemistry as Open Source is therefore possible and has many advantages:
- The source code is always available and the algorithms or defects are transparent
- The assumptions in running software (e.g. parameterization) are also clear
- It is possible to modularize the computational so that different algorithms, data and strategies can be varied with minimal effort.
- Publishing and validating the results becomes easier.
- Re-use of previous work, including program outputs and knowledgebases is possible.
The chemistry Open Source community, epitomized by the Blue Obelisk (http://www.blueobelisk.org) , aims at creating re-usable interoperable components of high, tested, quality which are developable by the community. In general there are Open Source components for almost all of the widely used algorithms and processes.
Data are equally important, though harder to acquire even when published conventionally as they are obscured by copyright or monopolies which restrict access and use. It is possible to extract much information from conventional publications using machines. We can also build aggregator and publisher systems for data encouraged by funder requirements.
We (http://wwmm.ch.cam.ac.uk) have built a series of components to support Open chemistry. They are based on the Chemical Markup Language (CML) infrastructure and include: OSCAR4, a modular system for textmining; OPSIN, a name2structure converter; JUMBOConverters which process collections of legacy material (including computational logfiles) into semantic form; Chempound, a semantic RDF repository for any chemistry; Crystaleye, an automatic aggregator of crystal structures and publications; Lensfield, a make facility for data. All interoperate with Blue Obelisk software.
I shall also review virtual communities (e.g. Quixote, http://quixote.wikispot.org/Front_Page for computational chem) and the social principles for successful modern Open Source and Data.