We have been working on a general, fluid, concept which we labelled “World Wide Molecular Matrix” – starting about 2001. (We actually put in a grant application under that name to the then new UK eScience programme – it didn’t get funded but helped to sort out some ideas. Actually, things turned out better as we got a more limited, but more tractable, funding from UK eScience via the Cambridge eScience Centre and DTI).
The WWMM is described on our home page and, interestingly also in Wikipedia. (I don’t know who started this – that’s one of the great things about WP. But it’s gratifying to think that someone believes it’s worth an article.)
However the WWMM has taken its own course. When we had the idea in 2000/1 it was very much driven by the idea of music-sharing peer2peer systems. We believed that the same would work naturally for chemistry – simply promote the idea, produce some simple examples of how it might work and it would “build itself”. Of course it didn’t, and there are many reasons – some of which I have blogged and most of which come down to how chemists think and behave. And I made my normal mistake which is to assume that from idea to deployment is trivial.
But the WWMM is now starting to take off. This is not due to any major single breakthrough but to a whole number of related ideas and technologies. They include:
- CML is now robust, supported and widely deployed
- InChI has catalysed the community to the realisation that metadata is valuable, works and is worth the effort
- PubChem has shown that large free resources can be deployed and that people will contribute to them
- The Open Access movement continues to gain ground
- The concept of data-reuse is much stronger
- The younger generation thinks in radically different ways and will not put up with the technical and social dysfunction of chemical providers
- Google has shown the value of free-text indexing
- Institutional Repositories are a reality
- We have received funding (JISC/SPECTRa) to help preserve scientific data from loss.
- We have continued to develop new informatics methods such as OSCAR3 (Peter Corbett)
- The Royal Society of Chemistry has promoted the idea of hypertextual documents (Project Prospect)
- The Blue Obelisk has pulled together much (most?) of the Open Source innovation in chemistry
- The Crystallographic Open Database has pioneered the idea of self-contributed crystal structures – ca 480,000 structures
- The bisociences are fed up with the conservatism of chemistry and are funding their own Open chemical resources (e.g. chEBI chemistry ontology)
- The chemical blogosphere has blossomed into a mature, thoughful, responsive community of new thinkers.
(and more that I have probably forgotten).
The point is that none of this was available in 2002. (If we had had largescale funding for a detailed managed project we’d have ended up somewhere different from where we are now. And perhaps out of sync…)
So why is the WWMM starting to take off? Isn’t Pubchem in fact the WWMM? It’s clearly part of it. But the WWMM is broader – it’s not a single entity – good as Pubchem is – it’s a set of technical metadata, social aspirations and protocols that allow collaborative knowledge-driven chemistry to flourish at near-zero cost.
I’m not claiming that the whole of the chemical net is WWMM -it’s a smaller part. It is the idea that in near zero-cost collaboratively planned activities chemists can create and share certain types of data and create certain types of shared knowledge. For example the crystallography, computational chemistry and spectroscopy data (thanks to SPECTRa and other contributions) represent a starting point.
Nick Day in our group has created a very exciting WWMM resource component, CrystalEye which he presented at the March meeting of the ACS 2007. CrystalEye (originally called CMLCrystBase) consists of arobot that harvests (scrapes) all legally allowed crystal structures in current publications (back to ca 1992). [Certain publishers (ACS, Wiley, Springer) forbid the re-use of their data so we don’t scrape them]. We are also adding in the Crystallographic Open Database (we have to remove duplicates). I’ll blog more of that later.
So the Openness of some publishers, and the contributions of individuals is an important part of the WWMM. So is the ability of OSCAR3 to scrape those parts of chemical articles that can reasonably be considered to be factual data (and I’m getting more bullish on what we can scrape and some cunning ways of doing it). But the real power will come from the regular, zero-cost automatic contribution of ordinary chemists in departments. If enough chemical theses are (a) Openly exposed and (b) in semantic form with agreed syntax (CML) and metadata then the concept is proved. The work with repositories (OAI-PMH and OAI-ORE, together with the proven value of free-text indexing) creates the infrastructure for the matrix – the chemistry is layered on top)
. If all chemical data collected in chemistry departments is exposed in the same way then we will have built the matrix.
So I’m working to help accelerate it. The toolkit is looking pretty good. Nick’s CrystalEye is great and the JUMBO and CDK toolkits which power it are also used in SPECTRa. The metadata seems robust, CML-RSS works and can tailor output to individual chemists on demand. Chemical substructure searching is still a challenge, but can be managed for medium-size collections or through Pubchem as a linkbase. And I suspect we shall create a standoff chemical search toolkit.
But the fundamental aspect of the matrix is that it is decentralised. All our software is Open and cloneable. Thus anyone can set up a CrystalEye or SPECTRa server. Anyone can download and run an OSCAR3 server (takes ca 10-15 minutes). In this way the whole chemical social computing community (which includes bioscientists, LIS etc.) can share the effort and excitement.
What’s missing?
- Senior chemists
- 20th-Century data and information aggregators
- 20th century chemical software companies
- The pharmaceutical companies
Does this matter? History will tell, but I doubt it.
It isn’t, of course completely zero-cost, just as Wikipdeia isn’t. But it’s not easy to get dedicated chemistry funding. JISC, EPSRC, DTI, Unilever have helped us in Cambridge. Some of the other collaborators have had useful funding. But the enormous contribution (CDK, NMRShiftDB, etc.) made by the group in Koeln (DE) through Christoph Steinbeck has had its funding terminated. So if there are organisations that are interested in supporting social computing in one of the most exciting current developments let us know.
Pingback: business|bytes|genes|molecules
Peter, great to see that WWMM is starting to gain some momentum. I certainly think there are various behavioural properties of chemists that have held many back from taking part in blogs and wikis and general web 2.0 type stuff. But, I interviewed Steve Bachrach for Reactive Reports and he admits to having been reluctant to get involved but has now recognised (through reading your blog) that there is value in these technologies and is soon to start his own. The interview is here
Meanwhile, I take it you saw the launch of the chemistry database search tool ChemSpider.com, which the developers reckon could provide a one-stop shop for chemistry database searching (pubchem, chebi, academic, NFP, and commercial). It’s got 10million+ entries so far, but I think they’re adding to it all the time. (I have to confess an interest here, as they gave me some webspace to host a new chemistry blog – Spinneret)
db