#quixotechem #wwmm #jiscxyz
Last week we agreed that a small, very agile, group of co-believers would put together a system for collecting, converting, validating, and publishing Open Data for computational chemistry, decribed by the codeword “Quixote”. This is not a fantasy – it’s based on a 10-year vision which I and colleagues put together and called the “World Wide Molecular Matrix”. I’ve talked about this on various occasions and it’s even got its own Wikipedia entry (http://en.wikipedia.org/wiki/WorldWide_Molecular_Matrix – not my contribution (which is as it should be)). We put together a proposal to the UK eScience program in (I think) 2001 which outlined the approach. Like so much the original design is lost to me, though it may be in the bowels for the EPSRC grant system. We got as far as presenting it to the great-and-the-good of the program but it failed (at least partly) because it didn’t have “grid-stretch”. [I have been critical of the GRID’s concentration on tera-this and peta-that and the absurdly complex GLOBUS system, but I can’t complain too much because the program gave us 6 FTE-years funding for the “Molecular Standards for the Grid” which has helped to build the foundations of our current work. Actually it’s probably a good thing that it failed – we would have had a project which contained herding a lot of cats and where the technology – and even more the culture – simply wasn’t ready. And one of my features is that I underestimate the time to create software systems – it seems to be the boring bits that take the time.
But I think the time has now come where the WWMM can start to take off. It can use crystallography and compchem separately and together as the substrates and then gradually move toward organic synthesis, spectroscopy and materials. We need to build well-engineered, lightweight, portable, self-evident modules and I think we can do this. As an example when we built an early prototype it used WSDL and other heavyweight approaches (there was a 7-layer software stack of components which were meant to connect services and do automatic negotiation – as agile as a battle tank). We were told that SOAP was the way forward. And GLOBUS. And certificates. We were brainwashed into accepting a level of technology which was vastly more complex (and which I suspect has frequently failed in practice). Oh, and Upper Level Ontologies – levels of trust, all the stuff from the full W3C layer cake.
What’s changed is that the bottom-up approach has taken a lightweight approach. REST is simple (I hacked the Green Chain Reaction in REST – with necessary help from Sam Adams). The new approach to Linked Open Data is that we should do it first and then look for the heavy ontology stuff later – if at all. Of course there are basics such as an ID system. But URLs don’t have to resolve. Ontological systems don’t have to be provably consistent. The emergent intelligent web is a mixture of machines and humans, not First-Order predicate logic on closed systems. There’s a rush towards Key-value systems – MongoDB, GoogleData, and so on. Just create the triples and the rest can be added later.
What’s also happened is Openness. If your systems are Open you don’t have to engineer complex human protocols – “who can use my data?” – “anyone!” (that’s why certificates fail). Of course you have to protect your servers from vandalism and of course you have to find the funding somewhere. But Openness encourages gifts – it works both ways as the large providers are keen to see their systems used in public view.
And the costs are falling sharply. I can aggregate the whole of published crystallography on my laptop’s hard drive. Compchem is currently even less (mainly because people don’t publish data). Video resources dwarf many areas of science – there are unnecessary concerns about size, bandwidth, etc.
And our models of data storage are changing. The WWMM was inspired by Napster – the sharing of files across the network. The Napster model worked technically (though it required contributors to give access to local resources which can be seen as a security risk and which we cannot replicate by default). What killed Napster was the lawyers. And that’s why the methods of data distribution and sharing have an impaired image – because they can be used for “illegal” sharing of “intellectual property”. I use these terms without comment. I believe in copyright, but I also challenge the digital gold rush that we’ve seen in the last 20 years and the insatiable desire of organization to possess material that is morally the property of the human race. That’s a major motivation of the WWMM – to make scientific data fully Open – no walled gardens, however pretty. Data can and will be free. So we see and applaud the development of Biotorrents, Mercurial and Git and many Open storage locations such as BitBucket. These all work towards a distributed knowledge resource system without a centre and without controls. Your power is your moral power, the gift economy.
And that is also where the Blue Obelisk (http://en.wikipedia.org/wiki/Blue_Obelisk ) comes to help. Over the last 6 years we have built a loose bottom-up infrastructure where most of the components are deployed. And because we believe in a component-based approach rather than monoliths it is straightforward to reconfigure these parts. The Quixote system will use several Blue Obelisk contributions.
And we have a lot of experience in our group in engineering the new generation of information systems for science. This started with a JISC project, SPECTRa, between Cambridge and Imperial, chemistry and libraries and which has seeded the creation of a component-based approach. Several of these projects turned out to be more complex than we thought. People didn’t behave in the way we thought they should, so we’ve adjusted to people rather than enforcing our views. That takes time and at times it looks like no progress. But the latest components are based on previous prototypes and we are confident that they now have a real chance of being adopted.
To keep the post short, I’ll simply list them and discuss in detail later:
- Lensfield. The brainchild of Jim Downing: a declarative make-like build system for aggregating, converting, transforming and reorganizing files. Originally designed in Clojure (a Java functional language), Sam Adams has now built a simpler system (Lensfield2). This doesn’t have the full richness and beauty of Clojure – which may come later – but it works. The Green Chain Reaction used the philosophy and processed tens or hundreds of thousands of files in a distributed environment.
- Emma. The embargo manager. Because data moves along the axis of private->Open we need to manage the time and the manner of its publication. This isn’t easy and with support from JISC (CLARION) we’ve built an Embargo manager. This will be highly valuable in Quixote because people need a way of staging release.
- Chem# (pronounced “ChemPound”). A CML-RDF repository of the chemistry – based on molecules. We can associate crystallography, spectra, and in this case compchem and properties. The repository exposes a SPARQL endpoint. This means that a simple key-value approach can be used to search for numeric or string properties. And we couple this to a chemical search system based on (Open) Blue Obelisk components.
The intention is that these components can be easily deployed and managed without our permission (after all they are Open). They will act as a local resource for people to manage their compchem storage. They can be used as push either to local servers or to community Chem# repositories which we shall start to set up. Using Nick Day’s pub-crawler technology (which builds crystaleye every night) we can crawl the exposed web for compchem, hopefully exposed through Emma-based servers.
We hope this prompts publishers and editors to start insisting that scientists publish compchem data with their manuscripts. The tools are appearing – is the communal will-to-publish equally encouraging?