I have suggested that the Scholarly Revolution should be decentralized and communicating and this post gives an example of why and how. “Decentralized” means that no one person or subgroup is critical to its operation and more importantly its continued operation. It also means that we do not have to agree on everything (and we certainly shan’t – the mess in “Open Access” should be a clear warning). We should not have to rely on key components – for example building the roads before the houses and then finding people don’t want to live where the roads are but where they can cross the river.
The good news is that information infrastructure can be very cheap and – certainly at an early stage – can be radical altered (refactored) if the community wants. The key thing is COMMUNICATION. As long as we know what other people are doing and saying many of the difficulties are solved.
So here’s an example of a bottom-up community that works. It costs me 20 pounds a year to run – that’s less than a dinner. It’s growing and it’s changing the world of chemical science. It would continue to run and flourish if I weren’t able to be involved. Everyone has their own homestead (http://en.wikipedia.org/wiki/Homesteading_the_Noosphere ) but there is also a commons. It’s a bazaar (http://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar – if you don’t know this, read it – it’s Open). There are other similar metaphors – “cooperative”, “tietotalkoot” (http://p2pfoundation.net/Rural_Cooperation_and_the_Online_Swarm ), “marketplace”, etc.
I’ve written it as part of a chapter we’ve offered for the http://www.openforumacademy.org/ because it’s more important to spread new ideas than gain impact factor.
Bottom-up Open Chemistry – the Blue Obelisk
Chemical software and data is a major activity, almost certainly exceeding 1Billion USD per year. But almost all of it is Closed, represented mainly by domain-specific software companies and traditional STM publishers. This is often aggressively protected; when the NIH set up an Open[*] database of chemicals and compounds the American Chemical Society (ACS) lobbied to politically to have this curtailed and threatened Wikipedia with legal action for publishing the widely used CAS identifiers for chemicals. A major software producer will take legal action against licensees who publish program output, including bugs.
A number of independent, often unfunded, chemical hacker activities grew up during the 1990’s and by 2000 a handful of codes were available but there was little continuity or coordination. We used to meet occasionally at ACS meetings and in 2006 we met in a bar near the large Blue Obelisk in Horton Plaza , San Diego. We felt that we had a consensus of philosophy, that the world undervalued our software and that we had the potential to change the future. We then agreed to loosely coordinate (not pool) our efforts. I suggested the name “Blue Obelisk” and our mantra ODOSOS – “Open data, Open Standards, Open Source “. To support this we created a Wiki, a mailing list and agreed to meet for dinner whenever we had a critical mass. There is no budget, no membership, no formal mechanisms – the mantra is our collective and very powerful DNA.
This has proved extremely successful and might work in other disciplines. We have about twenty projects which are happy to be counted as Blue Obelisk (http://en.wikipedia.org/wiki/Blue_Obelisk ) and which fit into our criteria of ODOSOS. Our dinners are open to all – and closed source providers have attended and been relaxed. In 2007 we published a paper outlining our components. Recently we reviewed this in a 2011 paper with about 20 groups as authors.
When someone or organization does something meritorious (normally an identifiable software product or data resource) I award a quartz Blue Obelisk (remarkably these are common and inexpensive). These loose traditions work. We now have software components in most of the chemical infrastructure for pharmaceuticals and increasingly in materials. The biggest problem is data – chemists do not publish machine computable data (though they should) , instead embedding a subset in formal (Closed Access) publications. We have machine extraction software but risk being prosecuted for extracting data.
Governance is minimal and we have been blessedly spared form either factionalism or imperialism. Each project is self-contained but uses other B/O libraries where possible or more recently runs them as web services. The main language is Java, followed by Python and C(++) – with some historical FORTRAN. There is generally a leader to each project and while the Benevolent Dictator for Life (BDFL) occurs the commonest is “Doctor Who”, where the Doctor hands on to a successor at irregular intervals.
Originally dismissed as cranks, we are now taken seriously. Companies (e.g. Kitware, NY, and CCG) contribute significant amounts of code (and as importantly) the critical mass of internal and external confidence. National labs (e.g. PNNL in US) have been awarded Blue Obelisk for collaborating on Open Source. We know that or code is widely used in pharma companies but we have few metrics (a common problem of Open Source in secretive industries).
As with all volunteer Open Source projects we do not have clear timelines, but progress over the last 5 years has been very good. It’s possible to find high-quality components in most subdomains, including unit and regression testing.
The main problems we face are that chemistry (surprisingly) often does not engineer its own solutions but prefers to buy them. This puts a value on shrink-wrapping and hand-held maintenance which gratis Open Source cannot easily provide. Academics producing new code often get little credit and it’s worse when they reengineer existing solutions, even when the result is markedly superior. It’s also difficult to get funding (“it’s a solved problem”). The fragmented nature of the commercial domain makes semantic interoperability very difficult –companies protect legacy walled garden approaches. The internal messes created by unvalidated variants of legacy files in the pharma industry (e.g. when the result of a merger requires data reconciliation) has probably cost well over 100 million dollars in human effort, while the B/O could have provided common semantics.
However I think we are approaching a breakthrough. Chemical software has made few objective advances in the last 10-15 years so that we now have implemented most of the major algorithms. For an organization which takes a responsible view of costs and values innovation, the Blue Obelisk can be an attractive part of a solution.
^ Guha, R; Howard, MT; Hutchison, GR; Murray-Rust, P; Rzepa, H; Steinbeck, C; Wegner, J; Willighagen, EL (2006). “The Blue Obelisk-interoperability in chemical informatics”. Journal of chemical information and modeling
^ O’Boyle, N; Guha, R; Willighagen, EL; Adams, SE; Alvarsson, J; Bradley, JC; Filippov, IV; Hanson, RM et al. (2011). “Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on”. Journal of Cheminformatics