I've just got back from a wonderful meeting in Zaragoza on "Databases for Quantum Chemistry" (http://neptuno.unizar.es/events/qcdatabases2010/program.html ). [Don't switch off, most of the points here are generally to scientific repositories and Open Knowledge]
Quantum Chemistry addresses how we can model chemical systems (molecules, ensembles, solids) of interest to chemistry, biology and materials science. To do that we have to solve Schroedinger's equation (http://en.wikipedia.org/wiki/Quantum_chemistry ) for our system. This is insoluble analytically (except for the hydrogen atom) so approximations must be made and there are zillions of different approaches. All of these involve numerical methods and all scale badly (e.g. the time and space taken may go up as the fourth power or even worse.
The approach has been very successful in the right hands but also is often applied without thought and can give misleading results. There are a wide variety of programs which make different assumptions and which take hugely different amounts of time and resources. Choosing the right methods and parameters for a study are critical.
Millions (probably hundreds of millions) of calculations are run each year and are a major use of supercomputing, grids, clusters, clouds, etc. A great deal of work goes into making sure the results are "correct", often checked to 12 decimal places or more. People try to develop new methods that give "better" answers and have to be absolutely sure there are no bugs in the program. So testing is critical.
Very large numbers of papers are published which rely in part or in full on compchem results. Yet, surprisingly, the data are often never published Openly. In , for some disciplines (such as crystallography) it's mandatory to publish supplemental information or deposit data in databases. Journals and their editors will not accept papers that make assertions without formal evidence. But, for whatever reason, this isn't generally the culture and practice in compchem.
But now we have a chance to change it. There's a growing realisation that data MUST be published. There's lots of reasons (and I'll cover them in another post). The meeting had about 30 participants – mainly, but not exclusively from Europe and all agreed that – in principle – it was highly beneficial to publish data at the time of publication.
There's lots of difficulties and lots of problems. Databases have been attempted before and not worked out. The field is large and diverse. Some participants were involved in method development and wanted resources suitable for that. Others were primarily interested in using the methods for scientific and engineering applications. Some required results which had been shown to be "correct"; others were interested in collecting and indexing all public data. Some felt we should use tried and tested database tools, others wanted to use web-oriented approaches.
For that reason I am using the term knowledgebase so that there is no preconception of what the final architecture should look like
I was invited to give a demonstration of working software. I and colleagues have been working for many years using CML, RDF, semantics and several other emerging approaches and applying these to a wide range of chemistry applications including compchem. So, recently, in collaboration with Chemical Engineering in Cambridge we have built a lightweight approach to compchem repositories (see e.g. http://como.cheng.cam.ac.uk/index.php?Page=cmcc ). We've also shown (in the Green Chain Reaction, http://scienceonlinelondon.wikidot.com/topics:green-chain-reaction ) that we can bring together volunteers to create a knowledgebase with no more than a standard web server.
I called my presentation "A quixotic approach to computational chemistry knowledgebases". When I explained my quest to liberate scientific information into the Open a close scientist friend (of great standing) asked "where was my Sancho_Panza?" – implying that I was a Don_Quixote . I'm tickled by the idea and, since the meeting was in Aragon it seemed an appropriate title. Since many people in chemistry already regard some of my ideas as barmy, there is everything to gain.
It was a great meeting and a number of us found compelling common ground. So common that it is not an Impossible_Dream to see computational chemistry data made Open through web technology. The spirit of Openness has advanced hugely in the last 5 years and there is a groundswell that is unstoppable.
The mechanics are simple. We build it from the bottom up. We pool what we already have and show the world what we can do. And the result will be compelling.
We've given ourselves a month to get a prototype working. Working (sic). We're meeting in Cambridge in a month's time – the date happened to be fixed and that avoids the delays that happen when you try to arrange a communal get-together. As always everything is – or will be when it's created – in the Open.
Who owns the project? No-one and everyone. It's a meritocracy – those who contribute help to decide what we do. No top-down planning – but bottom-up hard work to a tight deadline. So, for those who like to see how Web2.718281828... projects work, here's our history. It has to be zero cost and zero barrier.
- I set up an Etherpad on the OKFN site at http://okfnpad.org/zcam2010 - Etherpads taker 15 seconds to create and anyone can play
Pablo Echenique – one of the organizers and guiding lights of the meeting has set up a Wiki at
- Pablo has also set up a mailing list at http://groups.google.com/group/quixote-qcdb
- We are planning to set up a prototype repository at http://wwmm.ch.cam.ac.uk
[I suggested the name Quixote for the project and it's been well received so that's what we are going with.]
I have also mailed some of the Blue Obelisk and they have started to collect their resources.
So in summary, what we intend to show on October 21 is:
- A collection of thousands of Open datafiles produced by a range of compchem programs.
- Parsers to convert such files into a common abstraction, probably based on CML and maybe Q5COST
- Tools to collect files from users directories (based on Green Chain experience and code, i.e. Lensfield)
- Abstraction of the commonest attributes found in compchem (energy, dipole, structure, etc.) This maps onto dictionaries and ontologies
- Automated processing (perhaps again based on Lensfield)
- Compelling user interfaces (maybe Avogradro, COMO, etc.)
By giving ourselves a fixed deadline and working in an Open environment we should make rapid progress.
When we have shown that it is straightforward to capture compchem data we'll then engage with the publishing process to see how and where the supplemental data can be captured. This is a chance for an enthusiastic University or national repository to make an offer, but we have alternative plans if they don't.
We'll fill some of the details later.
I'll tag this #quixotechem