For the last month we have been developing an Open, distributed, automated Knowledgebase for computational chemistry. Millions of valuable data files are created each year – almost none are published. Yet they are amongst the best, the most reproducible science in any discipline. The give believable results about the real world.
They aren’t published because there’s a lack of vision in the computational chemistry community. Unlike crystallography they haven’t espoused community databases. They haven’t seen the value of communal dictionaries to explain the concepts. There’s a variety of ad hoc interconvert systems rather than a unified approach.
We have now produced a prototype of this knowledge base. We can now automate and concatenate the following:
- Search for unpublished data files
- Convert them to chemical markup language (CML)
- Validate the data against dictionaries
- Convert the results to RDF
- Upload them to a different Open database
This can be scaled.
From hundreds to hundreds of thousands. You can run the Quixote system on your own file store and discover all your old, unpublished compchem files. Find what’s in them. You can index your disk while you sleep.
This means that the world’s compchem is effectively a distributed Open database. Automated systems can trawl the web and find what’s new in the servers; aggregate and re-distribute. Transform and re-purpose.
So all it needs is your belief. Data matters. Data can now be cited. You can publish and cite your data. Everyone benefits.
Where are we going now. There’s some technical things to do (http://okfnpad.org/quixote20101021 ). Sam Adams will be bolting in his Chem# repository that comes out of the CLARION and #jiscxyz program. The Opensource Avogadro program will be a central tool on the client side for accessing and creating information.
In particular the project met its first milestone of creating a viable prototype within a month. There are bugs (e.g. running from behind a firewall with certificates can be a problem). We need more variety. And more sites (we currently only have 2). We need more people wanting to manage their compchem data better.
The project scales horizontally. We can add in new codes. We’d like to include NWChem – it’s Open source. NWChem volunteers please! We’d like example files from our current codes. We’d like people to help edit the compchem dictionaries.
We4’ve submitted an abstract to the ACS meeting. We can’t make it Open as then we would be debarred from submitting it. But it talks about the project and how it will transform the semantic infrastructure of compchem.
But we are going forward. We plan a full meeting in March. We’re setting up a newsletter. We’ll have chemical search before year-end. We’ll have molecular orbitals extracted and displayed in Avogadro.
Lots of thanks to lots of people:
Jorge, Pablo, Jens, Lance, Mark, Marcus, Tamas, Sam, Weerapong. Six of whom I hadn’t met 5 weeks ago. Who are now actively working on our communal project. And thanks to the Blue Obelisk for creating so much of the components.