The Quixote knowledgebase for compchem continues- and Open Bibliography??

#quixotechem #blueobelisk

It’s a few days since we announced that we had prototyped a distributed knowledgebase for computational chemistry – Quixote. We’ve already had useful and positive feedback. Here’s Anna:

Anna says:

October 25, 2010 at 8:12 am  (Edit)

Finally! I’ve been bemoaning the lack of and obviousness of a global compchem database for quite a while without the resources to do it. Where can I alpha-test, please? Well done guys.

Anna – you can become part of this; just go to and join the list and tell us what you would like and can offer. You don’t say whether you want to use an existing knowledgebase (e.g. for reference, starting geometries, teaching, etc.) or whether you want to store and publish your own work. We’d be particularly interested in collections of legacy log files that you think would be of general interest. (At present we’d see the files being saved in specific collections so that people have a natural way of browsing them, but of course the whole resource will be searchable).

I’ve also had an offer from Henry Rzepa. He has reposited several thousand files in DSpace. Trouble is that once a collection is in DSpace you can’t get it out. That happened to me. So my challenge to the DSpace repositarians is:

“How do I download 5500 entries in DSpace?” [ ]

I do not, obviously, have time to click through all of them. I would be prepared to spend 30 minutes of my time. No, since it’s Henry, I am happy to spend 60 minutes. If it’s easier than I thought I will gently revise my opinion of Institutional Repositories for data. (But this is only one of the reasons why IRs don’t service scientific data). If it’s effectively impossible without writing my own scraper then I shall continue to look elsewhere. (After all this is partly why we built Quixote).

Then I have had a wonderful correspondence from another correspondent. This is a mainstream organic chemist. They write:

I am just starting out as a lecturer in … and I am increasingly aware of the problems of how I (and other members of the compchem community) handle and report our calculations. With my new principal investigator hat on, I am particularly concerned that:

( a) when students/post-docs come and go, their compchem data is not lost forever on their laptops/desktops

(b) rather than reading a doc/xls/pdf file summary of computations I will want to check the input/output files, particularly with new students to make sure they are doing things correctly

(c) when we publish work other groups are able to re-use our data easily (how much time have I wasted reconstructing input files from weird PDF formatting??!)

(d) that my work is reproducible. 

I currently have output from Gaussian, NWChem, Orca, Gamess-US, Macromodel, Jaguar, Tinker on various machines/servers/external disks, which is clearly sub-optimal. The idea of some nice QM data in a searchable repository would be pretty cool too for the force-field community, as all the ab initio parameterization was done using HF and MP2 with small basis sets for small molecules – I would imagine in principle many of the calculations required to parameterize/reparameterize a forcefield with more exotic molecules and more sophisticated levels of theory have been done by various people already….

I was recently tasked by the journal[…] to see if authors were following their editorial guidelines on compchem. Supporting Info – from a sample of ten recent papers using DFT calculations we found that the requisite Cartesians [coordinates of atoms], absolute energy and imaginary frequencies (for [transition states]) were usually present although not always (in spite of a check list that authors have to use). Deposition of the original files in the same way as we do with cifs [crystallographic data files, required by journals] already would solve this problem. Another little anecdote – I wanted to visualize the MOs of retinal the other day, so I had to calculate them myself – hundreds of people must have done this already at some point, so if I could search for this data rather than waste hours of  time/electricity that would be much better.


…if I can get involved/test or supply certain file formats/ raise awareness then please let me know! 


The work with the Journal is extremely valuable. As my correspondent says all we have to do is persuade the Journal to publish the supporting info. We (in the JISCXYZ project) can validate it. Compchem is the best of all fields (even better than crystallography) for an almost complete data validation before refereeing.. it’s even possible to do it oneself and sign the result.

And another correspondent, this time from the Blue Obelisk. [I asked about REST and some of the problems with firewalls.]

I am forwarding this message to […] developers list since it is very similar to what […]project  is about.

Perhaps we can collaborate with reusing/extending existing […] REST services with similar functionality and sharing experience with development of  web services based on RDF .

One of my colleagues was managing certificate authority for EGEE grids and running a certificate authority, so there is such experience.

And just for the record,  distributed security in REST  is not trivial nor simple at all … there are no well accepted solutions currently.
This might be the major obstacle in front of any REST approach aiming at distributed services, rather than single REST site  , as most major commercial REST services these days.

We are currently adopting not optimal (centralized) solution in […] and GEANT-wise [A European GRID project and infrastructure] I am involved in a small group preparing an RFC for a protocol for REST security .  

Just sharing what we are struggling with for two years already, having tens of distributed REST services over Europe , 5 independent implementations in two languages, covering at least half of the functionality listed in your email.

This is all really exciting. You start to see how volunteers start to make major contributions to the project.

I am sure the redacted names will become public soon …

Now – if only we could get the same sort of excitement for bibliography. I’ve posted that we now have millions of Open Bibliographic records. Currently I have interest from 2 chemists, a zoologist, a mathematician, an economist, and several hackers. I met 4 librarians at JISC on Friday and bounced up to each and said “we’ve now got the British Library bibliography! Open! 3 million records”. What can we do together?

The answer was that they weren’t interested at all. “why should we be interested in some other library’s catalogue?” “bibliography isn’t interesting”. I was gobsmacked. Bibliography is the soul of scholarship. I thought that by collecting bibliography and turning it into an intelligent semantic resource then we would start a new era in the library.

I really don’t know now what I am going to say to the Research librarians in Edinburgh.

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to The Quixote knowledgebase for compchem continues- and Open Bibliography??

  1. Pingback: Twitter Trackbacks for Unilever Centre for Molecular Informatics, Cambridge - The Quixote knowledgebase continues- and Open Bibliography?? « petermr’s blog [] on

  2. Bram Luyten says:

    For DSpace 1.6.2, you can export collections through a command line interface program. So you will require access to the DSpace server.
    [dspace]/bin/dspace export –type=COLLECTION –id=collID –dest=dest_dir –number=seq_num
    Short form:
    [dspace]/bin/dspace export -t COLLECTION -d CollID or Handle -d /path/to/destination -n Some_number
    more info and settings, refer to sections 5.2.18 and 8.3 of
    (my apologies if this shows up as a duplicate comment)

Leave a Reply

Your email address will not be published. Required fields are marked *