petermr's blog

A Scientist and the Web


Archive for October, 2010

Beyond the PDF. Can you read this? Can your computer?

Saturday, October 30th, 2010


I am delighted to have been invited to a very important meeting in January that Phil Bourne and Anita deWaard are organizing. . Why do we want to go Beyond the PDF. Isn’t the PDF the epitome of human crearivity and aesthetics. Doesn’t it produce the most elegant means of communication the world has ever seen?

Here’s some chemical information. Sorry, a Chemical Hamburger. No, it’s actually a chemical cow-pat. A machine wrote it and it was sparkling clear.

  • How did it get to this state?
  • Can you read it? Would you bet your life on your interpretation?
  • Can your machine read it? I’d be interested how well OCR does.

FortranFormat, iChemLabs and Quixote (and a Test)

Friday, October 29th, 2010


In the Quixote project ( we are systematically developing Java (and other) OpenSource tools for managing the input, output, transformation, storage, searching and publication of computational chemistry. We intend to parse complete logfiles (not just the juicy bits) and so come across a lot of really grotty FORTRAN output. If you are thinking of writing a tool to parse FORTRAN output:

  • Stop
  • Don’t

Because there are so many beartraps that you will fall into unless you have actively worked with FORTRAN.

Here’s a simple question:

How many INTEGERs have been output in this 7-character string output by a FORTRAN program?


There is an answer. (No, Unicode is not involved – these are honest to goodness EBCDIC characters punched into a standard Hollerith card…) But the answer is not trivial.

So I am starting with GamessUS punch files and running a tutorial on how to write a parser. Among other things we have to parse chunks of Fortran output. So before writing my own I went to StackOverflow and asked:

Within about 7 minutes I had an answer. I accepted this as the best answer. That gives me two points. And I have since got 2 votes! So I get 2*5+2 =12 points. Not bad.

So public thanks to iChemLabs and Kevin J. Theisen. It’s great to see a chemical software company producing Open Source. It means one less set of wasted days duplicating what other people have done. How many other companies have Open Source that would save everyone labour and allow us to do innovative work. Because writing a FORTRAN parser isn’t fun and isn’t innovative.

FWIW it looks to have been done properly – with a BNF and JavaCC. That means that the code is written by a Compiler Compiler. All that is required is the FORTRAN language spec and, of course code to parse the different elements.

I’m planning to mavenize this and put it in a repository. I’m allowed to do that as it’s Open Source. All I have to do is say thank you and keep the licence attached to the code distro.


Span, Span, Span, Span, Span, loverly Span.

Friday, October 29th, 2010


Here is some code. Why won’t it compile?

void transformAtomArraysIntoFrequencies() {

        Nodes atomArrayNodes = xmlInput.query(

                “./*[local-name()='atomArray' and @dictRef='gamessuk:normal_coordinates']“);





Gillean (see comments) has cracked it.

I cut-and pasted it out of Eclipse, into Word (where I think the quotes were honoured as U+0022 ( It was then published DIRECTLY by Word to WordPress. One of them corrupted it.

The underlying source in WordPress looks like:

<span style=”color:#7f0055″><strong>void</strong><span style=”color:black”> transformAtomArraysIntoFrequencies() {</span>

                    </span></span></span></span></p><p><span style=”color:black; font-family:Courier New; font-size:10pt”> Nodes atomArrayNodes = <span style=”color:#0000c0″>xmlInput<span style=”color:black”>.query(</span>

            </span></span></p><p><span style=”color:black; font-family:Courier New; font-size:10pt”> <span style=”color:#2a00ff”>”./*[local-name()='atomArray' and @dictRef='gamessuk:normal_coordinates']“<span style=”color:black”>);

(The quote is still U+0022 at this stage.) Note the emulation of the Vikings in span,span,span,span, span loverly span. What garbage. WordPress is one of the most awful tools I have every used for editing. It corrupts everything. (Ok it’s probably an old version – I am not in control).

Then if we look at the page source of the post we find

 <span style=”color:#2a00ff”>“./*[local-name()='atomArray' and @dictRef='gamessuk:normal_coordinates']“<span…


ARGHH! ARGHH! WordPress has changed the quotes into smart quotes U+8820. ARGHHH. I can’t stop it. Blink, and it reverts. You cannot cut the heads off the hydra. It regrows. Sometimes it multiplies. And you end up with rows of empty Spans.


It has beaten me. Which is why I don’t post code any more.


(Yes I have tried code formatters and other things in WordPress. No use).

A common problem in informatics (UPDATE)

Thursday, October 28th, 2010


Amusement (hopefully). You don’t have to understand theoretical chemistry to take part.

The following ASCII text represents the output of a theoretical chemistry program. It contains an error. This error was not created by the program (and I have removed its name (XXXXXXXXXXX) so as not to throw any aspersions on it). The error is typical of a general problem in scientific information.


N-N= 1.312300318470D+03 E-N=-4.587475340442D+03 KE= 8.364019136890D+02

AllDun Frequency-dependent properties on file 20721 Mask= 2 NFrqRd= 1 NDeriv= 1 LenFil= 12:

Frequencies= 0.077357

Property number 2 — FD Optical Rotation Tensor frequency 1 0.077357:

1 2 3

1 -0.139106D+02 0.342269D+02 0.258275D+01

2 0.381206D+02 0.255976D+02 0.312180D+02

3 0.157839D+02 0.161937D+03 -0.135731D+02


Job cpu time: 0 days 1 hours 53 minutes 7.4 seconds.

File lengths (MBytes): RWF= 419 Int= 0 D2E= 0 Chk= 6 Scr= 1

Normal termination of XXXXXXXXXXX at Fri Oct 03 15:18:55 2010.



The error can be detected by computer more easily than by humans.

Please indicate:

  1. The error
  2. How you think the error might have arisen in practice
  3. Whether there might be other undetected errors in the document


Two people have found the error. Well done. They have not hypothesised how it might have occurred. This requires a flash of inspiration and/or common exposure to this very common problem.



The error is in:

Normal termination of XXXXXXXXXXX at Fri Oct 03 15:18:55 2010.

This is an impossible date (in any current Chronology). So how could it have happened? And I promise you that this type of problem occurs zillions of times every day. And no, it’s not human mistyping (though that destroys and corrupts science very effectively).


UPDATE: The date was actually Fri Oct 08 15:18:55 2010 . Does that give any clues as to what could have happened?

Chemical MIME and the role of the IETF

Thursday, October 28th, 2010

I’ve just described Chemical MIME – not in great detail, more to illustrate a highly virulent meme. Chemical MIME is what Fowler would call a “sturdy indefensible”. It breaks the rules, but it is used, it works and it upsets few except pedants.

Until now. Read on, even if you are not a chemist, because it’s a general problem in modern informatics. And we need your help. I don’t know how to solve it.

Egon has pointed out a problem. It’s hit the KDE bug list ( . It’s a software bug, not a chemical bug:


Top of Form

Bug 235563 – invalid MIME type in /usr/share/applications/kde4/kalzium.desktop



invalid MIME type in /usr/share/applications/kde4/kalzium.desktop



















Version Fixed In:


Description From Laurent Bonnaud 2010-04-27 19:19:39

Version: 2.3.80 (using 4.4.2 (KDE 4.4.2), Kubuntu packages)

Compiler: cc

OS: Linux (i686) release 2.6.32-21-generic-pae


Here is the problem:


# update-desktop-database


Error in file “/usr/share/applications/kde4/kalzium.desktop”: “chemical/x-cml”

is an invalid MIME type (“chemical” is an unregistered media type)


What does this mean?


It means that a server has labelled a file with the MIME type (Content-Type) as chemical/x-cml


And that the application software has said that’s invalid.


And the application software is pedantically right.


So, best beloved…


In the early days of the Internet when ordinary people hacked servers and small furry penguins were small furry penguins, there was a brilliant idea to label content with its type. It was a brilliant idea and it still is a brilliant idea. It means that anyone in the world, on whatever platform, getting documents from whatever server could determine their type. All you had to do was add a simple text-string and the machines would recognise it.


So if you were transmitting a piece of text, you could label it “text/plain”. And an image might be labelled “image/png”. If you didn’t do this then you couldn’t know whether the bit stream was meant to be displayed as text (e.g. in the body of a mail message) or as an image accompanying the mail.


Mail? I thought we were on browsers?


No. This far predates the browser. Read This will give you an idea of the enormous contribution made to the Internet and the modern world by the great body of those dedicated to interoperability. The Internet is based on RFCs.


RFCs? Read Without RFCs there would be no HTML. There would be no Google. No Facebook. No HTTP. No Wikipedia. No online pornography. There would be a bickering mass of companies fighting in a sludge of non-interoperability. Everyone would have their own server spec. Everyone have their own client spec. I remember that time. It was awful. A Holy Roman Empire of isolated barons.


One of the greatest achievement of the twentieth century was the Internet. And it succeeded because of the IETF. The IETF?


Their goal: “The goal of the IETF is to make the Internet work better.


Their motto: “Rough consensus and running code” . This is a great step towards the democratisation of the world through technology. It’s lead not only to a working system of physics and software but also as a touchstone for this century’s democracy. It’s exemplified in Wikipedia. It means listening to the other person’s point of view. And agreeing to come away with something that works.


In the IETF system, anyone can put forward a proposal. It’s called a draft. Here it is (

Document type:

Old Internet-Draft (Individual document)

Last updated:




Intended status:




Responsible AD:


Bottom of Form

Document history








Draft expired




New version available: draft-rzepa-chemical-mime-type-01 (diff from -00)

This Internet-Draft is no longer active. Unofficial copies of old Internet-Drafts can be found here:

The purpose of this Internet Draft is to propose an update to Internet RFC 1521 to include a new primary content-type to be known as chemical. RFC 1521[1] describes mechanisms for specifying and describing the format of Internet Message Bodies via content-type/subtype pairs. We believe that chemical defines a fundamental type of content with unique presentational and processing aspects. We outline the typical expected uses of such a content type and propose a number of chemical sub-types. This document updates IETF Internet Draft draft-rzepa-chemical-mime-type-00.txt in which this specific proposal was made, incorporates suggestions received during the initial discussion period and indicates scientific support for and uptake of this proposal[2-7].

Henry Rzepa <>

P. Murray-Rust <>

B. J. Whitaker <>

(Note: The e-mail addresses provided for the authors of this Internet-Draft may no longer be valid)

We put the idea into the IETF framework. It was a Draft, not an RFC. We had 6 months to convince the IETF. Henry went to a meeting. There was lots of discussion. One suggestion was that it could be used to send recreational drugs over the network. (Since I was working for Glaxo I wasn’t wild about being associated with this idea and it was not pursued in the body of the draft!).


The draft had a lot of supporters but it failed to get critical mass. It lapsed. Not enough rough consensus.


MIME is an excellent idea but its implementation does not allow easy extensibility. There’s a hardcoded set of type of the form foo/bar – seven major types and many secondary ones. Everyone knows that hierarchical classification systems break down sooner or sooner.


MIME’s extensibility was through “x-”. So suppose you had a new image format called penguin (designed to transmit pictures of penguins) , you might write “image/x-penguin”. At some stage in the future it might become accepted as a standard part of MIME.


So we started creating chemical/x-pdb, chemical/x-cml, etc. They are listed at . The idea took off. There are probably hundreds of millions of documents labelled with chemical MIME. OK, the IETF didn’t want to know about them but they worked. MIME system did not appear to require know mime types.


And they have worked for 15 years.


Until, apparently, now. The software above checks primary MIME types. “chemical” isn’t one of them. So it throws an exception. It’s “right”.


But it’s not helpful.


What to do? I really don’t know. I can think of the following:

  • Go back to the IETF. Chance of success? 0.00000001
  • Get the chemical world to change to another MIME type (it’s possible that “x-chemical/pdb” would be allowed. But it might not). It would destroy hundreds of millions of working documents.
  • Fix the behaviour of KDE. Chance of success 0.00001
  • Ignore the problem
  • Try some awful kludgy workaround


How important is this problem? I don’t know. Is MIME becoming stricter? I doubt it. Are more systems validating it? ??


In so far as it is a problem it reflects the lack of community approach in chemistry. The chemical software industry is based largely on non-interoperability and lockin. All the approaches – and there aren’t many – come from outside either the software vendors or the pharma industry. Chemical MIME; CML; The Blue Obelisk; InChI. None of these have been industry-led. They succeed to the extent that they fill an essential need. Pharma ought to care – it doesn’t publicly show it. Software industry ought to care. It doesn’t until it’s forced to. I am not surprised by this – standards come when the industry is in a mess and they are essential, and we are at that stage now.


There will be a considerable number of new MIME types registered as a result of the Quixote project. We need to know the precise types of computational input and output. We do this without the active help of the companies producing the tools that create these files.


For Chemical MIME we will keep buggering on.






Read the comments…

Bottom of Form

What makes an Internet meme? Chemical MIME and CML… “We must just KBO.”

Wednesday, October 27th, 2010

#quixotechem #jiscyxz

I am still intrigued (?amazed) by the unpredictability of new ideas and technology on the Internet. All I do is fire off memes and see what happens. I’m reasonably experienced in creating memes. And even more experienced at non-memes. I’m no better at predicting success than anyone else

For example in 1994 Henry Rzepa and I developed Chemical MIME (a way of supplying chemical typing in mail and server headers). It didn’t take off in the IETF but when we released a package of free software and specs it raced through the Internet in weeks. Admittedly at that time the scientific Internet (the number of sites) was smallish but the difficulty of configuring clients and of distributing software was also difficult. I surmise that the meme (like a virus) has to have 3 essential features:

  • The ability to infect someone, simply by its own power. For Chemical MIME this was the ability to display and rotate beautiful coloured molecules and to ask scientific questions. There was hardly a chemist in 1994 who would not go “wow!”. So infection was facile.
  • A near-zero or zero cost of replication. If the host has to expend a lot of energy to create replicas of the memes then the process slows. (For example if the host has to manufacture “stuff” – as in the RepRaps – then most people will not replicate.) In the case of Chemical MIME all the host has to do is to clone the incoming material – this involves obtaining and mounting copies of the displayed molecules and free software.
  • The desire to replicate and mutate. In this case the host wants to show the world what *they* can do. They create displays of their own molecules, which may be even more striking that the ones they saw. They also help to improve the process of replication – better tutorials, slicker web pages, improved software. So the process accelerates and prospers.

When we created Chemical Markup Language (in ca. 1994) we thought it might be rapidly copied – that in a year or two everyone would be using it. It needed 1997 to create XML but after that the world seemed to explode. We heard of financial consortia for XML which closed doors after 2 weeks. Tremendous hype. So clearly chemistry would be sucked up in the rush? Not quite.

In fact the same is true of MathML. It’s also taken its time. But both CML and MathML make steady progress. We keep hearing of new users and applications and we keep developing our toolkit. There isn’t an alternative to XML or similar language to represent the complexity of chemical documents (most of the legacy approaches deal only with molecules, or possibly simply reactions). CML can represent whole computations, crystal structures, preparations, etc.

To everything there is a season” – I learnt at school (Ecclesiastes 3:1-8 NKJV) and timing is critical on the Internet. Because the Internet doesn’t care WHO wins, it just cares that someone does. So there was bound to be an Internet Encyclopedia, but Jimmy Wales’s wasn’t the first. Google was nowhere near the first search engine, but it hit an optimum of timing, design and performance. In rapidly changing fields – such as social networking – you have to hit everything at the right time. But some things have a different timescale. Launched too early, they fail to take off (or take root). Launched too late, and the early birds beat the latecomers. Like biological ecosystems we should expect great wastage. Ultimately if a meme has a potential place in an ecosystem its progenitors must keep launching it until it succeeds.

In WWII, Winston Churchill came up with the simple formula “We must KBO” (Keep Buggering On). It’s a simple, heartening formula. It evens out the highs and the many lows. It’s based on absolute faith that one will succeed. During the low periods you have to keep going, however boring, hopeless, however many setbacks. Rather than the heady explosion of Chemical MIME, CML has been a long long slog. There’s never been any Gartner curve (Hype cycle – Wikipedia). No peak of inflated expectations; no trough of disillusionment. Mainly long slog, with general apathy, sometimes hostility, and occasional positive moments.


There have been many friends on the journey: The Blue Obelisk; the Earth Science community; Microsoft Research; the bioscience community; and others. The time is now coming. We’re continually finding people who are starting to use CML. We’ve got a million lines of Open code written. There are clear applications – semantic publishing, computational chemistry, crystallography which simply cannot be done by other technologies.

And there’s the Open semantic revolution. Conventional tools cannot support either of these – it needs a rich structured language technology with vocabulary support.

It’s the ultra-rapid takeoff of Quixote which has finally convinced me that the technology is mature enough for success. We’re on JUMBO5. Stable. Schema 2.4. Stable. Tools for much of the computational work. 200,000 structures in Crystaleye. 200,000 downloads of Chem4Word.

There’s a lot more “buggering on” required. “Blood on the floor at midnight” it feels like to me. Rewriting the code for the 5th time isn’t fun. The Blue Obelisk has been a lifesaver. The breathing space in Chem4Word has given us a stable, viable, robust validating toolkit. The Earth science experience has given us belief in a whole sector – compchem. Quixote has been a delight.

We have tunnelled through to the Slope of Enlightenment (in Gartner’s terms). If you want to join in, now is the start of a highly productive time.



REST in peace in the Matrix

Tuesday, October 26th, 2010

#quixote #jiscxyz

Many of the ideas that we’ve had on the World Wide Molecular Matrix are now starting to become possible. In my innocence in 2002 I thought that imaginings were one step from reality. The bits in between were so easy to conceive that they wouldn’t take much time.

They’ve taken 10 years

However I can look back and find that many of the ideas are pretty much unchanged.. Here’s a picture crafted at least 5 years ago which describes different types of site (server) in the WWMM. The details aren’t important.

It’s the technology that has risen up to meet them. And the general acceptability within the community. Back then it was WSDL and UDDI (imagine that!), oh and SOAP. And Portals. It’s taken courage to strip all that away and go back to the simple ideas. REST, schema-less designs, flexible vocabularies.

But most of all that the whole system is Open. I hadn’t realised back then how much of a drag AAA was. Authentication, Authorization, Accounting. They kill projects. Much of the eScience program was struggling with these monsters.

Of course if you want to transfer money between banks you need this. That’s why we pay bankers enormous salaries and bonuses. But for Open science we don’t need anything. A few social controls. Keep the spammers and wreckers out. Make sure people don’t DOS the system.

So Sam Adams has planned all this and I’m convinced his design is what modern eScience should be. (We also owe a lot of this to Jim Downing). All the bits are there. We need the following sorts of server (they not quite what is in the picture, but share the general idea):

  • One that allows scientists publish their data to the Open web. Pablo Echenique has already done this in the Quixote project. But not everyone is allowed to run a server.
  • A server to which anyone can upload. Anyone who is not a spammer. Sam will tell us how that can be done easily
  • A server that scrapes the exposed web. This is can Pablo-type, or journals or anything. Even Institutional repositories if they expose an iterator over the data (which most do not). Its results are exposed and read-only. It offers search and indexing
  • A customisable repository with embargo. Chem#, pronounced Chempound. It’s the results of several JISC projects – SPECTRa, CLARION, JISCXYZ and it’s coming together now. A few bits to come but RSN. It will allow people to store their data responsibility while they need it and archive it later

The WWMM is not restricted to molecules. The architecture will handle anything that’s semantic. It hates PDF. It hates Powerpoint (unless in XML). It likes anything in text and in XML. It’s not wild about images yet. The day will come shortly when images are semantic.

And you if you want to find out more, just join the Quixote list ( ). You don’t have to be a chemist. You just have to enjoy seeing Open scientific data.

“books” On the Cambridge Train

Tuesday, October 26th, 2010

I’m coming back from JISC and again sitting on the floor among the Bromptons. Alice and Bob are in their regular seats. They must get out earlier than me or rush along platform Zero faster than the average punter. (The 1645 is not a good train to arrive just-in-time for unless you like bicycles). Anyway I catch part of their conversation.

A: So what are you reading?

B: “Jane Austen’s Pride and Prejudice”

A: You mean “Pride and Prejudice”.

B: No it’s called “Jane Austen’s Pride and Prejudice”.

A: Who’s the author, then. I thought it was Jane Austen.

B: It’s by “eyePoodle”.

A: ?????

B: Yes, it’s the name of the company that makes this e-reader.

A: But the book is by Jane Austen, right?

B: Well sort of. She wrote most of the words, but eyePoodle actually wrote the book.

A: You mean they copied her words.

B: No, they’ve actually changed them to give a better user-experience.

A: What the hell is a “user experience”?

B: It’s what you get when you buy an eyePoodle.

A: OK – well how does it start? I know this by heart from first-year English Literature. It should be “It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.”

B: Well more or less. It says “It’s generally true that a wealthy man needs a wife”.

A: !!!!! That’s appalling. Why have they edited it?

B: The long words don’t fit the screen well. So they’ve shortened some of them. And they’ve made the sentences easier to read.

A: They’ve ruined it. Let me see.

B: You won’t be able to read it.

A: Yes I can – I don’t need glasses.

B: Only I can read it.

A: Bullshit. Give it to me…

B: OK …

A: [Stares at blank screen]. How do I switch it on?

B: It is on. I told you only I can read it.

A: What do you mean?

B: It’s DRM’ed.

A: ?????

B: Digital Rights Management. Only I can read it. I have to hold my thumb over this fingerprint reader.

A: OK, pass it over and pass your thumb over.

B: It’ll be the wrong way up.

A: No it won’t. Look

B: No my THUMB will be the wrong way up. You’ll have to sit on my lap…

A: Easy tiger…

B: Anyway I’ll have to get back to reading.

A: Why the rush?

B: I’ve only got 4 hours left.

A: ?????

B: You only get the book for 24 hours. Then you have to pay more.

A: So who owns the book?

B: eyePoodle. They’ve started buying up books for the eyePoodle. That’s why the title is slightly different. Then they can copyright it.

A: You mean that because they’ve rewritten it they can copyright it?

B: Yes, and every 60 years or so they’ll alter a few words and recopyright it. Great business. I’ve got shares in eyePoodle.

A: Well I’ll go to the library and get the real book. And I’ll make my own copy as Jane Austen’s been dead for years.

B: You can’t – they got rid of the books and replaced them with eyePoodles.

A: I’ll go straight to my Reader Services and DEMAND a copy.

B: Sorry it’s now called “Vendor Services”…

… Royston …Next stop Cambridge …


Posted on OKF open-bibliography:

No – I hadn’t seen this before writing the blog.


Monday, October 25th, 2010

I asked a question about stereochemistry and owe the community an answer. The answer is my answer. It may or may not be “right”. I don’t like the word “right” in science. But I hope it’s acceptable to those who think about the problem. I showed two pictures

And asked what the relationship of these molecules was. The “right” answer was to be that it was impossible to tell as there is a stereocentre in the middle of the molecule that is undefined. But it was suggested that because the two molecules were drawn in the same way then we might (not “should” but “could”) assume that the stereocentre was consistent. In which case we could say that although we didn’t know what the molecules actually were we could say they were geometrical isomers, not enantiomers.

If I had drawn all the centres explicitly

Then we could say “definitely” that the two were geometrical isomers (cis/trans, configuration, … I am using terminology).

There’s an assumption that conformation doesn’t play a role – that the cyclobutane ring flips rapidly enough to “average” the structure. Here’s Wikipedia’s example of cyclohexanes:

Alicyclic compounds can also display cis-trans isomerism. As an example of a geometric isomer due to a ring structure, consider 1,2-dichlorocyclohexane:



Note that the cis- compound as drawn can have enantiomers. What we all “know” is that at room temperature they intercovert so that the molecule is not optically active. But if we cool it down or look at a very short time scale then it would, indeed have enantiomers. So we have to be very careful in how we phrase the questions because people make assumptions. And my assumption is often not your assumption.

==== Next divertissement ====

I am now writing parsers for compchem log files. This engages some of the pleasure/hate centres of my brain in the same way as Sudoku does – and it’s slightly more productive. Here’s a typical bit of a log file. It doesn’t matter what the numbers are or what they mean

…. . . . omitted . . .

28 H 2.700772 3.229400 5.731856 4.467482 7.448261

29 H 1.099958 2.072781 5.129949 3.144928 5.344895

30 H 1.920507 0.965845 5.216951 4.614282 6.231286

21 22 23 24 25

21 H 0.000000

22 H 1.777203 0.000000

23 H 5.002806 5.524624 0.000000

24 H 6.096661 6.315358 1.774278 0.000000

25 H 7.214158 8.252435 4.711394 4.561542 0.000000

26 H 8.228138 9.136704 6.730950 6.277461 2.501917

27 H 7.915186 8.417597 7.182659 6.464504 4.302447

28 H 6.508693 6.586577 5.931001 5.094904 4.950885

29 H 4.817286 4.590808 4.296008 3.705276 5.455187

30 H 6.292404 5.863642 3.978192 2.711028 6.081788

. . .Snipped.. .

Question: (a) what would be the next line – precisely – and (b) what would the line after that start with?

And how would you write a parser for it

The Quixote knowledgebase for compchem continues- and Open Bibliography??

Monday, October 25th, 2010

#quixotechem #blueobelisk

It’s a few days since we announced that we had prototyped a distributed knowledgebase for computational chemistry – Quixote. We’ve already had useful and positive feedback. Here’s Anna:

Anna says:

October 25, 2010 at 8:12 am  (Edit)

Finally! I’ve been bemoaning the lack of and obviousness of a global compchem database for quite a while without the resources to do it. Where can I alpha-test, please? Well done guys.

Anna – you can become part of this; just go to and join the list and tell us what you would like and can offer. You don’t say whether you want to use an existing knowledgebase (e.g. for reference, starting geometries, teaching, etc.) or whether you want to store and publish your own work. We’d be particularly interested in collections of legacy log files that you think would be of general interest. (At present we’d see the files being saved in specific collections so that people have a natural way of browsing them, but of course the whole resource will be searchable).

I’ve also had an offer from Henry Rzepa. He has reposited several thousand files in DSpace. Trouble is that once a collection is in DSpace you can’t get it out. That happened to me. So my challenge to the DSpace repositarians is:

“How do I download 5500 entries in DSpace?” [ ]

I do not, obviously, have time to click through all of them. I would be prepared to spend 30 minutes of my time. No, since it’s Henry, I am happy to spend 60 minutes. If it’s easier than I thought I will gently revise my opinion of Institutional Repositories for data. (But this is only one of the reasons why IRs don’t service scientific data). If it’s effectively impossible without writing my own scraper then I shall continue to look elsewhere. (After all this is partly why we built Quixote).

Then I have had a wonderful correspondence from another correspondent. This is a mainstream organic chemist. They write:

I am just starting out as a lecturer in … and I am increasingly aware of the problems of how I (and other members of the compchem community) handle and report our calculations. With my new principal investigator hat on, I am particularly concerned that:

( a) when students/post-docs come and go, their compchem data is not lost forever on their laptops/desktops

(b) rather than reading a doc/xls/pdf file summary of computations I will want to check the input/output files, particularly with new students to make sure they are doing things correctly

(c) when we publish work other groups are able to re-use our data easily (how much time have I wasted reconstructing input files from weird PDF formatting??!)

(d) that my work is reproducible. 

I currently have output from Gaussian, NWChem, Orca, Gamess-US, Macromodel, Jaguar, Tinker on various machines/servers/external disks, which is clearly sub-optimal. The idea of some nice QM data in a searchable repository would be pretty cool too for the force-field community, as all the ab initio parameterization was done using HF and MP2 with small basis sets for small molecules – I would imagine in principle many of the calculations required to parameterize/reparameterize a forcefield with more exotic molecules and more sophisticated levels of theory have been done by various people already….

I was recently tasked by the journal[…] to see if authors were following their editorial guidelines on compchem. Supporting Info – from a sample of ten recent papers using DFT calculations we found that the requisite Cartesians [coordinates of atoms], absolute energy and imaginary frequencies (for [transition states]) were usually present although not always (in spite of a check list that authors have to use). Deposition of the original files in the same way as we do with cifs [crystallographic data files, required by journals] already would solve this problem. Another little anecdote – I wanted to visualize the MOs of retinal the other day, so I had to calculate them myself – hundreds of people must have done this already at some point, so if I could search for this data rather than waste hours of  time/electricity that would be much better.


…if I can get involved/test or supply certain file formats/ raise awareness then please let me know! 


The work with the Journal is extremely valuable. As my correspondent says all we have to do is persuade the Journal to publish the supporting info. We (in the JISCXYZ project) can validate it. Compchem is the best of all fields (even better than crystallography) for an almost complete data validation before refereeing.. it’s even possible to do it oneself and sign the result.

And another correspondent, this time from the Blue Obelisk. [I asked about REST and some of the problems with firewalls.]

I am forwarding this message to […] developers list since it is very similar to what […]project  is about.

Perhaps we can collaborate with reusing/extending existing […] REST services with similar functionality and sharing experience with development of  web services based on RDF .

One of my colleagues was managing certificate authority for EGEE grids and running a certificate authority, so there is such experience.

And just for the record,  distributed security in REST  is not trivial nor simple at all … there are no well accepted solutions currently.
This might be the major obstacle in front of any REST approach aiming at distributed services, rather than single REST site  , as most major commercial REST services these days.

We are currently adopting not optimal (centralized) solution in […] and GEANT-wise [A European GRID project and infrastructure] I am involved in a small group preparing an RFC for a protocol for REST security .  

Just sharing what we are struggling with for two years already, having tens of distributed REST services over Europe , 5 independent implementations in two languages, covering at least half of the functionality listed in your email.

This is all really exciting. You start to see how volunteers start to make major contributions to the project.

I am sure the redacted names will become public soon …

Now – if only we could get the same sort of excitement for bibliography. I’ve posted that we now have millions of Open Bibliographic records. Currently I have interest from 2 chemists, a zoologist, a mathematician, an economist, and several hackers. I met 4 librarians at JISC on Friday and bounced up to each and said “we’ve now got the British Library bibliography! Open! 3 million records”. What can we do together?

The answer was that they weren’t interested at all. “why should we be interested in some other library’s catalogue?” “bibliography isn’t interesting”. I was gobsmacked. Bibliography is the soul of scholarship. I thought that by collecting bibliography and turning it into an intelligent semantic resource then we would start a new era in the library.

I really don’t know now what I am going to say to the Research librarians in Edinburgh.