#quixotechem #nwchem #jumboconverters
As I’ve mentioned (/pmr/2011/04/09/nwchem-a-fully-open-source-compchem-code-from-pnnl/ ) I now see NWChem as my flagship project for the Open Source Quixote (http://quixote.wikispot.org/Front_Page ) project to create an open source semantic framework for computational chemistry.
I was really flattered and encouraged by the NWChem group at PNNL and so spent much of the time on that way back (plane, taxi) and the weekend hacking a JUMBO parser (/pmr/2011/03/24/extracting-data-from-scientific-calculations-and-experimental-log-files/ ). This is quite different from normal parsers written in C, Perl, Python, Java, etc. as it’s declarative. There is no need for the parser writer to write a single line of procedural code.
Here’s an example.I want to parse:
Brillouin zone point: 1
weight= 1.000000
k =< 0.000 0.000 0.000> . <b1,b2,b3>
=< 0.000 0.000 0.000>
.. terminated by a blank line ..
Assuming that the phrase “Brillouin zone point:” only occurs in this setting (and there are things I can do if it’s more complicated), I find it with a regular expression:
<template id=”brillouninzp” repeat=”*” pattern=”\s*Brillouin zone point:\s*” endPattern=”\s*”>
This gives the starting regex and ending regex ( \s* means any number of spaces) and the template can be repeated as many times as wanted (*). In the case of NWChem it works very well. Then to parse the 4 lines (records) we use per-line regexes (only active in the current scope (brillouinzp).
<record id=”brill”>\s*Brillouin zone point:{I,n:brill.zone}</record>
<record id=”weight”>\s*weight={F,n:brill.wt}</record>
<record id=”k”>\s*k\s*=<\s*{F,n:brill.h}{F,n:brill.k}{F,n:brill.l}>.*</record>
<record id=”k1″>\s*=<{F,n:brill.h}{F,n:brill.k}{F,n:brill.l}>.*</record>
This extracts the fields into CML elements (cml:scalar in this case) each identified with a unique ID. The {…} are shorthands for regular expressions for Integer, Float, etc. with an associated QName. The format n:foo maps onto a unique URI, with xmlns:n=”http://www.xml-cml.org/dictionary/nwchem“. (This dictionary will shortly exist publicly, but we hope the definitive one will be managed by the NWChem group – they know what the terms mean!)
Every field must have an entry in the NWChem dictionary and so far we have extracted about 200 terms from the example files – I expect this to reach about 3-400). Thus the entry with id=”brill.wt” will describe the floating point number (F) that is the weight of the zone point.
There are tools to extract arrays, matrices and molecules and to transform to CML where necessary (the above will probably be transformed into concise and semantic CML). Finally we terminate the template:
</template>
This declarative approach (inspired by XSLT) has many advantages over the procedural:
- It’s much quicker to write and more concise
- It’s much easier to see what is going on without delving into the code
- There are no IF statements to take care of unexpected or variable chunks of output
- It’s much easier to document (it’s all in XML and can be processed by standard tools)
- It’s easier to create sub-parsers for special cases
- There is no loss of information – unparsed records are reported as such
- It maps easily onto a dictionary structure
- It preserves the implicit hierarchical structure of documents
- It would be possible to generate some templates directly from a corpus of documents
- It provides an excellent input for Avogadro and Jmol
It requires the authors to learn regex, but they would have to do that anyway. Its main limitations are:
- It’s based on lines (records) and does not work well where line ends are simply wrapping whitespace
-
It relies on distinctive phrases (especially alphabetic) – it’s not designed for dense numeric output (though it will work for some)
There’s about 120 templates so far for NWChem and it’s stood up well to new examples. It has parsed files of ca. 1 MByte in a few seconds. (Remember that these files can take days to compute so the time is trivial). So I’m convinced that it works, and scales. I don’t yet know how easy others will find it, but we’ve had good first impressions.
I will keep you in touch. Open Data, Open Source and Open Standards is coming to computational chemistry!