Extracting data from scientific calculations and experimental log files

Many scientists have to work with data produced by programs written in the era of FORTRAN IV which produces a mindset of punched cards for input and lineprinter for output. Both of these were designed primarily for humans – the human punched the cards, fed them into the machine and got fanfold paper out. (I actually go back to when programs were keyed in on toggle switches and paper tape, but…). The output was expected to be read by humans and – if there was something useful it could be typed up again. This was good for the machine, but not much fun for the human.

Not much has changed. In computational chemistry much of the information is only available in log files (page-oriented, 80/132 column). Countless scientists in 2011 still retype data from logfiles.

The problem is that the files are machine-readable (ASCII if you are lucky) but not machine-understandable. Here’s a typical chunk from computational chemistry (I HOPE WordPress keeps the formatting on your browser).

———————————————————————

Rotational constants (GHZ): 0.8124822 0.2870678 0.2733351

Standard basis: CC-pVDZ (5D, 7F)

There are 312 symmetry adapted basis functions of A symmetry.

Integral buffers will be 131072 words long.

Raffenetti 2 integral format.

Two-electron integral symmetry is turned on.

312 basis functions, 678 primitive gaussians, 330 cartesian basis functions

64 alpha electrons 64 beta electrons

nuclear repulsion energy 1312.3003184698 Hartrees.

NAtoms= 30 NActive= 30 NUniq= 30 SFac= 1.00D+00 NAtFMM= 50 NAOKFM=F Big=F

——————————————————————————

 

Some of this information is important – some less so. I have no idea what some of it means. Some of the text is meaningful and varies, some is boiler plate. How do we get the information out? Let’s try

nuclear repulsion energy 1312.3003184698 Hartrees.

 

You can “grep” it (i.e. run a UNIX-type search over it). That works for small amounts of discrete information. It works in most cases but is highly fragile.

You can write a Python program to read this stuff. (But only if you are a Pythonista). And that suffers from lack of scale, difficulty of maintenance and non-transparency to other users. But it’s probably the commonest approach.

Or you can use a framework designed to extract this sort of information. I have been trying to find one for 10 years (yes, I have asked on Stack Overflow and Googled, so if there is one it’s well hidden).

So I have been forced to build my own. (If you tell me there is a better solution I’ll be delighted). It’s a declarative approach. That means the user doesn’t have to understand Java, or Python or nay procedural language. They create a set of instructions that define the output. I use XML because XML is my golden hammer but a purist would use LISP. So the declarative approach asserts what the result should be based on the input:

<record id=”nre”> nuclear repulsion energy {F20.10,compchem:nuclearRepulsionEnergy,unit:hartree} Hartrees.</record>

That’s it. The F20.10 means a floating-point number width 20 decimal 10 (this is standard FORTRAN anyway). The compchem:nuclearRepulsionEnergy denotes an entry in the compchem dictionary (more about dictionaries later) and the unit:hartree declares the units. The dictionary will check that Hartrees are energy units. The result looks like:

<scalar dictRef=”compchem:nuclearRepulsionEnergy” units=”unit:hartree”

dataType=”xsd:double”>1312.3003184698</scalar>

This is held in an XML DOM and can be searched and processed by a wide range of tools.

The reason I have written a framework is that I need parsers for all the compchem programs (>20). A conventional approach would take lots of programmers with fragmentation of effort. The declarative approach is much quicker and almost anyone can use it. (You may have to learn some simple regexes but that’s all).

Yesterday we proved that JUMBOConverters works, that people who hadn’t seen it before could pick it up quickly.

So I’m writing this in case there are groups outside compchem who need to parse “lineprinter” output. Cameron Neylon has given me two files and they should only take minutes to write parsers. I’ll show you this in later posts.

JUMBO is, of course, Open Source.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *