Unilever Centre for Molecular Informatics
 

petermr's blog

A Scientist and the Web

 

There are other evils than PDF: what’s the problem here?

I am writing a parser for the #quixotechem project – in this case NWChem output. The output is generated AFAIK by FORTRAN. I am having difficulty parsing it. Why?

Here is some text I can parse (deliberately a snapshot from a text processor):

And here is the evil output:

What unexpected horror (or semi-unexpected, as I’ve had it before) has caused me to waste a lot of time?

EDIT:

Here’s a very strong hint. This is what I get when I load it into my text editor. Whatever has happened? And what could I do to make it at least human readable?

And what piece of code did I have to write last night to solve the problem in future?

 

8 Responses to “There are other evils than PDF: what’s the problem here?”

  1. [...] This post was mentioned on Twitter by Davor Vidic, Natalia Reinic. Natalia Reinic said: There are other evils than PDF: what’s the problem here?: I am writing a parser for the #quixotechem project – i… http://bit.ly/ft3Iga [...]

  2. David Jones says:

    These “how did I go wrong” puzzles are always a bit tricky. Perhaps your was doing it in C#?

    • baoilleach says:

      The field width is not clear in the “evil” output…?

    • pm286 says:

      Thanks for comments. The language is irrelevant – this horror could have been committed in any language

      • David Jones says:

        Aghh! tabs! Will the evil never cease.

        Amusingly, and further illustrating the point, your post looks completely different in Google Reader (RSS) and on the web (HTML).

        I agree with Dan Hagon (below): Regular expressions are better for parsing this sort of stuff, even if it looks fixed width.

        Disappointed to discover that my favourite language, Python, doesn’t have a really easy “untabify” module / function / IOWrapper.

        Tabs are evil / fixed column output is evil / Fortran is evil. We knew that.

        • baoilleach says:

          @David: Re Python, not sure what you need for untabify, but I regularly use .split() on strings, or you could .replace(“\t”, ” “).

  3. Dan Hagon says:

    Two questions: what character encoding is the file written in and are those whitespace characters (a) spaces, (b) tabs, or (c) just simply unprintable? If I were parsing this I’d use a regex such as “^(\w+\s)+(\d+)\s(\d+)$” and select groups 2 and 3. I’d dump the resulting data into a structured file format such as XML and make it human-readable by providing an XLST style sheet that displays the data as HTML.

Leave a Reply