There are other evils than PDF: what’s the problem here?

Posted on January 30, 2011 by pm286

I am writing a parser for the #quixotechem project – in this case NWChem output. The output is generated AFAIK by FORTRAN. I am having difficulty parsing it. Why?

Here is some text I can parse (deliberately a snapshot from a text processor):

And here is the evil output:

What unexpected horror (or semi-unexpected, as I’ve had it before) has caused me to waste a lot of time?

EDIT:

Here’s a very strong hint. This is what I get when I load it into my text editor. Whatever has happened? And what could I do to make it at least human readable?

And what piece of code did I have to write last night to solve the problem in future?

This entry was posted in Uncategorized. Bookmark the permalink.

8 Responses to There are other evils than PDF: what’s the problem here?

Pingback: Tweets that mention Unilever Centre for Molecular Informatics, Cambridge - There are other evils than PDF: what’s the problem here? « petermr's blog -- Topsy.com
David Jones says:

January 31, 2011 at 2:23 pm

These “how did I go wrong” puzzles are always a bit tricky. Perhaps your was doing it in C#?

Reply
- baoilleach says:
  
  January 31, 2011 at 6:15 pm
  
  The field width is not clear in the “evil” output…?
  
  Reply
- pm286 says:
  
  January 31, 2011 at 6:28 pm
  
  Thanks for comments. The language is irrelevant – this horror could have been committed in any language
  
  Reply
  - David Jones says:
    
    February 2, 2011 at 10:19 am
    
    Aghh! tabs! Will the evil never cease.
    Amusingly, and further illustrating the point, your post looks completely different in Google Reader (RSS) and on the web (HTML).
    I agree with Dan Hagon (below): Regular expressions are better for parsing this sort of stuff, even if it looks fixed width.
    Disappointed to discover that my favourite language, Python, doesn’t have a really easy “untabify” module / function / IOWrapper.
    Tabs are evil / fixed column output is evil / Fortran is evil. We knew that.
    
    Reply
    - baoilleach says:
      
      February 2, 2011 at 2:03 pm
      
      @David: Re Python, not sure what you need for untabify, but I regularly use .split() on strings, or you could .replace(“\t”, ” “).
      
      Reply
      - David Jones says:
        
        February 2, 2011 at 2:31 pm
        
        Well, something that would make the unix utility “expand” a one-liner in Python. And that’s not «’ ‘.join(foo.split())»
Dan Hagon says:

January 31, 2011 at 11:06 pm

Two questions: what character encoding is the file written in and are those whitespace characters (a) spaces, (b) tabs, or (c) just simply unprintable? If I were parsing this I’d use a regex such as “^(\w+\s)+(\d+)\s(\d+)$” and select groups 2 and 3. I’d dump the resulting data into a structured file format such as XML and make it human-readable by providing an XLST style sheet that displays the data as HTML.

Reply