PDFBox and Hamburgers – the story continues

I blogged recently about how I used PDFBox to turn chemical theses (in hamburger PDF) into text. I have now found some interesting (and I think exciting) developments – but I’d like a reality check.
I downloaded a thesis from a well-known institution. The spectra were bitmaps (and not very pretty when magnified). But the text seemed real text – it was selectable by the Adobe Reader and PDFBox emitted it as ASCII. But I noticed some strange things. As an example:
H2O 13NMR
might come out as
H
2
OJ
3
NMR
– why the “J”? So I looked closer and it seemed that the text was also a bitmap. On magnification it gave jaggy charaters and the “j” was jsut a bad reading of the “1” of “13”. I’d assumed the document was Word converted to PDF – a sort of undead or still warm hamburger, but no – it’s a totally cold dead hamburger
So does PDFBox have a OCR facility inside it? Or is there some magic inside the PDF I don’t understand (it’s all binary gibberish). If PDFBox can do OCR without even blinking that is awesome.
But – as you can see – OCR corrupts. Or rather scanning paper documents corrupts.
So – please – when you ingest Electronic Theses and Dissertations (ETDs), Please use a semantic format. Please insist the student hands over  a semantic electronic version. After all you can withhold their degree or do other horrible things to them. It’s one of the few places where academia still retains some publishing power. So please show how it should be done properly 🙂

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to PDFBox and Hamburgers – the story continues

  1. Jim Downing says:

    Test comment

  2. FYI, work is in progress to fix the subscript issue, keep an eye out for the next version.
    PDFBox does not contain OCR, but it does have some magic 🙂
    Ben

Leave a Reply

Your email address will not be published. Required fields are marked *