PDFBox and OCR

Ben Litchfield is the(?) current guru of PDFBox and has updated me on  PDFBox. (I copy it here as although I think Jim has fixed the Blog (thanks, Jim)  I won’t take chances.)
Name: Ben Litchfield URI: http://www.pdfbox.org/ | IP: 170.37.224.2 | Date: April 17, 2007
FYI, work is in progress to fix the subscript issue, keep an eye out for the next version.
PDFBox does not contain OCR, but it does have some magic :)
Ben
====
I’m at the Coalition for Networked Information in Phoenix – and may blog on that – and talked to Glen Newton (from Canada) who suggested that what had happened was the library had digitised the thesis *and* run OCR over the thesis and then overlaid the OCR on the TIFF bitmap. This makes sense.
I managed to hack a simple heuristic for subscripts and it worked better than I had hoped. So now I am able to extract chemistry out of PDF theses.
BUT – it causes great pain. So please don’t deposit PDF-only. I am saddened to hear so many people at this meeting (which is mainly Library and Information Scientists) talking about “depositing the PDF in the repository” as if PDF was some god-given information object structure.
I am really excited that Ben has answered – it shows how rapidly the blogosphere helps people  make contact. Yes, I could have mailed the PDFBox list – and I probably shall – but the blog also reaches the people who are creating PDFs…
I sometimes get people who offer services but say they aren’t chemists. Here’s an opportunity.
Is there anyone interested in helping develop PDF2XML for chemistry – you don’t need to know any chemistry – but need to be excited about de-obfuscation (the reverse of what PDF does to chemistry).  For example at the moment I need a way of removing graphical boxes out of the manuscript as the characters in them bleed into the text.
P.

This entry was posted in Uncategorized. Bookmark the permalink.

5 Responses to PDFBox and OCR

  1. Jim Downing says:

    Another test comment – PMR is the email for this working?
    jim

  2. Chris Rusbridge says:

    Peter, PDF has many advantages for longevity, and some significant (hamburger) disadvantages as you have pointed out. I’m not yet certain that “publish in XML” is enough. In fact, most big publishers do have XML versions of their papers… in their own DTDs. Portico (http://www.portico.org/) converts publisher content from publisher DTDs into theirs, which is based on (the same as?) the NLM DTD (see http://dtd.nlm.nih.gov/), which is getting some mileage, so maybe that’s the way to go.
    However, PDF is a complex beast. It allows for lots of things to be included that I don’t yet understand. Would it be possible for authors to include in their PDF files some sort of microformat information on the chemistry involved? Then instead of scraping it out, your tools could politely request it!

  3. pm286 says:

    1. Jim, what do you mean?
    2. Chris – I have sympathy here – having been a guinea pig for the awful Publikon system that BMC tried on authors.
    However I am against any extension of PDF, especially used as a compound document format for political reasons. Polite requests are increasingly going to be ineffective against the more agressive commercial publishers – we need to hang onto our own.

  4. ojd20 says:

    Peter, I was really just checking that the comment system was working again…

  5. ojd20 says:

    (2) I wonder how much of PDF’s suitability for archiving stems from the destruction of the semantic content making documents difficult to edit.
    On a side note, having just tried to get Open Source tools and different Adobe tools and versions talking to each other (between myself and a small printing firm), I’m less sure than I was whether we’re standardizing on PDF or Acrobat.
    I’ve no idea whether PDF could handle compound documents as you describe – but I suspect that both OpenDocument and Office Open XML formats could. PDF could include chemical metadata in the XMP block, but I’ve no idea whether it has an internal addressing system (akin to XLink) that would enable you to add metadata to specific paragraphs / words.

Leave a Reply

Your email address will not be published. Required fields are marked *