petermr's blog

A Scientist and the Web

 

Jailbreaking the PDF; a wonderful hackathon and a community leap forward for freedom – 1

Yesterday we had a truly marvellous hackathon http://scholrev.org/hackathon/ in Montpellier, in between workshops and main Eur Semantic Web Conference. The purpose was to bring together a number of groups who value semantic scholarship and free information from the traditional forms of publication. I’ll be blogging later about the legal constraints imposed by the publishing industry, but Jailbreaking is about the technical constraints of publishing information as PDF.

The idea Jailbreaking was to bring together people who have developed systems, tools, protocols, communities for turning PDF into semantic form. Simply, raw PDF is almost uninterpretable, a bit like binary programs. For about 15 years the spec was not Open and it was basically a proprietary format from Adobe. The normal way of starting to make any sense of PDF content is to buy tools from companies such as Adobe, and there has been quite a lot of recent advocacy from Adobe staff to consider using PDF as a universal data format. This would be appalling – we must use structured documents for data and text and mixtures. Fortunately there are now a good number of F/OSS tools, my choice being http://pdfbox.apache.org/ and these volunteers have laboured long and hard in this primitive technology to create interpreters and libraries. PDF can be produced well, but most scholarly publishers’ PDFs are awful.

It’s a big effort to create a PDF2XML system (the end goal). I am credited with the phrase “turning a hamburger into a cow” but it’s someone else’s. If we sat down to plan PDF2XML, we’d conclude it was very daunting. But we have the modern advantage of distributed enthusiasts. Hacking PDF systems by oneself at 0200 in the morning is painful. Hacking PDFs in the company of similar people is wonderful. The first thing is that it lifts the overall burden from you. You don’t have to boil the ocean by yourself. You find that others are working on the same challenge and that’s enormously liberating. They face the same problems and often solve them in different ways or have different priorities. And that’s the first positive takeaway – I am vastly happier and more relaxed. I have friends and the many “I“s are now we. It’s the same liberating feeling as 7 years ago when we created the http://en.wikipedia.org/wiki/Blue_Obelisk community for chemistry. Jailbreaking has many of the shared values, though coming from different places.

Until recently most of the tools were closed source, usually for-money though occasionally free-as-in-beer for some uses or communities. I have learnt from bitter experience that you can never build an ongoing system on closed source components. At some stage they will either be withdrawn or there will be critical things you want to change or add and that’s simply not possibly. And licensing closed source in an open project is a nightmare. It’s an anticommmons. So, regretfully, I shall not include Utopia/pdfx from Manchester in my further discussion because I can’t make any use of it. Some people use its output, and that’s fine – but I would/might want to use some of its libraries.

There was a wonderful coming-together of people with open systems. None of us had the whole picture , but together we covered all of it. Not “my program is better than your program”, but “our tools are better than my system“. So here a brief overview of the open players who came together (I may miss some individuals, please comment if I have done you an injustice). I’ll explain the technical bits is a later post – here I am discussing the social aspects.

  • LA-PDFText (http://code.google.com/p/lapdftext/
    Gully Burns). Gully was in Los Angeles – in the middle of the night and showed great stamina J In true hacking spirit I used the time to find out about Gully’s system. I downloaded it and couldn’t get it to install (needed java-6). So Gully repackaged it, and within two iterations (an hour) I had it working. That would have taken days conventionally. LA-PDFText is particularly good at discovering blocks (more sophisticated than #AMI2) so maybe I can use it in my work rather than competing.
  • CERMINE
    http://sciencesoft.web.cern.ch/node/120 . I’ve already blogged this but here we had the lead Dominika Tkaczyk live from Poland. I take comfort from her presence and vice versa. CERMINE integrates text better than #AMI at present and has a nice web service
  • Florida State University. Alexander Garcia, Casey McLaughlin, Leyla Jael Garcia Castro, Biotea (http://biotea.idiginfo.org/ ) Greg Riccardi and colleagues. They are working on suicide in the context of Veterans’ admin documents and provided us with an Open corpus of many hundred PDFs. (Some were good, some were really awful). Alex and Casey ran the workshop with great energy, preparation, food, beer, etc. and arranging the great support from the ABES site.
  • #crowdcrafting. It will become clear that human involvement is necessary in parts of the PDF2XML process. Validating or processes, and also possible tweaking final outputs. We connected to Daniel Lombraña González

    of http://crowdcrafting.org/ who took us through the process of building a distributed volunteer community. There was a lot of interest and we shall be designing clear crowdcrafting-friendly tasks (e.g. “draw a rectangle round the title”, “highlight corrupted characters”, “how many references are there”, etc.)

  • CITALO
    http://wit.istc.cnr.it:8080/tools/citalo. This system deduces the type of the citation (reference) from textual analysis. This is a very good example of a downstream application which depends on the XML but is largely independent of how it is created.
  • #AMI2. Our AMI2 system is complementary to many of the others – I am very happy for others to do citation typing, or match keywords. AMI2 has several unique features (I’ll explain later), including character identification, graphics (graphics are not images) extraction, image extraction, sub and superscripts, bold and italic. (Most of the other systems ignore graphics completely and many also ignore bold/italic)

So we have a wonderful synthesis of people and projects and tools. We all want to collaborate and are all happy to put community success as the goal , not individual competition. (And the exciting thing is that it’s publishable and will be heavily cited. We have shown this in the Blue Obelisk publications where the first has 300 citations and I’d predict that a coherent Jailbreaking publication would be of great interest. )

So yesterday was a turning point. We have clear trajectories. We have to work to make sure we develop rapidly and efficiently. But we can do this initially in a loose collaboration, and planning meetings and bringing in other collaborators and funding.

So if you are interested in An Open approach to making PDFs Open and semantic, let us know in the comments.

 

Leave a Reply