Jailbreaking the PDF; a wonderful hackathon and a community leap forward for freedom – 1

Yesterday we had a truly marvellous hackathon http://scholrev.org/hackathon/ in Montpellier, in between workshops and main Eur Semantic Web Conference. The purpose was to bring together a number of groups who value semantic scholarship and free information from the traditional forms of publication. I’ll be blogging later about the legal constraints imposed by the publishing industry, but Jailbreaking is about the technical constraints of publishing information as PDF.

The idea Jailbreaking was to bring together people who have developed systems, tools, protocols, communities for turning PDF into semantic form. Simply, raw PDF is almost uninterpretable, a bit like binary programs. For about 15 years the spec was not Open and it was basically a proprietary format from Adobe. The normal way of starting to make any sense of PDF content is to buy tools from companies such as Adobe, and there has been quite a lot of recent advocacy from Adobe staff to consider using PDF as a universal data format. This would be appalling – we must use structured documents for data and text and mixtures. Fortunately there are now a good number of F/OSS tools, my choice being http://pdfbox.apache.org/ and these volunteers have laboured long and hard in this primitive technology to create interpreters and libraries. PDF can be produced well, but most scholarly publishers’ PDFs are awful.

It’s a big effort to create a PDF2XML system (the end goal). I am credited with the phrase “turning a hamburger into a cow” but it’s someone else’s. If we sat down to plan PDF2XML, we’d conclude it was very daunting. But we have the modern advantage of distributed enthusiasts. Hacking PDF systems by oneself at 0200 in the morning is painful. Hacking PDFs in the company of similar people is wonderful. The first thing is that it lifts the overall burden from you. You don’t have to boil the ocean by yourself. You find that others are working on the same challenge and that’s enormously liberating. They face the same problems and often solve them in different ways or have different priorities. And that’s the first positive takeaway – I am vastly happier and more relaxed. I have friends and the many “I“s are now we. It’s the same liberating feeling as 7 years ago when we created the http://en.wikipedia.org/wiki/Blue_Obelisk community for chemistry. Jailbreaking has many of the shared values, though coming from different places.

Until recently most of the tools were closed source, usually for-money though occasionally free-as-in-beer for some uses or communities. I have learnt from bitter experience that you can never build an ongoing system on closed source components. At some stage they will either be withdrawn or there will be critical things you want to change or add and that’s simply not possibly. And licensing closed source in an open project is a nightmare. It’s an anticommmons. So, regretfully, I shall not include Utopia/pdfx from Manchester in my further discussion because I can’t make any use of it. Some people use its output, and that’s fine – but I would/might want to use some of its libraries.

There was a wonderful coming-together of people with open systems. None of us had the whole picture , but together we covered all of it. Not “my program is better than your program”, but “our tools are better than my system“. So here a brief overview of the open players who came together (I may miss some individuals, please comment if I have done you an injustice). I’ll explain the technical bits is a later post – here I am discussing the social aspects.

  • LA-PDFText (http://code.google.com/p/lapdftext/
    Gully Burns). Gully was in Los Angeles – in the middle of the night and showed great stamina J In true hacking spirit I used the time to find out about Gully’s system. I downloaded it and couldn’t get it to install (needed java-6). So Gully repackaged it, and within two iterations (an hour) I had it working. That would have taken days conventionally. LA-PDFText is particularly good at discovering blocks (more sophisticated than #AMI2) so maybe I can use it in my work rather than competing.
  • CERMINE
    http://sciencesoft.web.cern.ch/node/120 . I’ve already blogged this but here we had the lead Dominika Tkaczyk live from Poland. I take comfort from her presence and vice versa. CERMINE integrates text better than #AMI at present and has a nice web service
  • Florida State University. Alexander Garcia, Casey McLaughlin, Leyla Jael Garcia Castro, Biotea (http://biotea.idiginfo.org/ ) Greg Riccardi and colleagues. They are working on suicide in the context of Veterans’ admin documents and provided us with an Open corpus of many hundred PDFs. (Some were good, some were really awful). Alex and Casey ran the workshop with great energy, preparation, food, beer, etc. and arranging the great support from the ABES site.
  • #crowdcrafting. It will become clear that human involvement is necessary in parts of the PDF2XML process. Validating or processes, and also possible tweaking final outputs. We connected to Daniel Lombraña González

    of http://crowdcrafting.org/ who took us through the process of building a distributed volunteer community. There was a lot of interest and we shall be designing clear crowdcrafting-friendly tasks (e.g. “draw a rectangle round the title”, “highlight corrupted characters”, “how many references are there”, etc.)

  • CITALO
    http://wit.istc.cnr.it:8080/tools/citalo. This system deduces the type of the citation (reference) from textual analysis. This is a very good example of a downstream application which depends on the XML but is largely independent of how it is created.
  • #AMI2. Our AMI2 system is complementary to many of the others – I am very happy for others to do citation typing, or match keywords. AMI2 has several unique features (I’ll explain later), including character identification, graphics (graphics are not images) extraction, image extraction, sub and superscripts, bold and italic. (Most of the other systems ignore graphics completely and many also ignore bold/italic)

So we have a wonderful synthesis of people and projects and tools. We all want to collaborate and are all happy to put community success as the goal , not individual competition. (And the exciting thing is that it’s publishable and will be heavily cited. We have shown this in the Blue Obelisk publications where the first has 300 citations and I’d predict that a coherent Jailbreaking publication would be of great interest. )

So yesterday was a turning point. We have clear trajectories. We have to work to make sure we develop rapidly and efficiently. But we can do this initially in a loose collaboration, and planning meetings and bringing in other collaborators and funding.

So if you are interested in An Open approach to making PDFs Open and semantic, let us know in the comments.

 

11 thoughts on “Jailbreaking the PDF; a wonderful hackathon and a community leap forward for freedom – 1

  1. Rahul Jha

    Hi,

    This is Rahul from University of Michigan. Our research group, CLAIR (http://clair.si.umich.edu/homepage/) works a lot with scientific publications and we are constantly struggling to extract structured data from scientific PDFs (e.g. citation links). So I’d love to help.

    I’d also like to draw your attention to a tool called Parscit (http://wing.comp.nus.edu.sg/parsCit/), it’s about extracting structure from scientific PDFs after it’s converted to text. It’s not perfect, but works well enough in most cases, I use it a lot.

    Reply
    1. pm286 Post author

      We’d love you to join – it’s open to all. There will be lots of little ways to help but even by testing current tools it will really help.

      I can’t get the links for ParsCit to work… Is it still current (it was 2004). There are people in our group who are interested in doing this.

      Reply
      1. Rahul Jha

        The ParsCit link seems to work for me, there could be an access restriction, are you able to view the parent domain? (http://wing.comp.nus.edu.sg). I can mail a copy of the software if needed.

        To provide more details, my current workflow involves converting PDF to text using PDFBox, then removing tables, figures etc. using ParsCit (since we are interested in mostly the body text) and then running a set of ad-hoc perl scripts for error correction e.g. hyphen removal (“develop -ment” -> “development”) parsing joined words, (“and speech recognition” -> “and speech recognition”) etc. I can provide my code for this, or integrate it with your code if you have a post-processing module.

        Reply
  2. Pingback: Jailbreaking the PDF; a wonderful hackathon and a community leap … | Richard Kastelein - Creative Technology and building the bridge

  3. Pingback: Unilever Centre for Molecular Informatics, Cambridge - Hack4ac, Content-mining and open-science in Oxford « petermr's blog

    1. pm286 Post author

      Marc,
      Incredible – fantastic. unfortunately I shall be in Lithuania. But you are welcome to all of my software – we are working with Tabula (see next post, I hope). I have found Gov PDFs are technically MUCH better than scientific publishers. Also I am developing graph and chart hacking so that CSV files can be extracted.

      I am with you in spirit and would probably be able to skype it, especially if anyone wanted to use the software and needed help

      Reply
      1. Marc

        Thanks, Peter. For what it’s worth, we are going to encourage remote participation by publishing dome of the challenges ahead of the hackathon and including remote submissions in judging.

        Also I am starting a Resources page at http://pdfliberation.wordpress.com/2013/11/15/hackathon/. If you want me to add your tools (or any others that performed well at your hackathon), I hope you can send details to marc[at]publicsectorcredit[dot]org.

        Reply
  4. Pingback: PDF Liberation Hackathon – DC, SF and Wordwide – Jan. 17-19, 2014 | PDF Liberation

  5. Pingback: Global Hackathon Seeks PDF Extraction Solutions

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>