Chris Rusbridge and I have been indulging in a constructive debate about whether PDF is a useful archiving tool. Chris, as readers know, runs the Digital Curation Centre. I’ll reproduce the latest post from his blog and intersperse with comments. But just before that I should make it clear that I am not religiously opposed to PDF, just to the present incarnation of PDF and the mindset that it engenders in publishers, repositarians, and readers. (Authors generally do not use PDF).
[Robotic mining] is comparatively new, and (not surprisingly) hits some problems. Articles, like web pages, are designed for human consumption, and not for machine processing. We humans have read many like them; we know which parts are abstracts, which parts are text, which headings, which references. We can read the tables, find the intersections and think about what the data points mean. We can look at the graphs, the spectra etc, and relate them to the author’s arguments. Most of these tasks are hard for robots. But with a little bit of help and persistence, plus some added “understanding” of genre and even journal conventions, etc, robots can sometimes do a pretty good job.
However, most science articles are published in PDF. And PDF does not make the robot’s task easy; in fact, PDF often makes it very hard (not necessarily to be deliberately obscure, but perhaps as side-effects of the process leading to the PDF).
PMR: Agreed. An expansion is “most articles are authored in Word/OO or LaTeX and converted to PDF for the purposes of publishing.”
Peter Murray-Rust has been leading a number of one-man campaigns (actually they all involve many more than one man[*], but he is often the vocal point-person). One such campaign, based on attempts to robotically mine chemical literature can be summed up as “PDF is a hamburger, and we’re trying to turn it back into a cow” (the campaign is really about finding better semantic alternatives to PDF). I’ve referred to his arguments in the past, and we’ve been having a discussion about it over the past few days (see here, its comments, and here).
PMR: [*] Alma Swan does me the honour to quote this in her talks 🙂
I have a lot of sympathy with this viewpoint, and it’s certainly true that PDF can be a hamburger. But since scientists and publishers (OK, mostly publishers) are not yet interested in abandoning PDF, which has several advantages to counter its problems, I’m also interested in whether and if so, how PDF could be improved to be more fit for the scientific purpose.
PMR: I don’t think scientists care about PDF. It’s something that comes down the wire. If it came down in Word they wouldn’t blink. So it’s the publishers, not the readers. And most authors create Word. The tip it into the publisher’s site which converts it to PDF.
PMR: Having said that “PDF” is rapidly moving from a trademark to an english word. Rather than “send the manuscript” it’s “send the PDF”. Just like “please send us your Powerpoints”.
One way might be that PDF could be extended to allow for the incorporation of semantic information, in the same way that HTML web pages can be extended, eg through the use of microformats or RFDa, etc. If references to a gene could be tagged accordning to the Gene Ontology, references to chemicals tagged according to the agreed chemical names, InChis etc, then the data mining robots would have a much easier job. Maybe PDF already allows for this possibility?
PMR: This is completely possible at the technical level. My collaborator Henry Rzepa is keen on using PDF as a compound document format and metadata container. It can do it. But nobody does, it certainly will require tools that have to be bought.
PMR argues quite strongly that PDF is by design unfit for our purpose (in this case, holding scientific information such that it can reliably be extracted by text mining robots); that PDF’s determined page-orientation and lack of structural and semantic significance doom such attempts to failure. He also argues strongly that the right current solution is to use XML… or perhaps XHTML for science writing.
I don’t know. He might be right. But being right is not necessarily going to persuade thousands of journal editors and hundreds of thousands of scientists to mend their ways and write/publish in XML.
PMR: I’m not asking for XML. I’m asking for either XHTML or Word (or OOXML)
CR: I think we should tackle this in several ways:
- try to persuade publishers to publish their XML (often NLM XML) versions of articles as well as the PDFs
PMR: I am one hundred percent in favour of this. The problem is that most publishers are one hundred percent against it. For business reasons, not technical. Because they are worried that people might steal their content (oops, the content that we wrote).
- try to persuade publishers who don’t have a XML format to release HTML versions as well as (or instead of PDFs)
PMR: Most large publishers have an XML format. It’s trivial for them to create HTML and many do. This is a business problem, not a technical one.
- tackle more domain ontologies to get agreements on semantics
PMR: agreed. This is orthogonal to whether we use PDF, Word or clay tablets (many ancient civilisations used markup)
- work on microformats and related approaches to allow semantics to be silently encoded in documents
PMR: Absolutely agreed. It needs tools but we have some cunning plans for chemistry which will be revealed shortly.
- try to persuade authors to use semantic authoring tools (where they exist), and publishers to accept these
- try to persuade Adobe to extend PDF to include semantic micro-metadata, and to help provide tools to incorporate it, and to extract it.
PMR: The first part already exists. I would not espouse the second as I don’t want to have to purchase another set of tools for something that should be free.
Might that work? Well, it’s a broad front and a lot of work, but it might work better than pursuing only one of them… But if we got even part way, we might really be on the way towards a semantic web for science…
PMR: It will work at some stage – the stage when the publishers want to help scientists in their endeavour rather than prevent them taking the next logical step because it might impact on subscriptions or be extra work. The W3C community , the Googles, Flickrs, etc. etc do all this already. They have semantic linked data. It just that the scientific publishing Tardis is still stuck in the nineteenth century. It looks lovely from the outside.