petermr's blog

A Scientist and the Web


Why PDF is a Hamburger

In a recent comment Chris Rusbridge asks:
April 29th, 2008 at 4:47 pm e

I’ve been thinking about a blog post related to your hamburger rants. But the more I try to think it through, the murkier it gets. Is the problem that PDF cannot store the semantic information? I’m not sure, but I’m beginning to suspect maybe not, ie PDF can. Is the problem that the tools that build the PDFs don’t encode the semantic information? Probably. Is the semantic information available in the publisher’s file from which the PDF is built? Possibly to probably, depending on the publisher and their DTD/schema. Is the semantic information available in the author’s file? Probably not to possibly, depending on author tools (I’m not sure what chemists use to write these days; Word would presumably be dire in this respect unless there is a chemistry plug-in; LaTeX can get great results in math and CS, but I’m not sure how semantic, as opposed to display-oriented, the markup is). And even if this were to all happen, does chemistry have the agreed vocabulary, cf the Gene Ontology in bio-sciences, to make the information truly “semantic”? And…

PMR: Thank you Chris. It’s a good time to revisit this. There are several aspects.

(From Wikipedia) The PDF combines three technologies:

  • A sub-set of the PostScript page description programming language, for generating the layout and graphics.
  • A font-embedding/replacement system to allow fonts to travel with the documents.
  • A structured storage system to bundle these elements and any associated content into a single file, with data compression where appropriate.
  • and…

One of the major problems with PDF accessibility is that PDF documents have three distinct views, which, depending on the document’s creation, can be inconsistent with each other. The three views are (i) the physical view, (ii) the tags view, and (iii) the content view. The physical view is displayed and printed (what most people consider a PDF document). The tags view is what screen readers read (useful for people with poor eyesight). The content view is displayed when the document is re-flowed to Acrobat (useful for people with mobility disability). For a PDF document to be accessible, the three views must be consistent with each other.

… so why is this a problem?

First let me dispose of the “PDF is only bad if it’s authored with tools from a Moscow sweat-shop. Good PDF is fit for any purpose”. PDF is concerned with positioning objects on the page for sighted humans to read. Yes, there are the two other views but they are often inconsistent or impenetrable. Because most of us are sighted the problem does not grate, but for those others it can be very difficult. Let’s assume I have the text string “PDF”. In a normal ASCII document (including HTML and Word) the “P” comes first, then the “D” then the “F”. In PDF it’s allowable (and we have found it!) to have the following instructions in the following order

  1. position the “F” at coordinate (100,200)
  2. position the “D” at coordinate (90.3, 200)
  3. position the “P” at coordinate (81.2, 200)

When drawn out on screen the F would come first, then the D then the P. The final result would read naturally, but a speech synthesizer would hear the order “F”, “D”, “P”. I believe that the US government was sufficiently concerned about accessibility that they required Adobe to make alterations to the PDF software so that the characters would be read aloud in the right order. This is the Eric Morecambe syndrome (in response to Andre Preview telling him he is “playing all the wrong notes”:

I am playing all the right notes, but not necessarily in the right order.
Eric Morecambe

This spills over into all common syntactic constructs. Run-of-the-mill PDF has no concept of a paragraph, a line end or other common constructs. This gets worse with technical documents – you cannot tell where the diagrams or tables are or even if they are diagrams and tables. HTML got it 90% right – it has concepts such as “img”, “table”, “p”. PDF generally does not.

To retiterate PDF is a cheap and reliable way of transporting a printed page from one site to another and a cheap and inexpensive way of storing pages without paper. Beyond that it gets much less valuable very rapidly.

There’s a general problem with semantic information. If I write “the casus belli is very important” the emphasis (italics) tells me that the words carry semantic information. It doesn’t tell me what this information is. You have to guess. We often cannot guess or even guess wrong. This type of semantics is very fragile – if the phrase is cut-n-pasted you’ll probably lose the italics in most systems. If, however, you use HTML and write ‘class=”latin” and ‘class=”authorEmphasis” you immediately see that the semantics are preserved. So HTML can, with care, carry semantics. PDF generally cannot.

To answer your other points rapidly (I will come back to them in more detail). I use to think Word was dire. Word2007 has changed that. It can be used as an XML authoring tool. Not always easily but it preserves basic semantics. And as for a chemical plugin to Word…

…I’ve run out of time :-)

Leave a Reply