petermr's blog

A Scientist and the Web

 

Why PDF is a Hamburger

In a recent comment Chris Rusbridge asks:
April 29th, 2008 at 4:47 pm e

I’ve been thinking about a blog post related to your hamburger rants. But the more I try to think it through, the murkier it gets. Is the problem that PDF cannot store the semantic information? I’m not sure, but I’m beginning to suspect maybe not, ie PDF can. Is the problem that the tools that build the PDFs don’t encode the semantic information? Probably. Is the semantic information available in the publisher’s file from which the PDF is built? Possibly to probably, depending on the publisher and their DTD/schema. Is the semantic information available in the author’s file? Probably not to possibly, depending on author tools (I’m not sure what chemists use to write these days; Word would presumably be dire in this respect unless there is a chemistry plug-in; LaTeX can get great results in math and CS, but I’m not sure how semantic, as opposed to display-oriented, the markup is). And even if this were to all happen, does chemistry have the agreed vocabulary, cf the Gene Ontology in bio-sciences, to make the information truly “semantic”? And…

PMR: Thank you Chris. It’s a good time to revisit this. There are several aspects.

(From Wikipedia) The PDF combines three technologies:

  • A sub-set of the PostScript page description programming language, for generating the layout and graphics.
  • A font-embedding/replacement system to allow fonts to travel with the documents.
  • A structured storage system to bundle these elements and any associated content into a single file, with data compression where appropriate.
  • and…

One of the major problems with PDF accessibility is that PDF documents have three distinct views, which, depending on the document’s creation, can be inconsistent with each other. The three views are (i) the physical view, (ii) the tags view, and (iii) the content view. The physical view is displayed and printed (what most people consider a PDF document). The tags view is what screen readers read (useful for people with poor eyesight). The content view is displayed when the document is re-flowed to Acrobat (useful for people with mobility disability). For a PDF document to be accessible, the three views must be consistent with each other.

… so why is this a problem?

First let me dispose of the “PDF is only bad if it’s authored with tools from a Moscow sweat-shop. Good PDF is fit for any purpose”. PDF is concerned with positioning objects on the page for sighted humans to read. Yes, there are the two other views but they are often inconsistent or impenetrable. Because most of us are sighted the problem does not grate, but for those others it can be very difficult. Let’s assume I have the text string “PDF”. In a normal ASCII document (including HTML and Word) the “P” comes first, then the “D” then the “F”. In PDF it’s allowable (and we have found it!) to have the following instructions in the following order

  1. position the “F” at coordinate (100,200)
  2. position the “D” at coordinate (90.3, 200)
  3. position the “P” at coordinate (81.2, 200)

When drawn out on screen the F would come first, then the D then the P. The final result would read naturally, but a speech synthesizer would hear the order “F”, “D”, “P”. I believe that the US government was sufficiently concerned about accessibility that they required Adobe to make alterations to the PDF software so that the characters would be read aloud in the right order. This is the Eric Morecambe syndrome (in response to Andre Preview telling him he is “playing all the wrong notes”:

I am playing all the right notes, but not necessarily in the right order.
Eric Morecambe

This spills over into all common syntactic constructs. Run-of-the-mill PDF has no concept of a paragraph, a line end or other common constructs. This gets worse with technical documents – you cannot tell where the diagrams or tables are or even if they are diagrams and tables. HTML got it 90% right – it has concepts such as “img”, “table”, “p”. PDF generally does not.

To retiterate PDF is a cheap and reliable way of transporting a printed page from one site to another and a cheap and inexpensive way of storing pages without paper. Beyond that it gets much less valuable very rapidly.

There’s a general problem with semantic information. If I write “the casus belli is very important” the emphasis (italics) tells me that the words carry semantic information. It doesn’t tell me what this information is. You have to guess. We often cannot guess or even guess wrong. This type of semantics is very fragile – if the phrase is cut-n-pasted you’ll probably lose the italics in most systems. If, however, you use HTML and write ‘class=”latin” and ‘class=”authorEmphasis” you immediately see that the semantics are preserved. So HTML can, with care, carry semantics. PDF generally cannot.

To answer your other points rapidly (I will come back to them in more detail). I use to think Word was dire. Word2007 has changed that. It can be used as an XML authoring tool. Not always easily but it preserves basic semantics. And as for a chemical plugin to Word…

…I’ve run out of time :-)

2 Responses to “Why PDF is a Hamburger”

  1. Chris Rusbridge says:

    Well Peter, thanks for amplifying your answer, but I still think you miss the point. Despite your “First let me dispose of…” comment above, much of your post says that many PDF files are badly structured. I absolutely agree, but it wasn’t the point I was trying to make (and anyway it’s true of most other formats; I’m staggered how few people use Word styles properly, for instance, yet without them many simple transformations become impossible).

    What I’m asking is:

    IF we have good authoring tools (preferably able to tag scientific information semantically)

    AND we have a good publishing workflow that preserves and perhaps standardises that semantic information

    AND we have good tools for converting the in-process document to PDF, XHTML etc

    THEN COULD the PDF contain semantic information, in the form of metadata, tags, RDFa, microformats, etc?

    (Or conversely, if we don’t use decent authoring tools, don’t care about encoding the semantic information in the first place, don’t care about document structure, use cobbled-together publishing systems ignoring standard DTDs, what does it matter if we use manky Russian PDF converters since there’s no semantic information there anyway!)

    … and if PDF cannot currently contain semantic information at a fine enough level, what would need to be added to make it possible? But there’s no point going down this route (try to get PDF better for science) if the earlier parts of the workflow make sustaining semantic information too hard.

    BTW, from the PDF/A standard (ISO 19005-1): “The future use of, and access to, these objects depends upon maintaining their visual appearance as well as their higher-order properties, such as the logical organization of pages, sections, and paragraphs, machine recoverable text stream in natural reading order, and a variety of administrative, preservation and descriptive metadata.” So a conforming PDF/A file should at least be able to recover text in the natural order… I do suspect PDF metadata is at the wrong level of granularity, although PDF reference says it can apply to various levels of object, but I don’t know…

  2. The authoring tools really are the key here. I’m less worried about the filetype (as long as it can work) as long as the tools can be built to make the authoring process work. But as Peter has said the work on ICE appears to me to be the most interesting in the space.

Leave a Reply