Chris Rusbridge (Director of the Digital Curation Centre) has added another thoughtful comment which has helped me clarify my ideas.
Chris Rusbridge Says:
April 30th, 2008 at 5:00 pm e
Well Peter, thanks for amplifying your answer, but I still think you miss the point. Despite your “First let me dispose of…” comment above, much of your post says that many PDF files are badly structured. I absolutely agree, but it wasn’t the point I was trying to make (and anyway it’s true of most other formats; I’m staggered how few people use Word styles properly, for instance, yet without them many simple transformations become impossible).
PMR: I agree we needn’t follow this:
What I’m asking is:
IF we have good authoring tools (preferably able to tag scientific information semantically)
AND we have a good publishing workflow that preserves and perhaps standardises that semantic information
AND we have good tools for converting the in-process document to PDF, XHTML etc
THEN COULD the PDF contain semantic information, in the form of metadata, tags, RDFa, microformats, etc?
PMR: Potentially yes, in practice I think we have missed the opportunity and it won’t happen. But there may be technical and social advances that will prove me wrong. The baseline is that we need structured documents which can act as a compound document and to which semantics can be added. And I agree that we can separate the creation of semantic information from the final object (at least in most cases). And that in many cases it isn’t easy to create the semantic information. And it is conceivable that this cold be done with PDF – it is a compound document format and it can hold metadata (DC, MPEG, etc.). But it’s neither the best, nor is it widely used for this. So it would require a great deal of effort to satisfy me on these two points.
In SPECTRaT we looked at this and had heated discussions about whether we could make PDF/a work as a container document. Henry Rzepa has made several attempts to use PDF as a container for chemistry. I don’t think it’s going to fly. And we came to the strong conclusion that if we wanted to involved machines in helping us get useful information out of theses then PDF – in its varieties – doesn’t help.
(Or conversely, if we don’t use decent authoring tools, don’t care about encoding the semantic information in the first place, don’t care about document structure, use cobbled-together publishing systems ignoring standard DTDs, what does it matter if we use manky Russian PDF converters since there’s no semantic information there anyway!)
PMR: That’s a pragmatic approach. It keeps scientific information in the C20.
… and if PDF cannot currently contain semantic information at a fine enough level, what would need to be added to make it possible? But there’s no point going down this route (try to get PDF better for science) if the earlier parts of the workflow make sustaining semantic information too hard.
PMR: As you know I favour XML documents. These have many advantages over PDF. More below
BTW, from the PDF/A standard (ISO 19005-1): “The future use of, and access to, these objects depends upon maintaining their visual appearance as well as their higher-order properties, such as the logical organization of pages, sections, and paragraphs, machine recoverable text stream in natural reading order, and a variety of administrative, preservation and descriptive metadata.” So a conforming PDF/A file should at least be able to recover text in the natural order… I do suspect PDF metadata is at the wrong level of granularity, although PDF reference says it can apply to various levels of object, but I don’t know…
The fundamental problem is that PDF is a graphically oriented language. Scholarly work is a small part of what it is used for. It’s perfectly OK to have text reading vertically, in huge or tiny fonts, with arbitrary graphics strokes and primitives. It has no sense of “content-versus-presentation” that is the key design principle in all of XML.
It is extremely difficult to read the average PDF. Firstly it’s often encrypted or compressed. You cannot read it without bespoke software. In contrast XML documents are designed to be readable by humans without special software. And I often do. It’s not fun but it’s perfectly possible to read MathML, CML in a normal text editor.
There are virtually no useful freely available tools for PDF, while there are zillions for XML. Every computer on the planet now has an XML-DOM and SAX engine (I take some credit for the latter). In PDF there are heroes such as Ben Litchfield who has created PDF-BOX and I’ve worked a lot with it, but in general the problem of reading PDF into a machine is horrendous. For example “where does one paragraph end and the next one start?”. This is impossible to determine in PDF. There is no concept of paragraph. There isn’t even a concept of word – there are heuristics saying that if one character is sufficiently close to another on the screen then it’s probably part of a word. While I accept that there may – somewhere – be expensive PDF tools that can add information on where the words and paragraphs are in a document and expensive tools that can decode this, there is nothing in general use. In contrast every teenage web-hcaked can use HTML to define where the words start and end and where the paragraphs are.
Maybe some people don’t mind not being able to determine word boundaries, paragraphs, tables, graphics, lists, etc. But scientists have to. XHTML solves all these problems. It’s not perfect but with microformats it can be made into a very sophisticated approach. It’s universal, free, fit-for-purpose