(From: Carol Jackson [..email ..] [via Jim Downing]
Subject: Latest DPC Technology Watch Report – ‘PDF should be used to preserve information for the future’. To: DIGITAL-PRESERVATION@jiscmail.ac.uk)
PDF should be used to preserve information for the future
Good news the already popular PDF file format adopted by consumers and business alike is one of the most logical formats to preserve today’s electronic information for tomorrow.
According to the latest report released today by the Digital Preservation Coalition (DPC), Portable Document Format (PDF) is one of the best file formats to preserve electronic documents and ensure their survival for the future. This announcement will allow information officers to follow a standardised approach for preserving electronic documents.
Information management and long–term preservation are major issues facing consumers and businesses in the 21st Century. This report is one of a series where The Digital Preservation Coalition (DPC) aims to think about and address the challenges facing us.
This report reviews PDF and the newly introduced PDF/Archive (PDF/A) format as a potential solution to the problem of long–term digital preservation. It suggests adopting PDF/A for archiving electronic documents’ as the standard will help preservation and retrieval in the future. It concludes that it can only be done when combined with a comprehensive records management programme and formally established records procedures.
Betsy Fanning, author of the report and director of standards at AIIM, comments, “A standardised approach to preserving electronic documents would be a welcome development for organisations. Without this we could be walking blindly into a digital black hole.”
The National Archives works closely with the DPC with issues surrounding digital preservation and will continue to do so. Adrian Brown, head of digital preservation at The National Archives said: “This report highlights the challenges we all face in a digital age. Using PDF/A as a standard will help information officers ensure that key business data survives. But it should never be viewed as the Holy Grail. It is merely a tool in the armoury of a well thought out records management policy. “
The report is a call to action, organisations need to act now and look hard at their information policies and procedures to anticipate the demand for their content (documents and records) in the future. Everybody has different criteria, types and uses for documentation so you need to find one that works for your organisation.
If you would like to read the full report please go to the Digital Preservation Coalition website. This can be accessed here: www.dpconline.org/graphics/reports/index.html#twr0802
PMR: I am not an expert in digital curation and am reluctant to criticize a body devoted to it. I am sure that they know in great detail how difficult it is to extract information from PDF, whatever the version. We’ve been looking at theses – bitmapped, born digital etc. and PDF is vastly more difficult than Word for information extraction. Vastly. Our programs such as OSCAR can read documents in Word but lose much of the information when they try to read PDF.
So yes, I can see that PDF is useful for preservation. Whether it’s better than XML I doubt. I’d like to see the argument. Whether PDF is any use after it’s been preserved is much less clear. Yes, if the document is pored over by human scholars. We’d hate to lose Shakespeare or similar.
But there are 1,000,000 scientific articles per year (give or take a bit). 15 million abstracts in Pubmed. Assuming they are preserved in PDF, how can we currently make full sense of them? If they were also in XML, HTML, Word, or LaTeX we’d be able to index them. Not that we cannot index PDF at all, it’s just that we lose much more information.
So I’m not arguing that PDF shouldn’t be used. But please please use a semantic format as well. And think about re-use as well as preservation.