On Monday I shall be taking at Colorado State University on the theme on “Digital preservation of the scientific record” – probably not the precise title. Digital preservation (WP):
“ refers to the management of digital information over time. Unlike the preservation of paper or microfilm, the preservation of digital information demands ongoing attention. This constant input of effort, time, and money to handle rapid technological and organisational advance is considered the main stumbling block for preserving digital information beyond a couple of years. Indeed, while we are still able to read our written heritage from several thousand years ago, the digital information created merely a decade ago is in serious danger of being lost.
Digital preservation can therefore be seen as the set of processes and activities that ensure the continued access to information and all kinds of records, scientific and cultural heritage existing in digital formats.”
This is hard. Let’s assume we actually know what we wish to preserve (I’ll blog about that later). I’ve lost most of by digital past. Every time I move employers (especially when they kicked me out) I was unable to transfer my record. They no longer maintain it. I have snippets (digital potshards) chached in various web searches engines – I came across one today in Clusty (another future post…). It was from 1996, and carried a Birbeck address. It probably only exists in the fragmented digital strata – certainly the machine at Birkbeck carrying is no more.
And, of course, there is context and process. If the bytes are in ASCII and use English that’s a start. But many are binary. And there is much context (metadata) that is lost. There are useful prosthetics – thus I link to Wikipedia wherever possible and I can assume that other named entities in my discourse are trivially discoverable on todays’ web. But will today’s web persist?
Many people talk of the sheer volume of data – that’s not the problem – it’s the complexity and interrelatedeness. I got a mail yesterday asking:
“I have a C++ program and data
What data repositories are available?”
and I realised I couldn’t answer the question! I know how to archive Open Source software (I use Sourceforge) , and I know how to archive certain types of bio-data (proteins sequences, etc.) I expect that there are repositories that hold genotypes but I doubt that they accept it without being coupled to publication.
So I feel slightly awkward but have to say that we haven’t yet got good solutions. (I got very annoyed at the Glasgow meeting last year on digital scholarship when a smooth vendor of repositories told us how easy it was to put scientific digital objects into their system. And I let the meeting know that I didn’t think it was trivial.)