Open Data: Datument submitted to Elsevier's Serials Review

I have just finished writing an invited article for Serials Review – Elsevier (I’m making an exception and submitting to a closed access publisher because (a) this is a special issue – from the invitation from Connie Foster

*Serials Review*
Serials Review (v.30, no.4, 2004) was a focus issue on Open Access. It remains one of the most heavily downloaded issues and articles even now. Open Access remains a “hot topic” and fundamental discussion in scholarly communication. Your names were suggested by either current board members or previous contributors to the Open Access issue.
At the time of that publication, editors and authors envisioned revisiting the Open Access environment a few years hence since issues, publisher responses, “experiments,” and government mandates were or are in flux.

PMR: and (b) we are all allowed to retain copyright.
[I’ll discuss the message later. This post is about the medium. And how today’s medium doesn’t carry messages very well at all.]
First to publicly thank Connie Foster for her patience. I warned her that I would not submit a conventional manuscript because I wanted to show what Scientific Data are actually like. And you can’t do that in a PDF, can you?
So I asked ahead of time if I could submit HTML. It caused the publoisher (Elsevier) a lot of huffing and puffing. The answer seemed to be “yes”, but when I came to submit the manuscript it only accepted dead documents. So I’ve ended up mailing it to Connie.
The document is a datument – a term that Henry Rzepa and I coined about 4 years ago (From Hypermedia to Datuments: Murray-Rust and Rzepa: JoDI). It emphasizes that information should be seamless – not arbitrarily split into “full-text” and “data” because it’s easier for twentieth century publishers. (I return to this in a later post). The ideal medium for datuments is XML – for example using ICE (Integrated Content Environment) and that’s why I’m going to visit Peter Sefton and colleagues.
But the simple way to create datuments is in valid XHTML. Every editor in the world should now produce XHTML so there is no reason not to do it. It’s a standard. It’s in billions of machines over the world. It’s got everything we need. You see hundreds of examples every day.
XHTML manages:

images (it’s done this for 15 years)
multimedia (also for 15 years)
hyperlinks (for 15 years)
interactive objects (also for 15 years, though with some scratchy syntax)
foreign namespaces – probaly about 10 years
vector graphics (SVG) nearly 10 years

. It also manages STYLES. You don’t have to put the style in the content. You put it in a stylesheet. So my datument doesn’t have styles. Elsevier can add those if it wants. Personally I like reading black text on a white background – I know it’s very old-fashioned, but that;s how I was educated.
Also, since it’s in XML you can repurpose it. Extract just the images. Or discard the applet. Or reorganise the order of author’s names. Or mash it with another paper. Or extract the data. Or…
So XHTML is a liberating medium int which to publish while PDF is a dead, restrictinf and dismal medium. So having created my manuscript as a standard XHTML hyperdocument – no technology that isn’t at least 10 years old I try to submit it. Doesn’t work. Publisher doesn’t like HTML. This seems barmy since they actually publish in HTML.
I am not prepared to transform the datument into PDF. It destroys the whole point of the article. It would be like publishing movie as a single snapshot. Or a recording of a song using only a score. So I’ve had to zip it up and send it as email. Which is what we do everyday anyway.
[In passing – why this elaborate ritual with the publishers’ technology? Authors have been producing acceptable manuscripts in HTML for years. Why publish in double-column PDF? I didn’t ask for it. It is purely for the benefit of the publishers. To help their branding. (It’s not even to make their life easier, as I’ll show later because it doesn’t).]
So, as a good Open Access advocate I have reposited it in the Cambridge DSpace. DSpace does not deal wth hyperdocuments (please tell me I’m wrong). I would have to go through all the documents and find the relative URLs and expand them to the Cambridge DSpace base URL. This, of course, means that the documents are not portable. So I had to reposit a ZIP file. 15 years after the invention of HTML and we cannot reposit HTML hyperdocuments.
[UPDATE: I have since found that it does accept HTML so we’ll see how it comes out. ]
[UPDATE2: Yes, it accepts HTML, but no the links don’t work. You have to know the address of each image before you deposit them. Then you have to edit the main paper to make them work. Which means it breaks if you export it. So basically you cannot reposit normal HTML in DSpace and expect it to work.]
So, dear reader, if you are a human, and want to read the file, download the zip file, unzip it, point your browser at it, swear at me when the browser breaks.
[UPDATE: Bill says is breaks. I don’t understand this.]
And, dear reader, if you are a robot you have no option but to ignore it. It’s a zip file. It’s potentially evil. And anyway you wouldn’t know what you were indexing or looking for. So maybe I will give you the top part of the HTML to look at. You won’t see the pictures, but you probably don’t care at this stage, though in a few years you will.
I also tried to reposit it at Nature Precedings. They wouldn’t let post a zip file. Only DOC, PPT, PDF. Oh dear.

14 Responses to Open Data: Datument submitted to Elsevier's Serials Review

bill says:

January 5, 2008 at 1:07 pm

It didn’t break my browser but it didn’t work right either — most of the actual data were inaccessible.

pm286 says:

January 5, 2008 at 1:55 pm

(1) I am trying to upload to DSpace under HTML but everytime it gets near the end it crashes. So much for “one-click” repositories. It’s taken me 45 minutes

Peter Sefton says:

January 5, 2008 at 9:49 pm

Your DSpace seems to be offline at the moment. I also tried resolving the handle through handle.net and that timed out too.

pm286 says:

January 6, 2008 at 1:32 am

(3) DSpace has gone walkabout today. I wonder whether I crashed it or just a coincidence.

Dorothea Salo says:

January 6, 2008 at 5:03 am

Bwa-hahahahaha. DSpace is wonderfully perverse. It always goes kaboom at times of maximum inconvenience. If we’re ever at a pub together, ask me about the day I took off work to go help out the Library of Congress. (Then duck.)
DSpace can take HTML, but with some caveats — internal links need to be relative rather than absolute, is the main one. I typically do a bit more data-massage than that even, but so it goes.

Pingback: Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Open Data in Science
pm286 says:

January 6, 2008 at 11:07 am

(5) DSpace is back online at:
http://www.dspace.cam.ac.uk/bitstream/1810/194892/1/opendata.html
but the links don’t work and never will.
(1) Bill – can you tell me the symptoms?

alf says:

January 6, 2008 at 2:59 pm

Why don’t you host it on your own server?

pm286 says:

January 6, 2008 at 7:02 pm

(8) We will at some stage – when I get in on Monday, probably

bill says:

January 6, 2008 at 11:54 pm

I didn’t keep the unzipped or zip files, but what seemed to happen was that the html document opened fine but couldn’t call up any of the images or interactive stuff from the local folder. If you put the link back up I can re-download it and be more specific.

pm286 says:

January 7, 2008 at 12:01 am

(10) Bill,
doesn’t http://www.dspace.cam.ac.uk/handle/1810/194890
work?

Jim Downing says:

January 7, 2008 at 9:57 am

(3): DSpace@Cambridge don’t use Handle, but decided to commit to their URLs instead.

bill says:

January 7, 2008 at 9:58 am

I am an idiot twice over:
1. that link works fine; and
2. when I move the unzipped folder out of the temp directory, everything works. (That’s irritating, since if it’s on the damn computer it’s on the damn computer — but I’ve seen this before, where things have to be moved out of the temp directory to work. I hate Windows.)

Jill says:

January 7, 2008 at 3:59 pm

So essentially, the real problem is not twentieth century publishing practices, but the software industry (term encompasses both browser developers as well as Microsoft). You wonder why publishers don’t accept material in this format? You are blaming the wrong parties in this. The technology has progressed but imperfectly and with great gaps in what can be readily managed. Publishers are supporting multiple formats and re-purposing content within the constraints of the demands on any institution or enterprise. They may not be without blame certainly, but they aren’t the villains of the piece. In fact, there are no real villains, just a variety of players.