Unilever Centre for Molecular Informatics
 

petermr's blog

A Scientist and the Web

 

Posts Tagged ‘open data’

Can I data- and Text-mine Pubmed Central?

Saturday, April 5th, 2008
Until last week I had assumed that the NIH policy on access to publicly funded research grants full Open Access rights to anyone in the world. The works will be deposited in Pubmed Central (PubMed Central site). Pubmed Central has its own definition of “open access”) and generally uses the phrase “public access” – which is operationally unclear. Last week I learned at Dagstuhl that data- text-mining of Pubmed Central was blocked by the site itself – delgates had found that there is a maximum of two papers that can be downloaded before the IP address is blocked. I’d very much like clarification (as I have found the NIH sites and elsewhere extremely difficult to navigate on a consistent basis). There is no explicit mention of the right to download material for data-mining and a lot of verbiage about “consistency with publishers’s policies” which is no help to scientists like me. So – simply – when the flood of public depositions comes on stream after April 7 (obviously with some delay) can I text-mine them? This is important. Biology is in critical need of machine help in reading papers. The bioscience community spends tens of millions of dollars (a figure mentioned at Dagstuhl) on annotating genomes including the ontologies and lexicons. Without this we simply do not understand much of the science being published It is hugely costly to use humans for this. When George Bush signed the mandate he clearly envisaged that the information should be used for the benefit of human health… …and this means text-mining. So – simply – can I run my robots over the material deposited by mandate?
  1. Yes – without question or fear of reprisal.
  2. No – not at all.
  3. Well – um – err – it depends on each individual paper and each individual publisher and nobody can give a clear answer
The current answer appears to be 2 (I will be cut off mechanically). I suspect the real answer is 3. Note that although our group has been able to write robots that can understand chemistry we are a long way from understanding publishers’ policies on access (mainly because many are designed to be unhelpful). So it is impossible to do bulk mining as we cannot differentiate publisher policies. Please tell me I am wrong and that it’s really 1. If not, should we not prepare a case to the NIH – they have asked for submissions – asking them to assert that the policy is 1. and to make it clear. Perhaps the Open Knowledge Foundation should create a submission. If the NIH aren’t prepared to do this then the “victory” is only the first step in a long struggle for liberating data.

Open Data: Datument submitted to Elsevier’s Serials Review

Saturday, January 5th, 2008
I have just finished writing an invited article for Serials Review – Elsevier (I’m making an exception and submitting to a closed access publisher because (a) this is a special issue – from the invitation from Connie Foster
*Serials Review* Serials Review (v.30, no.4, 2004) was a focus issue on Open Access. It remains one of the most heavily downloaded issues and articles even now. Open Access remains a “hot topic” and fundamental discussion in scholarly communication. Your names were suggested by either current board members or previous contributors to the Open Access issue. At the time of that publication, editors and authors envisioned revisiting the Open Access environment a few years hence since issues, publisher responses, “experiments,” and government mandates were or are in flux.
PMR: and (b) we are all allowed to retain copyright. [I'll discuss the message later. This post is about the medium. And how today's medium doesn't carry messages very well at all.] First to publicly thank Connie Foster for her patience. I warned her that I would not submit a conventional manuscript because I wanted to show what Scientific Data are actually like. And you can’t do that in a PDF, can you? So I asked ahead of time if I could submit HTML. It caused the publoisher (Elsevier) a lot of huffing and puffing. The answer seemed to be “yes”, but when I came to submit the manuscript it only accepted dead documents. So I’ve ended up mailing it to Connie. The document is a datument – a term that Henry Rzepa and I coined about 4 years ago (From Hypermedia to Datuments: Murray-Rust and Rzepa: JoDI). It emphasizes that information should be seamless – not arbitrarily split into “full-text” and “data” because it’s easier for twentieth century publishers. (I return to this in a later post). The ideal medium for datuments is XML – for example using ICE (Integrated Content Environment) and that’s why I’m going to visit Peter Sefton and colleagues. But the simple way to create datuments is in valid XHTML. Every editor in the world should now produce XHTML so there is no reason not to do it. It’s a standard. It’s in billions of machines over the world. It’s got everything we need. You see hundreds of examples every day. XHTML manages:
  • images (it’s done this for 15 years)
  • multimedia (also for 15 years)
  • hyperlinks (for 15 years)
  • interactive objects (also for 15 years, though with some scratchy syntax)
  • foreign namespaces – probaly about 10 years
  • vector graphics (SVG) nearly 10 years
. It also manages STYLES. You don’t have to put the style in the content. You put it in a stylesheet. So my datument doesn’t have styles. Elsevier can add those if it wants. Personally I like reading black text on a white background – I know it’s very old-fashioned, but that;s how I was educated. Also, since it’s in XML you can repurpose it. Extract just the images. Or discard the applet. Or reorganise the order of author’s names. Or mash it with another paper. Or extract the data. Or… So XHTML is a liberating medium int which to publish while PDF is a dead, restrictinf and dismal medium. So having created my manuscript as a standard XHTML hyperdocument – no technology that isn’t at least 10 years old I try to submit it. Doesn’t work. Publisher doesn’t like HTML. This seems barmy since they actually publish in HTML. I am not prepared to transform the datument into PDF. It destroys the whole point of the article. It would be like publishing movie as a single snapshot. Or a recording of a song using only a score. So I’ve had to zip it up and send it as email. Which is what we do everyday anyway. [In passing - why this elaborate ritual with the publishers' technology? Authors have been producing acceptable manuscripts in HTML for years. Why publish in double-column PDF? I didn't ask for it. It is purely for the benefit of the publishers. To help their branding. (It's not even to make their life easier, as I'll show later because it doesn't).] So, as a good Open Access advocate I have reposited it in the Cambridge DSpace. DSpace does not deal wth hyperdocuments (please tell me I’m wrong). I would have to go through all the documents and find the relative URLs and expand them to the Cambridge DSpace base URL. This, of course, means that the documents are not portable. So I had to reposit a ZIP file. 15 years after the invention of HTML and we cannot reposit HTML hyperdocuments. [UPDATE: I have since found that it does accept HTML so we'll see how it comes out. ] [UPDATE2: Yes, it accepts HTML, but no the links don't work. You have to know the address of each image before you deposit them. Then you have to edit the main paper to make them work. Which means it breaks if you export it. So basically you cannot reposit normal HTML in DSpace and expect it to work.] So, dear reader, if you are a human, and want to read the file, download the zip file, unzip it, point your browser at it, swear at me when the browser breaks. [UPDATE: Bill says is breaks. I don't understand this.] And, dear reader, if you are a robot you have no option but to ignore it. It’s a zip file. It’s potentially evil. And anyway you wouldn’t know what you were indexing or looking for. So maybe I will give you the top part of the HTML to look at. You won’t see the pictures, but you probably don’t care at this stage, though in a few years you will. I also tried to reposit it at Nature Precedings. They wouldn’t let post a zip file. Only DOC, PPT, PDF. Oh dear.

Open Data: I want my data back!

Friday, January 4th, 2008
var imagebase=\’file://C:/Program Files/FeedReader30/\’;

 

Although I am mainly concerned with campaigning for data associated with schoilarly publishing to be Open, the term Open Data has also been used in conjunction with personal data “given” or “lent” to third parties (see Open Data – Wikipedia) which contains Jon Bosak’s quote “I want my data back”). Here is a good example of the problems of getting one’s personal data (and possibly other people’s) back from Paul Miller of Talis: Scoble, Facebook, Plaxo, open data; time for change?. Excerpts (read the whole post for the details)

 

I am of course talking, like so many others, about Robert Scoble being barred from Facebook for using an as-yet unlaunched capability of Plaxo that clearly and unambiguously breached Facebook’s Terms and Conditions. It all began with a ‘tweet’ from Robert Scoble, about the time that post-holiday blues kicked in for those returning to work this (UK) morning;
“Oh, oh, Facebook blocked my account because I was hitting it with a script. Naughty, naughty Scoble!”
Twitter exploded, closely followed by large chunks of the blogosphere. … Minutiae aside, the whole affair raises a couple of points pertinent to one of the biggest issues for 2008; ownership, portability and openness of data.
  • I want to be able to take my data from a service such as Facebook, and use it somewhere else. That’s what Marc Canter has been arguing forever, along with the AttentionTrust, OpenSocial (to a degree), DataPortability.org and many more. That’s part of the rationale behind all the work we’ve been doing on the Open Data Commons, too. However, whether I want to or not, doing it the way Scoble did is a breach of the terms and conditions of Facebook; terms and conditions to which I – and he – signed up when we chose to use the site. If you don’t like the terms, don’t use the service. It’s as simple as that;
  • Even were I allowed to export ‘my’ data, there’s a fuzzy line between that which is mine and that which isn’t. The fact that I am a Facebook friend with Nova Spivack certainly should be mine to take wherever I choose. The contact details Nova chooses to surface to me as part of that relationship, however? Are they mine to take with me, or his to control where I can surface them? There’s clearly work to do there, although it’s interesting that ‘even’ people such as Tara Hunt are reacting (also on Twitter, of course) with;
“I’m appalled that someone can take my info 2 other networks w/o my permission. Rights belong 2 friends, too.”

PMR: I have no additional comments on this other than to say it’s going to take hard work, forethought to anticipate problems of this sort and probably a lot of legal work. Kudos to Paul and Talis and their collaborators for helping in these general areas.

 

In science it’s easy. Our data are ours. They don’t belong to Wiley, ACS, Elsevier, Springer. I’ve just finished a paper on this which you should all see shortly.

 

We want our data back.

 

And in future we want to make sure we don’t give away our rights to them. Is that a simple message for 2008?

 

 

Technorati Tags: , , , , , ,