Should Data Repositories be Open?

  Robin Rice Says: March 5th, 2008 at 4:51 pm

Hi Peter,

    I’ve a question. What happened to the impetus for open data in this abstract? This looks like a useful set of solutions for storing/managing/curating data within research centres but not necessarily for disseminating or publishing that data. Repository services could play a role with that, by either

    packaging up some of those long tail datasets and making them accessible now and in the future (after the researchers have moved on to new projects), or by using the embargo features that repository software offers to make data available after the date of publication of a paper on which its based, or to create metadata records for discovery, with access controlled by the researcher, as you suggest is often necessary.

PMR: Obviously I'm a fan of Open Data and Open Access but I don't take it as axiomatic that all Repositories must be completely Open. The primary purpose (IMO) is that (scientific) repositories preserve information and that they should try to capture all meaningful output from an institution. Much of this is, necessarily, not Open in the first instance. There are, for example, theses (and the data associated with them) that are closed because of commercially sensitive information, humanly sensitive information, etc. and universities have managed this concern for many years. So it's reasonable that some information may stay closed for a considerable time.

There is also a pragmatic aspect. Many scientists (e.g. in chemistry) would never put their data in an Open repository at the beginning. The fear of being scooped (perhaps even by their own colleagues) or being banned from publication by publishers who regard this as prior disclosure, or invalidating a patent application. To over come that we have created an embargo process so that data can be stored and only disseminated later (in our eCrystals meeting with UKOLN and Soton 3 years was reported as probably tolerable to chemists). I hope that by carefully choosing the protocol it may be possible to lower this time gradually but it takes time and data.

Then - when the data come out of embargo - should they always be Open. I'd say yes, but there may be domain or community norms that militate against that, particularly in fields containing human data.

What is axiomatic, however, is that if we don't capture it at all, then we cannot ever disseminate it, so my emphasis is on capture.

When giving the talk I do not feel bound to the precise topics in the abstract - so I'll probably mention Open Data. What is on my mind at the moment is the critical need to adjust the thinking that Institutional Repositories as currently set up will address the data capture problem. They won't - and if they try they will be much less successful than the IRs have been at capturingPDFs or other fulltext. So the need for a new breed of Data Repositories is clear. They will look very different from IRs if they are going to succeed.

2 Responses to Should Data Repositories be Open?

  1. Robin Rice says:

    Thanks for the response to my comments, Peter. Of course you've got a lot of issues in there, and I look forward to hearing the keynote, but on the openness question, yes I think it's undeniable that not all data can or should be made open. I just couldn't help but notice its absence in your abstract to a conference about Open Repositories, and have heard you credited as the originator of the term Open Data as well.

    It's interesting that despite recent drivers such as the OECD declaration about open access to publicly funded research data, it is not at all generally accepted that data should be made openly available. I've heard Kevin Schurer, the director of the UK Data Archive argue that "managed access" is preferable (where an account has to be created before access is granted) so that usage can be tracked and the data contributors can know who has downloaded it for what purpose. Even the Open Data Foundation (not to be confused with the Open Knowledge Foundation) claims that 'open data' is about using open standards and open source software to provide data, not that the data itself is open access.

    But of course there is much utility to be had in making data open access, especially through mashup applications on the web, etc. which can't function behind restricted access websites. In our project DISC-UK DataShare, we're looking for academics at our institutions (Universities of Edinburgh, Southampton, Oxford, and LSE) who are keen to make their data available through open access repositories (or in mashups). We aim to help them create 'public-use' versions of their data that is anonymised, well-documented, etc. In other words, it does not so much provide a space for them to analyse their data and keep working on it, but to disseminate it. I am wondering if the 'data repositories' you are describing are repositories or indeed something more akin to Virtual Research Environments, something that would precede data going into an (access) repository? Well, we have to start somewhere, and again, I'm looking forward to seeing your examples for data capture in your keynote.
    Robin Rice
    University of Edinburgh

