Repositories for Scientific Data (at OR08)

I am honoured to have been asked by Liz Lyon and Les Carr to give the opening keynote at Open Repositories 2008 at Southampton. In our group we have been very actively trying to work out what repositories in lab subjects might look like. It is quite clear that the current approach to institutional repositories, which was designed to manage fulltext, cannot and should not be extended for data. The theme of my presentation will be the need for a complementary, but separate, system of Data Repositories. The Abstract I have submitted is:

“Repositories for Scientific Data”
“Scientists are producing data at an ever increasing rate (“the data deluge”) due to automated instruments, image capture and simulation tools. This holds the promise of “data-driven science” where scientific discovery can be made by linking or mining existing data. The reality is, unfortunately, that almost all this data is lost. Although some publishers welcome data as an adjunct to “fulltext”, many do not and most do not have the domain expertise to store and curate the data. And although “big science” (such as high energy physics, geospatial imaging, genomics and structural biology) can often provide domain repositories (e.g. in bioinformatics) most science (“the long tail”) cannot.
There is an urgent need to address this problem. Current Institutional Repositories (IRs) are geared to storing and disseminating scholarly manuscripts and while some are prepared to accept other digital artefacts the practice is fragmented and does not scale. We need to define “Data Repositories” (DRs) which serve the interests of the scientists directly. This is highly domain-dependent and there is no one-size-fits-all solution. However there are some general principles.
* The DRs must intimately embedded in the current practice of the scientists – ideally they should be invisible to them.
* They must directly support the scientific effort and been seen as doing so rather than being confused with metrics, business processes, etc.
* The people running them should be physically present in the scinetific laboratories (“wearing lab coats”).
It is important not to overcomplicate with unnecessary middleware and metadata. The typical informatics toolset of a scientist includes Word/LaTeX, Excel, and the goold old filing system – which with huge storage comes back into its own. Free text indexing tools will do as good a job of creating domain metadata as humans. Many departments are starting to introduce backup systems such as Active Directory, Samba or SVN which satisfy the most important user of the repository – the scientist themself. HTTP/REST is good enough for many departments. These tools are an excellent starting point to engage the scientists and show there is real benefit.
This is a new field and I shall review some of the current approaches, including work from our own group (in chemistry and crystallography). It is critical that prototypes and developed with sustainability in mind. This is difficult (it is rarely possible to get direct grants) but the tools are often well known and easy/free to install. In many cases it may be possible to “hide” the costs of data capture in other accepted activities (“backup”, “publication”, “thesis preparation”, “instrument maintenance”, “analytical services”, etc.). Good prior design is much cheaper than retrofitting “repositories” and can be seen to have an immediate benefit on quality of data, re-use and mashups, speed of thesis preparation, etc. Indeed, if good principles of data management are brought into the teaching and learning process (e.g. in final year projects) then the students themselves will provide much of the innovation and tools.
On the assumption that we can have an Internet connection there will be live demonstrations.

Much of this is due to my colleagues especially Jim Downing. I believe that there has been too much over-engineering and that we should look to simpler approaches based on common tools where possible. In many cases what the scientist wants is a “bit-bucket” where the data can be stored in the knowledge that they won’t be automatically lost when the desktop crashes or they change laptop. Most scientists will not have worked with a versioning system such as SVN and this may be an important productivity tool for managing manuscripts and theses. Access control is an unavoidable necessity (the lab down the corridor may be your worst competitors…) and it highlights the central requirement for any repository system – enough people embedded in the department who can actually fix the glueware on a regular basis. This is the central and costly challenge we have to solve.
(I’m tagging this as OR08 – I couldn’t find any other suggestion)
UPDATE. After I wrote this I realised I need to say more about RDF and ORE and will do so.

This entry was posted in Uncategorized and tagged . Bookmark the permalink.

7 Responses to Repositories for Scientific Data (at OR08)

  1. Well, I’m now even more regretful that I can’t be there. I think the program committee chose its keynoter well, and I hope you knock ’em dead!

  2. pm286 says:

    (1) Thanks Dorothea – I am sure we’ll meet one day.

  3. Robin Rice says:

    Hi Peter,
    I’ve a question. What happened to the impetus for open data in this abstract? This looks like a useful set of solutions for storing/managing/curating data within research centres but not necessarily for disseminating or publishing that data. Repository services could play a role with that, by either
    packaging up some of those long tail datasets and making them accessible now and in the future (after the researchers have moved on to new projects), or by using the embargo features that repository software offers to make data available after the date of publication of a paper on which its based, or to create metadata records for discovery, with access controlled by the researcher, as you suggest is often necessary.

  4. Pingback: Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Should Data Repositories be Open?

  5. Gary King says:

    I suggest you look at the Dataverse Network project which is a version of what you are interested in. We have tackled these and many other related issues. See DVN is an open source project.
    Gary King
    p.s. thanks to Christian Zimmermann for pointing me here.

  6. Pingback: Science Library Pad

  7. Pingback: Open Repositories 2008 « pintiniblog

Leave a Reply

Your email address will not be published. Required fields are marked *