Repositories and Scientific Data (for OR2008)

I have been invited to give a keynote lecture at Open Repositories 2008 (see the programme – about 25% down) and have chosen the title “Repositories for Scientific Data”. I’d value help from the repositarian blogosphere and elsewhere.
My thesis is that the current approach for Instituional Repositories will not translate easily to the capture of scientific data and related research output. In some fields of “big science” (e.g. High Energy Physics) the problem is or will be solved by the community and their funders and institutions have effectively no role. However much – probably most – science is done in labs which are the primary unit of allegiance. Typical disciplines are chemistry , materials science, biochemistry, cell biology, neuroscience, etc. etc. These labs are often focussed on local and short-term issues rather than long-term archival, dissemination of data to the community, etc. Typical worries are:

  • My grad student has just left without warning – can I find her spectra?
  • How can we rerun the stats that our visitor last year did for us?
  •  My laptop has just crashed and I’ve lost all the images from the microscope
  • My chosen journal had to retract papers due to recent scientific malpractice. Now they want me to send them all my supporting data to prove I have adopted correct procedures. This will take me an extra month to retype in their format.

If we are to capture and preserve science we have to do it to support the scientist, not because the institution thinks it is a good idea (even it is is a good idea). So we have to embed the data capture directly into the laboratory. Of course in many cases there is a key role for the Department, particularly when – as in chemistry – there is a huge investment in analytical services (crystallography, spectroscopy, computing).
I am developing this theme for the presentation and would be very grateful for anecdotal or other information as to where the institution or department has developed a data capture system which ultimately feeds into medium-term (probably Open) preservation. Two emerging examples are Monash which has acquired a petabyte for storage of University scientific data and will layer a series of access mechanisms (SVN, Active Directory, Samba, RDB, SRB, etc.) on top of it. Recently Oxford has announced a Data Repository.
If you have material that will help give a balanced picture of data reposition in institutions I’d be grateful for email (or comments on the blog but I’ll be offline for a few days from Monday). I’m aware that some disciplines have domain repositories independently of institutions (e.g. HEP, bio-sequences, genes, structures, etc and David Shotton’s image repository for biology) – I’m after cases where the institution has invested in depertamental or lab facilities and which are actually being used.
Many thanks in advance.

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Repositories and Scientific Data (for OR2008)

  1. Chris Rusbridge says:

    Peter, I too have doubts on whether the “standard” IR platforms are much use for data. We know (as in So’ton crystallography) that with imagination, they can be used. I’ve read folks saying DSpace is not particularly useful, but FEDORA (being more customisable and flexible) may be more so.
    In Edinburgh, the ECDF service does have a major data component, but it’s more like huge chunks of disk space for active data than a repository. And I reported last year about re-thinking an Archive Service, but again they were thinking more of reliably keeping bits than anything departmental.
    There are, of course, many people are building databases, some with public interfaces, and they probably think of them as repositories in some sense.
    Finally, the main subject data services do not make much use of repository software, but instead keep data in filestore structures, and tend to have a separate dataset catalogue. Of course in many cases this is at least in part because they predate the repository platforms.
    And post finally, data stresses repositories in very different ways, in particular in extremes of scale in total size, object size, object numbers, deposit rates, re-use rates, change rates (yes!) and computational load…
    Sorry I can’t be there, but I would very much like to know the result.

  2. pm286 says:

    (1) Many thanks – I’ll try to include these comments.

Leave a Reply

Your email address will not be published. Required fields are marked *