I have been invited to give a keynote lecture at Open Repositories 2008 (see the programme – about 25% down) and have chosen the title “Repositories for Scientific Data”. I’d value help from the repositarian blogosphere and elsewhere.
My thesis is that the current approach for Instituional Repositories will not translate easily to the capture of scientific data and related research output. In some fields of “big science” (e.g. High Energy Physics) the problem is or will be solved by the community and their funders and institutions have effectively no role. However much – probably most – science is done in labs which are the primary unit of allegiance. Typical disciplines are chemistry , materials science, biochemistry, cell biology, neuroscience, etc. etc. These labs are often focussed on local and short-term issues rather than long-term archival, dissemination of data to the community, etc. Typical worries are:
- My grad student has just left without warning – can I find her spectra?
- How can we rerun the stats that our visitor last year did for us?
- My laptop has just crashed and I’ve lost all the images from the microscope
- My chosen journal had to retract papers due to recent scientific malpractice. Now they want me to send them all my supporting data to prove I have adopted correct procedures. This will take me an extra month to retype in their format.
If we are to capture and preserve science we have to do it to support the scientist, not because the institution thinks it is a good idea (even it is is a good idea). So we have to embed the data capture directly into the laboratory. Of course in many cases there is a key role for the Department, particularly when – as in chemistry – there is a huge investment in analytical services (crystallography, spectroscopy, computing).
I am developing this theme for the presentation and would be very grateful for anecdotal or other information as to where the institution or department has developed a data capture system which ultimately feeds into medium-term (probably Open) preservation. Two emerging examples are Monash which has acquired a petabyte for storage of University scientific data and will layer a series of access mechanisms (SVN, Active Directory, Samba, RDB, SRB, etc.) on top of it. Recently Oxford has announced a Data Repository.
If you have material that will help give a balanced picture of data reposition in institutions I’d be grateful for email (or comments on the blog but I’ll be offline for a few days from Monday). I’m aware that some disciplines have domain repositories independently of institutions (e.g. HEP, bio-sequences, genes, structures, etc and David Shotton’s image repository for biology) – I’m after cases where the institution has invested in depertamental or lab facilities and which are actually being used.
Many thanks in advance.