I am honoured to have been asked by Liz Lyon and Les Carr to give the opening keynote at Open Repositories 2008 at Southampton. In our group we have been very actively trying to work out what repositories in lab subjects might look like. It is quite clear that the current approach to institutional repositories, which was designed to manage fulltext, cannot and should not be extended for data. The theme of my presentation will be the need for a complementary, but separate, system of Data Repositories. The Abstract I have submitted is:
“Repositories for Scientific Data”
“Scientists are producing data at an ever increasing rate (“the data deluge”) due to automated instruments, image capture and simulation tools. This holds the promise of “data-driven science” where scientific discovery can be made by linking or mining existing data. The reality is, unfortunately, that almost all this data is lost. Although some publishers welcome data as an adjunct to “fulltext”, many do not and most do not have the domain expertise to store and curate the data. And although “big science” (such as high energy physics, geospatial imaging, genomics and structural biology) can often provide domain repositories (e.g. in bioinformatics) most science (“the long tail”) cannot.
There is an urgent need to address this problem. Current Institutional Repositories (IRs) are geared to storing and disseminating scholarly manuscripts and while some are prepared to accept other digital artefacts the practice is fragmented and does not scale. We need to define “Data Repositories” (DRs) which serve the interests of the scientists directly. This is highly domain-dependent and there is no one-size-fits-all solution. However there are some general principles.
* The DRs must intimately embedded in the current practice of the scientists – ideally they should be invisible to them.
* They must directly support the scientific effort and been seen as doing so rather than being confused with metrics, business processes, etc.
* The people running them should be physically present in the scinetific laboratories (“wearing lab coats”).
It is important not to overcomplicate with unnecessary middleware and metadata. The typical informatics toolset of a scientist includes Word/LaTeX, Excel, and the goold old filing system – which with huge storage comes back into its own. Free text indexing tools will do as good a job of creating domain metadata as humans. Many departments are starting to introduce backup systems such as Active Directory, Samba or SVN which satisfy the most important user of the repository – the scientist themself. HTTP/REST is good enough for many departments. These tools are an excellent starting point to engage the scientists and show there is real benefit.
This is a new field and I shall review some of the current approaches, including work from our own group (in chemistry and crystallography). It is critical that prototypes and developed with sustainability in mind. This is difficult (it is rarely possible to get direct grants) but the tools are often well known and easy/free to install. In many cases it may be possible to “hide” the costs of data capture in other accepted activities (“backup”, “publication”, “thesis preparation”, “instrument maintenance”, “analytical services”, etc.). Good prior design is much cheaper than retrofitting “repositories” and can be seen to have an immediate benefit on quality of data, re-use and mashups, speed of thesis preparation, etc. Indeed, if good principles of data management are brought into the teaching and learning process (e.g. in final year projects) then the students themselves will provide much of the innovation and tools.
On the assumption that we can have an Internet connection there will be live demonstrations.
Much of this is due to my colleagues especially Jim Downing. I believe that there has been too much over-engineering and that we should look to simpler approaches based on common tools where possible. In many cases what the scientist wants is a “bit-bucket” where the data can be stored in the knowledge that they won’t be automatically lost when the desktop crashes or they change laptop. Most scientists will not have worked with a versioning system such as SVN and this may be an important productivity tool for managing manuscripts and theses. Access control is an unavoidable necessity (the lab down the corridor may be your worst competitors…) and it highlights the central requirement for any repository system – enough people embedded in the department who can actually fix the glueware on a regular basis. This is the central and costly challenge we have to solve.
(I’m tagging this as OR08 – I couldn’t find any other suggestion)
UPDATE. After I wrote this I realised I need to say more about RDF and ORE and will do so.