Dictated into Arcturus
This post is a first outline – not even a draft – of a proposed Panton Paper on “Repositories for Scientific Data”
[Note: This is likely to be controversial.]
Very soon we will need to decide where Scientific Data should be stored unless we solve this problem many feels will not have effective access to open data because it is too large or too complex to be included in conventional publications. Some fields such as bioscience or high energy physics and astronomy have already made significant and valuable progress in setting up repositories for their scientific outputs (mainly data) but most fields rely on what can be included in the traditional print-like publication. In almost all cases this is inadequate although the tireless work of organizations such as the International Union of Crystallography shows that it can be achieved in some cases.
I shall be blunt. The only place where Scientific Data should be stored is in domain-specific repositories.
Many people, especially those in the academic library community, has suggested that institutional repositories are the appropriate place to reposit data. Almost every scientist I have talked to believes that we need specialist repositories for each domain and that institutional repositories are not set up for and cannot be adapted to serve this purpose. There are several reasons.
- Scientists expect information to be global. They either assume that Google, and Bing, or other search engines will index all sources of information or they look to specialist domain repositories such as the Proteins (Structure) DataBank or sequence data bases such as Genbank or Swissprot. Until repository managers can coordinate their services so that they appear as a single global resource they will not be used by scientists either for deposition or for discovery.
- Most Scientific Data needs very expert and careful validation. This cannot be provided by the average institutional repository who has no knowledge of the particular domain. Arbitrary deposition of data into repositories will simply reduce rather than increase their value..
- Scientific Data has its own specialized metadata. Again this requires domain experts to create and manage as part of the data deposition process.
- Scientific Data requires specialist search and discovery tools. For example biological sequences are normally searched using a very large database of known sequences to see if they are novel. These services are provided by the NIH and the EBI for example. There is no way that these facilities can be duplicated except in specialist repositories.
- Scientific Data requires specialist tools for deposition and validation. These are likely to be developed by community efforts centred around global repositories rather than individual academic institutions.
Therefore this paper will address the question of how to build domain-specific repositories for Scientific data.
- Scientific data should be stored in specialist domain-specific repositories.
- Every sub community in science should explore the data management needs of its community. It is certain that they will need to find sources of funding for this.
- The community it will need to dedicate time and energy to the creation of data standards and metadata, such as markup language use and Ontologies.
- The community will also need to create processes for validating data so that there is an expectation of an appropriate quality in their repository.
- The community will need to build specialist tools for the deposition of data. These tools should be as easy to use as possible (as otherwise data will not be reposited) and it should be noted that this is a resource intensive requirement.
- The community will also need to develop discovery tools which go beyond text searching. And these tools must be open source so that innovation, correction and validation can be carried out by the community.
- Arrangements should be made for the transfer of data from theses in institutional repositories into these domain-specific repositories. It is highly likely that this will involve validation which may become part of the future requirements for an acceptable thesis.
- It is critically important that scientific unions and societies are intimately involved in these repositories but they must not be allowed to gain monopoly status.
- It is also critical that publishers are active members and amend their processes such that data can be validated before publication and deposited seamlessly into the repositories. Publishers must not be allowed to dominate the data deposition process
- It is important to think out the governance model of these domain-specific repositories.
- It is critical that the data and access to them are open in perpetuity. However if the money for the repositories is raised it can no longer be acceptable to charge the world for access.