Dictated into Arcturus
This post is a first outline – not even a draft – of a proposed Panton Paper on “Repositories for Scientific Data”
[Note: This is likely to be controversial.]
Very soon we will need to decide where Scientific Data should be stored unless we solve this problem many feels will not have effective access to open data because it is too large or too complex to be included in conventional publications. Some fields such as bioscience or high energy physics and astronomy have already made significant and valuable progress in setting up repositories for their scientific outputs (mainly data) but most fields rely on what can be included in the traditional print-like publication. In almost all cases this is inadequate although the tireless work of organizations such as the International Union of Crystallography shows that it can be achieved in some cases.
I shall be blunt. The only place where Scientific Data should be stored is in domain-specific repositories.
Many people, especially those in the academic library community, has suggested that institutional repositories are the appropriate place to reposit data. Almost every scientist I have talked to believes that we need specialist repositories for each domain and that institutional repositories are not set up for and cannot be adapted to serve this purpose. There are several reasons.
- Scientists expect information to be global. They either assume that Google, and Bing, or other search engines will index all sources of information or they look to specialist domain repositories such as the Proteins (Structure) DataBank or sequence data bases such as Genbank or Swissprot. Until repository managers can coordinate their services so that they appear as a single global resource they will not be used by scientists either for deposition or for discovery.
- Most Scientific Data needs very expert and careful validation. This cannot be provided by the average institutional repository who has no knowledge of the particular domain. Arbitrary deposition of data into repositories will simply reduce rather than increase their value..
- Scientific Data has its own specialized metadata. Again this requires domain experts to create and manage as part of the data deposition process.
- Scientific Data requires specialist search and discovery tools. For example biological sequences are normally searched using a very large database of known sequences to see if they are novel. These services are provided by the NIH and the EBI for example. There is no way that these facilities can be duplicated except in specialist repositories.
- Scientific Data requires specialist tools for deposition and validation. These are likely to be developed by community efforts centred around global repositories rather than individual academic institutions.
Therefore this paper will address the question of how to build domain-specific repositories for Scientific data.
- Scientific data should be stored in specialist domain-specific repositories.
- Every sub community in science should explore the data management needs of its community. It is certain that they will need to find sources of funding for this.
- The community it will need to dedicate time and energy to the creation of data standards and metadata, such as markup language use and Ontologies.
- The community will also need to create processes for validating data so that there is an expectation of an appropriate quality in their repository.
- The community will need to build specialist tools for the deposition of data. These tools should be as easy to use as possible (as otherwise data will not be reposited) and it should be noted that this is a resource intensive requirement.
- The community will also need to develop discovery tools which go beyond text searching. And these tools must be open source so that innovation, correction and validation can be carried out by the community.
- Arrangements should be made for the transfer of data from theses in institutional repositories into these domain-specific repositories. It is highly likely that this will involve validation which may become part of the future requirements for an acceptable thesis.
- It is critically important that scientific unions and societies are intimately involved in these repositories but they must not be allowed to gain monopoly status.
- It is also critical that publishers are active members and amend their processes such that data can be validated before publication and deposited seamlessly into the repositories. Publishers must not be allowed to dominate the data deposition process
- It is important to think out the governance model of these domain-specific repositories.
- It is critical that the data and access to them are open in perpetuity. However if the money for the repositories is raised it can no longer be acceptable to charge the world for access.
Peter does not mention above, but you will find a fully explained example of how to create a domain specific (chemistry) repository at 10.1021/ci7004737.
Other comments: We have used this repository to provide complete supporting data in readily accessible form for more than 20 primary scientific publications with publishers such as Nature publishing group, VCH (Wiley), RSC, ACS, Science. Each of these publishers had to be persuaded to a greater or lesser degree to integrate this data into the primary article (rather than more obscurely in supporting information), a location we insisted upon in order to give that data prominent visibility to the readers. If you are interested in seeing the effects, I have blogged on the topic.
I recently acted as a referee for a well known publisher. The article was the analysis of quite large molecules, and I was quite keen to explore the proposed structures. Data had been provided, in the form of a double column, page broken, Acrobat file. I faced not a little work in converting this format to something I could use for the purpose. Since I knew the authors, I contacted them after my review process was complete (yes, thus breaking my anonymity) asking why they had provided the data in such an arcane and relatively unusable format. They were following the publisher guidelines. They did suggest that it should be the publisher themselves who should offer a domain specific repository for authors to use, since it is non trivial for an author to establish a domain specific repository themselves (and even within domains, there are may diverse requirements). I have my doubts however that such a model could be effectively deployed by the multiple publishers in chemistry any time soon. Meanwhile, for the vast majority of articles which have associated data submitted with them, the Internet revolution has yet to make much of an impact!
Pingback: Twitter Trackbacks for Unilever Centre for Molecular Informatics, Cambridge - PP4_0.1: Repositories for Scientific Data « petermr’s blog [cam.ac.uk] on Topsy.com
“as otherwise data will not be repository at”?
Sadly I can’t read the item henry refers to without paying $30 for 48 hours access! Unless… No, can’t find this in a repository!
My real point to make is that Peter suggests an ideal that i fear cannot be realised in the broad. There are comparatively few existing domain-specific repositories, and most are extremely vulnerable. Witness what happened to the AHDS when the makeup of the policy committee changed slightly. Secondly, don’t think (please!) that domains are consistent; there can be endless divisiveness of approach between many subdomains. Thirdly, why should institutional data repositories not work, given the support of the institutional scholars? Fourthly, how can reasonably well-managed institutional data repositories not be federated so that the sub-domain parts of all the world appear as one? Fifthly, institutional data repositories do have a sustainability case, if linked to a library, an institutional mission, and that vital sense of scholarship disclosure.
I would never seek to undermine a domain repository that existed and worked, but I would hesitate to try to establish (and more importantly sustain) a domain repository where none existed. I would aim to establish IDRs and federate them. I’m not saying the former can’t be done, just that it is MUCH harder!
@Chris
I have to say that I broadly agree with your points, and that the best sustainability and access is offered by federated institutional / sub-institutional repos.
I don’t think this is the easy path, though. There are few IRs tackling data archiving at a significant level, and even fewer aggregated domain-specific meta-repositories.
In the spirit of paving the cow paths, the best route might be to look for ways to deliver institutional support to domain repositories.
Peter, You mention ‘open data’ twice in this blog entry, in the opening sentence and in the final sentence. In between you do not address how the extensive requirements can be achieved while continuing to provide open data. You propose to disregard the contribution that might be made by researchers’ institutions, yet intimate roles for scientific unions, societies and publishers. These are likely to provide services at a cost that is not compatible with open data. Since open is axiomatic to what you want, it doesn’t seem to add up here. I think we could, and will, see examples of more diversified structures, with IRs at the apex, to provide the expert data management and curation that you seek, but within our research institutions.