PP4_0.1: Comments on Repository Structure and location

Dictated into Arcturus

Responses to PPaper4_0.1.

I have had a number of useful comments to my suggestion that Scientific Data be reposited in domain-specific repositories (and a number of tweets to the effect that “PMR is dissing librarians yet again”. To the latter I’d ask the authors to reread what I actually said which was that many librarians think that data should be put in IRs; all the scientists I have spoken to think otherwise. This was a factual statement, not an attack.) The meaningful comments are:

Chris Rusbridge says:

July 28, 2010 at 8:57 pm  (Edit)

My real point to make is that Peter suggests an ideal that i fear cannot be realised in the broad. There are comparatively few existing domain-specific repositories, and most are extremely vulnerable. Witness what happened to the AHDS when the makeup of the policy committee changed slightly. Secondly, don’t think (please!) that domains are consistent; there can be endless divisiveness of approach between many subdomains. Thirdly, why should institutional data repositories not work, given the support of the institutional scholars? Fourthly, how can reasonably well-managed institutional data repositories not be federated so that the sub-domain parts of all the world appear as one? Fifthly, institutional data repositories do have a sustainability case, if linked to a library, an institutional mission, and that vital sense of scholarship disclosure.

I would never seek to undermine a domain repository that existed and worked, but I would hesitate to try to establish (and more importantly sustain) a domain repository where none existed. I would aim to establish IDRs and federate them. I’m not saying the former can’t be done, just that it is MUCH harder!

Jim Downing says:

July 29, 2010 at 10:47 am  (Edit)

@Chris

I have to say that I broadly agree with your points, and that the best sustainability and access is offered by federated institutional / sub-institutional repos.

I don’t think this is the easy path, though. There are few IRs tackling data archiving at a significant level, and even fewer aggregated domain-specific meta-repositories.

In the spirit of paving the cow paths, the best route might be to look for ways to deliver institutional support to domain repositories.

Steve Hitchcock says:

July 29, 2010 at 10:53 am  (Edit)

Peter, You mention ‘open data’ twice in this blog entry, in the opening sentence and in the final sentence. In between you do not address how the extensive requirements can be achieved while continuing to provide open data. You propose to disregard the contribution that might be made by researchers’ institutions, yet intimate roles for scientific unions, societies and publishers. These are likely to provide services at a cost that is not compatible with open data. Since open is axiomatic to what you want, it doesn’t seem to add up here. I think we could, and will, see examples of more diversified structures, with IRs at the apex, to provide the expert data management and curation that you seek, but within our research institutions.

Firstly, none of this will be easy and it may well be impossible in most cases. I see no reason why Institutions should not provide data repositories other than the fact that they do not currently do so and there is little sign of them making any progress. I can certainly conceive of a future where this happens – I just don’t see it happening. There *are* a number of domain-specific repositories , and yes, most of them are fragile. But that is to be compared with almost zero equivalents in IRs.

If you read my actual draft for the PPaper (between the rules) there is no mention of where the repository should be and who should finance it. I have simply made the point that data should not be stored in a general-purpose repository where there is no domain expertise. If you wish to make the point that it *should* be stored like that – without effective federation and without domain expertise for ingestion – I will continue to disagree. If you agree that it needs domain expertise then you will have to get that from practising scientists – there is no way that anyone outside the discipline (libraries, Google, Bing) can rightly manage the intricacies and detail.

The last thing that scientists want is their data spread over ca. 10,000 sites (because that is how many HE institutions there are worldwide. No scientist, editor, journal that I have spoken to would countenance data being reposited in that way.

So if libraries (whom I did not attack) wish to be involved they have to engage with domain scientists. Libraries have the following positives to offer:

  • They have (at least currently) funding for IRs
  • They have some degree of permanency

Domain repositories have the following:

  • They have the technical trust of the community
  • They offer a single point of contact

My (rather tentative) solution is that libraries should actively try to take on one or two domain-specific repositories. Not more. Those repositories should correspond to world-expertise on campus. So the Protein Data bank (RCSB) is located in Rutgers. I have no idea whether the University supports it. But it is a single point of contact for the discipline.

The future is tough, however you look at it. But the fact that scientists are starting to set up their own repositories sends a message.

I am simply the messenger.

This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to PP4_0.1: Comments on Repository Structure and location

  1. Chris Rusbridge says:

    Peter, on the tweet, may I respectfully point out (a) we only get 140 characters, (b) I was being deliberately provocative, (c) I didn’t mention librarians at all, and (d) my aim was to provoke people in the JISC Managing Research Data community (#jiscmrd) to read your post. Here’s what I wrote in my tweet:
    “Is @petermurrayrust right to dis institutional data repositories in http://wwmm.ch.cam.ac.uk/blogs…? I don’t agree, do you #jiscmrd?”
    Note, I mentioned institutional DATA repositories, which is mostly not what we’ve got, but what the JISC programme is trying to stimulate.
    I’d also like to remind folk that I did try to square the circle of sustainability and science involvement with this post some 15 months ago: http://digitalcuration.blogspot.com/2009/02/national-research-data-infrastructure.html. It’s definitely not right, but we do need some more serious analysis of the sustainability angles.

    • pm286 says:

      Chris,
      It was not you I was commenting on it was:
      “@cardcc @petermurrayrust makes a habit of dissing anything that libraries are involved in #jiscmrd”
      from, I believe, a librarian.
      I do not make a habit of dissing libraries and I try to choose my words reasonably carefully.
      P.

  2. Kenji Takeda says:

    Dear Peter,
    Great discussion you have opened up! I believe that you are right, domain specilaists need to be involved. My personal hope is that a new type of data librarian may emerge, and that these people may come from a mixture of discipline specialists and librarians. This is what happened/happens in scientific computing, some scientists move into the computing arena, and some computing people move into the scientific arena – as demonstrated by the eScience programme.
    I believe this will take around a decade, as open repositories have done. I also believe that a federated approach is the only sane architecture. You may be interested in reading Chris Gutteridge’s recent paper on the subject @ http://eprints.ecs.soton.ac.uk/20885/
    The University of Southampton is committed to open data, and as such we hope to be in the leading wave of open institutional data repositories. eCrystals and our new Materials Data Centre (www.materialsdatacentre.com) projects, are examples of disciplines being the focus.
    The whole point of our Institutional Data Management Blueprint project is to figure out how a whole institution can manage and openly publish its data. The genesis of this was that a bunch of us got together and realised that the institution is responsible for its data, and it therefore has the biggest incentive to manage, and publish it. Libraries have a role, as do publishers. I think forward-thinking libraries could jump right in here, as you suggest, but as always funding and priorities need to be matched.
    We have a long way to go, but we’re starting down the road. We’re learning all the time!!!
    Cheers,
    Kenji
    http://www.southamptondata.org

Leave a Reply

Your email address will not be published. Required fields are marked *