berlin5: Open Data and institutional repositories

John Marks (ESF) introduced our session and set the scene on the need for Open Data and sharing. He stated strongly that it was essential that we had discipline-specific repositories for different branches of science. I share this view and blogged it recently (berlin5 : how to progress Open Data?).
My stance comes from meetings this year where I have talked to many people about institutional repositories. I ask them “why are you setting up an IR?” I have got about 8 distinct answers. Very few of them mention data.
Some of us addressed these issues at ETD2007. There are hundreds of different types of biologiocal data, tens of chemistry data, humderds of geoscience, etc. There is no way that these managers – with the best will in the world – will know how to manage them. So I wrote:

although there is quite a lot of activity in institutional digital repositories they won’t (and shouldn’t) address Data. It’s subject-specific and too complex for the average repository manager.

PMR: In response to this Dorothea Salo (who has run Caveat Lector blog for some years and has a strong following).

  1. Dorothea Salo Says:

    Disagree somewhat that IRs and their managers shouldn’t address data, though I agree that for now it’s impractical because the software is so wretched and the technical infrastructure insufficiently scalable. Just because IR software in its current state is completely broken with regard to data doesn’t mean it must or should stay that way, though. Moreover, the notion that “domain knowledge” is the sole key to data curation is (bluntly) bunk, and nobody’s yet tested the assertion that it’s harder to teach a librarian domain knowledge than to teach a discipline-practitioner info management.Frankly, “it differs by discipline” doesn’t matter. So does everything else in librarianship, from reference transactions to collection development. We cope. It’s our job to. As for “too complex,” says who? And about which librarians? I think I’ve just been insulted.
    There’s nothing wrong with telling librarians — and the subset of librarians who are repository managers — that we need to brush up our game to deal with these issues. I have a plan in place to learn the principles of data curation for myself over the next year or so. I want to see more librarians planning the same!
    Looks like a good talk. Wish I could be there to hear it!

PMR: I haven’t met Dorothea but I’d like to – her blog is insightful and entertaining and she is unafraid to speak out. She’s also technically proficient in the IT skills required – XML, etc. And the last thing I want to do is upset and antagonize people like Dorothea.
But… There is no single human on the planet who knows how to reposit all of protein structures, variable stars, ice sheets, chemical structures. It needs much more than metadata. So what can a repository manager do. Putting the raw data into the repository without understanding it is not an option. It has to go into a system devised by experts in the discipline. And, for me, that means subject repositories. Maybe each university has a different one. Maybe they are national.Some, like the bioscience ones, will be international.

This entry was posted in berlin5, open issues. Bookmark the permalink.

4 Responses to berlin5: Open Data and institutional repositories

  1. I should have added a smiley. I’m not really insulted. 🙂
    No, there is no one human and there can be no one human. In the medium-to-long term, though, I expect it’ll work out much as other discipline-dependent aspects of librarianship have: you end up with librarians with specific domain expertise tapped to help manage what comes out of a particular discipline.
    Also, it’s worth looking at the social sciences here. They manage their data via embedding librarians in their space, along with the relevant technical expertise. I think that’s a fantastic model, worth considering in other disciplines as well — it’s what’s already done with IT, after all! How these embedded librarians fit themselves together to create a campus-wide, cross-disciplinary preservation program is a fascinating question… but it seems a halfway-tractable problem, at least.
    I also think (possibly over-optimistically?) that some of these problems have more similarities than differences. Any librarian can start out with the right questions about a particular dataset: “What software makes this?” “Who owns that software?” “Are there standards in this area?” “Can the data be exported?” “Is there a codebook?” “What’s the history behind this data format?” “What were you doing with these data five years ago?” and so on.
    Last week I went to a presentation by Kevin Eliceiri, who is NOT a librarian but ought to be given an honorary MLS just on general principles. He is working on digital microscopy, and almost singlehandedly bringing open standards to a proprietary-vendor-dominated field. I don’t know what-all the scientists do with the pretty images he was showing us; microscopy isn’t my thing — but I understood every word he said about the challenges he’s facing, the formats he’s building, and the workflows he’s creating. You’d find them familiar too, though I don’t believe digital microscopy is quite your thing either.
    Don’t write us off. That’s all I’m asking. We don’t have to be the whole picture, but I see no reason we can’t be a useful part of it.

  2. DrZZ says:

    I think Dorothea makes an excellent and important point. I think way too much emphasis is placed on how the data could be used and not enough on making sure that the data CAN be used. Certainly how to make use of the data requires domain specific expertise and is vitally important, but there a large part of the desire to archive is to preserve the data for uses that aren’t even thought of yet and I don’t see how arguing about not yet thought of requirements helps in the archiving process. I say this from experience. The NIH Roadmap Screening Center network is beginning to look very closely at how to archive all the data. Summary results are deposited in PubChem, but there is much more data that isn’t deposited and the question is what to do with it. In all the discussion I’ve participated in, I haven’t heard anyone ask the basic, but vital questions Dorothea asks above, and I think it’s reasonable to argue that it is because the discussions have only included domain experts and no one has the broader perspective that a librarian brings.

  3. Jakob says:

    From the librarian’s point of view I can tell you that archiving data is probably even more complex then it seems to be. From the computer scientist’s point of view I can tell you that Semantic Web will enlight us easily. From the Open Content movement‘s point of view I can tell you that you should just license the data and make it available and usable for anyone – like you said: first make sure THAT the data CAN be used.

  4. Pingback: berlin5: Open Data and institutional… — Buy and sell domain names

Leave a Reply

Your email address will not be published. Required fields are marked *