I have 3-4 important talks to give in the next 2-3 weeks about Openness, Data and how they come together. As I often do I expose the ideas on this blog and hope for feedback. I’m going to highlight the meetings, but first I’m going to explain why IRs are not valuable for what I want to do.
Institutional Repositories – possibly for the last time
I’ve been writing a lot about “Repositories” and I am going to stop writing about them. My 6+ years of involvement with formal “repositories”, essentially Institutional Repositories (IRs), has convinced me that their design and philosophy is completely wrong for modern scientific research, especially data-driven research. They can’t be mended or adapted for science. Where their purposes are clear (and most do not have single clear purposes) they are designed and used for static book-like and article-like objects such as theses, e-articles and reports. The purposes are usually either preservation or research management for the university’s benefit. They are heavily institution-oriented and so only of interest to depositors and users closely involved with the institution. Nothing wrong with that, but some practitioners and advocates go beyond that and suggest they are general solutions – they are not and they will never be.
I’ve enjoyed being part of this community. I’ve been able to get some funding (JISC, Microsoft, thanks). The developer groups have been imaginative and energetic and there has been a small spinoff in generic software and protocols (e.g. SWORD2). But it’s clear after this time that the idea of “Repository” – as somewhere you deposit a fixed and final precious digital object unrelated to the rest of the content is not what we now want in science. The concept of “repository” has never engaged scientists – they either think of “databases” or systems for sharing community resources. I use http://en.wikipedia.org/wiki/Software_repository (s) but they are very different beasts and have many features that IRs don’t have, such as versioning, automatic installation software, and distributed repositories.
As a scientist I’ve tried to engage the IR community but with almost no success (other than words). There may have been opportunities for IRs to provide domain-specific solutions, but these haven’t caught the imagination of the IR managers. I have tried to suggest distributed iteration (e.g. for theses), and more generally the provision of born-digital theses (e.g. Word, not PDF). Again no common interests. I’ve tried to start discussion on support for scientists in laboratories (e.g with distributed versioned data capture). Again no practical interest.
I was disappointed not to get any substantial feedback from my earlier post on Criteria for Data Repositories. I put quite a bit of effort into thinking about it and I got one comment. Clearly this is a community which doesn’t discuss things in public and where there is no sense of electronic community. Maybe occasionally on the DCC-list, but repositories shouldn’t be about curation, they should be about people – which they aren’t. There is, of course, no reason why anyone should reply to me. But it gives the clear message that IRs should not be involved in data – and they should make this clear.
Actually it seems more generally that there is little public discussion of IRs anywhere; not even an active general global mailing list. So I am talking in the wrong direction. It’s saddening how little activity Universities have had in the information revolution. IRs seem to be the largest area of university funding – there’s basically no interest in publication, no interest in data. There is no way of communicating to universities even though I am employed by one.
New directions
So from now on I’m going to be addressing the following:
- Groups of practising scientists, especially those building new tools for information
- Funders
- Enlightened scientific publishers (effectively those committed to Open Access/Data and models where the scientist is not just seen as a commodity for creating income)
Rather than use “Repository” I’ll introduce a new term “Sharer”, as in CodeSharer, DataSharer. I’ll develop these ideas over the new day or so and present them at the meetings
First let me advertise a meeting in Zaragoza (http://grandir.com/EN/debatesessionSTM/ )
Debate session on STM research data management (Zaragoza, Aug 25)
25/08/2011
Next Thu Aug 25 a technical session organised under the auspices of GrandIR will be held in Zaragoza, Spain, for dealing with the management of STM research data, a yet relatively unexplored field in Spain. Along the meeting the current state of development of the Quixote Project will be also presented as an example. Quixote is a pioneering initiative for research data management in Quantum Chemistry in which several Spanish researchers are involved.
…
The meeting will have two sections: The first one will introduce the Quixote project, as well as existing national and international research data management initiatives. The talks will be short (15-20 minutes), with 10 extra minutes for questions. The second and core section of the meeting will be a discussion session, aiming to evaluate the needs of researchers and repository managers regarding data management repositories and tools, and to plan collaborations for creating a research data management infrastructure in Spain as a collection of repositories.
I am very appreciative of this – it’s less than a year since we conceived the Quixote system and here the second formal meeting about it. More later.
And then a meeting in Madrid (Int. Union of Crystallography, Triennial) 2011-08-29
http://www.iucr.org/__data/iucr/lists/comcifs-l/msg00533.html
In honour of the 20th anniversary of CIF, the upcoming IUCr meeting in
Madrid will feature a COMCIFS-sponsored microsymposium entitled
“Scientific Data Archiving, Exchange and Retrieval in the 21st
Century”. We have three excellent invited speakers, Brian Matthews,
Brian MacMahon and Peter Murray-Rust. These speakers will discuss
various topics drawn from the past, present and future of scientific
data exchange and management.
Here I am trying to work out aspects of how a Data Journal ties into a DataSharer. I don’t what I am going to say – some of it will be controversial and may upset some people. The general emphasis will be that primary Scientific data must be universally Open/libre. Since some organisations make their income by selling our data back to us I think we need som change of thought.
Finally I caught a tweet from Gigascience (a new Data Journal? with a DataSharer?):
Lots of useful advice (especially for us) on what makes a successful repository from @petermurrayrust, #opendata
So this is the direction I should now point in, perhaps. I’ll analyse their web site in a future post – there are some plus and minus things…
Hi Peter,
we greatly appreciate the useful advice and effort you’ve put into these posts, and we are also appreciative of any advice and feedback you can provide the GigaScience project as it’s very much work in progress. I’ll be at Science Online London if you want to chat in person, and I’d be very interested to pick your brains on these issues.
🙂 I have literally just posted an initial review of GigaScience.
Hi Peter,
Was just directed to your blog recently and have found some very interesting commentary on IRs from you as a working scientist. I agree that IRs probably aren’t what scientists need when their work is in progress. Have you heard of DataVerse, open software from Harvard, for scientists to store their data in as they are working on it, and what do you think of its usefulness for scientists? Our university is trying out an instance of it and hoping it will be of interest to members of our scientific research community. We’re aware from some of our reading (e.g.The Fourth Paradigm) that data repositories seem to be needed and are interested in further comment about their possible use, usefulness and development from scientists. More info on DataVerse: http://thedata.org/
Thanks for any comments you can provide.
S. Langlands
Thanks for commenting.
I haven’t used Dataverse – my gut feeling is that we need a range of software. I expect that Dataverse will solve some problems, but not others. The key thing is to get scientists involved – without that it won’t be possible to get the dynamics right. Fell free to ring or skype me if you want more info
Hi, Peter … I am saddened to see you so upset with IR’s. Not because I offer any particular defense of IR’s, but because I sense you have been let down by a community that shares your values in data sharing and open access. I’ve followed your writing on data sharing over the years and wish to thank you for being a strong advocate for open access to research data. As a data professional working to preserve and provide access to research data, I have found your voice to have been important in advancing the principle of data sharing in science. While you may have abandoned IR’s, I beg you not to give up on those of us engaged in data curation.
As you well know, the production and management of information across the stages of the research lifecycle have resulted in the need for new specialists to assist scientists. Some of us are engaged in the production and management of data and its metadata, others are active in the ingest and preservation of data and still others are busy supporting access to data and its dissemination. Within this lifecycle, IR’s can play an important role in the preservation of research data. One cannot depend on individuals to preserve data for the long term; it takes an institutional commitment to ensure the long-term access to data. Thus an IR commited to preserving research data can be very valuable to science. So, is it IR’s that you feel are wrong for science or is a particular group of management in today’s IR’s that has driven you to this conclusion?
Thanks for your comments , much appreciated
>>>While you may have abandoned IR’s, I beg you not to give up on those of us engaged in data curation.
The IRs I am “upset” about are Universities which believe they can solve data problems and I have yet to find any university which has made a significant contribution (I would be delighted to have this proved wrong). The idea that each scientist uses their local university to store their data is not working and will not work with at least 5 years. Meanwhile Dryad, Tranche and Figshare are repositories which flourish
>>>As you well know, the production and management of information across the stages of the research lifecycle have resulted in the need for new specialists to assist scientists. Some of us are engaged in the production and management of data and its metadata, others are active in the ingest and preservation of data and still others are busy supporting access to data and its dissemination. Within this lifecycle, IR’s can play an important role in the preservation of research data.
“can” possibly – “do” – I need to see evidence
>>>One cannot depend on individuals to preserve data for the long term; it takes an institutional commitment to ensure the long-term access to data. Thus an IR commited to preserving research data can be very valuable to science. So, is it IR’s that you feel are wrong for science or is a particular group of management in today’s IR’s that has driven you to this conclusion?
If a university IR steps forward and offers to host all open crystallography data (not just their own institutional data) then I shall be delighted. The size is not large – we managed it on our own server – 300,000 data sets of a few Mb each – probably < 1 terabyte. I am proposing this tomorrow. If you make the offer within the next 5 hours I will announce it in lights.
So – make me the offer and I'll change my tune. But it's no good telling me that everying "can" be done by IRs when nothing "is".
Pingback: Thoughts on Institutional Repositories from Peter Murray-Rust | e-Science Community