As part of my analysis of what data repositories should look like, I look here at repos in general. There has been some useful feedback to my latest posts, mainly about Instituional Repositories (IRs), in the comments and on Twitter. Some people agree with me, others have suggested I have either got things wrong. I am not against repositories – in fact I am strongly in favour of them. I question the role of many institutional repositories and, as readers know, I argue in favour of domain-specific repositories

Institutional Repositories have only one thing in common – they are supported by cash and staff provided by the institution – and they are institution-centric. Here for example is Imperial College (

Welcome to Spiral, the Digital Repository for research output of Imperial College. Spiral primarily contains full text peer-reviewed versions of journal articles and conference papers produced by academic staff of Imperial College London, as well as PhD theses by students of Imperial College London.

In fact NONE of the theses are visible to people outside Imperial. If I go to “More Information” ( It’s primarily about how to submit content (obviously only for Imperial people). If I go to the first item in “Chemistry” I find a “journal” which is actually a thesis from Cambridge ( One useful feature is the “Top twenty downloads” ( ).

By contrast my own University’s repository exists for a completely different purpose:

DSpace@Cambridge is the institutional repository of the University of Cambridge. The repository was established in 2003 to facilitate the deposit of digital content of a scholarly or heritage nature, allowing academics and their departments at the University to share and preserve this content in a managed environment.

I have actually uploaded ca 180,000 items. There is no download indicator so I have no indication that anyone has ever downloaded anything. (Actually I have had 1 email , which shows somebody downloaded something 2 months ago). This, not surprisingly, is demotivating.

So, as a result, I am not highly motivated to explore Imperial as it is highly Imperial-centric. And I am not very motivated to deposit things in DSpace@Cam (I continue to do so, but out of a sense of duty, rather than because I want to).

By contrast Nature Precedings (run by Nature Publishing Group) runs a preprint server, and I have put papers in that, for example: (My ill-fated paper on Open data which got buried behind the Elsevier paywall). People have read the NP offering. 11 voted for it. Now votes are not very scientific but it gives me a slight warm fuzzy feeling. It would be interested to know the downloads (and I’m slightly surprised I can’t find that). The NP site is nicely presented.

The upshot is:

  • I don’t want to browse the Imperial repository – it makes me feel an outsider
  • I don’t want to upload to DSpace@Cam (it’s tedious and I have no evidence anyone reads it)
  • I do want to upload to Nature Precedings.

So 3 months ago I sent 11-15 papers off the Biomed Central. (J. Cheminformatics). I put them all in DSpace@Cam. The reviews have mainly come in and I think all the papers will get published. So I think I’ll upload them also to Nature Precedings and get a feel of what the world thinks of them. I’ll also see whether BMC have a “precedings” – if not, maybe they should.

And I submitted a lot of work last night to another repository – I stayed up till 0100 because of the excitement of doing so. It’s called Bitbucket. It’s how I make sure my code is working, high quality, available to everyone. The main motive was to increase our collaboration with the European Bioinformatics Institute (Christoph Steinbecks’ group at ChEBI).

There is no reason why IRs should not be able to appeal to sections of the community. But I think very few appeal to any more than very small groups, mainly within the institution. And if the repository is not clear what its purpose is, then I suspect it won’t appeal to anyone.

So I’ll leave you with Ranganathan’s laws, modified for repos (authors now have a role that they did not)

  • Repositories are for use (by machines and/or humans)
  • Every entry its reader
  • Every author and every reader their entry
  • Save the time of the reader and the author
  • The repository is an evolving organism

If, for a given community of authors and readers, you can truly answer YES to every law, then you already have a success repository. Bitbucket has. Stackoverflow has. Wikipedia has. Dryad has. Tranche has. NCBI and EBI has. CKAN has. ArXiv has. Chemspider has. Figshare looks promising. Nature precedings (3500 entries) continues – I would expect more.

You cannot be everything to everyone and this is where IRs generally fail. If your main purpose is to manage the REF, say so. If it is to store theses and stop the rest of the world seeing them, say so. If it’s to create collections of important digital objects, say so. And don’t do the other things unless you are sure you can make a success of them.


  1. Steve Hitchcock says:

    Peter, It pains me to say it, but you have some valid criticisms of institutional repositories. As a passionate supporter of IRs my view is that IRs broadly exhibiting the problems you identify lack leadership, that is, institutional leadership (it’s in the term, *I*R). As a result many repository managers are left thrashing around to find a purpose and identity for the repository. And yet. If you look closely, you will find exemplary IRs. These may be typical open access IRs, or untypical but with a clear focus and implementation to match. From experience there are repository teams taking exemplary approaches, but it is not apparent in the resulting repository, yet. They will be rewarded if the institution recognises and backs their efforts. The wider repository community might be too slow to recognise, promote and emulate its best, and not always clear on the importance of their mission. Let’s be clear, you recognise the importance of the mission, ultimately the progress of science and wider academic endeavour by correct and proper management and access to all of its critical (now) digital outputs, yet your response to current perceived problems is to look for instant solutions without joining it all up. IRs can offer a joined-up approach to the mission. That is why they were conceived, and we too often and too easily lose sight of that. The way forward is to highlight and learn from the best IRs. That is the way to fix these problems, rather than abandon them.

  2. Chris Rusbridge says:

    Peter, my sincere apologies for not letting you know this before, but you vastly under-estimate the usage of the WWMM collection. The report I wrote on DSpace @ Cambridge belongs to the University Library, but I’m sure they will not mind if I quote one paragraph to you:
    “There have been concerns that the huge WWMM collection might be essentially unused. Because of the Handle-like structure of identifiers, it is not easy to count accesses to all the items in a collection. However, thanks to sterling work by the DSpace team, they were able to determine that accesses to the WWMM collection represent approximately 12.5% of total accesses over the 15-month period available. This is a respectably high usage rate, that justifies keeping the collection, and comes despite the fact that 104,000 of the 175,000 items in the collection received no accesses at all during that period (a classic long tail effect).”
    One in every eight accesses to the repository was to that collection.

