How much scientific content is there in IRs?

I have suggested (/pmr/2011/08/14/institutional-repositories-are-they-valuable-to-scientists/ ) that Institutional Repositories are not valuable for scientists.Chris Rusbridge who used to run the Digital Curation Centre has commented. I am replying here, rather than in the thread. My argument is that there is little content in most IRs (Soton, UCL and probably CompSci depts. are exceptions) and a fortiori even less scientific content. I have done a simple analysis to back this up.


[CR] August 14, 2011 at 8:23 pm  (Edit)
[IRs] are set up to meet rather different aims [from mine], which they do more or less well. I don’t think that means that they are not useful to scientists; indeed in my review of the repository at your university, I met researchers who claimed that repository was an essential tool for them. One told me it would “shatter his work” if it went away. As it happened, their needs were more closely aligned with the repository than yours appear to be.

PMR: I don’t doubt that there a few places where repositories meet the needs of a few scientists. And a few places where the University (e.g. Soton) has put in significant resources and effort so that there is a critical mass of users. But the use of IR’s by everyone, let alone scientists, is very small. Let’s take the contents of the Russell Group (UK’s “top” 20 Universities) and see their contents (items deposited – taken from the latest index of DOAR – . I have searched for the name of the University and taken the IR which is most obviously the “central” one. Some may not be up to date, but in the absence of any useful TOC/index I have to take what is there)

University of Birmingham 730 items

University of Bristol 1208

University of Cambridge 190836 (almost all mine)

Cardiff University 757

University of Edinburgh 4124

University of Glasgow 21604

Imperial College London 1217 (of which ca 1000 are theses not on public view)

King’s College London (CompSci) 1984

University of Leeds + Sheffield + York ~8000

University of Liverpool 598

London School of Economics & Political Science 19228 + 72 theses

University of Manchester (Maths) 371

Newcastle University 6100

University of Nottingham 801

University of Oxford 2730

Queen’s University Belfast (Pol, Phil) 87

University of Southampton 13 repos, ca 60,000 items

University College London 196442

University of Warwick 3844

The median is around 2000-3000 items, over I suspect about 5 years.


Despite your gloom, I suspect repositories are more useful than you think. For a start, very often when you do a Google Scholar or equivalent search for a paper, you will often end up at a version in a repository. It may have been self-archived, it may have been deposited by faculty or library staff on behalf of the author, it may even have been harvested from elsewhere. But there are enough (at least in the areas I look in) for it to be a surprise not to find a version. Some institutions (notably Southampton in the UK, and Michigan, MIT and others in the US) have been markedly successful in getting content into their repositories, and it reflects well on them.

I don’t think the UK figures above bear that out. They suggest a median of perhaps 500 items per year. Even if all these are papers (which I doubt – probably half are theses, and an uknown amount of gray material) then that’s less than a paper per staff member per year. That means it’s extremely unlikely that any given paper will be in the repository. In computer science, perhaps. But not in bioscience or chemistry or materials.

Most institutional repositories are not data repositories; Cambridge is unusual in focusing more on data (well, on scholarly materials) than on outputs in the form of articles etc. Institutional data repositories are not yet widely available, but they are coming.

And, even if I were to believe that IR-data is coming, my argument is that this is the wrong way to do it.

Even they, however, are not particularly likely to provide “data publication and storage at all stages of the scientific endeavour”. I have argued in the past that they should (see posts in the Digital Curation Blog on “negative click repositories” etc), but in practice most centralised repositories are likely to focus on fairly static data for some time to come, for purely practical reasons. Personally, I think departments and research groups should be building more dynamic repositories and databases to support the earlier stages.

I’d be surprised if scientists used IRs for data, given that they don’t use them for full-text

I also don’t believe that “single-domain” repositories will support all stages of the science endeavour. The best developed field here is bio-informatics, where you see a multitude (>1,000) of databases mostly highly specialised to the needs of certain data types. The more you cross institutions, the harder become some of the problems, particularly sustainability. Many of those 1,000 databases are precariously funded, despite their evident importance.

II agree that this is precarious. If IRs are less precarious (i.e. they have sustainable funding) then they should be used for domain-specific data on a global basis.

Some of your other comments are spot on. The repository movement really missed a trick early on in declaring the licence under which material is made available. Even though I pressed my institution, the best they could offer (with the resources they had) was to put up future content with a CC licence. And I also agree that the APIs could be much better; OAI-PMH was built around a two-layer architecture of data providers (repositories) and service providers… very few of whom materialised or are used. I just don’t know whether OAI-ORE is useful for the kinds of things you want to do.

I agree that this is all difficult, especially the sustainability. I shall be developing this in future posts.




This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to How much scientific content is there in IRs?

  1. [REDACTED PMR; PLEASE DO NOT USE INSULTS ON THIS LIST OR I SHALL BLOCK THEM] there are ways to get facts about “famous” IRs like Southhampton.
    Did you know that from 3145 “articles” (year: 2009) in only 678 have a PDF? Und that from the 678 PDFs 319 are not Open Access??

Leave a Reply

Your email address will not be published. Required fields are marked *