Institutional Repositories: are they valuable to scientists?

Posted on August 14, 2011 by pm286

I have had time to reflect on http://www.repositoryfringe.org/ (the meeting of repositarians in Edinburgh) and having been recently concerned about the publishing of data (about which I shall post more later) I post my current analyses of the UK repository scene (I don’t know enough about elsewhere). I shall try to be objective, possibly constructive, but this will probably be a rather uncomfortable post. Before I start I’ll say that I have been committed in the past to working with my local repo and more generally the repo community.

I am going to comment (>> PMR) as a working research scientist who needs a repository for (a) collaboration and (b) data publication and storage at all stages of the scientific endeavour. My comments do not necessarily extend to other disciplines or other purposes.

Here are some basic motivations for repos (http://en.wikipedia.org/wiki/Institutional_repository ) :

to provide open access to institutional research output by self-archiving it; >>PMR: This hasn’t worked for science and isn’t going to. I have self-archived some of my publications pre-publication but not post-publication. Most publishers of chemistry do not permit post-publication, the process is complex, distracting and I know of no cases where scientists search in IRs for post-publication material.
to create global visibility for an institution’s scholarly research; >>PMR This is a useful function but IRs are generally poorly set up as showcases and there is so little science in most that I don’t go looking. (Why would I look at the output of the University of X? I might if they were headhunting me, but not otherwise)
to collect content in a single location; >>PMR this has no value for the average scientist. It is primarily (if at all) for institutional purposes such as managing the Assessment exercises
to store and preserve other institutional digital assets, including unpublished or otherwise easily lost (“grey”) literature (e.g., theses or technical reports). >>PMR. This is the only thing that might be useful to me *if I could discover the material easily and read it*. As an example of the non-use, Imperial College prevents anyone outside the institution reading any of their ca 1000 theses. This is not the norm, but it is impossible to answer the question “show my all UK theses”. The interfaces to the ca 200 UK IRs are hotch-potch and completely unnavigable by machine. So I agree with “store and preserve” (which is no use to most scientists in the modern world) but not “discover”.

And from Alma Swan: (I exclude topics above, teaching, measurement, showcasing):

Providing a workspace for work-in-progress, and for collaborative or large-scale projects; >>PMR This is something I have been urging repos to do as I think it’s the only thing that would provide something of value to the average scientist. If scientists used their university system for managing their work processes and data then they would have naturally engaged. But I think repos are running out of time and I think there are existing solutions which have a trajectory and will work.

If repos wish to engage with scientists I think the only real way forward is to help create *single* domain-specific repositories. Examples of these are Dryad, Tranche, PDB, etc. NCBI/EBI resources. The model would involve domain scientists running the [single] repository (let’s say for computational chemistry) and one or more traditional repos managing the sustainability. Note that scientists do not, in general, care about preservation beyond a few years at most. Scientist will not and should not put data directly into their own IR – it fragments the discipline and there are no good search tools.

So I have painted a fairly stark picture for IRs and science. They aren’t working and they aren’t going to work in their current form. The only area of possible interest is theses. To do this the IRs must, across all institutions:

Make their content Open. If the response is “it’s the student’s copyright, we can’t do anything” then we are not interested.
Label the Open content as open (machine-readable). It is *impossible* in any repository I have visited to find specifically Open material in bulk (i.e. by machine –reading). So almost all thesis and other content in UK repositories is closed.
Make it iterable. It should be possible to list everything in the repository systematically. Google does this but academics are usually forbidden to do so. Relying on Google to search University information is simply bottling the problem. I have floated this idea, got very little take up, even though it could be done in a week if the community put its effort into it. I doubt they will, but would be happy to be proved wrong.

On the assumption, therefore, that IR’s have nothing to offer scientists either in data management or discovery my next posts will turn to solutions from different sectors.

This entry was posted in Uncategorized. Bookmark the permalink.

11 Responses to Institutional Repositories: are they valuable to scientists?

Stevan Harnad says:

August 14, 2011 at 3:05 pm

DEPOSIT INSTITUTIONALLY, HARVEST/SEARCH CENTRALLY
(1) PM-R has missed the main function of Institutional Repositories (IR): so researchers can make their peer-reviewed papers freely accessible to *all* their would-be users rather than, as now, only to those whose institution can afford subscription access to the journal in which they were published.
(2) Users don’t search institutional repositories directly. (That would be absurd, repository by repository!) They search central harvesters (free ones, such as Citeseerx, Scirus or google scholar, or first paid ones, like Web of Science or Scopus, and then the free ones, to retrieve the “hits” that are OA).
(3) The solution for authors who wish to abide by the embargo of the (minority of) journals (< 40$) that embargo Open Access is to deposit the paper immediately upon acceptance anyway, set access as "Closed Access" instead of Open Access, and allow the repository's "email eprint request" Button to transmit and fulfill eprint requests (one click from user to request, one click from author to agree) semi-automatically. The Button (3a) provides "Almost-OA," allows immediate-deposit — of every paper, in every discipline, by every institution and funders — to be mandated universally, (3b) provides for researchers' urgent immediate access needs, and (3c) will eventually hasten the well-deserved natural death of the remaining immediate-OA embargoes.
Sale, A., Couture, M., Rodrigues, E., Carr, L. and Harnad, S. (2012) Open Access Mandates and the "Fair Dealing" Button. In: Dynamic Fair Dealing: Creating Canadian Culture Online (Rosemary J. Coombe & Darren Wershler, Eds.) http://eprints.ecs.soton.ac.uk/18511/
(4) PM-R is again thinking too narrowly in terms of his own (and perhaps his discipline's) special needs — (4a) libre OA re-use rights rather than just gratis OA free access, (4b) data-archiving — when he announces that [self-archiving refereed drafts in the author's institutional repository] "hasn’t worked for science and isn’t going to." And although they are long overdue, and I too am dissatisfied with their growth rate, it is most definitely premature to declare that mandating green gratis OA self-archiving "hasn’t worked for science and isn’t going to."
(5) Ironically, there are good reasons to believe that mandating the gratis green OA that PM-R declares insufficient is also the fastest and surest way to prepare the way to the libre gold OA for which PM-R yearns!
Stevan Harnad
EnablingOpenScholarship
http://www.openscholarship.org

Reply
- pm286 says:
  
  August 14, 2011 at 6:24 pm
  
  Thanks Stevan,
  Points noted
  
  Reply
Chris Rusbridge says:

August 14, 2011 at 8:23 pm

Peter, you say you need a repository for (a) collaboration and (b) data publication and storage at all stages of the scientific endeavour. You are right, current institutional repositories, by and large, are not set up to meet these needs. They are set up to meet rather different aims, which they do more or less well. I don’t think that means that they are not useful to scientists; indeed in my review of the repository at your university, I met researchers who claimed that repository was an essential tool for them. One told me it would “shatter his work” if it went away. As it happened, their needs were more closely aligned with the repository than yours appear to be.
Despite your gloom, I suspect repositories are more useful than you think. For a start, very often when you do a Google Scholar or equivalent search for a paper, you will often end up at a version in a repository. It may have been self-archived, it may have been deposited by faculty or library staff on behalf of the author, it may even have been harvested from elsewhere. But there are enough (at least in the areas I look in) for it to be a surprise not to find a version. Some institutions (notably Southampton in the UK, and Michigan, MIT and others in the US) have been markedly successful in getting content into their repositories, and it reflects well on them.
Most institutional repositories are not data repositories; Cambridge is unusual in focusing more on data (well, on scholarly materials) than on outputs in the form of articles etc. Institutional data repositories are not yet widely available, but they are coming. Even they, however, are not particularly likely to provide “data publication and storage at all stages of the scientific endeavour”. I have argued in the past that they should (see posts in the Digital Curation Blog on “negative click repositories” etc), but in practice most centralised repositories are likely to focus on fairly static data for some time to come, for purely practical reasons. Personally, I think departments and research groups should be building more dynamic repositories and databases to support the earlier stages.
I also don’t believe that “single-domain” repositories will support all stages of the science endeavour. The best developed field here is bio-informatics, where you see a multitude (>1,000) of databases mostly highly specialised to the needs of certain data types. The more you cross institutions, the harder become some of the problems, particularly sustainability. Many of those 1,000 databases are precariously funded, despite their evident importance.
Some of your other comments are spot on. The repository movement really missed a trick early on in declaring the licence under which material is made available. Even though I pressed my institution, the best they could offer (with the resources they had) was to put up future content with a CC licence. And I also agree that the APIs could be much better; OAI-PMH was built around a two-layer architecture of data providers (repositories) and service providers… very few of whom materialised or are used. I just don’t know whether OAI-ORE is useful for the kinds of things you want to do.

Reply
Chris Rusbridge says:

August 14, 2011 at 8:24 pm

Sorry, I meant to say repositories are not designed for collaboration either.

Reply
Phil Lord says:

August 15, 2011 at 10:30 am

The problem with *institutional* repositories is that they do not reflect the way that scientists work. I have lots of collaborators and interact with many people on a daily basis. But this is not based around my institution. This is just where I work and tied to the physical geography of where I live. It’s important to me personally, but says almost nothing at all about my work, it’s rarely even an issue of passing interest to people who might consume it.
So why, then, do we have institutional repositories. My feeling is two-fold. First it reflects the management structure of science; in many cases IR are simply there to feed the beast of RAE/REF. And, second, because they reflect the old notion of a library which because they involved physical things had to be institutional. Neither of these are terribly good reasons.

Reply
- pm286 says:
  
  August 15, 2011 at 4:58 pm
  
  >>So why, then, do we have institutional repositories. My feeling is two-fold. First it reflects the management structure of science; in many cases IR are simply there to feed the beast of RAE/REF. And, second, because they reflect the old notion of a library which because they involved physical things had to be institutional. Neither of these are terribly good reasons.
  Exactly my feelings. IRs have generally now converged on managing the REF and probably CRIS. These are no use to scientists, they are necessary distractions that have to be addressed as quickly as possible and then forgotten.
  
  Reply
Gerry Lawson says:

August 15, 2011 at 3:46 pm

Hi Peter – I think you dispair too soon about IRs
1. They are essential for Green OA since publishers don’t usually extend rights to pre-print and post-print deposit to thematic repositories. If your Library has a subscription most mainstream journals this is perhaps not important – but the requirements of non-academic and developing world users should be considered – as should the situation when libraries can’t afford the big deals anymore.
2. IRs (with help from Sherpa-Romeo) are needed to handle legal issues of what can be deposited and what not.
3. Repository Aggregators like MIMAS-IRS (esp the faceted version), Driver, OAISter, OCLC Digital Gateway (in addition to those mentioned by Steve Harnad) do/could Boolian searching of federated repositories – emulating the the power of a Thematic repository
4. Web of Science Web Services Lite allows harvesting of metadata from WoS to populate IRs (I gather the capacity to use those comes as standard in the latest version of EPrints. Web Services Lite is needed for Abstracts and other fields – but may IR managers will find this useful to populate under-used repositories. A similar product is available from Elsevier – Spotlight I think.
5. There are many overlay services (down load statistics, multiple deposit, academic names) which should significantly increase the attractivness of repositories.
6. The EThOS aggregation service already offers a unified search engine for UK PhD theses – I agree however that some HEIs keep the full text embargo for up to 5 years and this is not acceptable. Research Councils are likely strengthen out guidelines to funded students on this – and an embargo period of 1 year is under discussion.
7. We’ve discussed the need for Open Access to the Acknowledgement section of papers to allow data mining of funder and grantnumber information. To my mind this is best checked/edited/updated within an IR or CRIS system rather than in a central service.
8. Repositories can expose more complex information held in CRIS systems – which may include impacts, esteem and other measures needed for RCs and the REF.
9. Institutional Repostories are increasingly needed to handle dataset metadata – why invent the wheel in the management of these – clone the library IR systems and make links between publications and datasets.
10. Almost forgot – holders of certain EU FP7 grants are obliged to mount information on their outputs on their IRs – following the OpenAire format.

Reply
- pm286 says:
  
  August 15, 2011 at 4:55 pm
  
  Thanks for this Gerry, very useful
  GL>>Hi Peter – I think you dispair too soon about IRs
  It is not the “Repository” word I have a problem with, it’s the “Institutional”. Change this to “national” or “NERC” or “FP7” or whatever and I don’t have a problem. I’ll go through you comments looking for places where the Repo must be institutional
  GL>>1. They are essential for Green OA since publishers don’t usually extend rights to pre-print and post-print deposit to thematic repositories. If your Library has a subscription most mainstream journals this is perhaps not important – but the requirements of non-academic and developing world users should be considered – as should the situation when libraries can’t afford the big deals anymore.
  I am not against Green OA (except that it isn’t happening in science, at least in the UK). I assume that archiving in a Funder repository or UKPMC would be acceptable to a publisher and IMO a better solution
  >>2. IRs (with help from Sherpa-Romeo) are needed to handle legal issues of what can be deposited and what not.
  And I assume NERC could also manage the legal issues 🙂 – probably better than many IRs
  >>3. Repository Aggregators like MIMAS-IRS (esp the faceted version), Driver, OAISter, OCLC Digital Gateway (in addition to those mentioned by Steve Harnad) do/could Boolian searching of federated repositories – emulating the the power of a Thematic repository
  There’s the pity. They could, but they don’t and I see no likelihood. I could just about believe a federated repository system if it were actually deployed. And remember, I want to develop my own searches (based on chemistry). I can do the indexing myslef. I just can’t iterate over the material.
  >>4. Web of Science Web Services Lite allows harvesting of metadata from WoS to populate IRs (I gather the capacity to use those comes as standard in the latest version of EPrints. Web Services Lite is needed for Abstracts and other fields – but may IR managers will find this useful to populate under-used repositories. A similar product is available from Elsevier – Spotlight I think.
  I am ignorant of this. Are you allowed to populate Open repositories with fulltext copies of the original articles?? Because if not an IR filled only with metadata isn’t much use.
  >>5. There are many overlay services (down load statistics, multiple deposit, academic names) which should significantly increase the attractivness of repositories.
  Where is the repo that does download stats? It’s one of the many things that should have been available a long time ago.
  >>6. The EThOS aggregation service already offers a unified search engine for UK PhD theses – I agree however that some HEIs keep the full text embargo for up to 5 years and this is not acceptable. Research Councils are likely strengthen out guidelines to funded students on this – and an embargo period of 1 year is under discussion.
  I have found it very difficult to search for theses in the UK, even through eTHOS. By contrast SURF in the NL makes it trivial.
  >>7. We’ve discussed the need for Open Access to the Acknowledgement section of papers to allow data mining of funder and grantnumber information. To my mind this is best checked/edited/updated within an IR or CRIS system rather than in a central service.
  That may be a fair point. It isn’t, however necessary to buil,d a repository to do it.
  >>8. Repositories can expose more complex information held in CRIS systems – which may include impacts, esteem and other measures needed for RCs and the REF.
  Again a possibly useful point. But here the benefit is primarily to the institution not the scientist.
  >>9. Institutional Repostories are increasingly needed to handle dataset metadata – why invent the wheel in the management of these – clone the library IR systems and make links between publications and datasets.
  I have yet to find any IRs which manage signigicat amounts of data. My argument is that these should be in repositories that understand the domains.
  >>10. Almost forgot – holders of certain EU FP7 grants are obliged to mount information on their outputs on their IRs – following the OpenAire format.
  Does this stipulate that the repo must be institutional?
  I shall continue the data theme. There will need to be a major change in thinking before I am convinced they are the best place for data.
  
  Reply
- Phil Lord says:
  
  August 16, 2011 at 9:19 am
  
  “Institutional Repostories are increasingly needed to handle dataset metadata – why invent the wheel in the management of these – clone the library IR systems and make links between publications and datasets.”
  Data and publications are really not the same thing. Next gen sequencing technology (or, as it is increasingly becoming known, sequencing technology) can produce enough data in a day or so to put the British Library to shame. I don’t store my sequence metadata in Bibtex, why should the library IR be any use? So, you need different types of repository for different types of data.
  Which is the key point really. The repository should reflect the nature of the data in it, as opposed to the employment history of the people who produced it.
  
  Reply
  - pm286 says:
    
    August 16, 2011 at 9:27 am
    
    Again, exactly my views. I will be expanding them over the next few posts.
    At present the only advantage that Institutional Repositories have over non-Institutional is that the Institution has agreed (explicitly or implicitly) to provide the funding. Otherwise they have no innate advantages and, since they address a different business (managing the University’s research output) from what I am interested in they generally don’t provide what I want
    
    Reply
Pingback: Thoughts on Institutional Repositories from Peter Murray-Rust | e-Science Community