Criteria for successful Repositories

[I overwrote this, but have recovered the content from Google’s cache. Thanks to Google]

 

In thinking about data repositories I have been thinking about repositories in general and what makes them successful. I’m going to start by suggesting some principles for success. These are based on the repositories I use and the repositories set up in our group.

  • A repository should have a clear, single purpose. Bitbucket is for repositing and editing code. Stackoverflow is for programming questions. EMSL is for atomic basis functions. Crystaleye is for crystal structures. CKAN is for public metadata. Chemspider is for chemical compounds. Purpose is more important than technology. A repository which tries to do more than one thing will have enormous difficulty. Creating one because everyone else is doing it is most unlikely to be useful.
  • People should want to put stuff in a repository. The want is almost always self-interest, or is driven by actual coercion (employment contract or legal requirements). Voluntary and altruistic expectations don’t work.
  • People should want to get stuff out of the repository. It’s the difference between this and this. [Jim Downing]
  • The Community using the repository should be clear. This can be a person, a research group, an institution, a government agency, a nation, or “the web”. You cannot by default expect the world to be interested in a repository designed for a particular group. A university is not, generally, a cohesive community and does not have a clear purpose.
  • Most successful repositories are started by one or more identifiable people. The repos I have mentioned are all started by identifiable, passionate people. They may change to institutional mode later, but in the formative period it is the drive of individuals that sets the basis for the repository. Committees are not generally visionaries.
  • The repository should generate a community that has a sense of ownership. Typical features are mailing lists, movement of high-profile users into development roles (usually spare-time). I fell I have a small ownership of Wikipedia and I could have more if I wanted.
  • The repository should be a dynamic organism. Content should be versionable and editable by “anyone” using the repository. Who that is depends on the purpose, and obviously some data cannot be edited for legal or historical reasons. But a repository without history and editing is effectively “dead trees”, not a living Web organism.
  • If you want people outside the “community” to become involved work to make this happen. The home page of a repository sends a positive or negative message to newcomers as to whether they are welcome. Otherwise plan that this is a repository for the limited community only. Provide examples, software, news, discussion, etc.
  • Make the repository successful rapidly. A repository that grows rapidly will attract new blood, new material, new ideas. Some don’t and you shouldn’t be afraid to let them wither. I am doing that with some of the things I have started while trying to grow others. A successful repository will attract sustainability.
  • Plan for the “now”, not for future generations. There is a modest role for repositories that preserve but they are incompatible with current use. I want to use things now. There is a role for digitization, but it is expensive, slow, layered with legality and generally out of sync with much of internet.
  • Make the rights on the repository universal and completely clear. A repository is either completely Open/libre (OKDefinition), completely gratis (viewable but with no rights), or should be regarded as closed. A repository where some of the items are Open, some visible and some restricted is effectively restricted. Machines are a primary user of repositories. They cannot read and cannot understand licences.
  • Build repositories for machines as well as humans. Assuming that users will navigate by reading metadata and clicking on it restricts the use to humans. The value of data repositories will be that they can be used by machines; if not, they will fail.
  • Make it clear what the extent of the repository is, and make it iterable. It should be easy to find out how many items there are in a repository and get a list of all of them. For an Open repository it should be possible to visit every one automatically (iteration)
  • An open/libre repository should be cloneable/forkable. Id should be possible to copy the whole or part content of a repository. There may be technical problems with this, but the repository owners should be prepared to help make this possible as far as resources allow. This is the only way of protecting Openness.
  • Encourage the community to innovate in searching and displaying the contents of the repository. If you delegate your search strategy to Bingle, you will get what Bingle provides. If you want to search for any concept or data that Bingle isn’t interested in, you won’t be able to. Wikipedia has flourished in part because people have done clever things with it.
  • Do not rely on traditional metadata strategy. Depositors hate metadata. Full-text provides better metadata that hand-crafted metadata. Machines produce better metadata than people (a machine can, for example, tell whether something is a FORTRAN program, an electron micrograph, a map… leave it to machines.
  • Build for evolution, not stasis. If your repository looks the same as it did 5 years ago then it is either obviously successful with a vibrant community, or needs tearing down.
  • Give depositors massive feedback. Add download counters. Allow visitors to comment/vote. Because to post material and get zero feedback is a massive turnoff (I know).
  • Create a system of unique identifiers that can be turned into URIs. or URLs. It’s really important that every item can be identified independently of its method of storage.
  • Please add your own. This list is not comprehensive.

Having been to repo fringe, I had hoped for more feedback and public discussion about repositories. (In writing this I realise that repo fringe is effectively about institutional repositories, not repositories in general). Yes, I’m provocative, and my views are not shared by everyone or even the majority. But in that case say something. I am one of the very few academics who is in any way trying to communicate with the IR community. I am one of the even smaller number of scientists who even know what repository means. Institutional repositories have exactly two (UPDATED) things in their favour,

  • They have significant resource commitment from their institutions.
  • They have created an excellent community including some star developers

That is a great deal more than most other repositories have, which rely on building critical mass out of marginal funds and trying as hard as possible to create a community. When an institution funds a repository it presumably has a clear reason for doing it. In most cases I have no idea what that is. It seems increasingly that is it aimed at the management of research. That’s fine, research should be managed, and management takes money. But it bores most academics stiff. It bores the non-academic community unless they are government or funding agencies. If repositories are being used to support the Research Assessment process (REF), say so – and stick to that single purpose. Don’t also use them to try to advertise the University, or pretend they are for the benefit of academics (they generally aren’t) or that they take scholarship to the “public” (they don’t), or that they promote Open Access (they don’t, only hard political slogging does that), or that they are preserving scholarship for future generations. If you want to do any of these things build separate repos. If they don’t have a clear institutional basis – and I can only think of the REF as in that category – build domain or national repos. They are cheaper, and will allow much better discovery.

 

NL has a single repository for the country’s theses. UK has 100 universities each with their own, limited approach. I can find theses in the NL, not UK. Why does every university have to have its own approach (yes, it’s politics and yes every university has statues going back to … and yes every student is the copyright owner and copyright is the new religion so it trumps everything. Did anyone lobby Hargreaves to have a national approach to copyright for theses?). This post is still looking for positive ways forward.

 

If IRs want to be involved with data then they should seek to host domain-specific repositories. I’ve had no take-up on this idea but I keep hoping.

 

This entry was posted in Uncategorized. Bookmark the permalink.

5 Responses to Criteria for successful Repositories

  1. Neil Stewart says:

    Hi Peter,
    This is an excellent post, and I appreciate your engagement with IRs and your attempts to rethink the way in which the IR community approaches things. Full disclosure: I am an IR manager, so am speaking from that position.
    A couple of points spring to mind. First of all, I don’t think there’s necessarily a binary opposition between developing IRs and developing subject repositories. It’s perfectly possible to harvest selectively from IRs to subject-based portals, as per the recently launched Economists Online: http://www.economistsonline.org/home. I think this is something that will (or at least should) be done more over time, and is in fact essential to expose IR content- it seems to me that, for the reasons you list, IRs can’t usefully be “standalone” services. These initiatives require the kind of high-level political will that us IR managers aren’t generally in the position to bring to bear (sadly!).
    On the question of theses, I agree that the UK has been slow on the uptake here, but this is partly an issue of scale, and partly a result of the way the EThOS project has approached things. The BL has actually taken a relatively robust approach by digitising on demand, and only taking down if the copyright holder objects. However the EThOS portal cannot currently be considered a repository in the sense of making its contents openly available, and until the BL changes its approach so far as this is concerned, this will remain a problem. In terms of content, tt seems to me that over time, as EThOS gets hold of more content and more institutions digitise back-runs of theses, the gaps will be filled. There’s also a governance issue here: some universities still aren’t mandating electronic submission of theses, which is in my view hard to believe, but there we go. More generally, the infrastructure for making theses available is actually already in place- DART Europe http://www.dart-europe.eu/ provides a pan-European portal for this.

    • pm286 says:

      Many Thanks and quick comments.
      >>>A couple of points spring to mind. First of all, I don’t think there’s necessarily a binary opposition between developing IRs and developing subject repositories. It’s perfectly possible to harvest selectively from IRs to subject-based portals, as per the recently launched Economists Online: http://www.economistsonline.org/home. I think this is something that will (or at least should) be done more over time, and is in fact essential to expose IR content- it seems to me that, for the reasons you list, IRs can’t usefully be “standalone” services. These initiatives require the kind of high-level political will that us IR managers aren’t generally in the position to bring to bear (sadly!).
      It is impossible to harvest material that doesn’t exist and that’s the current problem. It would make much more sense to have domain- or national- collections (these are much more likely to be populated) and the the IRs to extract what they want for whatever reason. People simply don’t and won’t use IRs for this purpose. Biologists, astronomers, chemists use domain-repos.
      Untill the IRs actually help the researcher with their work, they won’t be used.
      >>>On the question of theses, I agree that the UK has been slow on the uptake here, but this is partly an issue of scale, and partly a result of the way the EThOS project has approached things. The BL has actually taken a relatively robust approach by digitising on demand, and only taking down if the copyright holder objects. However the EThOS portal cannot currently be considered a repository in the sense of making its contents openly available, and until the BL changes its approach so far as this is concerned, this will remain a problem. In terms of content, tt seems to me that over time, as EThOS gets hold of more content and more institutions digitise back-runs of theses, the gaps will be filled. There’s also a governance issue here: some universities still aren’t mandating electronic submission of theses, which is in my view hard to believe, but there we go. More generally, the infrastructure for making theses available is actually already in place- DART Europe http://www.dart-europe.eu/ provides a pan-European portal for this.
      I am completely uninterested in digitized theses and so are almost all scientists. We should be aggregating born-digital theses in semantic form in a digital repository. Digitization is a 19th-C approach to a 21-C problem and opportunity. It may have some value for A+H, it has none for science. I have visited and presented my concerns to eTHOS – they aren’t able to address any of my concerns – so they aren’t relevant to the modern scientific world
      I would mandate deposition into DART. But I am a control freak. The general view is that we should continue muddling and find excuse why things aren’t possible. That’s why we need individuals to build repos

  2. Jim Downing says:

    I’d add, really close to the top: “People should want to get stuff out of the repository.”. It’s the difference between this and this.

  3. Pingback: Thoughts on Institutional Repositories from Peter Murray-Rust | e-Science Community

Leave a Reply

Your email address will not be published. Required fields are marked *