PP4: More on where data should be reposited

Scraped into Arcturus

More debate on where data repositories should be located

Chris Rusbridge says:

August 8, 2010 at 11:39 am  (Edit)


Let’s set the data to one side for the moment and think about the two models for science outputs (articles). It doesn’t greatly matter if articles are deposited in institutional repositories, domain repositories or one grand central repository (ArXiV on steroids). And however many there are, they don’t really have to be federated (I know I mentioned the word earlier, but I was thinking in a looser context). In fact, the OAI-PMH protocol on which repositories are based was originally based around the idea of federated repositories: there were going to be data providers (repositories) and service providers (specialist search sites, like OAISTER). The service providers would harvest the metadata from the data providers, and build their search and other services on that basis.

This was a great idea of its time, but let’s see what happened. OAISTER and a few other search providers do exist, nd let you search the repositories they know about. They generally do a MUCH better job than the previous distributed search paradigm, Z39.50 (and later SRU/SRW), which were very prone to realtime and metadata mismatch failures. But sadly, they turn out to do a MUCH WORSE job than the best search engines, like Google et al.

So now we have a large and growing set of repositories, being indexed by Google, and searched by millions or billions every day. And lo! It works well! […]

PMR: This is a useful analysis. Essentially the message is that a textual document can be discovered and indexed by major search engines outside the repository system. Effectively for discovery all you have to do is post your material on the web and very shortly (hours) it will be discovered. For example this blog is indexed within about 2 hours.

So what is the value of the text-oriented repository system. It has some value, but it’s often not what authors want. And authors will generally only post something if there is value to them. Here are some of the points which have been used to promote the value of IRs:

  • A single place for authors to place material. If authors wish their work to be discovered they will generally create specific web pages – and these will be disovered and indexed. A repository takes time to learn, the early ones were unnatural to use and offered no flexibility in the type of material. So authors don’t use them unless there is a specific benefit.
  • Higher visibility for citations. This is true and valuable where the original material is closed access. It’s often a hassle – it’s not a natural process and it’s not surprising that most people I talk to have no familiarity with their repo.
  • Archival. Scientific authors have no interest in archival. They want their stuff disseminated now. If they publish in a conventional journal it is de facto archived. In general repositories have no facilities for archiving blogs, web pages, hypermedia.
  • Metadata-driven federated discovery. It would be nice if this happened. My simple test : “find all exposed chemistry theses in UK universities“. This is a simple, important, legitimate aspiration. I want it for my Green Chain Reaction Challenge. It depends on two simple fields: “thesis” and “chemistry” which I would hope would be among the top metadata fields. Yet I cannot do this on even a single site, let alone a federation of UK repos.

So the primary roles for repos appear to be archival and citation enhancement.

So to data. As I remember you found out with some of your molecule deposits in the Cambridge repository, data can have a problem: much data will not be indexed by search engines (it “cannot speak for itself” as text does), and hence is not inherently findable. But as we also found out then, if you stick it in some sort of repository with some kind of standardised metadata, the latter becomes indexable and hence searchable. So now well-curated data can perhaps be found even if it cannot speak.

Let’s clarify terms – a general repository such as DSpace has the opportunity for the author to manually add metadata terms for discovery. It has little opportunity to support structured semantic content and so the metadata is associated not with data items but with pages relating to collections of data items. I have put (or Jim Downing rather put) 150,000 data sets into our DSpace, with 150,000 splash pages. The pages do not interpret the data – they may give some very limited discovery metadata (though at that stage not enough even to retrieve the data automatically). I cannot, for example, discover a ionization potential between 5 and 10 eV. And yet the data are clearly visible.

Of course there are lots more problems with data. Data are different from text in many ways. Scale factors can vary enormously: size, numbers of objects, rate of deposit, rate of access and use, rate of change, etc by many orders of magnitude. Raw data, processed data, combined data, interpolated data (eg the UARS Level 0 to Level 3 distinctions). Many more ways, I guess. So the repository infrastructure built for text will only be useful for a small subset of data. It might be a useful subset (eg the data representing figures and tables and otherwise substantiating the findings of a scientific article), but there will be much data that’s not appropriate. I think that’s a great place to start promoting the value of depositing data, as it links so closely to the value point of the science enterprise for so many scientists: the article.

I will agree that articles are a good place to start. First they are the natural way that many scientists communicate their findings. It’s a pity that most publishers do not care about publishing the data properly and worse claim ownership of everything thus stultifying any attempt to make advances here. The association of tables, figures, spectra, molecules, etc with text is a good way of providing many types of data. The data, of course, are not properly indexed and will reply on things like captions. Scientific units, etc are almost always lost. Numbers are squashed into pixels or PDF.

Now I really do think the institutional versus domain data repository dichotomy is a false one. I don’t think I know of any “single” domain repository. There are more than a thousand databases in nucleic acids research alone. There are dozens of data centres in the social sciences, and many more in climate-related areas. There are half a dozen funded by NERC in the natural sciences. But there are still many domain areas without them, and no likelihood of them being set up. In those cases data repositories linked to institutions, faculties, research groups etc are appropriate. If they are properly managed (yes, with adequate domain involvement), the data can be found and used by those who need them.

There’s a misunderstanding here. All the examples given here are what I call domain repositories. They usually cater for a few types of data, and data that are understood by those running the repository. The examples are perfect illustrations of why domain repositories already exists and already work. A nucleic acid repository accepts nucleic acid sequences, not evolutionary biology. A crystallography repository accepts crystallography not evolutionary biology.

So, if there are domain repositories for your domain, use the most appropriate for your kind of data. If there aren’t, and it is even reasonably appropriate, you could try your institution, especially for the “data behind the graph”. Or you could spend your energy persuading your sub-domain to create and sustain its own.

And that’s exactly what I have been asking for consistently. It takes three things, all beginning with “M”:

  • Motivation (people need to want to do it, and I am trying to show this in chemistry and some other areas). It’s not easy but it’s coming. The greatest motivations are coming from domains which care about published data quality (patchy in chemistry, climate; strong in crystallography, proteomics) and funders who insist.
  • Methods. The data management, the metadata, and the discovery mechanisms have to be in place. Bunging data in an unstructured repository is a waste of time. There has to be a domain-specific discovery tool whether its graph substructure search (for chemistry), numeric indexing (e.g. for crystallography), or triple store (e.g. for key-value data)
  • Money. It’s not a zero cost operation. That’s hard and there is no general solution.


Perhaps the really hard part is persuading folk to want to save and share their data in the first place!

That’s what I am trying to do through Panton Papers and elsewise. But we only get one shot. If we don’t do it properly then recovering will be almost impossible. I’m sorry to say it but IRs in universities have done nothing for scientists (except perhaps in the few universities which have actually tried to promote Open Access).

Your rightly say that many domains (and let’s use this to cover any sub-sub… domain) do not have their repositories. Their problems will not be solved by simply putting complex data into general-purpose repositories – they will be worsened because they won’t gain anything.

However IRs have funding (I don’t know for how much longer, but they could do it). They could reach out to domains, but it would have to be on a specific basis. It might be done in conjunction with an Open Access publisher (most closed access publishers currently have a model of “owning data” and selling it back to the community). It would have to be a model where the content was not “owned” by the particular institution. (Indeed the whole idea that data can be fragmented on the basis of people’s employers rather than the natural structure is clearly unworkable). And in many institutions that probably fails on inter-institutional politics at the first hurdle – why should “we” host “their” data.

So for those who still believe that federated IRs provide us with a natural scalable solution, just tell me how to get chemistry theses in the UK. It’s a natural, simple, important request. The only way I can do it is by writing my own crawler and text-mining engine.

And I will probably be told that I don’t have the legal rights to do it.



This entry was posted in Uncategorized. Bookmark the permalink.

One Response to PP4: More on where data should be reposited

  1. Steve Hitchcock says:

    > I’m sorry to say it but IRs in universities have done nothing for scientists (except perhaps in the few universities which have actually tried to promote Open Access).
    Peter, There are three components here: the IR software (primarily a set of interfaces), the people who run the IR service, and the scientists. In your exceptions you acknowledge it is possible to make these three components work, in the right combination and circumstances. Surely the best approach is to try and emulate these examples – admittedly, these components may not be as interchangeable as we might like – noting that the scientists have just as much a role as the other players.

Leave a Reply

Your email address will not be published. Required fields are marked *