Data and Institutional Repositories

One of the themes of ETD2007 was a strong emphasis on IR’s. Not surprising since they are topical and a natural place to put theses and dissertations. Almost everyone there – many from the Library and Information Services (LIS) community – had built, or was building, IRs. I asked a lot of people why. Many were doing it because everyone else was, or there was funding, or similar pragmatic motives. But beyond that the motives varied considerably. They included:

  • To promote the institution and its work
  • To make the work more visible
  • To manage business processes (e.g. thesis submission, or research assessment exercises).
  • To satisfy sponsors and funding bodies
  • To preserve and archive the work
  • To curate the work

and more. The point is that there is no single purpose and therefore IR software and systems have to be able to cope with a lot of different demands.
The first generation IRs (ePrints, DSpace, Fedora) addressed the reposition of single eManuscripts (“PDFs”) with associated metadata. This now seems to work quite well technically, although there are few real metrics about whether they enhance exposure and there is poor compliance in many institutions. There is also a major problem in some Closed Access (e.g. in chemistry) formally forbidding Open reposition. So the major problems are social.
Recently the LIS community has started to highlight the possibility of repositing data. This is welcome, but needs careful thought – here are a few comments.
Many scholars produce large amounts of valuable data. In some cases the data are far more important than the full text. For example Nick Day’s crystallographic repository CrystalEye contains 100,000 structures from the published literature and although it links back to the text can be used without needing to. This is also true of crystallographic data collected from departmental services as in our SPECTRa system. With the right metadata data can often standalone.
All scientists – and especially those with sad experiences of data loss (i.e. almost all) – are keen for their data to be stored safely and indefinitely. And most would like other scientists to re-use their data. This needs a bit of courage: the main drawbacks are:

  • re-analysis could show the data or the conclusions were flawed
  • re-analysis could discover exciting science that the author had missed
  • journals could refuse to publish the work (isn’t it tedious to have to mention this in every post :-()

But many communities have faced up to it. The biosciences require deposition of many sorts of data when articles are published. A good example is the RCSB Protein Data Bank which has a very carefully thought-out and tested policy and process. If you are thinking of setting up a data repository this document (and many more like it) should be required reading. It works, but it’s not trivial – 205 pages. It requires specialist staff, constant feedback to and from the community. Here’s a nice, honest, chunk:

If it is so simple, why is this manual so long?
In the best of worlds all information
would be exchanged using both syntactically and semantically precise protocols. The reality of the current state of information exchange is somewhat different. Most applications still rely on the poorly defined semantics of the PDB format. Although there has been significant effort to define and standardize the information exchange in the crystallographic community using mmCIF , the application of this method of exchange is just beginning to be employed in crystallographic software.
Currently, there is wide variation in specification of deposited coordinate and structure factor data. Owing to this uncertainty, it is not practical to attempt fully automated and unsupervised processing of coordinate data. This unfortunately limits the functionality that can be presented to the depositor community through Web tools like ADIT , as it would be undesirable to have depositors deal with the unanticipated software failures arising from data format ambiguities. Rather than provide for an automated pipeline for processing and validating coordinate data, coordinate data processing is performed under the supervision of an annotator. Typical anomalies in coordinate data can be identified and rectified by the annotator in a few steps permitting subsequent processing to be performed in an automated fashion.
The remainder of this manual describes in detail each of the data processing steps. The data assembly step in which incoming information is organized and encoded in a standard form is described in the next section.

So data reposition is both highly desirable and complex. I’m not offering easy solutions. But what is not useful is the facile idea that data can simply be reposited in current IRs. (I heard this suggested by a commercial repository supplier in the Digital Scolarship meeting in Glasgow last year. He showed superficial slides of genomes, star maps, etc. and implied that all this could be reposited in IRs. It can’t and I felt compelled to say so.)
At ETD2007 we had a panel session on repositing data. Lars Jensen gave a very useful overview of bioscientific data – here are my own points from it:

  • There is a huge amount of bioscience data in repositories.
  • These are specialist sites, normally national or international
  • There is a commitment to the long-term
  • much bioscience is done from the data in these repositories
  • the data in them is complex
  • the community puts much effort into defining the semantics and ontologies
  • specialist staff are required to manage the reposition and maintenance
  • there are hundreds of different types of data – each requires a large amount of effort
  • the relationships between the data are both complex and exceedingly valuable

All bioscientists are aware of these repositories (they don’t normally use this term – often “data bank”, “gene bank”, etc. are used.) They would always look to them to deposit their data. Moreover the community has convinced the journals to enforce the reposition of data by authors.
Some other disciplines have similar approaches – e.g. astronomers have the International Virtual Observatory Alliance. But most don’t. So can IRs help?
I’d like to think they can, but I’m not sure. My current view is that data (and especially metadata) – at this stage in human scholarship – have to be managed by the domains, not the institution. So if we want chemical repositories the chemical community should take a lead. Data should firstly be captured in departments (e.g. by SPECTRa) because that is where the data are collected, analysed, and – in the first instance – re-used. For some other domains it’s different – perhaps it might be at a particular large facility (synchrotron, telescope, outstation, etc.).
Some will argue that chemistry already operates this domain-specific model. Large abstracters aggregate our data (which is given for free) and then sell it back to us. In the 20th C this was the only model, but in the distributed web it breaks. It’s too expensive, does not allow for community ontologies to be developed (the only Open ones in chemistry are developed by biologists). And it’s selective and does not help the indivdual researcher and department.
Three years ago I thought it would be a great idea to archive our data in our DSpace repository. It wasn’t trivial to put in 250, 000 objects. It’s proving even harder to get them out (OAI-PMH is not designed for complex and compound objects).
Joe Townsend who works with me will submit his thesis very shortly. He want to preserve his data – 20 GBytes. So do I. I think it could be very useful for other chemists and eScientists. But where to put it? If we put it in DSpace it may be preserved but it won’t be re-usable. If he puts it on CD it requires zillions of actual CDs. And they will decay. We have to do something – and we are open to suggestions.
So we have to have new model – and funding. Here are some constraints in chemistry – your mileage may vary:

  • there must be support for the development of Open ontologies and protocols. One model is to encourage groups which are already active and then transfer the maintenance to International Unions or Learned Societies (though this can be difficult when they are also income-generating Closed Access Publishers)
  • funders must make sure that the work they support is preserved. Hundreds of millions or more are spent in chemistry departments to create high-quality data and most of these are lost. It’s short-sighted to argue that the data only live until the paper publication
  • departments must own the initial preservation of this data. This costs money. I think the simplest solution is for funders to mandate the Open preservation of data (cf. Wellcome Trust).
  • the institutions must support the generic preservation process. This requires the departments actually talking to the LIS staff. It also requires LIS staff to be able to converse with scientists on equal terms. This is hard, but essential.

Where the data finally end up is irrelevant as long as they are well managed. There may, indeed, be more than one copy. Some could be tuned for discoverability.
So the simple message is:

  • save your data
  • don’t simply put it in your repository

I wish I could suggest better how to do this well.

This entry was posted in data, etd2007. Bookmark the permalink.

2 Responses to Data and Institutional Repositories

  1. What type of data is in those 20GB?

  2. pm286 says:

    (1)
    Mainly CML, I think. 10,000 GAMESS calculations converted into CMLComp. There’s also some SVG plots of the results with hyperlinks to all the calculations. Since the compounds are from the published literature people might be interested in them 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *