Typed/Scraped into Arcturus
Useful constructive feedback on my strawman PP4 for where scientific data should be stored:
Kenji Takeda says:
August 2, 2010 at 9:39 pm (Edit)
Dear Peter,
Great discussion you have opened up! I believe that you are right, domain specilaists need to be involved. My personal hope is that a new type of data librarian may emerge, and that these people may come from a mixture of discipline specialists and librarians. This is what happened/happens in scientific computing, some scientists move into the computing arena, and some computing people move into the scientific arena – as demonstrated by the eScience programme.
I believe this will take around a decade, as open repositories have done. I also believe that a federated approach is the only sane architecture. You may be interested in reading Chris Gutteridge’s recent paper on the subject @ http://eprints.ecs.soton.ac.uk/20885/
The University of Southampton is committed to open data, and as such we hope to be in the leading wave of open institutional data repositories. eCrystals and our new Materials Data Centre (www.materialsdatacentre.com) projects, are examples of disciplines being the focus.
The whole point of our Institutional Data Management Blueprint project is to figure out how a whole institution can manage and openly publish its data. The genesis of this was that a bunch of us got together and realised that the institution is responsible for its data, and it therefore has the biggest incentive to manage, and publish it. Libraries have a role, as do publishers. I think forward-thinking libraries could jump right in here, as you suggest, but as always funding and priorities need to be matched.
We have a long way to go, but we’re starting down the road. We’re learning all the time!!!
Cheers,
Kenji
http://www.southamptondata.org
OK, I’ll set out the possibilities as I see them. I’d like to get agreement on what is being proposed, before discussing their merits
Federated Repositories
Here we conceive of a network of repositories worldwide (it has to be worldwide as disciplines are worldwide). The simplest model to conceive is one repository per research institution, funded out of central university funds (themselves top-sliced from grants). [There are somewhere about 1,000 – 10,000 institutions (I’d appreciate a better figure). This is what we have got.] They are used for certain disciplines for storing digital artefacts (mainly in arts and humanities) but not in science. There is variable compliance in repositing author manuscripts – either raw or final. Southampton and Queensland UT are very high, Cambridge probably near zero.
When an author has a data set they reposit it locally. The repository is federated to all 9999 other repos and immediately alerts them (RSS?) to a new data set. So now all repos can get the dataset.
Pros:
- Insts already fund repos so we “only” have to scale it
- Insts have a permanency
- Insts can interact directly with authors and help or beat them to submit.
-
Authors have a one-stop for any type of data
Cons
- There is very little, if any, current federation of repos
- It will take years/decades for 9999 repos to get to a state where this federation is seamless
- All repos have to play. If some don’t, then their authors have nowhere to repo data
- The individual repos have no domain knowledge of the data and can only provide generic support
- Users have to have a portal into the system
- There is no simple mechanism for adding domain support for search, etc.
- Journals have to deal with 9999 different places to reposit
- Compliance will have to be monitored by Insts, will be highly variable and is unlikely to be acceptable to funders for some time.
- It’s very difficult to run distributed communal projects
Domain repositories
These already exist, some for > 20 years. My guess is that there are in the region of 30-300. Again a figure would be useful. There is a single site for each discipline (there may be mirroring or some slight national/continental federation but it’s small. For example the world uses 3 genome sites (US, EU, JP). PubMedCentral uses 2 so far (PMC and UKPMC). The author finds the single repo for their discipline and submist the data. Increasingly the data are reviewed and validated by specialist personel and software. The world has a single point of deposition and a single search point.
Pros:
- One clear deposition point
- One clear search and retrieval point
- Specialist help for both
- Major interest in the community (e.g. special sessions at meetings)
- The system works in many cases and is very highly valued by the community
- Journals understand it and can manage the interaction
- Funders can easily monitor compliance
- It’s easy-ish to run distributed communal projects. For example Open StreetMap is sponsored by ULCC (Univ London Comp. Centre)
Cons
- A different repo needed for each discipline or subdiscipline
- Funding is highly discipline dependent and always difficult (unlike the current topslicing support for IRs).
- There is no guaranteed permanence.
Ultimately in the giant scientific semantic web the systems will converge (the domain experts could receive material from Insts as well as directly). But that is 10 years or more away.
To build a federated system of IRs for this will take at least 5 years and probably more (see the current rate of progress in IRs – this is not judgmental, simply good project metric practice). And even then it will not be complete and there will be many many gaps. But for this we need to tell people – for this period – hang on, federated IRs are coming and we can’t do discipline repos until they are here.
The alternative – which is happening whatever you feel about it – is that domains are scraping their pennies together, blagging space where they can find it. Journals are supporting it. Funders such as Wellcome are supporting it (UKPMC).
Pragmatically, therefore, I see that if we are to capture the drive for data we have to take it to the disciplines and not to the Insts. Of course there will be overlap and collaboration, but it will be by domain. Show me any scientist who is arguing otherwise – I haven’t found one. And scientists know it will cost money.
Pingback: Twitter Trackbacks for Unilever Centre for Molecular Informatics, Cambridge - PP4: Discipline or Institutional repositories? « petermr’s blog [cam.ac.uk] on Topsy.com
I blogged about this myself years ago now.
http://www.russet.org.uk/blog/2007/06/institutional-and-subject-archives/
For the record, I think that both of your two options are wrong. We should have neither
institutional nor subject archives. These should be part of the data model, not
the architecture. In the ideal world, a single repository backed by a distributed, and
federated model would be the way I would go. This is how much free software is distributed
for instance.
But, I’d much rather have discipline based repositories than institutional. Who pays
me has less impact on the way that I work than what I am doing.
Pingback: Unilever Centre for Molecular Informatics, Cambridge - Panton papers on Ether Pad « petermr’s blog
Hmmm. I think I agree with Phil that both the options are wrong as you describe them. I’m not sure I understand Phil’s reason, though!
Let’s set the data to one side for the moment and think about the two models for science outputs (articles). It doesn’t greatly matter if articles are deposited in institutional repositories, domain repositories or one grand central repository (ArXiV on steroids). And however many there are, they don’t really have to be federated (I know I mentioned the word earlier, but I was thinking in a looser context). In fact, the OAI-PMH protocol on which repositories are based was originally based around the idea of federated repositories: there were going to be data providers (repositories) and service providers (specialist search sites, like OAISTER). The service providers would harvest the metadata from the data providers, and build their search and other services on that basis.
This was a great idea of its time, but let’s see what happened. OAISTER and a few other search providers do exist, nd let you search the repositories they know about. They generally do a MUCH better job than the previous distributed search paradigm, Z39.50 (and later SRU/SRW), which were very prone to realtime and metadata mismatch failures. But sadly, they turn out to do a MUCH WORSE job than the best search engines, like Google et al.
So now we have a large and growing set of repositories, being indexed by Google, and searched by millions or billions every day. And lo! It works well! It is simple, loosely coupled and only lightly federated. Yet if you put a paper in the Cambridge repository, I can find it the very next day.You could put that paper on your personal web site, or on a blog as Henry Rzepa does, and I could still find it. But it would only persist as long as the web site or blog does (I know Henry is worrying about this).
So to data. As I remember you found out with some of your molecule deposits in the Cambridge repository, data can have a problem: much data will not be indexed by search engines (it “cannot speak for itself” as text does), and ehnce is not inherently findable. But as we also found out then, if you stick it in some sort of repository with some kind of standardised metadata, the latter becomes indexable and hence searchable. So now well-curated data can perhaps be found even if it cannot speak.
Of course there are lots more problems with data. Data are different from text in many ways. Scale factors can vary enormously: size, numbers of objects, rate of deposit, rate of access and use, rate of change, etc by many orders of magnitude. Raw data, processed data, combined data, interpolated data (eg the UARS Level 0 to Level 3 distinctions). Many more ways, I guess. So the repository infrastructure built for text will only be useful for a small subset of data. It might be a useful subset (eg the data representing figures and tables and otherwise substantiating the findings of a scientific article), but there will be much data that’s not appropriate. I think that’s a great place to start promoting the value of depositing data, as it links so closely to the value point of the science enterprise for so many scientists: the article.
Now I really do think the institutional versus domain data repository dichotomy is a false one. I don’t think I know of any “single” domain repository. There are more than a thousand databases in nucleic acids research alone. There are dozens of data centres in the social sciences, and many more in climate-related areas. There are half a dozen funded by NERC in the natural sciences. But there are still many domain areas without them, and no likelihood of them being set up. In those cases data repositories linked to institutions, faculties, research groups etc are appropriate. If they are properly managed (yes, with adequate domain involvement), the data can be found and used by those who need them.
So, if there are domain repositories for your domain, use the most appropriate for your kind of data. If there aren’t, and it is even reasonably appropriate, you could try your institution, especially for the “data behind the graph”. Or you could spend your energy persuading your sub-domain to create and sustain its own.
Perhaps the really hard part is persuading folk to want to save and share their data in the first place!