Dictated into Arcturus
The Green Chain Reaction will soon be generating a lot of high quality structured data. The question is how and where to store this. To give an idea of the scope let me illustrate this with the patent data.
The European Patent Office publishes about 100 patents each week in the categories that we are interested in. Our current software downloads the index, extracts all patents, and selects those which are classified as chemistry. Each patent contains anywhere between 5 and 500 files (the large number is because the chemical structures are represented as graphical images, usually TIFFs). So this means about 10,000 files each week, in a well structured hierarchy. The absolute size is not large, and is about 100 MB per index. We arrange the raw data and processed data in a directory structure for each index such that it can easily be exposed on the web. Every document will have a unique identifier, so that it is straightforward to transform this into URIs and URLs. This means that we will be able to create Linked Open Data in a straightforward manner.
So the whole project might deliver
10 years * 50 weeks * 10,000 files = 5,000,000 files (ca 50 Gigabytes)
This shouldn’t terrify any body and in fact I routinely hold subsets of this on my laptop. It’s simply a set of structured directories which can be held on a file system and exposed as web pages. There is no need for relational databases or other engines to deliver them. (Of course we shall build indexes as well which may require engines such as triple stores).
The question is where to store it? We’ve been having discussions recently as to whether data should be stored in domain-specific repositories (DSRs) or in institutional repositories (IRs) or in both or in neither. This is where we’d like your help.
I don’t know how to store five million web pages in an institutional repository. It ought to be easy. (I and Jim Downing tried with 200,000 files in DSpace and it was a failure, because of the insistence on splah pages which destroy the natural structure).It’s critical to store them as web pages so that then they are indexed by the major search engines. We shall also index them chemically and by data. It’s obviously a valid type of digital hyperobject to store in a repository and ours must be similar to many other requests that scientists would be likely to make.
We could also store them in a domain repository. I don’t know any Open domain repository for chemical patents (there are many closed ones and a few that are free but not open). It’s possible that we could create an equivalent service to the one we provide for Crystaleye (http://wwmm.ch.cam.ac.uk/crystaleye). However this does not address the problem of long term archiving (although assuming this experiment is successful I don’t think there will be any problem in finding people who wish to help.)
Or we could store it to through the Open Knowledge Foundation and its CKAN repository of metadata. CKAN is not normally used for storing data per se so this would be a departure and the OKF will need to discuss it. It wouldn’t be my first choice, but it’s certainly better than not storing the results at all.
Or we could store it through something like BioTorrent (http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0010071 and http://www.biotorrents.net/ ). This is a new and exciting service which tackles the problem of sharing open data files in a community. One of its purposes is to solve the problem of distributing very large files, but it may also be suitable for distributing a very large number of small files – I don’t yet know but I’m sure I will get feedback from the community. If this is the best technical solution then I don’t think I would have much difficulty persuading them that chemical patents were a useful source of linked open data for the Bio community.
Or some other organizations that I haven’t thought of might offer to host the data. Examples could be the British Library, or UKPubMedCentral (UKPMC), or a bioinformatics institute, …
… or you. (I tried with 200,000 files in DSpace and it was a failure).
It would be a major step forward in Open data-driven science if we could find some common answers to this problem.
I would say that the actual place does not matter so much; just ensure the collection is properly annotated with author/contributor information, and where appropriate either copyright/license information, or public domain waiver (PDDL, CCO).
That said, the more distribution the better. So, I’d say, start with BioTorrent to which you link to from CKAN. That way, you contribute the building communities around those efforts, which is just as important.
If you have RDF around them, make sure to put up a SPARQL end point too, and again add a link to that in the CKAN entry.
Peter, some furthers thoughts on this in my blog:
http://chem-bla-ics.blogspot.com/2010/08/molecular-chemometrics-principles-3.html
Ask Google where to put them. A Google Code project might be appropriate: you get some basic filespace (a few GB I think) but they can boost it if you have a reason. Or they might have some other more suitable place. Try google.org as well.
Torrent systems are specialized to large files, and are not indexed by the search engines.
When you say “archiving”, what requirements do you have in mind? Disaster recovery (for which you could stick big compressed tarballs somewhere resilient such as dspace, amazon s3, or okfn: your data volume is so small by today’s standards that you can stick the whole lot in a corner of a hard disk and not worry about it)? Persistent URIs?
Good initiative Peter. I hope that some repositories will step up to the plate and provide long term hosting of the data. But in case they don’t, I’d happily provide the space and bandwidth necessary for hosting 50GB.
Pingback: Twitter Trackbacks for Unilever Centre for Molecular Informatics, Cambridge - #solo10: Green Chain Reaction; where to store the data? DSR? IR? BioTorrent, OKF or ??? « petermr’s blog [cam.ac.uk] on Topsy.com