I’m currently at a meeting on Computational Chemistry where we are looking at how to store, search and disseminate our results. http://neptuno.unizar.es/events/qcdatabases2010/program.html The problem is a very general one:
A community creates results and wants to make the raw results available under Open licence on the web. The results don’t all have to be in the same place. Value can be added later.
One solution is to publish this as supplemental data for publications. (The crystallographers require this and it’s worked for 30 years). But the Comp chem. People have somewhat larger results – perhaps 1-100 TB /year. And they don’t want the hassle (particularly in the US) or hosting it themselves because they are worried about security (being hacked).
So where can we find a few terabytes of storage. Can university repositories provide this? Would they host data from other univs? Could domain-specific repositories (e.g. Tranche, Dryad) manage this scale of data?
Last time I asked for help on this blog I got no replies and we had to build our own virtual machine and run a webserver. We shouldn’t have to do this. Surely there is a general academic solution – or do we have to buy resources from Amazon. If so how much does it cost per TB-year?
If we can solve this simple problem then we can make rapid advance in Comp Chem.
Simple web pages, no repository, no RDB, no nothing.
UPDATE
Paul Miller has tweeted a really exciting possibility:
http://aws.amazon.com/publicdatasets/
At first sight this looks very much what we want. It’s public, draws the community together, it’s Open. Any downside?
P.
Peter – I am just looking into this exact issue. For our open ELN () our university is worried about long-term storage of data and safety information. Paper lab books are meant to be stored in our department for 30 years – there is legislation on this. If we are to be allowed to transfer to ELNs rather than paper we need some kind of reassurance that the ELN will be backed up and stored for a similar length of time. That’s quite a challenge.
I’m sure there are similar rules for every university, and at the moment the safety people and records management people don’t realise that researchers are increasingly not writing down what they’re doing on paper.
In short – we need a solution to this. Would be very interested to hear of any other university that has a solution. Either data storage or the ability to archive web pages permanently.
Sorry – the ELN is here: http://www.ourexperiment.org/racemic_pzq
Starting with a few strawman axioms:
Cost = Hosting anything has a real cost in time, effort and money.
Usefulness = Not all data is useful and worth hosting.
Appeal = Researchers are good at predicting what is useful to them, but not to the rest of the communities.
Characterisation = Data becomes useful only when a community finds it so.
Provenance = No data is 100% accurate and so who created it, how, when and by what methods are important to let users work out how confident they can be with this data.
These are just my viewpoints on the situation and this leads me to some intermediary points:
– data should be as open as possible so we can see if it is useful.
– to understand the data, we also need to know about its provenance and history.
– If project/research fundings aim is to further human knowledge in a given area, then isn’t it within the funding remit to cover the cost of sharing that information?
– Currently, there is no financial reason for an institution to care about a researcher’s work after they have left.
So, what can be done about it?
– Institutions acknowledge that hosting and sharing data is necessary and their responsibility as much as providing a JANET connection, heating and floor space to researchers.
– Research funding must include a cost for the initial dissemination of the data for a fixed term of 5 years.
– Institutions to use this funding to maintain their own http://data.XXXXX.ac.uk site. Nothing fancy is needed. It could even be a FTP site, with READMEs detailing the contents. (NB provenance/history is STILL needed alongside the data however.)
– Put data under an open (as in http://opendefinition.org) licence, so that popular or useful data is copied and shared by those who care about it (sharing the cost for further dissemination once they discover the data’s usefulness.)
– When the initial hosting term draws to an end, the researcher is offered a choice to pay for more hosting, have the data archived with a library or specialist organisation for a fee, or to be sent a copy of the original data.
Note the introduction of money at almost every step – this is important, as there are no data publishers yet and so the ongoing cost of a publication isn’t as hidden as it is with journals and subscriptions.
This is only an idea for the initial phase of some data’s lifetime – how archives and libraries could support themselves in the longer term, or how to preserve access to the data in 10, 20 or even 50 years time are difficult questions with no sweeping general answers.
Is there a National Data Store in the UK? (e.g. in Ireland, http://www.e-inis.ie/)
Pingback: Twitter Trackbacks for Unilever Centre for Molecular Informatics, Cambridge - Where can scientists host their web pages? Please help « petermr’s blog [cam.ac.uk] on Topsy.com