scifoo: data-driven science and storage

I managed to get out to a few sessions at scifoo not concerned with my immediate concerns, of which two were on the Large Synoptic Survey Telescope and Google’s abiility and willingness to manage scientific data. They come together because the astronomers are producing hundreds of terabytes every day(?) and academia isn’t always the most suitable place to manage the data. So some of them have considered/started shipping it to Google. Obviously it has to be Open Data. There cannot be human-related restrictions that require management.

Everyone thinks they are being overwhelmed with data. Where to keep it temporarily? Can we find it in a year’s time? Should we expect CrystalEye data to remain on WWMM indefinitely?  But our problems are minute compared with the astronomers which are probably 3 orders of magnitude greater.

How would you obtain bandwidth to ship data to someone like Google? Remarkably the fastest way to transmit it is on hard disk. 4 750GByte disks (i.e. 3Tb) fit nicely into a padded box and can be shipped by any major shipping company.  And disk storage  cost is decreasing at 78% per year.

I’m tempted to start putting our data into the “cloud” in this way. It’s Open, so we don’t mind what happens to it (as long as we are recognised as the original creators). It’s peanuts for the large players. If we allocate a megabyte for each new published compound (structure, spectra, crystallography, computation, links, etc. and the full-text if we are allowed) and assume  a million compounds a year that is just ONE terabyte. The whole of the world’s new chemical data each year can fit on a single disk! What the astronomers collect in one minute!

But before we all rush off to to this we must think about semantics and metadata. The astronomers have been doing this for years. They haven’t solved it fully, but they’ve made a lot of progress and have some communal dictionaries and ontologies.

So we could have all the world’s chemical information on our desktops or access it through GYM (Google/Yahoo/Microsoft).
I wonder why we don’t.

