petermr's blog

A Scientist and the Web

 

scifoo: data-driven science and storage

I managed to get out to a few sessions at scifoo not concerned with my immediate concerns, of which two were on the Large Synoptic Survey Telescope and Google’s abiility and willingness to manage scientific data. They come together because the astronomers are producing hundreds of terabytes every day(?) and academia isn’t always the most suitable place to manage the data. So some of them have considered/started shipping it to Google. Obviously it has to be Open Data. There cannot be human-related restrictions that require management.

Everyone thinks they are being overwhelmed with data. Where to keep it temporarily? Can we find it in a year’s time? Should we expect CrystalEye data to remain on WWMM indefinitely?  But our problems are minute compared with the astronomers which are probably 3 orders of magnitude greater.

How would you obtain bandwidth to ship data to someone like Google? Remarkably the fastest way to transmit it is on hard disk. 4 750GByte disks (i.e. 3Tb) fit nicely into a padded box and can be shipped by any major shipping company.  And disk storage  cost is decreasing at 78% per year.

I’m tempted to start putting our data into the “cloud” in this way. It’s Open, so we don’t mind what happens to it (as long as we are recognised as the original creators). It’s peanuts for the large players. If we allocate a megabyte for each new published compound (structure, spectra, crystallography, computation, links, etc. and the full-text if we are allowed) and assume  a million compounds a year that is just ONE terabyte. The whole of the world’s new chemical data each year can fit on a single disk! What the astronomers collect in one minute!

But before we all rush off to to this we must think about semantics and metadata. The astronomers have been doing this for years. They haven’t solved it fully, but they’ve made a lot of progress and have some communal dictionaries and ontologies.

So we could have all the world’s chemical information on our desktops or access it through GYM (Google/Yahoo/Microsoft).
I wonder why we don’t.

7 Responses to “scifoo: data-driven science and storage”

  1. jjd323 says:

    Peter, I think your allocation of 1Mb per molecule is far too low. The amount of data it is possible to generate from a single molecule is significantly higher than this. For example, if you were to calculate and store a large number of local energy minima, something you might want to do for following the folding in a protein, you already have a huge amount of data just from one aspect of study for the molecule. Another case might be if you were going to store the original data from a 2D NMR experiment; these can be several megabytes even for relatively small molecules.

    Perhaps we ought to be optimistic and allocate 100 megabytes per molecule, or even a gigabyte.

  2. pm286 says:

    (1) You’re right of course… I was really thinking about what is publicly available. Yes, I have no problems with calculating a gigabyte for each molecule. But we don’t have these at the moment though I’m happy to think about automating it as we have done. So let’s say 100 Mbyte * 1 million = 100 terabytes – that’s reasonable.

    Of course we can run extended aqueous MD for every molecule but I’m not sure we could make much use. And when we know what we want to do it would probably be easier to recalculate this.

    There are ideas here which I shan’t post publicly…

  3. [...] The data will probably be provided on a Google Code like page, and anyone should be able to get access to the data. There was talk of allowing people to build applications of the data. As Peter Murray-Rust noted, putting the data in the cloud is definitely enticing to some (I would add Amazon to his list as well). Like many others, I am curious to see where this goes. Quite a few people, and not just the astrophysics variety were very interested in what Google has to offer. [...]

  4. Dan says:

    Beware the difference between marketing and engineering:

    Quoted: “4 750GByte disks (i.e. 3Tb)”

    750 GByte drives are shipped as 750,000,000,000 byte drives. Devide by 1024^3 (i.e: 2^30), and you’re actually buying 698 GiB drives. Make sure you buy too much, and not too little. :)

  5. [...] to scientists and access to the data will be free for all. The project, known as Palimpsest and first previewed to the scientific community at the Science Foo camp at the Googleplex last August, missed its [...]

  6. [...] to scientists and access to the data will be free for all. The project, known as Palimpsest and first previewed to the scientific community at the Science Foo camp at the Googleplex last August, missed its [...]

  7. [...] to scientists and access to the data will be free for all. The project, known as Palimpsest and first previewed to the scientific community at the Science Foo camp at the Googleplex last August, missed its [...]

Leave a Reply