I managed to get out to a few sessions at scifoo not concerned with my immediate concerns, of which two were on the Large Synoptic Survey Telescope and Google’s abiility and willingness to manage scientific data. They come together because the astronomers are producing hundreds of terabytes every day(?) and academia isn’t always the most suitable place to manage the data. So some of them have considered/started shipping it to Google. Obviously it has to be Open Data. There cannot be human-related restrictions that require management.
Everyone thinks they are being overwhelmed with data. Where to keep it temporarily? Can we find it in a year’s time? Should we expect CrystalEye data to remain on WWMM indefinitely? But our problems are minute compared with the astronomers which are probably 3 orders of magnitude greater.
How would you obtain bandwidth to ship data to someone like Google? Remarkably the fastest way to transmit it is on hard disk. 4 750GByte disks (i.e. 3Tb) fit nicely into a padded box and can be shipped by any major shipping company. And disk storage cost is decreasing at 78% per year.
I’m tempted to start putting our data into the “cloud” in this way. It’s Open, so we don’t mind what happens to it (as long as we are recognised as the original creators). It’s peanuts for the large players. If we allocate a megabyte for each new published compound (structure, spectra, crystallography, computation, links, etc. and the full-text if we are allowed) and assume a million compounds a year that is just ONE terabyte. The whole of the world’s new chemical data each year can fit on a single disk! What the astronomers collect in one minute!
But before we all rush off to to this we must think about semantics and metadata. The astronomers have been doing this for years. They haven’t solved it fully, but they’ve made a lot of progress and have some communal dictionaries and ontologies.
So we could have all the world’s chemical information on our desktops or access it through GYM (Google/Yahoo/Microsoft).
I wonder why we don’t.
-
Recent Posts
-
Recent Comments
- pm286 on ContentMine at IFLA2017: The future of Libraries and Scholarly Communications
- Hiperterminal on ContentMine at IFLA2017: The future of Libraries and Scholarly Communications
- Next steps for Text & Data Mining | Unlocking Research on Text and Data Mining: Overview
- Publishers prioritize “self-plagiarism” detection over allowing new discoveries | Alex Holcombe's blog on Text and Data Mining: Overview
- Kytriya on Let’s get rid of CC-NC and CC-ND NOW! It really matters
-
Archives
- June 2018
- April 2018
- September 2017
- August 2017
- July 2017
- November 2016
- July 2016
- May 2016
- April 2016
- December 2015
- November 2015
- September 2015
- May 2015
- April 2015
- January 2015
- December 2014
- November 2014
- September 2014
- August 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013
- September 2013
- August 2013
- July 2013
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- December 2006
- November 2006
- October 2006
- September 2006
-
Categories
- "virtual communities"
- ahm2007
- berlin5
- blueobelisk
- chemistry
- crystaleye
- cyberscience
- data
- etd2007
- fun
- general
- idcc3
- jisc-theorem
- mkm2007
- nmr
- open issues
- open notebook science
- oscar
- programming for scientists
- publishing
- puzzles
- repositories
- scifoo
- semanticWeb
- theses
- Uncategorized
- www2007
- XML
- xtech2007
-
Meta
Peter, I think your allocation of 1Mb per molecule is far too low. The amount of data it is possible to generate from a single molecule is significantly higher than this. For example, if you were to calculate and store a large number of local energy minima, something you might want to do for following the folding in a protein, you already have a huge amount of data just from one aspect of study for the molecule. Another case might be if you were going to store the original data from a 2D NMR experiment; these can be several megabytes even for relatively small molecules.
Perhaps we ought to be optimistic and allocate 100 megabytes per molecule, or even a gigabyte.
(1) You’re right of course… I was really thinking about what is publicly available. Yes, I have no problems with calculating a gigabyte for each molecule. But we don’t have these at the moment though I’m happy to think about automating it as we have done. So let’s say 100 Mbyte * 1 million = 100 terabytes – that’s reasonable.
Of course we can run extended aqueous MD for every molecule but I’m not sure we could make much use. And when we know what we want to do it would probably be easier to recalculate this.
There are ideas here which I shan’t post publicly…
Pingback: » Scifoo: Google and large scientific datasets » business|bytes|genes|molecules
Beware the difference between marketing and engineering:
Quoted: “4 750GByte disks (i.e. 3Tb)”
750 GByte drives are shipped as 750,000,000,000 byte drives. Devide by 1024^3 (i.e: 2^30), and you’re actually buying 698 GiB drives. Make sure you buy too much, and not too little. 🙂
Pingback: PROnetworks » PRO Archive » Google To Host Terabytes Of Open-Source Science Data
Pingback: Greenr - Accelerate the Change! » Google to Host Terabytes of Open-Source Science Data
Pingback: Open-source science « The Oyster’s Garter