Open Data genetics and Open Data astronomy

The world is exploding with Open Data posts…
From the Data Strategy blog:

I started writing this post on open data astronomy some time ago, and damn… I got scooped by Read/WriteWeb today with their article on Galaxy Zoo and other “distributed brain” projects. Galaxy Zoo, like Stardust@Home and Clickworkers, asks volunteers over the Web to label astronomy features (galaxies, moon craters, etc.) on images. These data help astronomers do better analysis. I actually had referenced Clickworkers for my PhD thesis under the Open Mind Initiative.
Also see my previous post on open data genetics.

PMR: which reads:

Open data genetics

I had just read about the Personal Genome Project (PGP) a couple days ago, and it’s a really interesting open data project. According to its Wikipedia entry:

The project will publish the genotype (the full DNA sequence of all 46 chromosomes) of the volunteers, along with extensive information about their phenotype: medical records, various measurements, MRI images, etc. All data will be freely available over the Internet, so that researchers can test various hypotheses about the relationship between genotype and phenotype.

The published data will include identifyable information such as the volunteers’ name. The reason for doing so is that they can’t guarantee anonymity anyways when one’s genotype and phenotype are already open. In an interview in Technology Review, the project’s founder, Harvard University’s George Church, said:

We and others have raised concerns about the difficulty of maintaining anonymity [in medical records]. You promise subjects you will make the information anonymous, but it’s becoming increasingly easy to re-identify an individual. This project will hopefully raise consciousness on what we need to do to encourage insurance companies and government and employers to make this safer. This has already been done in some countries, so it’s just a matter of policy.

The first volunteers will be tenured human geneticists, who best understand the risk and benefits of this project. Harvard Medical School’s Institutioal Review Board had given the project permission to start, and it sounds like they will review its progress before the project will recruit a broader set of volunteers.

So we have seen “Open Data Foo” as a useful and accurate term to define data-driven science in a discipline Foo. Obviously the detail will depend on the discipline – astronomy and genetics are very different – but the key features are:

  • The data must be absolutely Open (see Open Data on Wikipedia) – no licence restrictions, no need to ask permissions, no restrictions on what can be done with the data, no “non-commercial” restriction.
  • the science is data-driven (i.e. no foreseeable requirement to collect more experimental data). Obviously Science is not predictable and it may turn out that to answer a question a real-world experiment is needed. But that’s Science
  • The experiment and its interpretation take place in full public view. Ideally anyone can take part, though too many people can be difficult to control and management may be important. Genetics may require more organization because people are involved.

I’ve got very excited by this idea. It’s a wonderful way of communicating science. Volunteers can come from anywhere (and we’ve found this in the Blue Obelisk – not everyone is a chemical hacker). They’ll find out that science is hard, unforgiving, often gives no “results”.
So why don’t we offer our own CrystalEye (http://wwmm.ch.cam.ac.uk/crystaleye) as an Open Data Crystallography project? There’s a lot that anyone can do – you don’t have to be a crystallographer – hackers, visualizers, statisticians, RDFers, etc. – all could make an impact. I’ll blog about this in a day or so.

This entry was posted in data, open issues. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *