(Open) Data in crystallography

I’ll try to post at least twice in what I shall say at RSC on Thursday. (FWIW at least 2 readers have recently applied to go to the meeting – I should have started blogging earlier…)
The posts look to a positive future based on the ready technological ability of high-quality chemical data as a result of the information and instrumentation revolution. For those who espouse this view there’s a great future. For those who traditionally have seen data as something hard-won and chargeable there are turbulent times. The reality is that they will have to change. Not because I tell them so, but because the world does. Not explicitly, but in the relentless change in the way we do things – credit cards vs banknotes, mobile phones over landlines (they are knocking down the iconic red telephone boxes in UK) – you know the list as well as me.
So the model of “you publish your data in fragmented form; we type it up and sell it back to the community” is now no longer necessary or viable. We’ve seen gigabillion companies flourish in the last 10 years based on openly available data. (Yes we have concern about some of the Openness but that’s another post). There isn’t a market for micropayments for scientific information. (I exclude 40 USD to read a paper as it is hardly “micro” -it’s quite a good meal in some places)
When I started scientific research data were hard to come by. I was a crystallographer and I’m going to use that discipline as I understand it and also as it contrasts different approaches well. And it has exceptionally high-quality data.
To solve a crystal structure I had to record the diffraction pattern. This was done using methods developed by Karl Weissenberg (I was appalled that I couldn’t find a Wikipedia article). The diffraction pattern looks like some here. (or search Google for “Weissenberg photograph”). I solved 6 crystal structures for my doctorate and each might produce several thousand “spots”. The spots were of different intensity and this could be used to determine the positions of the atoms in the structure – incredible, even now. The intensities were meaured by eye (usually compared to a calibrated scale). So I measured some tens of thousands of spots.
This was the raw data. It mattered. It was hard-won, every spot hand-recorded in the lab book. Then each was typed (on a teletype) onto punched tape and fed into a Mercury computer (later the KDF9).  When the strucrure was published every spot was published in the journal. It took pages but the journals and editors required them.
Why? Because they were the primary record of the experiment. They were the proof that you had made the right deductions about the atomic positions. Over-ambitious or sloppy claims were regularly demolished by  self-appointed critics such as Jerry Donohue, and Richard Marsh (and several others). If they saw a structure they didn’t feel was correct they would re-type the data and re-analyse the structure. No-one likes being corrected in public print but we all accepted it was completely appropriate. It’s that public criticism which has helped to keep crystallography at the top of data quality.
There’s a complementary aspect. As science evolves it’s often possible to re-analyze existing data. So, for example, many Weissenberg photographs recorded so-called anomalous dispersion which can be used to determine the cirality of molecules. In many cases the effect was clear enough to observe in retrospect but the authors weren’t aware of the phenomenon. It would be possible to revist the data and re-analyse. Similarly the advance of theoretical methods and programs in crystallography means that more effects can be corrected or analysed in the computation. If I revisited my DPhil data I could use anistropic refeinement to get imporived coordinates. It’s probably not worth it – cheaper to re-synthesize and re-collect the data. But it’s possible.
Then the journals started to worry about page space. These intensities took up a lot of space. So they were no longer published (although they might be microfiched). And then even the final coordinates were no longer published – the only way you could get them was by requesting microfiches from the publishers.
And this culture has often stayed with us. There is now no technical reason why all the scientific data shouldn’t be part of the publication. But it usually isn’t. And that leads to several consequences:

  • Loss of data
  • A market in data

I’ll address these later.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *