Monthly Archives: June 2008

Festival of Crystallography at Wellcome

Last Thursday a group of us went to see the Wellcome Collection (iat 183 Euston Road, built in 1932). It's literally across the road from Euston Station. We were all crystallographers and had worked with the great women and men crystallographers of the 20th Century. The particular reason is that the insititute is housing an exhibition of a remarkable collaboration between crystallographers and the Festival of Britain in 1951.

I remember being taken to the Festival and being wowed by it - it was a vision of the future and how science and technology could change the world. I don't remember all the parts and I don't think I remember the crystallography. The crystallographic community, spurred by Helen Megaw at Cambridge, donated some of their output to be used as visual displays ("patterns") and this is now exhibited as From Atoms to Patterns. There were no computer displays in crystallography at that time, so they couldn't use the graphics that we have now. Instead they created displays in fabrics and synthetic materials and glass. I can't reproduce them here (copyright) but please click through to see what I'm talking about. Better still, if you are in London and have half an hour before catching a train at Euston drop in. It's free. And the permanent exhibition from Henry Wellcome is also most interesting.

Crystallography is very beautiful. That's why I got into it as a teenager - I got excited by polyhedra and my chemistry teacher gave me Phillips' book on crystallography to read. I made physical models of all 32 point groups. Perhaps there was a subliminal echo of the festival in that activity.

Several of the patterns had been provided by Dorothy Crowfoot Hodgkin - all of us had worked with Dorothy - some for many years. She'd been asked for patterns and the question of copyright came up. Dorothy wrote:

"I feel rather doubtful whether I own any copyright of a pattern perpetuated by nature".

Elsevier: The grand challenge

Some of you will have noticed that Elsevier has launched a competition:


Demonstrate your best ideas for how scientific research articles should be presented on the web and compete to win great prizes!


We’ve worked hard to build the Article 2.0 dataset, and now we’re opening it up to developers via a simple, straightforward REST API. We will provide contestants with access to approximately 7,500 full-text XML scientific articles (including images) and challenge each contestant to be the publisher. In other words, each contestant will have complete freedom for how they would like to present the scientific research articles contained in the Article 2.0 dataset. We will encourage the use of XQuery, but this will not be a mandate. By leveraging these APIs, the contestant becomes the publisher and can render scientific articles to meet their needs including integrating the article into existing applications or combining it with other web service APIs.

I and my colleagues are excited by this and I have written off to Elsevier to ask more about the content of the dataset. However, what can you do with 5000 articles covering the whole of science? Citation analysis has already been done. What else is general to all the disciplines? Well, we have some ideas and we aren't giving them away here, but here's one you might like to work on.

Lets' assume that the "fulltext" is chosen randomly from all items on the Elsevier site marked as "fulltext PDF". These are, of course, chargeable at 31.50 USD. So I've done a pilot study on the latest issues of Tetrahedron.

There are 29 fulltext articles (all 31.50 USD) and here are five of them (16% of the total).:

Editorial board
Page IFC
Open Preview Purchase PDF (71 K) | Related Articles
2. You are not entitled to access the full text of this document
Graphical contents list
Pages 7445-7451
Corrigendum to “The potential of intermolecular Ncdots, three dots, centeredO interactions of nitro groups in crystal engineering, as revealed by structures of hexakis(4-nitrophenyl)benzene” [Tetrahedron 63(28) (2007) 6603–6613]
Page 7650
Eric Gagnon, Thierry Maris, Kenneth E. Maly, James D. Wuest
Open Preview Purchase PDF (68 K) | Related Articles
28. You are not entitled to access the full text of this document
Page I
Open Preview Purchase PDF (60 K) | Related Articles
29. You are not entitled to access the full text of this document
IBC: Guide for Authors
Page IBC
Open Preview Purchase PDF (238 K) | Related Articles

So it would normally cost 157.50 USD to read these. Hopefully 16% of the Elsevier Article 2.0 dataset will be of this category and we'll be able to read and analyse the fulltext for free. That's about 800 articles, so enough to work on. I do hope Elsevier have included them, because they are clearly worth paying for. Indeed the total cost of these articles would be 25, 000 USD and we can get them for free!

I'll be continuing my little adventure into which others publishers charge non-subscribers for:

  • Editorial Board Info
  • List of abstracts
  • Corrigenda
  • calendar
  • Guide for authors

and whether they have thought of other ways of making money. After all Web 2.0 is all about making money, isn't it?

Elsevier: How much is a corrigendum worth?

Here's a diversion - as I am sat watching England lose to New Zealand... (Note that I read this without University access as it's easier to realise what it's like not to have it).

I'm interested in how easy or difficult it is to find crystallographic structures in various publishers and I've turned to Tetrahedron - an Elsevier journal. Tetrahedron does not allow -or at least does not seem to encourage - its authors to expose their crystallographic data as supplemental information. I wrote to them last year asking if they would consider doing so. After all RSC, IUCr and ACS do. None of the five editors replied. Well - I'm only a reader.

Anyway I turned to the latest issue. No luck (I didn't expect any). But there was one "article" of interest:

Corrigendum to “The potential of intermolecular Ncdots, three dots, centeredO interactions of nitro groups in crystal engineering, as revealed by structures of hexakis(4-nitrophenyl)benzene” [Tetrahedron 63(28) (2007) 6603–6613]
Page 7650
Eric Gagnon, Thierry Maris, Kenneth E. Maly, James D. Wuest
Open Preview Purchase PDF (68 K) | Related Articles

Maybe I can find a (corrected ) structure in the corrigendum. After all, that is free...

... oops...

The corrigendum will cost me 31.50 USD.

I have to pay 31.50 USD to read the original article and then another 31.50 USD to find out it's wrong. I'd never have thought of such a clever way to make money.

I wonder how many words there are in it. If there are less than 31 (excluding the author rubric) then it's over a dollar a word. Perhaps someone can give this information in a comment (I don't want to get sued for comparing the price of articles.)


For some reason the WordPress software has been stripping the paragraph markup from the posts - they show fine in the editor and local viewer but are stripped when published. I'll re-edit as many back as I have time for and see if I can fix the problem.

Nature Network Blogging Conference

I was at Nature yesterday and talked to Matt Brown who runs Nature Network. He's setting up a meeting - see Science Blogging Conference: Full Steam Ahead . The date seems to have crystallised on 2008-08-30 in London at the Royal Institution.

This is a great way to spend a Saturday - the RI is worth a visit in itself. And so is London.

We sat in a cafe opposite the British Museum and talked about why blogs and social networks do and don't work.

Nature Networks also runs RL meetings and I'm much in favour of this. Blogs and other social networks can be good ways of meeting and taking a shared purpose forward. Are they a good way of developing that social purpose? Sometimes, but sometimes it helps to meet IRL.

How about a grand challenge project? Open drug discovery? With Open Notebook science? GSK has just put (Gavin Baker reports) all their cell lines into the "public domain" (not sure exactly what licence or antilicence, and I'll blog that later). What could the blogosphere do with that? It could be enormous.

And, by chance, Matt studied at York and worked in the lab run by crystallographers Guy and Eleanor Dodson who I was just off to meet at the Wellcome building. They remember him.

But the Wellcome and crystallography in 1951 is a separate post.

Another reason why Data must be Open

Ben Goldacre (The Guardian columnist on "Bad Science") has unearthed a superb interchange between a scientist and the creationists. Richard Lenski's replies are tours de force and quite apart from rebutting the criticisms they could be read by undergraduate scientists as an example of excellent scientific practice. It's worth emphasizing that good scientists live in the daily concern that their data must be validatable, reproducible.

The discussion is classic and a must-read. It's long and according to your makeup will leave you laughing, weeping or raging. I append Ben's summary below. But read the discussion

Inter alia RL was attacked by the creationists for failing to provide data to support his claims. RL replies that the data were in the paper but his attacker wilfully failed to understand this (and possibly failed to read the paper anyway). It is absolutely clear that RL provided everything that any responsble scientist would.

Here are some snippets:

Schlafly (Creationist): Submission guidelines for the Proceedings of the National Academy of Science state that "(viii) Materials and Data Availability. To allow others to replicate and build on work published in PNAS, authors must make materials, data, and associated protocols available to readers. Authors must disclose upon submission of the manuscript any restrictions on the availability of materials or information." Also, your work was apparently funded by taxpayers, providing further reason for making the data publicly available.

Please post the data supporting your remarkable claims so that we can review it, and note where in the data you find justification for your conclusions.


Dear Mr. Schlafly:

I suggest you might want to read our paper itself, which is available for download at most university libraries and is also posted as publication #180 on my website. Here’s a brief summary that addresses your three points....

All these issues and the supporting methods and data are covered in our paper.

Schlafly: Dear Prof. Lenski, This is my second request for your data underlying your recent paper...

If the data are voluminous, then I particularly request access to the data that was made available to the peer reviewers of your paper, and to the data relating to the period during which the bacterial colony supposedly developed Cit+. As before, I’m requesting the organized data themselves, not the graphs and summaries set forth in the paper and referenced in your first reply to me. Note that several times your paper expressly states, "data not shown."

RL: Finally, let me now turn to our data. As I said before, the relevant methods and data about the evolution of the citrate-using bacteria are in our paper. In three places in our paper, we did say “data not shown”, which is common in scientific papers owing to limitations in page length, especially for secondary or minor points. None of the places where we made such references concern the existence of the citrate-using bacteria; they concern only certain secondary properties of those bacteria. We will gladly post those additional data on my website.

PMR: RL has taken great pains (many pages) to refute the claims of fraud and to assert that the data were visible: BUT this was only possible because the fulltext pf the paper was available:

Imagine what would have happened if RL had replied:

"I am sorry, but the article was published in a closed access journal and I have no rights to make the text , which contains the data , publicly available. You will just have to believe that the reviewers and editors agreed with my arguments and that the data supported it. Or each of you and your acolytes will have to purchase the article at a cost of 30 USD from the publisher. And don't post it on your website - even just the graphs containing the data - or the publishers will send legal letters to you".

So, in this case, Open visibility was essential to RL's successful defence. However the "data not shown" was a potential serious weakness, which was not imposed by the author but by the publication process. Electrons and magnetic disks are infinitely cheaper than "pages" and there is no reason whatever to have "data not shown" in a modern article.

Except of course if, as a publisher, you want to stop us re-using it for free...


All time classic creationist pwnage (PMR: follow this link)

June 24th, 2008 by Ben Goldacre in bad science |

Richard Lenski is a biologist who recently found evidence for the emergence of new traits among E.coli bacteria, in a fascinating experiment which he has described in a paper in PNAS (best lay coverage here). His results look a bit like evolution. You will note that his paper includes the original data.

Andrew Schlafly is a startlingly predictable right wing christian activist who runs Conservapedia. I highly recommend a look around there if you’ve not already had the pleasure, because even the people who run Conservapedia find it hard to tell whether the edits are being made by god-fearing americans or naughty satirists.

Schlafly read Lenski. He got angry. He demanded the original data. It was pointed out to him that the original data was in the paper. He demanded the original data again. With menaces.

The following exchange is mirrored humbly and verbatim in case of disappearance. It represents pwnage on a scale most of us can only dream of. (PMR: read under All time classic creationist pwnage)

Crystallography Open Database

The Crystallography Open Dataase is an early and excellent example of the way that a community can start to help itself and make its data open

I've taken most of my information from the website (although I have also met the founder, Armel Le Bail 3 years ago). The COD arose out of the frustration of a number of scientists that much published crystallographic data was not Openly or (usually) freely available. There are clear contrasts in the discipline in that the structures of protein molecules are Openly available in the  RCSB Protein Data Bank but that other crystals (organic, organometallic, inorganic, metals, etc.) are not. Le Bail wrote to these databases and to the International Union of Crystallography asking that crystallographic data should be made freely available. There is a considerable correspondence on the web site (here). There is also a petition of 1000+ signatures (including mine) requesting that crystallographic data be made openly available.

I shan't discuss the history here, but simply outline the current situation. An author publishes a crystal structure as part or all of an experiment. Almost all journals require the data to be made available, inter alia to validate the experiment and to allow it to be reproduced. (It's also proved extremely valuable for re-use). In the past the only place the data could be stored electronically was these databases, which not unreasonably had to recover costs. Now, however, many journals publish the data directly on their websites (RSC, ACS, IUCr, most but not all Open Access journals, American Mineralogist). Others (Wiley, Springer, Elsevier) do not publish this data (though ironically they may publish other data such as spectra) but require it to be sent to the appropriate crystallographic database. The data is then not Openly available, and is mainly used by annual subscription (although individual entries may sometimes be free). Many scientists do not have access (including those who only have an occasional use). This is now acting as an impediment to access rather than a support.

Examples (mostly in 2005) from the COD petitioners include:

We are doing a lot of molecular simulations on all kinds of minerals and access to crystallography structure data bases is crucial for our work...USA

As the author of, any public access to mineral data promotes and encourages further understanding of the mineral (material) sciences....USA

The information on data basis is generated by persons who, in same cases, have to pay for obtaining it. It is not reasonable; it should be freely available!...Brazil

Because I am only an occasional user of crystallographic databases (for example, during the mineralogy section of my soil chemistry course) I cannot justify paying full price for access to a database. I generally just use crystal structures that are prepackaged with CrystalMaker. Open access would allow my students and I to learn to use the crystallography software for a large variety of soil minerals. ... (USA)

The data is all obtained by the scientific community and should be available without charge. As a fact of nature it should not be copywritable.... USA

I am a Crystallographer and our University is suffering from a constant lack of funding - so the databases are not being updated and are also not easily available... Australia

This is really a good proposal for people from developing countries.... India

I trust there should be no economical barriers to knowledge, and databases are essential tools for the scientific community and should be accessibile to everyone.... Italia

Egypt is a devoloping Country and We canot have the regular Crystallographic data base which we realy need it in our work accordingly if the crystallographic community allow us to have it on line it will be a big achievement for the third world countries...Egypt

As a result of our poor exchange rate, researchers in South Africa have to pay a very high price to obtain these databases. As a young academic and researcher, I could not afford the CSD in the first two years of my career, which made crystallographic research very difficult. Even now, obtaining the CSD is extremely expensive, and a large percentage of my research funds is used for this.... South Africa

... and many more.

So the volunteers at the COD started soliciting contributions, and - I think - typing up some historical and classic structures. (It is quite difficult to get Open structures for the sorts of materials required in undergraduate teaching, especially minerals). As I understand it the COD sources include:

  • typing up public information
  • donations from various regular sources - Am. Mineral and others
  • donations from individual groups and laboratories.

The COD has now reached an extremely impressive 70, 000+ structures and it's enormously valuable for our work and for CrystalEye. It seems to cover all fields of crystallography (except, of course, proteins). We have assimilated a great deal of COD into CrystalEye  Here's a typical example:

The CIF:
Neue Verbindungen mit Ba6 Ln2 (M3+)4 O15-Typ: Ba6 Nd2 Fe4 O15, Ba5 Sr
La2 Fe4 O15 und Ba5 Sr Nd2 Fe4 O15
'Mevs, H'
'Mueller-Buschbaum, Hk'
_journal_name_full 'Journal of the Less-Common Metals'
_journal_volume 158
_journal_year 1990
_journal_page_first 147
_journal_page_last 152

which you can view in CrystalEye.

The COD comes as individual CIFs in large ZIP files and we have done bulk imports when releases became available. We are now using the data for Nick Day's MOPAC calculations.

Note that at present the COD data is in the databases it is not indexed by bibliography (though it is available by bond-length search). That's because CrystalEye was not set up as a repository and the main bibliographic indexing system is based on regular publishers. It's something we are reviewing over the summer for our in-house crystallographic repository and then we shall be able to ingest CIFs from arbitrary sources.

We also have a problem in that many entries are duplicates of structures we already have, That's because they were published in journals and also submitted to the COD. We don't want to put them in twice, but it's not easy to create a uniquifier when metadata are missing- we currently use structural formula and cell dimensions.

The data in the COD is variable. Some are duplicates as mentioned. Some are old structures retyped, some are new depositions. Metadata varies a lot. At present some CIFs are syntactically incorrect and we have various heuristics to read them. (The COD does not edit the CIFs it receives - at least it didn't). So quality is a problem.

However it is extremely useful. Many of our inorganic materials come directly from there. So if you are using CrystalEye you are also using almost all of the COD. We'll be gradually adding metadata to all our data so that the COD is more prominent and acknowledged. Meanwhile many thanks.

Which licence should usyd use?

I had the pleasure of meeting Mat Todd in Sydney this year - a very pleasant day. Mat's an organic chemist and very keyed into the ideas of semantic publication, sharing information, etc. He runs a blog - the Synaptic Leap - which looks at some of these issues. Here he needs suggestions about licensing.

Which Creative Commons Licence?

We're drawing up a contract (with WHO and the ARC) to cover our new grant (and hence this site). Our business office would like to know which Creative Commons licence is most suitable. I was assuming Attribution 3.0 unported, since this allows sharing and remixing under attribution. On the face of it, a better alternative is Attribution-Share Alike 3.0 Unported, since this also requires that anyone using the research has to distribute their own work under a similar licence.

Anyone have any views on this for science research? Is the share-alike clause unduly restrictive? What if a company, reading results posted on these pages, would like to develop those results for a different for-profit reason that is unrelated to our original research interest - would a 'share-alike' licence prevent that?

PMR: It's a great idea to post this on a blog - I'm relaying it here so it gets to other audiences, especially from the SC and OKF communities.

My own take is very strongly influenced by the Open Knowledge Foundation (OKF) and Science Commons ( who are an offshoot - not a fork - of CC). I'd strongly urge CC-BY (attribution) for the "fulltext" and the Science Commons + Open Knowledge approach to the data - no licence at all but instruments such as PDDL which make it clear that the data are Open and Free as in air.

"Non-commercial" immediately raises questions as to what is "commercial". This is almost impossible to define, so generally all it causes is problems. Is a non-profit that sells goods and services commercial? If an academic runs a course and charges fees is that commercial? And so on...

I started this blog with CC-NC, not because of me but I was worried that readers migh not wish to contribute. Then I was persuaded to change to CC-BY and haven't regretted it. Now it's clear that anything on this blog is free and open, as long as you acknowledge the author (not always me). If you want to set it to music, compile a dictionary of incorrect English usage, use it as a public key for cryptography you can do it. You can create an anthology of all CC-BY blogs and sell them. You can set up a web service which charges people access to these blogs...

NC-SA is a recursive nightmare. I have already covered myself in shame for having mistakenly aired views of OKF members here, but at least it got the problem into the open. CC-SA has implications for any other digital artefact it is involved with. It is theoretically viral.

For the data we should use PDDL. From WP on Science Commons:

Using data and CC licenses

Science Commons launched on 16 December 2007 the Protocol for Implementing Open Access Data in conjunction with the Public Domain Dedication License and the Open Knowledge Foundation.

The Protocol is of note because, rather than relying on copyright licenses such as the Creative Commons licenses and the GNU GPL, it provides a rationale and methodology for reconstructing the public domain of data.

PMR: This can be coupled with "community norms" - the idea that authors express - in non-legal terms - what they feel is reasonable and not reasonable to do with the data. They can't override the basic freedoms (e.g. authors cannot prevent export to countries their givernments don't approve of). But in some areas there are complex problems - e.g. the use of human data - and it is important to develop protocols for these. My hope is that communities will start to pick up what practices are seen as acceptable and help to formalize them, hopefully with the help of Learned Societies and International Scientific Unions.

Library Workshop for virtual scholars

I always enjoy having visitors to the Unilever Centre and encourage people to visit. Yesterday we had a visit from ca 16 staff and Masters students from the Pratt Institute in New York. They were here as part of a 2-week visit to Britain hosted by Anthony Watkinson of University College in London. They had a background of cultural studies and essentially no science. Nonetheless Anthony had asked me to put on something that would give them an insight into some of the practices and challenges in scientific scholarly publishing and related issues.

I hadn't heard of the Institute, so I went to WP and found:

Charles Pratt (1830-1891) was an early pioneer of the natural oil industry in the United States. He was founder of Astral Oil Works in the Greenpoint section of Brooklyn, New York. He joined with his protégé Henry H. Rogers to form Charles Pratt and Company in 1867. Both companies became part of John D. Rockefeller's Standard Oil in 1874.

Pratt is credited with recognizing the growing need for trained industrial workers in a changing economy. In 1886, he founded and endowed the Pratt Institute, which opened in Brooklyn in 1887.

Charles Pratt, Founder

Charles Pratt, Founder

This resonates with other visionaries of the time - I have a long association with Birkbeck College in London:

Working as a doctor in London, Birkbeck, with others, established the London Mechanics Institute in November 1823 - of which he was the first President. The Mechanics Institute concept was quickly adopted in numerous other cities and towns across the UK and overseas, but his association with the ground-breaking London institution was marked by it being renamed the Birkbeck Literary and Scientific Institution in 1866 (now, as Birkbeck College, part of the University of London).

Well, the one similarity is that we are "in a changing economy" so I hope that Charles Pratt would have looked favourably on what we did...

We are very well equipped for workshops at UCC and have 16 machines and a projector/beamer which can be booked for sessions. So rather than my pontifcating I prepared some hands-on. You can do this at home - it only needs a web-browser.

I started by asking them what significance they might attach to the number:


Not all of them got it immediately, so I asked them to Google for it and, of course, it's a DOI for a scientific article. I then asked them to see what the components were - abstract, full text in HTML, full text in PDF, etc. Could they read the full text? Yes. If they went back to a hotel could they read it? They quickly realised no. Could they tell from the display that the difference was due to the fact that the university had a subscription to the journal. Yes - there was a rubric saying so, though I doubt that many undergrduates or staff in most institutions would notice it.

How did the DOI work? We found the DOI site. What did it provide? Could we have donw all this through Google? etc. How did we know a DOI was unique? How do you identify a book? By ISBN... yes, but how do most of the population identify a book? By its Amazon stock number. Oh...

That's the sort of disruption that is changing the role of libraries. Names and addresses in TimBL's world are conflated. The reality is the web. Current reality is an illusion unless it has a URI (==URL).

So now we know what the contents of a paper are , here were some exercises - with discussion in between. (You can use them if you want - this blog is CC-BY).

Goals. To investigate how data is published in leading journals.

Each team (2-3 people) should pick a publisher from:

  • Royal Society of Chemistry (Org. Biomol. Chem)
  • ACS (J. Org Chem)
  • Wiley (Angew. Chemie)
  • Beilstein Journal of Organic Chemistry
  • J Heterocyclic Chemistry
  • Molecules (MDPI)

And see if you can answer the questions:

  • Is the Journal Open or Closed access?
  • Can you access the fulltext? If so is it because you are on the Cambridge network?
  • Does the journa; publish data embedded in full-text?
  • Does it publish data as supplemental/supporting info/data?
  • Is there a licence?
  • Can you understand it?
  • does the author retain copyright?
  • is the supplememtal data copyrighted? by whom?

In some cases the answers are easy. In some cases we genuinely have no idea of the answers. Some publishers are very helpful... so maybe the rest could try to make their policies clearer.

Then we looked at the technology of scientific data.

  • Download OSCAR/Experimental data checker from RSC site (Google for it - I deliberately don't give URLs any more)
  • Who wrote it? (answer some very bright chemistry undergraduates here)
  • What is the copyright?
  • what is the licence? (Open Source)
  • Run it. This worked a dream on Windows - clicking the jar fired up OSCAR.
  • Use it to extract data from one of the papers you have found (the paragraph needs to describe "Synthesis..." or "preparation of ... "

I indicated that text and information extraction also applied to all disciplines. Some asked why journal X did not publish XHTML. I have no idea? Inertia? Bad for business?...

Next exercise:

  • Load CrystalEye (Google)
  • Find the latest issue of your journal.
  • Is it abstracted by CrystalEye?
  • If not, why not? (Because the publisher does not allow or support the publication of crystallographic data as supplemental information)
  • Pick another journal. Find the latest issue in the TOC.
  • Pick a paper.
  • Marvel at Jmol. (Open Source molecular viewer from the Blue Obelisk)
  • follow the DOI in CrystalEye to find the article.
  • Where in the article is the crystal structure described?
  • Where is the CIF file (crystallographic information file)?
  • What is its copyrighted? (Some publishers add their copyright to these files of facts. Did we agree with this? No, we didn't.)

So now we have a clear idea of the importance of data and the role (positive and negative of the scholarly publisher). We went into more fluid debate... Did we need fulltext in experimental reports - I showed them some of the word-free chemistry publishing we have been doing. Who is going to pay for Open Acess? Who, indeed? What's the turnover of scholarly publishing (6 billion current units). What's the turnover of academia? A lot more - (I'd like figures, but lets' scale by 100). Could the deans, provosts, princials, vice-chancellors, etc. start to take control of this economy. (Observation from the group: many of them are on the boards of prestigious publishers). What will libraries do about subscriptions (TA) and funder-pays (OA) at the same time? Won't they pay twice? We couldn't answer that one.

But I bet that Charles Pratt or George Birkbeck would have.

Data-driven science and repositories: consideration of errors

The main theme of the current posts is to show how Open publication of data aids scientific research. Our particular domain is chemical crystallography, but these posts contains ideas which I hope have wider applicability and I will skim over the more technical details. There may, however, be some posts wher I need to explain some concepts.

As we have blogged earlier (CrystalEye - an example of a data repository) CrystalEye was developed by Nick Day as part of his PhD work. The primary aim is to see if large amounts of data - larger than a human can inspect - can be reliably used for scientific work. Before describing this I shall beriefly review "errors" and indicate the implications for data repositories

I'm restricting my discussion to physical science where believe that in general an experiment is repeatable by other scientists and should give consistent data. However we all know that the scientific literature contains "errors". This is a very general term (and often too judgmental) but there are centuries of work systematising the detection and categorisation - such as in the science of metrology. My discussion will be very superficial and is not intended to be a systematic or authoritative coverage; it's more an indication to data-driven scientists and data repositarians of issues they should address.

"Errors" can include:

  • variance in the original experiments. All scientific measurements should be quoted with error estimates, which can often be obtained by repeated measurement.
  • systematic errors (bias) in the measurements. Sometimes the causes are known but often they are not. Bias is often discovered when measurements are made in different laboratories or with different methods and equipment. Miscalibration of instruments is a common cause.
  • misunderstandering or misreporting of the physical quantity or measurement. For example in chemistry there are several concepts of "bond length" - the distance between 2 atoms - and they are fundamentally different. One effect is due to the uncertainty principle - atoms do not occupy a fixed position even at absolute zero.
  • omission of relevant independent variables. Thus a crystal structure varies with temperatute and pressure. Often these are not explicitly recorded - there is often a default assumption that measurements are done in "normal conditions" - about 25 deg C and 1 atmosphere. But many theoretical calculations relate to absolute zero and no pressure.
  • omission of units of measurement. This should never happen, but many computer program still emit raw numbers and assume the user knows what the units are.
  • Transcription and typographical errors. These are still common. Many chemists still measure spectra with rulers. Many scientists write numbers in a lab book and type them wrongly. Many computer operations fail to report invalid input or produce corrupted output. For example we used a well-known theoretical program which takes free format input limired to 80 characters on each line. However lines greater than this were not flagged as errors but simply ignored silently which led to gross errors hard to detect. Even copying files - perhaps by cut-and-paste - can corrupt information.
  • Our inability to describe effects comprehensively. In crystallography, for example, it is frequently found that atoms are "disordered" - a simple picture is that they are sometimes in place A and sometimes in place B. Whether they hop between these places or whether the disorder is a statistical average over a macroscopic crystal may not be known. A full treament of disorder may be difficult and expensive and include weeks of work on a neutron source (which needs a nuclear reactor).

We therefore need to know which of these are important. If typographical errors are very low (e.g. less than 1% probability in a data set) we can concentrate on effects which occur more frequently (say 20% of the time). If there is a typo in every data set we may have to use statistical methods to detect them or even abandon the effort. If we estimate a quantity by two different methods and the variance between them is low, then this gives confidence in the precision of each (though says nothing about the accuracy).

Nick Day showed this approach in his online analysis of measured and computed 13C chemical shits (Open NMR: Nick Day’s “final” results). This showed a range of "errors" in both the measured data and the computed data. However it was possible to find many data which each approach reinforced the validity of the other. It was also possible to find outliers and detect the effects reponsible for them (not just "explain them away").

Nick's NMR work built on the work that Joe Townsend did by comparing molecular structures in crystals with computer structures in the gas phase. These are not identical concepts but are similar enough that Joe was able to develop rules showing when they could be regarded as "agreeing". Nick has now been doing this with crystal structures and their computed structures using theoretical methods.

I'll be blogging about this. It won't be formally Open Notebook Science but it will be pre-publication in the same way as the NMR work. The next posts will review where we get our data from and why we need Open Data publication.