Availability of Crystallographic Data and Errors therein

There has been a lengthy correspondence on the SimBioSys blog about the availability of crystallographic data frm the Cambridge Crystallographic Data Centre (CCDC). It raises general concerns about access to scientific data, refers to this blog, and puts in focus our own CrystalEye resource. I shall highlight sections without comment.
I need to make things clear before I do this:

  • Although I am physically close to CCDC I have no formal contact and none of the CrystalEye work is informed by their data or practice.
  • Although I could have access to CCDC data and software I have not used it for many years. CrystalEye does not contain any data not publicly accessible on the web, nor does it use CCDC Refcodes (6- or 8-character codes which uniquely identify entries).

SimBioSys has two posts, the second of which was partially in response to private emails

SimBioSys: Crystal structure errors — in CSD too

[…] One would assume that the small molecule crystal structures of the Cambridge Structural Database (CSD) do not have such errors, since they have much higher resolution and dealing with small molecules. Let me correct that wrong assumption!
[…] In the past year we have collected some statistics from the CSD with the hope to improve the accuracy of our scoring function by using more reliable, more precise data. Unfortunately, we had to learn the hard way, that the CSD isn’t so clean either. We have found a lot of obvious errors, like some atom centers falling within 0.2 Angstrom or less from each other when the crystal packing transformations are applied, some completely impossible bond lengths and angles. We kept adding sanity checks to report and exclude data entries with various obvious errors. At the end of the automated cleaning process, we had almost 15% of the data dropped for one reason or another. Then we thought the remaining data is good, we can use it for collecting the statistics.
[…] Then Bashir has pulled out one of the worst examples where the X-ray structure had very high strain energy: CSD code [REDACTED] has two carboxylate groups as shown on the image [REDACTED]. The original structure from the CSD is was displayed with thick bonds and the optimized one has thin bonds, you can see the optimization has twisted the two carboxylate out of the plane of the aromatic ring in order to avoid two lone pair facing each other.
[…]
So, the morale of the story: we can’t even trust the high resolution CSD data, let alone the PDB.
ZZ
Update:
Since the posting of this blog entry, we have received 2 public comments — displayed in a standard way as all comments by the WordPress blog software, and some private emails originating from CCDC. One of the complaints from CCDC was that the second comment — which explains the problems and directs the blame on my naivity for my wrong expectations about the data — was not displayed as prominently as the original article. To correct this problem, I will quote the entire comment here:

Author : J (IP: 131.111.113.139 , jenner.ccdc.cam.ac.uk) Says: June 16th, 2008 at 8:39 am
[… details snipped … PMR] Though admittedly for QUICNA note that the choice is inconclusive based on the 4 lists given: I think the hydrogen list may not account for deuteration.

The other main point raised was, that our CCDC license has expired since the data collection was made, therefore we can no longer use any data — even derived data — from the CSD. We certainly fully obey this cease and desist order and will not use any of the data — we have not made any publications containing data from CSD except for this blog entry (and I have now removed the code name and the image to comply with the order) and none of the released versions of our software containes such data either. […]
On a personal opinion: such restrictions on the use of scientific facts do not seem to make much sense to me. As the IUCr position paper explains: There is a long-standing acceptance within crystallography of the principle that such primary data sets should be freely available for sharing and re-use (with appropriate credit) within the structural science community. Also the FAQ on the CystalEye site explains: “As this supplementary data is a set of facts and is not part of the article full-text it does not fall under the copyright, and it should therefore be free to both view and download“. Nevertheless, CCDC has the legal right to stop us from using the data, since we signed a licensing agreement containing such conditions. That was a mistake on our part, one that we have to live with now. Let this case be a warning for others who have not yet made such mistake to sign the draconian agreement.
[…]

PMR: The quoted FAQ on CrystalEye was written by me. The second post:

Public apology to CCDC

My previous post about errors in crystal structures have triggered strong reactions from CCDC (not only response post, direct email, but also email to my former PhD supervisor in the UK asking him for remedy and explanation). Apparently, they have interpreted my post as an attack on the quality of their services. Let me clarify first, that I have never intended to imply anything negative or derogatory about the CCDC services or software. My sincere apologies if my post came across that way. All I wanted to do is raise awareness in the docking/scoring community that small molecule crystallographic data is not free of errors. My understanding is, that the data deposited in CSD has been determined by thousands of people all over the world and published in various scientific journals, while CCDC aggregates the data and creates a comprehensive, validated and value-added database known as the Cambridge Structural Database (CSD), and the complete CSD System (CSDS) includes the CSD itself and associated software for search, visualisation and analysis of stored information. I acknowledge that CCDC provides a valuable service to the community and any error in the data is not their fault.

They have also sent us a “friendly reminder” that since our license to CSD has expired, according to the signed agreement we are not allowed to retain or use any data downloaded form CSD, not even any derived information or data. As I already stated in the update added to the previous blog entry, we have ceased using any data derived from CSD to comply with the license. I have even removed the image of the molecule from the post (since that can also be considered as derived data). We have not incorporated any data into our software. […] One lesson I learned from this exchange is the importance of Open Data for scientific advancement (some scientists believe that research data must be free), e.g. such that is available from CrystalEye. When even non-profit organizations (registered as charity) use draconian license agreements protecting data created and published by others, then fully commercial entities (like pharmaceutical companies) must be guarding their own data even stronger. It makes it difficult to make scientific progress if a single blog mention of an error in a data entry invites the wrath of the company who sells services on the data. It is ironic that the links expressing the need for open data and the open repository happens to point to a web site within the same University where CCDC resides. ZZ

PMR: The links are mainly to entries on my blog discussing rhe need for Open Scientific data. I have not discussed the CCDC data or policy in those posts and to avoid confusion shall not do here.
The position with CrystalEye data is as clear as I can make it. The data come from freely accessible files on various publishers’s websites. I do not believe the data in these files to be copyright. All raw and derived data are distributed as Open Data except for the CIFs on the ACS website which the publsher claims as their copyright. If users want tis they can visit the website
We have spent 3 years analysing this data and we can confirm that the crystallographic literature contains many places where the user needs to be careful in re-using them. I will not call them all “errors” – some are accurate determinations of uncertainty in the crystallographic experiement (e.g. disordered atoms or atoms whose identity is unknown. There are others where the author(s) have added constraints on atomic positions.
However some patterns emerge (much of this is due to Joe Townsend and Nick Day). The quality of reporting structures in the literature has increased dramatically in some journals over the last 5-10 years – led by the IUCr journals. The IUCr has provided the CheckCIF service which is regularly used by authors to check the data for known problems. Some of these can be addressed before submission, others are simply results of the experiment. (The quality of structures themselves has also probably increased though this is more difficult to measure).
We have found that the data on CrystalEye are sufficient for many types of data-driven science. Whether they are suitable fo ZZ’s work I cannot say but they are available for him to re-use in perpetuity. CrystalEye has the advatange that it combines inorganic and organic structures, although users should be aware that some of the inorganic materials relate to very early structures.
On CrystalEye we deliberately do not change any data which is exactly as it was reported in the author’s CIF. However we have a system where authors can comment and we could expect to collect these. We are, however, developing automatic filters to create collections that achieve certain levels of fitness and we have been using these for calibration with GAMESS and MOPAC calculations.
Users of CrystalEye should be aware that currently only about half the major publishers expose authors’ CIFs.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *