There has been a lengthy correspondence on the SimBioSys blog about the availability of crystallographic data frm the Cambridge Crystallographic Data Centre (CCDC). It raises general concerns about access to scientific data, refers to this blog, and puts in focus our own CrystalEye resource. I shall highlight sections without comment.
I need to make things clear before I do this:
- Although I am physically close to CCDC I have no formal contact and none of the CrystalEye work is informed by their data or practice.
- Although I could have access to CCDC data and software I have not used it for many years. CrystalEye does not contain any data not publicly accessible on the web, nor does it use CCDC Refcodes (6- or 8-character codes which uniquely identify entries).
SimBioSys has two posts, the second of which was partially in response to private emails
[...] One would assume that the small molecule crystal structures of the Cambridge Structural Database (CSD) do not have such errors, since they have much higher resolution and dealing with small molecules. Let me correct that wrong assumption!
[...] In the past year we have collected some statistics from the CSD with the hope to improve the accuracy of our scoring function by using more reliable, more precise data. Unfortunately, we had to learn the hard way, that the CSD isn’t so clean either. We have found a lot of obvious errors, like some atom centers falling within 0.2 Angstrom or less from each other when the crystal packing transformations are applied, some completely impossible bond lengths and angles. We kept adding sanity checks to report and exclude data entries with various obvious errors. At the end of the automated cleaning process, we had almost 15% of the data dropped for one reason or another. Then we thought the remaining data is good, we can use it for collecting the statistics.
[...] Then Bashir has pulled out one of the worst examples where the X-ray structure had very high strain energy: CSD code [REDACTED] has two carboxylate groups as shown on the image [REDACTED]. The original structure from the CSD is was displayed with thick bonds and the optimized one has thin bonds, you can see the optimization has twisted the two carboxylate out of the plane of the aromatic ring in order to avoid two lone pair facing each other.
So, the morale of the story: we can’t even trust the high resolution CSD data, let alone the PDB.
Since the posting of this blog entry, we have received 2 public comments — displayed in a standard way as all comments by the WordPress blog software, and some private emails originating from CCDC. One of the complaints from CCDC was that the second comment — which explains the problems and directs the blame on my naivity for my wrong expectations about the data — was not displayed as prominently as the original article. To correct this problem, I will quote the entire comment here:
[... details snipped ... PMR] Though admittedly for QUICNA note that the choice is inconclusive based on the 4 lists given: I think the hydrogen list may not account for deuteration.
The other main point raised was, that our CCDC license has expired since the data collection was made, therefore we can no longer use any data — even derived data — from the CSD. We certainly fully obey this cease and desist order and will not use any of the data — we have not made any publications containing data from CSD except for this blog entry (and I have now removed the code name and the image to comply with the order) and none of the released versions of our software containes such data either. [...]
On a personal opinion: such restrictions on the use of scientific facts do not seem to make much sense to me. As the IUCr position paper explains: There is a long-standing acceptance within crystallography of the principle that such primary data sets should be freely available for sharing and re-use (with appropriate credit) within the structural science community. Also the FAQ on the CystalEye site explains: “As this supplementary data is a set of facts and is not part of the article full-text it does not fall under the copyright, and it should therefore be free to both view and download“. Nevertheless, CCDC has the legal right to stop us from using the data, since we signed a licensing agreement containing such conditions. That was a mistake on our part, one that we have to live with now. Let this case be a warning for others who have not yet made such mistake to sign the draconian agreement.
PMR: The quoted FAQ on CrystalEye was written by me. The second post:
My previous post about errors in crystal structures have triggered strong reactions from CCDC (not only response post, direct email, but also email to my former PhD supervisor in the UK asking him for remedy and explanation). Apparently, they have interpreted my post as an attack on the quality of their services. Let me clarify first, that I have never intended to imply anything negative or derogatory about the CCDC services or software. My sincere apologies if my post came across that way. All I wanted to do is raise awareness in the docking/scoring community that small molecule crystallographic data is not free of errors. My understanding is, that the data deposited in CSD has been determined by thousands of people all over the world and published in various scientific journals, while CCDC aggregates the data and creates a comprehensive, validated and value-added database known as the Cambridge Structural Database (CSD), and the complete CSD System (CSDS) includes the CSD itself and associated software for search, visualisation and analysis of stored information. I acknowledge that CCDC provides a valuable service to the community and any error in the data is not their fault.
They have also sent us a “friendly reminder” that since our license to CSD has expired, according to the signed agreement we are not allowed to retain or use any data downloaded form CSD, not even any derived information or data. As I already stated in the update added to the previous blog entry, we have ceased using any data derived from CSD to comply with the license. I have even removed the image of the molecule from the post (since that can also be considered as derived data). We have not incorporated any data into our software. [...] One lesson I learned from this exchange is the importance of Open Data for scientific advancement (some scientists believe that research data must be free), e.g. such that is available from CrystalEye. When even non-profit organizations (registered as charity) use draconian license agreements protecting data created and published by others, then fully commercial entities (like pharmaceutical companies) must be guarding their own data even stronger. It makes it difficult to make scientific progress if a single blog mention of an error in a data entry invites the wrath of the company who sells services on the data. It is ironic that the links expressing the need for open data and the open repository happens to point to a web site within the same University where CCDC resides. ZZ