Monthly Archives: June 2008

Availability of Crystallographic Data and Errors therein

There has been a lengthy correspondence on the SimBioSys blog about the availability of crystallographic data frm the Cambridge Crystallographic Data Centre (CCDC). It raises general concerns about access to scientific data, refers to this blog, and puts in focus our own CrystalEye resource. I shall highlight sections without comment.

I need to make things clear before I do this:

  • Although I am physically close to CCDC I have no formal contact and none of the CrystalEye work is informed by their data or practice.
  • Although I could have access to CCDC data and software I have not used it for many years. CrystalEye does not contain any data not publicly accessible on the web, nor does it use CCDC Refcodes (6- or 8-character codes which uniquely identify entries).

SimBioSys has two posts, the second of which was partially in response to private emails

SimBioSys: Crystal structure errors — in CSD too

[...] One would assume that the small molecule crystal structures of the Cambridge Structural Database (CSD) do not have such errors, since they have much higher resolution and dealing with small molecules. Let me correct that wrong assumption!

[...] In the past year we have collected some statistics from the CSD with the hope to improve the accuracy of our scoring function by using more reliable, more precise data. Unfortunately, we had to learn the hard way, that the CSD isn’t so clean either. We have found a lot of obvious errors, like some atom centers falling within 0.2 Angstrom or less from each other when the crystal packing transformations are applied, some completely impossible bond lengths and angles. We kept adding sanity checks to report and exclude data entries with various obvious errors. At the end of the automated cleaning process, we had almost 15% of the data dropped for one reason or another. Then we thought the remaining data is good, we can use it for collecting the statistics.

[...] Then Bashir has pulled out one of the worst examples where the X-ray structure had very high strain energy: CSD code [REDACTED] has two carboxylate groups as shown on the image [REDACTED]. The original structure from the CSD is was displayed with thick bonds and the optimized one has thin bonds, you can see the optimization has twisted the two carboxylate out of the plane of the aromatic ring in order to avoid two lone pair facing each other.


So, the morale of the story: we can’t even trust the high resolution CSD data, let alone the PDB.



Since the posting of this blog entry, we have received 2 public comments — displayed in a standard way as all comments by the WordPress blog software, and some private emails originating from CCDC. One of the complaints from CCDC was that the second comment — which explains the problems and directs the blame on my naivity for my wrong expectations about the data — was not displayed as prominently as the original article. To correct this problem, I will quote the entire comment here:

Author : J (IP: , Says: June 16th, 2008 at 8:39 am

[... details snipped ... PMR] Though admittedly for QUICNA note that the choice is inconclusive based on the 4 lists given: I think the hydrogen list may not account for deuteration.

The other main point raised was, that our CCDC license has expired since the data collection was made, therefore we can no longer use any data — even derived data — from the CSD. We certainly fully obey this cease and desist order and will not use any of the data — we have not made any publications containing data from CSD except for this blog entry (and I have now removed the code name and the image to comply with the order) and none of the released versions of our software containes such data either. [...]

On a personal opinion: such restrictions on the use of scientific facts do not seem to make much sense to me. As the IUCr position paper explains: There is a long-standing acceptance within crystallography of the principle that such primary data sets should be freely available for sharing and re-use (with appropriate credit) within the structural science community. Also the FAQ on the CystalEye site explains: “As this supplementary data is a set of facts and is not part of the article full-text it does not fall under the copyright, and it should therefore be free to both view and download“. Nevertheless, CCDC has the legal right to stop us from using the data, since we signed a licensing agreement containing such conditions. That was a mistake on our part, one that we have to live with now. Let this case be a warning for others who have not yet made such mistake to sign the draconian agreement.


PMR: The quoted FAQ on CrystalEye was written by me. The second post:

Public apology to CCDC

My previous post about errors in crystal structures have triggered strong reactions from CCDC (not only response post, direct email, but also email to my former PhD supervisor in the UK asking him for remedy and explanation). Apparently, they have interpreted my post as an attack on the quality of their services. Let me clarify first, that I have never intended to imply anything negative or derogatory about the CCDC services or software. My sincere apologies if my post came across that way. All I wanted to do is raise awareness in the docking/scoring community that small molecule crystallographic data is not free of errors. My understanding is, that the data deposited in CSD has been determined by thousands of people all over the world and published in various scientific journals, while CCDC aggregates the data and creates a comprehensive, validated and value-added database known as the Cambridge Structural Database (CSD), and the complete CSD System (CSDS) includes the CSD itself and associated software for search, visualisation and analysis of stored information. I acknowledge that CCDC provides a valuable service to the community and any error in the data is not their fault.

They have also sent us a “friendly reminder” that since our license to CSD has expired, according to the signed agreement we are not allowed to retain or use any data downloaded form CSD, not even any derived information or data. As I already stated in the update added to the previous blog entry, we have ceased using any data derived from CSD to comply with the license. I have even removed the image of the molecule from the post (since that can also be considered as derived data). We have not incorporated any data into our software. [...] One lesson I learned from this exchange is the importance of Open Data for scientific advancement (some scientists believe that research data must be free), e.g. such that is available from CrystalEye. When even non-profit organizations (registered as charity) use draconian license agreements protecting data created and published by others, then fully commercial entities (like pharmaceutical companies) must be guarding their own data even stronger. It makes it difficult to make scientific progress if a single blog mention of an error in a data entry invites the wrath of the company who sells services on the data. It is ironic that the links expressing the need for open data and the open repository happens to point to a web site within the same University where CCDC resides. ZZ

PMR: The links are mainly to entries on my blog discussing rhe need for Open Scientific data. I have not discussed the CCDC data or policy in those posts and to avoid confusion shall not do here.
The position with CrystalEye data is as clear as I can make it. The data come from freely accessible files on various publishers's websites. I do not believe the data in these files to be copyright. All raw and derived data are distributed as Open Data except for the CIFs on the ACS website which the publsher claims as their copyright. If users want tis they can visit the website
We have spent 3 years analysing this data and we can confirm that the crystallographic literature contains many places where the user needs to be careful in re-using them. I will not call them all "errors" - some are accurate determinations of uncertainty in the crystallographic experiement (e.g. disordered atoms or atoms whose identity is unknown. There are others where the author(s) have added constraints on atomic positions.
However some patterns emerge (much of this is due to Joe Townsend and Nick Day). The quality of reporting structures in the literature has increased dramatically in some journals over the last 5-10 years - led by the IUCr journals. The IUCr has provided the CheckCIF service which is regularly used by authors to check the data for known problems. Some of these can be addressed before submission, others are simply results of the experiment. (The quality of structures themselves has also probably increased though this is more difficult to measure).
We have found that the data on CrystalEye are sufficient for many types of data-driven science. Whether they are suitable fo ZZ's work I cannot say but they are available for him to re-use in perpetuity. CrystalEye has the advatange that it combines inorganic and organic structures, although users should be aware that some of the inorganic materials relate to very early structures.
On CrystalEye we deliberately do not change any data which is exactly as it was reported in the author's CIF. However we have a system where authors can comment and we could expect to collect these. We are, however, developing automatic filters to create collections that achieve certain levels of fitness and we have been using these for calibration with GAMESS and MOPAC calculations.
Users of CrystalEye should be aware that currently only about half the major publishers expose authors' CIFs.

CrystalEye in Chemspider - stereochemistry

In a previous post ( CrystalEye links in Chemspider) and links I discussed the information and meta-information in Chemspider relating to an entry in CrystalEye. I took the first one in the collection. [I skip the second as I believe the crystallographer providing the entry for CE has misreported either the stereochemistry or the name]. I shall take the third , which links to this page. I shall here only discuss the identity of this compound.

On the CrystalEye site it is reported as a 2007 paper with the titleL-Asparagine

and the structure given refers to the conventional use of L- in aminoacids. There is no conventional structure diagram and there is no Flack parameter (which is definitive for absolute stereochemistry) - however I am assuming it is correct. As a result we computed the InChI and SMILES as:


InChI=1/C4H8N2O3/c5- 2(4(8)9)1- 3(6)7/h2H,1,5H2,(H2,6,7)(H,8,9)/t2- /m0/s1


[H]N([H]) C(=O) C([H]) ([H]) [C@]([H]) ([N+]([H]) ([H]) [H]) C(=O) [O-]

The last part of the InChI denotes the stereochemistry and the "@" in SMILES does the same. Checking the SMILES in the Daylight Depict facility gives:

SMILES: [H]N([H])C(=O)C([H])([H])[C@]([H])([N+]([H])([H])[H])C(=O)[O-]

which seems correct.

The Chemspider page for 231 is


Inherent Properties, Identifiers and References
ChemSpider ID: 231
Empirical Formula: C4H8N2O3
Molecular Weight: 132.1179
Nominal Mass: 132 Da
Average Mass: 132.1179 Da
Monoisotopic Mass: 132.053492 Da
Systematic Name: 2,4-diamino-4-oxo-butanoic acid
InChI: InChI=1/C4H8N2O3/c5-2(4(8)9)1-3(6)7/h2H,1,5H2,(H2,6,7)(H,8,9)

(Details...) Names and Synonyms

Validated by Experts, Validated by Users, Non-Validated, Removed by Users, Redirected by Users, Redirect Approved by Experts

200-735-9 [EINECS/ELINCS]

2058-58-4 [RN]

218-163-3 [EINECS/ELINCS]

2-Aminosu​ccinamic ​acid


a-Aminosu​ccinamic ​acid


Aspartic ​Acid b-Am​ide

2,4-diami​no-4-oxob​utanoic a​cid

221-521-1 [EINECS/ELINCS]


2-amino-3​-carbamoy​lpropanoi​c acid

3130-87-8 [RN]

7006-34-0 [RN]

70-47-3 [RN]


alpha-ami​nosuccina​mic acid




asparagine [Wiki]

Asparagin​e (VAN)

asparagin​e acid

Asparagin​e, DL-

aspartic ​acid beta​-amide


DL-Aspara​gine mono​hydrate


DL-Aspart​ic acid 4​-amide



The "3D" structure appears to show D-asparagine. The picture aove appears to have no stereochemical information. From this brief survey I have not found any compounds on CS showing stereochemical information.

It seems that Chemspider strips this stereochemical information from the InChI, SMILES and coordinates that we provide on our site. This may not matter to many Chemspider visitors but it seems that it is not easy (or impossible) to find stereochemistry. Personally I am uncomfortable with DL-asparagine pointing to a structure on our site which is clearly not racemic.

Chemspider has listed under "Names and Synonyms":


DL-Aspara​gine mono​hydrate

To crystallographers these are completely different substances. Indeed there is a great deal of research as to what compounds form solvates and why (often called pseudopolymorphs). There is no clear way in which Chemspider can hold polymorphs, hydrates, salts, etc

On CrystalEye every CIF data_block is treated as a distinct entry and those with different cell dimensions will be regarded as distinct polymorphs. We also hold polymorphs with different temperatures and pressures as distinct entries.

In conclusion therefore, many distinct entries in CrystalEye are mapped onto a single record in Chemspider, which therefore provides the same metadata for each. In many cases this metadata (chirality, hydration) will not be correct so users should be very careful.

CrystalEye links in Chemspider

I have agreed to review the CrystalEye data in Chemspider and before reading this post you should read the background carefully ( CrystalEye and Chemspider). The main points are that Crystaleye was not designed to be redistributable and that the method that we were asked to use (InChI/URL/SDFile) leads to massive semantic loss and possible corruption. I have also undertaken to give an objective review and to make no judgments. This post will not endorse or criticize Chemspider per se.

I do not vouch for the accuracy of information in this post. I believe that what Chemspider had access to was connectionTable-URL pairs. It is possible to use these to download more information from our site if required.

I also stress very strongly that CrystalEye is a crystallographic site and that the relation of crystallography to chemistry is non-trivial.

I shall also assume that the CrystalEye collection in Chemspider might be discovered by someone who was not familar with the organisation and motive of the site.

So to report...After a few minutes I found the collection of CrystalEye under (this link) - I hope it is the correct place to start. It links through to a page describing CrystalEye with further links to our homepage. It describes us as:

Physical Properties (including SAR/QSAR databases)
Information Aggregators

Organizational logo, personal photo or avatar, up to 50K in size. It will be shown on publicly accessible data source web page.

I do not know what "approved" means but it does not imply endorsement or approval by ourselves. (This is a general point for all aggregrators).

The page starts with the heading (This may cause problems for some readers' browsers as it is wide):


24539 hits found in 29.86 seconds. Search terms: DATA_SOURCE in (CrystalEye) AND SingleComponent AND NonIsotopic
1 2 3 4 5 6 7 8 9 10 ...
ID Structure Empirical Formula Molecular Weight Monoisotopic Mass, Da LogP ACD/LogD (pH 5.5) ACD/LogD (pH 7.4)
116 C4H9NO2 103.1198 103.063329
ACD/LogP: -0.64
XLogP: -0.70
ALOGPS: -2.99
-3.15 -3.14


I interpret this to mean that there are about 25000-30000 entries from CE which have been put into Chemspider. I do not know the exact number as CS can have multiple links.

The only information in this table that came from CE is the connection table (but not the depiction) and jmol/"cell" - see below. The ID is a Chemspider ID, not a CE link. The other columns are presumably generated by a program. There is nothing on the page to indicate what they mean. The average crystallographer visiting the site might conclude this had nothing to do with crystallography and leave at this stage (there is no link to CrystalEye from this page). I assume the data are computed LogP which is outside most crystallographers' daily experience or requirements. I make no judgment on whether they are generally useful.

The "Structure" cell contained 4 links. I cannot depict them on this blog - you will have to click. The "jmol" (sic) linked to a page with 3 links "2D" "3D" and "Cell". Jmol (sic) is a 3D program and should not be used for 2D diagrams. The 3D link did not work in Firefox 2. In IE it appeared to create a molecule in real time, without reference to the crystallography. This is very seriously misleading as a user of a crystallographic resource would expect the molecular structure displayed to be the one in the crystal. The "cell" resource brought up Jmol and displayed the same cell information as on our site. The molecule was depicted with a spurious bond - I do not know whether this is an artefact of the crystallography or Chemspider.

The link "116" takes the reader to a page which combines all the information for this compound (GABA). In this case, but not others, there is probably agreement about its identity. I will attempt to copy salient points:


Inherent Properties, Identifiers and References
ChemSpider ID: 116
Empirical Formula: C4H9NO2
Molecular Weight: 103.1198
Nominal Mass: 103 Da
Average Mass: 103.1198 Da
Monoisotopic Mass: 103.063329 Da
Systematic Name: 4-aminobutanoic acid
InChI: InChI=1/C4H9NO2/c5-3-1-2-4(6)7/h1-3,5H2,(H,6,7)

(Details...) Original Reference(s)

Gamma-aminobutyric acid (GABA) is an amino acid and the chief inhibitory neurotransmitter in the mammalian central nervous system. As such, GABA plays an important role in regulating neuronal [snipped, PMR]


The SMILES and InChI are presumably either used as the primary link for this page or computed. They are not taken from the CrystalEye site (which provides both). I do not know whether they have been verified against our site. Lower down we find:

CrystalEyeLink to Record

which does what it says - links to a page on the CE site. Some people will find the linkage of the textual information on Gaba and the links to the crystal structure useful.

Lower down the page we find:


Names and Synonyms

Validated by Experts, Validated by Users, Non-Validated, Removed by Users, Redirected by Users, Redirect Approved by Experts

200-258-6 [EINECS/ELINCS]

Acide ami​no-4- but​yrique [French]

butanoic ​acid, 4-a​mino-

Butyric a​cid, 4-am​ino-

g-Aminobu​tyric Acid

g-Amino-n​-butyric ​Acid

Gamma ami​nobutyrate

gamma Ami​nobutyric​ acid


gamma-Ami​nobutanoi​c acid



gamma-Ami​nobutryic​ acid


Gamma-ami​nobutyric​ acid [JA​N]

gamma-ami​no-n-buty​ric acid





.gamma.-A​minobutan​oic acid

.gamma.-A​minobutyr​ic acid

.gamma.-A​mino-N-bu​tyric acid


4-Aminobu​tanoic ac​id



4-aminobu​tyric acid

4-amino-n​-butyric ​acid

56-12-2 [RN]



Butanoic ​acid, 4-a​mino- (9C​I)






gamma-Ami​nobutyric​ acid

gamma-Ami​nobutyric​ acid (JA​N)

gamma-Ami​nobutyric​ acid-car​boxy-14C









omega-Ami​nobutyric​ acid

Piperidic​ acid

Piperidin​ic acid


w-Aminobu​tyric acid



(Details...) Database ID(s)

Validated by Experts, Validated by Users, Non-Validated, Removed by Users, Redirected by Users, Redirect Approved by Experts






CCRIS 3721



DF 468



EPA Pesti​cide Chem​ical Code​ 030802















NSC 27418














(Details...) Predicted Properties
LogP: ACD/LogP: -0.64 XLogP: -0.70 ALOGPS: -2.99 # of Rule of 5 Violations: 0
ACD/LogD (pH 5.5): -3.15 ACD/LogD (pH 7.4): -3.14
ACD/BCF (pH 5.5): 1 ACD/BCF (pH 7.4): 1
ACD/KOC (pH 5.5): 1 ACD/KOC (pH 7.4): 1
#H bond acceptors: 3 #H bond donors: 3
#Freely Rotating Bonds: 4 Polar Surface Area: 29.54 Å2
Index of Refraction: 1.465 Molar Refractivity: 25.68 cm3
Molar Volume: 92.8 cm3 Polarizability: 10.18 10-24cm3
Surface Tension: 46.2 dyne/cm Density: 1.11 g/cm3
Flash Point: 103.8 °C Enthalpy of Vaporization: 53.43 kJ/mol
Boiling Point: 248 °C at 760 mmHg Vapour Pressure: 0.00798 m


PMR: I will have more to say on names and synonyms later but for those here I have no particular comment.

Readers should make their own judgment about the value of predicted properties. They should note that GABA is a solid at room temperature and so the concept of surface tension and several other properties is irrelevant. I do not know whether the other properties refer to the solid or liquid states but personally I would not use them for anything. I also observe that a machine reading this page (and even some humans) could easily not notice that the properties were not observed.

In general - apart from the prediction of properties - the aggregation provided for this compound is probably useful to many people though probably not for mainstream crystallographers. It does, however, require a great deal of expert judgment to determine what properties are useful and which are seriously misleading. I would not, for example, recommend its use in undergraduate teaching.

I shall comment on two other entries later and in any case it's a good point to break as this blog struggles with cut and paste.

CrystalEye and Chemspider

Chemspider (Antony Williams) has asked on his blog for feedback on CrystalEye and shall respond in some detail. I shall try to exclude any personal judgment and not make statements about the value of the process. In essence it will be the factual material that I would write were I asked to review it for a scientific journal, but I shall omit any judgment. I'm spending time on this review because it may hopefully by useful in making clear parts of our approach to chemical ontology.

Firstly some background. I have blogged about the genesis of CrystalEye which was solely to support our research. It happened that soon after it appeared there were requests for the data. CrystalEye was NOT created with the idea that it was a redistributable. It had no unique ID system and was also heavily hyperlinked, some being hardcoded.

The data is Openly available and can, in principle be completely downloaded with wget or other spidering tools. There is no technical restriction though we would ask that people contact us beforehand or use a sensitive robot. However the site is updated daily and so it is extremely difficult to take a snapshot which has referential integrity - by the time that someone has finished spidering the data has changed considerably.

The data are complex. They consist of ca 100,000 entries, several millions entries of derived data (fragments, etc.) and statistics (bondlengths). Much of this is difficult to redistribute in principle.

It is critical to realise that one of the primary parts of the research was the preservation of semantics thus all CIF data has been transformed into CML and we believe this is almost lossless. Therefore the CML pages (which are Openly downloadable) are the primary semantic resource.

At this stage we (aminly Jim and Nick) provided limited redistributability through the following mechanisms:

  • RSS feeds on new entries. This is easy to implement and, for example, I get a daily feed os some 10's of new entries. That's available to anyone with a feedreader. So the simplest way to get entries is to subscribe to the RSS feed.
  • A tool - which Jim wrote specially - for downloading the "backlog". This tool is publicly available and should be capable of downloading the complete set of entries (but not the derived data)

This is the only mechanisn we can produce for avoiding semantic loss and corruption.

We were, however, asked if we could provide the data in SDF (MDL molfile format). This necessarily involves massive semantic loss as the format cannot hold crystallography and CrystalEye is a crystallographic site. Moreoever many crystal structures do not fit into the molfile concept at all - diamond and graphite are clear examples. Many - perhaps most - crystals consist of two or more components ("moieties" in CIF). The identity of these cannot be preserved in molfiles.

We were therefore reluctant to convert the data to molfiles and or InChI for these and many other reasons. Put simply, they al likely to lead to massive loss and corruption.

We excluded many CrystalEye entries from w.hat we sent to Chemspider.

  • Only molecular crystals with one molecule in the crystallochemical unit.
  • No purely inorganic structrures and I believe no meta-organic ones

Moreoever what was provided was essential link information, not data. Essentially these were connectionTable-URL pairs.

Since CrystalEye was not developed for export we have not checked the validity of the connection tables or their suitability for linking. This is non-trivial (and in many cases not possible with the data we have - it would require extraction of data from full-text). We therefore make no guarantee about the suitability of the InChI or the connection table for any purpose. As examples we add information on bond order, charges and stereochemistry. This requires heuristics which we know to fail in certain cases. We do not have any metrics on this, though you will see later that it happens.

ChemSpiderMan Says: June 22nd, 2008 at 12:43 am e

Peter: Regarding “Jim Downing has put in considerable work to create a subset of CrystalEye for Chemspider who now wish use to review their site:” and “I have stressed several times that Nick is writing his thesis and has no time to review commercial sites - however I will do so sometime in the next few posts”.

Thanks to Jim for the considerable work, and I have already acknowledged it in an email to all of you.

It would be good to receive your feedback on the deposition of CrystalEye onto ChemSpider but you are under no obligation to do so. But, I would think that since ChemSpider is the first site other than CrystalEye itself (I believe this to be true) to host your data I thought you might want to look at it. If not that’s fine too.

Your modus operandi in regards to ChemSpider is to not provide direct feedback to us regarding any issues but rather to blog about any issues. My earlier email regarding providing feedback was a request to provide feedback directly to us on any obvious issues so we could resolve them. I understand your preference is to simply blog about them so I will monitor your site for the comments instead. I look forward to your comments.

PMR: You have made this point several times. I use my blog because (a) I know how it works (b) can add images to it (c) post long posts which IMO are not suitable for comments on other blogs (d) have a readership to which I address wider questions (e) know that you monitor this blog. In general I regard blog comments as useful for shortish comments which do not normally need large replies - this may be unusual but it's how I treat this blog.

CS: To Nick- good luck with the thesis and apologies if the request to check out CrystalEye on ChemSpider was a distraction. I believe one of the outcomes of CrystalEye is Open Data that could proliferate to other sites ESPECIALLY now that Jim has done the hard work to create the dataset for download. I would hope that having your work more highly exposed would be good for you and you would have more bragging rights in your thesis in regards to your contribution to the domain of crystallography and Open Data since over 5000 people per day frequenting ChemSpider could end up over on CrystalEye as a result of the connection. I think this is good for the project personally. I look forward to seeing the data shown up on PubChem, eMolecules and other sites shortly.

Either way, of “referred sites” driving traffic to ChemSpider what is interesting to observe is that is now in the top 10 of referring sites so the benefit is mutual. Thanks!

PMR: Although it's nice to have people visit CrystalEye - and this is not meant unkindly - the primary thing we have to focus on is whether its creation can be justified as a scientific resource which is worthy of being included in a thesis and whether the data is fit for purpose. We do not, in fact, have any idea of how many hits we get - this carries no weight with those who assess our science. Only publication and grants matter. Most of everything else is a distraction. If we end up with a new architecture for depositing data that could be of considerable interest to funders.

In the next post I shall give an obective analysis of the CrystalEye links in Chemspider

Leave a Reply

Open data is essential for science

An important set of papers on Open Data and science:

The June issue of the Journal of Science Communication is now available (Peter Suber) . OA-related articles:

The first is an editorial overview. Bora's has been blogged elsewhere and I'll concentrate on John's. You should read it rather than relying on me. But what I take is:
  • if we try and apply ANYTHING other than the public domain to scientific facts we shall no be able to manage scientific data. Problems include aggregation, restrictions (however reasonable) on re-use, cascading attribution, different jurisdictions
  • the public domain in NOT another licensing scheme. It is as free as the air we breath. No-one has to ask permission to ask to breathe. It is NOT copyright
  • It must be supplemented by community norms. Yes, you may legally do anything with this data, but if you do X and Y we shan't like it and this might affect future funding, collaorations, publishability, etc.
We have no alternative. Everything else descends into infinite recursion and hypotheticals. You cannot control how a marketing analyst might use meteorite data.  Or what data sets are useful for devloping new machine-learning techniques. Or how word frequency in scientific texts gives a greater understanding of the structure of the brain. Or...
We don't know how to do some of this in practice but it shouldn't stop us trying. The simplest thing is to add the "Open Data" sticked from the Open Knowledge Foundation as we do in CrystalEye. This says "This data is is Open". You can do what you want. If your are the primary user please acknowledge us (we shan't sue if you don't but it's simple human courtesy). If you aggregate into another resource and "our" data was a major input in would be nice to have it acknowledged. If you take a few parts it's probably overkill to acknowledge the bits.
What can you do to help? If you are a scientist, add the "Open data" to your data. This stops it being possessed and controlled by third parties. Even if they do you can point to the fact it was labelled as that originally which I hope would resolve lawsuits. If you are a publisher who believes in Open Data, add some statement to your web site that makes it clear tha the data are open. DON'T try to control it. DON'T add CC-NC licences. These are impossible to use anyway. Indeed don't use an CC licences.
If you are a publisher who understands what I have just written and fail to label the data that passes through your organ as "open" then you are actively impeding science. This is a strong and accurate statement. Your refusal to help suport the free flow and reuse of data makes it harder for scientists to make discoveries, makes it harder for readers to judge the quality of data in your journals.
And, if at the same time to are making money by restricting access to scientific data created and provided by the scientific community (and not b y yourself)  then not only your effect, but your intent is also clear.

CrystalEye - an example of a data repository

I shall be writing a number of posts about (chemical) crystallography - which may be of wider interest to those interested in data quality assessment, robotic harvesting, robotic calculation, hyperlinking, repositories and the free access to scientific data. I'll start by talking abour CrystalEye - what it is and where it may be going.

We are generally interested in the area of data-driven, or data-enabled science in the scientific "long-tail". Can machines extract useful information from the hetereogeneous mass of data that increases daily. And - because we are chemists - we have chosen to do this in chemistry, although it has serious problems of restrictive access to data. The area which has turned out to be most fruiful has been chemical crystallography - the determination of the structures of "small molecules" by diffraction methods. In this we pay great tribute to the International Union of Crystallography which is probably unsurpassed in its commitment to data quality and data preservation. Moreover they are delightful people to work with.

The basis questions included:

  • Can machines aggregrate enough public data to be useful? (We did not wish to use publisher-firewalled data in case of legal threats). The answer is definitely yes (10 years ago it would probably have been no). The method of aggregation is to spider/scrape the websites of publishers who expose the crystallographic data submitted by authors as supplemental data. (More later)
  • Is the data of high enough quality to do useful work with? This is difficult to answer without a lot of work, and that work has been put in by Joe Townsend, Nick Day, Mark Holt, Jim Downing and myself, with input from colleagues and IUCr. We have taken as measures (a) the syntactic quality of the data - and here different sites are very different. Acta Cryst in 2008 is excellent, as is RSC. The Crystallography Open Database (COD) is very variable with a small fraction of questionable material. (I should praise the COD for their commitment to Open Data, and COD remains the unique source for many inorganic structures. But the quality is dependent on what the depositors submit, whereas IUCr operates strict quality checks). Ideally we would not wish to make many sibjective judgments and we have metrics on the syntactic quality in some journals. Publishing this may be problematic in case we encounter a legal response from publishers.
  • Is the data of scientific value? Does the automatic use of crystal structures provide enough information that can be extrapolated to chemistry? This is a field that people such as Jack Dunitz, Hans Buergi, Sam Motherwell and myself helped to start in the late 1970's and the answer is generally "yes". Before embarking on this, however, we have undertaken an intensive program to determine whether the information in any given structure "agrees" with other relevant data. We have done this by (a) abstracting indivdiual molecules and doing high-quality QM calculations (Joe Townsend) and (b) calculating the complete crystal structure (MOPAC, Nick Day). These studies provide a great deal of information about errors in the data and problems in the calculation method. Joe has produced a protocol which can be used to determine whethe a structure might be used for accurate work. In general, for non-metal compounds it is the crystallography which provide more variance. For the MOPAC calculations there is much more variance in the calculations and relatively little concern with crystallographic errors (though the program has discovered a few).

Nick Day chose to aggregate the whole of the visible crystallographic web and this now runs to ca 100,000 entries. (It is difficult to decise what a "structure" is - there are ca 80 different entries for SiO2 (silica) under different conditions, although most substances only have one entry. He chose to use CML as the primary mechanism for holding the information and it would have been impossible to do the work without this. CML is lossless and also is now intergrated into a number of computational chemistry programs.

We chose to expose the aggregated data to the world as "Open Data" since we feel it is fundamentally Open. As we are in the business of creating semantic chemistry we have also created RDF tools which help support it. Since we are also interested in variability of molecular structure we have computed chemical concepts from it - these are necessarily heuristic. The original data contains no explicit chemistry such as connection tables (though we are working with the IUCr on this) and this is the primary motive for adding concepts such as chemical bonds, stereochemistry, bond orders, InChI and SMILES. However there are likely to be arbitrary decisions and it is impossible to make claims on the correctness of the chemistry (we believe that for many organics this is > 99% but we have no formal metrics).

The initial reason for exposing CrystalEye was (a) because Nick has created a valuable resource in its own right (b) as an exemplar of Open Data. We are happy for anyone to do whatever they wish (subject to acknowledging us) but we make no claims for the data or its value.

While doing this we (mainly Nick, Jim and me) realised that the architecture of CrystalEye - based loosely on the filing system - was both simple and robust and was a lighweight alternative to the use of formal databases. It allowed browsing, and we have been able to add derived data searches (such as for internolecular distances). We also added a substructure search (OpenBabel) which works very well for the 100,000 strcures and is a good example of OB's value. We also added an RSS feed and you can get daily updates of all new freely visible chemical crystallography (see the CrystalEye page for links). And more recently Andrew Walkingshaw has converted the CML into RDF and indexed it under the Talis Platform.

So we believe that CrystalEye is an exemplar of the future of chemical repositories. It manages some, but not all, of the complex ontological relations needed in modern chemistry. We are about to start on the Microsoft/Cornell/LANL project "OREChem" which explores how ORE/RDF can be used for aggregated resources and CrystalEye should provide a very good examplar. We are also starting on the eCrystals poroject led by Southampton and will be making CrystalEye part of that.

We also see CrystalEye as a starting point for the Departmental or domain repository for chemistry, and perhaps more widely for long-tail scientific data. To that end we have three summer students working in this area:

  • to develop graphical authoring tools for crystallographic publication (and hopefully deposition) funded by the IUCr.
  • to refactor CrystalEye. (It's the first system and the architecture will be overhauled. For example there is no unique ID in the system other than the file name or URL. Depositor IDs are often a nightmare with weird characters. We also want to separate the derived data (e.g. bondlengths)
  • to build on SPECTRa and CrystalEye to populate a Department Repository with crystallographic data. We hope this is an exemplar of what a long-tail scientific repository should be like - it will be operated by the service provider and will develop social protocols such as embargoes based on the people in the department. We hope that a large fraction of the Cambridge output can be exposed to the world through a CrystalEye interface.

The primary motivation for CrystalEye is still a working research tool. For example in his work with MOPAC Nick has found that certain atom pairs are poorly parameterised. This is not news - Jimmy Stewart ("Mr MOPAC") is well aware that some pairs need improving and CrystalEye allows us to do this. Because Nick has created statistics on all bondlengths in CE (millions) the system can easily answer questions like "Is a Na-Na distance of 1.2A every found in crystals?" Answer no. Is a I-N bondlength of 2.6 A ever found? Answer yes. It's three simple clicks on the website.

Finally there is the emerging concern over whether crystallographic data (a) should be and (b) is free and Open. There is no technical reason against this - the costs are so marginal that they are negligible. It's simply a question of allowing or requiring another piece of supplemental information. So here, in anticipation of the discussion are some pointers:

(Major) Publishers exposing all crystallographic information:

  • IUCr
  • RSC
  • ACS

Publishers not exposing crystallographic information:

  • Elsevier (expet 1 journal)
  • Wiley
  • Springer

Open primary aggregations of crystallographic information:

  • PDB (Protein Data Bank)

Closed aggregations

  • CCDC (Organic)
  • ICSD (Inorganic)

I hope to be containg some of the closed sites by Open letters through this blog. I will treat them courteously and publish replies in full. I'd be delighted if any of them wish to make their position clear ahead of time - please mail me directly as well as posting a comment (pm286). The question for publishers who do not expose crystallographic data is simple:
"You already have a mandatory requirement for the publication of crystallographic information. Please can you add this to your web site as supplemental information as you do for all other information such as spectra and synthesis".

If the answer is "yes, certainly" I shall be delighted.

Update and emphasis on publishing and Crystallography

I have been off air for a week part of which is because we have been concentrating on crystallography. The next few posts will cover various aspects of this subject - but there are many which are general enough that readers interested in Open Access, repositories, etc will hopefully find something useful. There are at least the following themes.

  • Nick Day is finishing writing up his thesis, part of which has been the compilation of CrystalEye to provide material for his thesis. I'll blog about some of this but the headline messages are (a) that CrystalEye is a research tool (b) without the Open Data on publishers' sites and COD we couldn't have done the work (c) the fundamental architecture of CML has allowed Nick and Joe Townsend to develop high-throughput computational chemical crystallography - we have done over 20,000 calculations.
  • Jim Downing has put in considerable work to create a subset of CrystalEye for Chemspider who now wish use to review their site:"The Unilever School at Cambridge, via Nick Day’s work, has generated CrystalEye and, after many conversations, we were provided the data source and have it on ChemSpider now. We are awaiting constructive feedback from Nick and Peter Murray-Rust regarding our implementation of their data on our site." I have stressed several times that Nick is writing his thesis and has no time to review commercial sites - however I will do so sometime in the next few posts
  • SimBioSys, and others have blogged about the availability of data from the CCDC (post). This raises some extremely serious points about the closing of data meant to be public.
  • As part of our ongoing collaboration with the IUCr - International Union of Crystallography we have a summer student working on authoring tools and we are very grateful for this (CrystalEye was catalysed by an IUCr-sponsored student - the original project was highly overambitious, but Nick was able to build on it for his own work. We also have two other students sponsored by the department to build and populate a Departmental Crystallographic Repository - again more on that later.

Naming things

Two things from today - a presentation by one of my colleagues (revealed later) on "Naming things" - or a similar title - which isn't yet public and they'll be giving at a meeting next week. I won't give it away, except to say that they presented a beautiful and thoughtful talk.

Names are hard.

And a useful reply (to my post Chemical names and structures continued) from Rich Apodaca about the problem of Chemical Abstracts Services (CAS) identifiers. Identifiers are names, with the additional facet that they are usually controlled by an authority and may also (possibly) have IP restrictions (i.e. e copyrighted).

Peter, there is no authority on CAS numbers other that CAS itself. No third party, no matter how carefully they attempt to do it, can create an authoritative version of any subset of the CAS database. It’s impossible by definition.

PMR: I ahve always asserted exactly this.  CAS has a right to develop whatever system of identifiers it likes. Only CAS can give this authority. I imagine there have been court cases where the identity of a compound depended inter alia on CAS identifiers and an expert witness could only use CAS to support evidence.

Having said that, there is a widespread need to use CAS numbers outside of the CAS Registry system. The question is how to best do this.

If we can’t have authority, at least we can use provenance. This is why Chempedia doesn’t merely give the CAS number associated with a structure (or vice versa), it gives the name and URL of the organization making the assertion. It’s a ternary, not binary relationship (cas number-structure-organization).

PMR: I fully agree with this analysis. However the assertion is often originally made by an organization which isn't mentioned and copying (including errors) is common.

And what you see is broad consensus on certain cas number/structure mappings, and disagreement on others.

Consider caffeine:

and cyclosporin:

as polar opposites in terms of consensus. If Chempedia had tried to be intelligent and hide the discrepancies, users would be misled. Recording and displaying the provenance of the data in plain sight lets humans judge for themselves what’s really happening.

PMR: again agreed. However we must beware that consistency is not due to copying. Note that caffeine is the only compound out 30 million+ that can be obtained freely from the CAS website.

CAS numbers are really nothing more than a name assigned to a structure. Like all names, it’s an opinion with the twist that in this case one organization (CAS) is always right. But the minute I communicate the answer CAS gives me to someone else, the information becomes unreliable. Recording the source of the CAS number/structure mapping is one way to determine reliability. It’s not a yes/no answer, but some shade of gray.

Not the way I’d prefer to have things work, but it’s what really happens. Designers of information systems need to factor this into the systems they create.

Any ideas on other ways to address this problem?

PMR: There is the additional problem that copying CAS numbers and making assertions about them could lead to copyright problems with CAS. So that is a major disincentive to a solution. There are many identifiers which are "CAS" but where the letters "CAS" do not appear - possibly for fear of copyright.

By far the best solution would be for CAS to take an Open approach to its CAS identifiers. This is the spirit of the current century. The major value of CAS - and other -  identifiers is for common compounds - to give names to chemicals that cannot be easily identified by other means, or where there are confusing names or chemical formulae. We don't really need a CAS number for water (H2O) - but we do benefit from an identifier for glucose.

It was good to see CAS allowing Wikipedians to use Scifinder (to which they subscribe) to check for CAS numbers against Wikipdeia chemicals. (Note that Wikipedia is name-based, not identifier-based). I think it would enhance their business, as well as their standing - if they were to offer free lookup - and free re-use -  for - say - 500,000 common chemicals. (Of couse I'd like the whole lot, but let's start somewhere).  This would require courage but would reinforce the authority that CAS already offers.

The alternative is that CAS numbers will continue to be spread around the web in a way that degrades their authenticity. Or that some other, more Open, authority will develop. That looks impossible, but so did Wikipedia.

Chemical names and structures continued

Antony Williams, Rich Apodaca and I have been having a debate on our blogs about how to identify chemicals (I choose this word so as not to be too specific) to a machine. Antony has a long and detailed response (Is there 100% in chemical names and compounds?) and Rich has commented (see comment there) . Here are some points, but if you are interested in the machine aggregation of information on chemicals, read them in full.

CS: Recently I posted on whether or not there is “a right structure for a compound“. I taked about trade names and registered chemical entities and posited the question regarding “whether a Registered Trade Name is absolute? I�m asking the question since I�m actually not sure. ”

There were two responses…

1) Rich Apodaca commented:”you�d probably find agreement among chemists that a trade name uniquely identifies one specific chemical entity. Ditto CAS Number.”


I, like Rich, am of the opinion that a CAS Number does uniquely identify a specific chemical entity, not necessarily a unique structure. Of course, CAS numbers can be confusing too as I have commented here. Aspirin, for example, has 6 CAS numbers! So Rich and I agree on this…can anyone from CAS confirm or not whether our belief is right?


[illustration of various names snipped...]

We are in absolute agreement about this issue. The names are not identical. One declares stereo and the other doesn’t. The question then is what synonyms are useful to the user of ChemSpider to locate the structure if they have a systematic name. One might assume that the more the merrier. There is an enormous number of variants of bracket styles and dashes that could give rise to probably dozens of names that are all consistent with the structure and the names shown come from different sources.

PMR: For certain purposes, it is valuable to collect as many names as possible, for example for locatioln of lookup. But these should be accompanied with metadata. A similar example is:

On a record view we list “Names and Synonyms”. The question marks Peter sees are for a French name shown here:

Looks fine in my broswer and pasted in here too: N-{2-[({5?-[(dim�th?ylamino)m?�thyl]fur?an-2-yl}m?�thyl)sul?fanyl]�th?yl}-N’-m�?thyl-2-ni?tro�th�ne?-1,1-diam?ine. So, not junk (saying that the French name is junk would offend the Parisians). Notice that the Z- has been removed (for now) and that the name is labeled French on the record. If any of you are seeing issues in your browser let us know and we will investigate at our end.

PMR: Without the metadata giving the langauage information is losr. For example what does "pain" mean? If the language is not given there is a tendency to interpret this as english.  We have to acknowledge that the language of science is currently english (it wasn't when I started and we had to read French and German papers). So RDF, for example, provides a language qualifier (e.g. @en or @fr). The addition of that qualifier transforms the information from junk to meaningful.

CS: I� look forward to seeing how Zantac and Ranitidine are handled in this new world- if its a structured ontology then it sounds like an integration of MeSH with structures? Wikipedia is over 5000 organics now and is the culmination of thousands of hours of work by many dedicated individuals. And is not error-free. Any other efforts will be prone to similar issues so it’s going to be a major undertaking and I look forward to the results. The ChEBI team are already doing a good job in this area. You can see an ontology Tree View here. So, I’m definitely excited to see what will be better! Exciting times.

PMR: We spent some time yesterday discussing our ontology for chemicals, which covers many of these points It is not trivial to build one and not surprisingly we argue. I like Tails' ratio of 75% arguing and 25% building - that's certainly the position with ontologies. Rich Apodaca commented :o n the discussion:

Tony, you’d probably find agreement among chemists that a trade name uniquely identifies one specific chemical entity. Ditto CAS Number.

But in practice (in databases, Excel spreadsheets, books, reviews, peer-reviewed articles, etc.), you’d find some disagreement about the structure that a particular identifier should be linked with, and vice-versa.

The disagreements would range from the baffling (completely wrong structure) to the annoying (wrong stereochemistry) to the amusing (ionized carboxylate vs. protonated).

For databases that aggregate content from diverse sources, the best practice may be to model this situation with a many-to-many relationship, rather than a one-to-one or even one-to-many.

In other words, CAS numbers, trade names, and IUPAC names may be better modeled as social networking-style tags than as unique identifiers. I’m not saying this is the way things should be - just that this is how situation appears to have evolved.

See this article, which discusses the problem as it applies to CAS numbers used in the wild and how Chempedia addresses it:

PMR: I very much like the idea of regarding chemical names as social identifiers. But, of course, that only works for humans. The machines can aggregate the tags but they cannot make inferences from them. The problem is that when they are put into databases they lose their social context and are managed by hard boolean logic. That fails immediately and often dramatically. A major cause is the loss of metadata and authorities. In this world you cannot use voting (which is why Chempedia cannot be seen as an authority for CAS numbers, only a useful guide).

We have to use authorities (provenance) in our information. Thus the statements:

Ranitidine is the Z-isomer


Ranitidine is the E-isomer

may be seen as contradictory. That's why people have suggested that RDF should have quads, not triples, such as

Antony_Williams asserts ranitidine hasIsomer Z

Wikipedia asserts ranitidine hasIsomer E

Both these are true. That is the language we should use in the semantic web

PeterMR still deliberately fails to make an assertion about this isomerism and is waiting to see what others think.

There is no “right structure” for a compound

A few days ago I promise d to respond to Antony Williams' post on associating chemical names with structures. I wrote:

There is no “right structure (sic)” for a compound. There are structures which have a very high probability of being associated with a name. There are names which have a probability of representing a chemical entity.”

I still hold by this statement and Antony's post reinforces my view. I'll post most of his and comment... [There is a question at the end that I'd like readers to comment on].

05:37 07/06/2008, Antony Williams,

I refer you back to the original post from which this comment was made as it is taken from a specific context.

Is this [PMR above] a true statement? In many case I would agree but I have my own opinion in specific cases and let’s focus on the drug industry for a moment and trade names. First, let’s talk about me..and my identifiers. Depending who’s talking about me I am Tony, Antony, Dr Williams, Mr Williams, Dad, sweetheart, son, Tone, AJ, Bro’ and so on. However I am registered with a social security number and exist as a legal entity, a “registered” entity.

PMR: Although humans are peripheral to this discussion, it's actually very difficult to associate a human with identifiers. The UK is spending zillions of pounds on trying to do this and requiring everyone to have identity cards. They can be forged. They'll probably need to brand us with a number, and have us rebranded every year in case we try to laser it off.

CS: Now, Zantac is a registered trade name for the chemical here.

PMR: This points to a page in Chemspider ( which I shall refer to as page571454 for simplicity of dialog. It contains the header:

ChemSpider ID: 571454
Empirical Formula: C13H22N4O3S
Molecular Weight: 314.4038
Nominal Mass: 314 Da
Average Mass: 314.4038 Da
Monoisotopic Mass: 314.141261

CS: I am not an expert in the registration process but I believe that somewhere along the line a defined chemical entity is associated with that name. Whether the chemical entity has been appropriately elucidated by analytical technologies or not is a different question. What is registered as a compound, and associated with the name, is what that name defines.

PMR: I am not currently an expert in registration, but at one stage I worked closely with authorities such as FDA and WHO on registration of drugs so my comments may be out of date. I and colleagues also worked for several years on the structire of "ranitidine" - I'll clarify later

CS: Now, there are a whole series of other names for the same compound - registry numbers, systematic names, organization numbers. See below

PMR: I will leave these here, and also add some from some from page571454 :

Ranitidine [Wiki]



1,1-Ethen​ediamine,​ N-[2-[[[​5-[(dimet​hylamino)​methyl]-2​-furanyl]​methyl]th​io]ethyl]​-N’-methy​l-2-nitro​-, (Z)-

128345-62​-0 [RN]

266-332-5 [EINECS]

66357-59-3 [RN]


GR 122311X






Ranitidin​e Base





PMR: The first point is that these are NOT exact synonyms. It is clear that




are not identical. One describes a compound whose stereochemistry is asserted, the other describes one where the stereochemistry is not asserted. Butene and 1-butene and 2-butene and (Z)-butene are all different. They all have different InChIs. Some of them may refer to the same concept in some contexts, but they are not synonyms. Fowler (Modern English Usage) says "perfect synonyms are extremely rare".

This is not nit-picking or logic chopping. If we are representing something in a machine, and we assert the two are to be used interchangeably then we have to be very sure that they can be. Adding a "(Z)" may appear a reasonable thing to do - in this case it is a diastrous act that corrupts information (I'll leave that till the next post).

The robotic aggregation of chemical names and identifiers, if done without metadata and ontology, corrupts information. That's a strong statement, but we can see it in the current case. First there is junk out there. Robotic name harvesting harvests junk. (Christoph Steinbeck described it in worse terms at the RSC meeting. ) Here's a snip from page571454

Validated by Experts, Validated by Users, Non-Validated, Removed by Users, Redirected by Users, Redirect Approved by Experts

Ranitidine [Wiki]



The "?" characters show up in my browser - I don't know what they are, but they are not normal "e"s (ASCII 101). The first name is not a synonym - I'm sorry, but it's junk. Associating junk with good information degrades the good information rather than increasing the quality of the junk (There is a more formal proof somewhere by Shannon - I believe - that machines cannot act as 100% proofreaders).

CS: I think that the Trade Name for a compound is definitive since its registered. Relative to the statement “There are structures which have a very high probability of being associated with a name. There are names which have a probability of representing a chemical entity.”…my question is whether a Registered Trade Name is absolute? I’m asking the question since I’m actually not sure. Thoughts anyone?

PMR: A trade name represents a product, not a compound and certainly not a connection table. In some cases it may refer to a pure substance, which itself is describable by a connection table, but these are not synonyms. And aggregating them as synonyms adds error rather than clarity.

However there is an even stronger reason why "Zantac" does not describe ranitidine. See the FDA page.

Zantac (Ranitidine Hydrochloride) Tablets

Zantac contains (not "is") ranitidine hydrochloride. Ranitidine is not ranitidine hydrochloride, any more than ammonia is ammonium chloride. Listing them toegther under synonyms corrupts information.

You may argue than an intelligent chemically educated chemist will know the difference and that may be true. But the current aggregations of chemicals (Chemspider, eMolecules, Chempedia) are designed for use by machines as well as humans.

And unless high-quality metadata is given, along with a structured ontology then machine aggregation of chemistry corrupts rather than enhances.

For that reason we are building molecular repositories based on metadata and ontologies. In the current era of the web it's becoming essential.

Now, I suggested that the "(Z)" should not have been added to "ranitidine" to indicate the stereochemistry. You can find pages out there with "(E)". What is the "correct structure"? Or is this a meaningless question?