I shall be writing a number of posts about (chemical) crystallography – which may be of wider interest to those interested in data quality assessment, robotic harvesting, robotic calculation, hyperlinking, repositories and the free access to scientific data. I’ll start by talking abour CrystalEye – what it is and where it may be going.
We are generally interested in the area of data-driven, or data-enabled science in the scientific “long-tail”. Can machines extract useful information from the hetereogeneous mass of data that increases daily. And – because we are chemists – we have chosen to do this in chemistry, although it has serious problems of restrictive access to data. The area which has turned out to be most fruiful has been chemical crystallography – the determination of the structures of “small molecules” by diffraction methods. In this we pay great tribute to the International Union of Crystallography which is probably unsurpassed in its commitment to data quality and data preservation. Moreover they are delightful people to work with.
The basis questions included:
- Can machines aggregrate enough public data to be useful? (We did not wish to use publisher-firewalled data in case of legal threats). The answer is definitely yes (10 years ago it would probably have been no). The method of aggregation is to spider/scrape the websites of publishers who expose the crystallographic data submitted by authors as supplemental data. (More later)
- Is the data of high enough quality to do useful work with? This is difficult to answer without a lot of work, and that work has been put in by Joe Townsend, Nick Day, Mark Holt, Jim Downing and myself, with input from colleagues and IUCr. We have taken as measures (a) the syntactic quality of the data – and here different sites are very different. Acta Cryst in 2008 is excellent, as is RSC. The Crystallography Open Database (COD) is very variable with a small fraction of questionable material. (I should praise the COD for their commitment to Open Data, and COD remains the unique source for many inorganic structures. But the quality is dependent on what the depositors submit, whereas IUCr operates strict quality checks). Ideally we would not wish to make many sibjective judgments and we have metrics on the syntactic quality in some journals. Publishing this may be problematic in case we encounter a legal response from publishers.
- Is the data of scientific value? Does the automatic use of crystal structures provide enough information that can be extrapolated to chemistry? This is a field that people such as Jack Dunitz, Hans Buergi, Sam Motherwell and myself helped to start in the late 1970’s and the answer is generally “yes”. Before embarking on this, however, we have undertaken an intensive program to determine whether the information in any given structure “agrees” with other relevant data. We have done this by (a) abstracting indivdiual molecules and doing high-quality QM calculations (Joe Townsend) and (b) calculating the complete crystal structure (MOPAC, Nick Day). These studies provide a great deal of information about errors in the data and problems in the calculation method. Joe has produced a protocol which can be used to determine whethe a structure might be used for accurate work. In general, for non-metal compounds it is the crystallography which provide more variance. For the MOPAC calculations there is much more variance in the calculations and relatively little concern with crystallographic errors (though the program has discovered a few).
Nick Day chose to aggregate the whole of the visible crystallographic web and this now runs to ca 100,000 entries. (It is difficult to decise what a “structure” is – there are ca 80 different entries for SiO2 (silica) under different conditions, although most substances only have one entry. He chose to use CML as the primary mechanism for holding the information and it would have been impossible to do the work without this. CML is lossless and also is now intergrated into a number of computational chemistry programs.
We chose to expose the aggregated data to the world as “Open Data” since we feel it is fundamentally Open. As we are in the business of creating semantic chemistry we have also created RDF tools which help support it. Since we are also interested in variability of molecular structure we have computed chemical concepts from it – these are necessarily heuristic. The original data contains no explicit chemistry such as connection tables (though we are working with the IUCr on this) and this is the primary motive for adding concepts such as chemical bonds, stereochemistry, bond orders, InChI and SMILES. However there are likely to be arbitrary decisions and it is impossible to make claims on the correctness of the chemistry (we believe that for many organics this is > 99% but we have no formal metrics).
The initial reason for exposing CrystalEye was (a) because Nick has created a valuable resource in its own right (b) as an exemplar of Open Data. We are happy for anyone to do whatever they wish (subject to acknowledging us) but we make no claims for the data or its value.
While doing this we (mainly Nick, Jim and me) realised that the architecture of CrystalEye – based loosely on the filing system – was both simple and robust and was a lighweight alternative to the use of formal databases. It allowed browsing, and we have been able to add derived data searches (such as for internolecular distances). We also added a substructure search (OpenBabel) which works very well for the 100,000 strcures and is a good example of OB’s value. We also added an RSS feed and you can get daily updates of all new freely visible chemical crystallography (see the CrystalEye page for links). And more recently Andrew Walkingshaw has converted the CML into RDF and indexed it under the Talis Platform.
So we believe that CrystalEye is an exemplar of the future of chemical repositories. It manages some, but not all, of the complex ontological relations needed in modern chemistry. We are about to start on the Microsoft/Cornell/LANL project “OREChem” which explores how ORE/RDF can be used for aggregated resources and CrystalEye should provide a very good examplar. We are also starting on the eCrystals poroject led by Southampton and will be making CrystalEye part of that.
We also see CrystalEye as a starting point for the Departmental or domain repository for chemistry, and perhaps more widely for long-tail scientific data. To that end we have three summer students working in this area:
- to develop graphical authoring tools for crystallographic publication (and hopefully deposition) funded by the IUCr.
- to refactor CrystalEye. (It’s the first system and the architecture will be overhauled. For example there is no unique ID in the system other than the file name or URL. Depositor IDs are often a nightmare with weird characters. We also want to separate the derived data (e.g. bondlengths)
- to build on SPECTRa and CrystalEye to populate a Department Repository with crystallographic data. We hope this is an exemplar of what a long-tail scientific repository should be like – it will be operated by the service provider and will develop social protocols such as embargoes based on the people in the department. We hope that a large fraction of the Cambridge output can be exposed to the world through a CrystalEye interface.
The primary motivation for CrystalEye is still a working research tool. For example in his work with MOPAC Nick has found that certain atom pairs are poorly parameterised. This is not news – Jimmy Stewart (“Mr MOPAC”) is well aware that some pairs need improving and CrystalEye allows us to do this. Because Nick has created statistics on all bondlengths in CE (millions) the system can easily answer questions like “Is a Na-Na distance of 1.2A every found in crystals?” Answer no. Is a I-N bondlength of 2.6 A ever found? Answer yes. It’s three simple clicks on the website.
Finally there is the emerging concern over whether crystallographic data (a) should be and (b) is free and Open. There is no technical reason against this – the costs are so marginal that they are negligible. It’s simply a question of allowing or requiring another piece of supplemental information. So here, in anticipation of the discussion are some pointers:
(Major) Publishers exposing all crystallographic information:
- IUCr
- RSC
- ACS
Publishers not exposing crystallographic information:
- Elsevier (expet 1 journal)
- Wiley
- Springer
Open primary aggregations of crystallographic information:
- PDB (Protein Data Bank)
Closed aggregations
- CCDC (Organic)
- ICSD (Inorganic)
I hope to be containg some of the closed sites by Open letters through this blog. I will treat them courteously and publish replies in full. I’d be delighted if any of them wish to make their position clear ahead of time – please mail me directly as well as posting a comment (pm286). The question for publishers who do not expose crystallographic data is simple:
“You already have a mandatory requirement for the publication of crystallographic information. Please can you add this to your web site as supplemental information as you do for all other information such as spectra and synthesis”.
If the answer is “yes, certainly” I shall be delighted.