Egon Willighagen asks about collaboratin on Nick Day’s CrystalEye knowledgebase. I reproduce his post – note that CrystalEye should be able to provide many examples to increase the size of data sets – and then discuss some of the advantages of using CrystalEye over conventional closed databases.
Automatic Classification of thousands of Crystal Structures
Clustering and classification of crystal structures is hot. Parkin hit the front cover of CrystEngComm with a story on Comparing entire crystal structures: structural genetic fingerprinting (DOI:10.1039/b704177b). Now, the story itself, while rather interesting and well written, has three major flaws:
- the data set it way too small
- the proposed proof-of-concept is not novel at all
- they do not cite me
Well, the latter sounds a bit boohoo, and it is 🙂 (BTW, I do like this paper.)
The propose the work as proof-of-concept, but use a very artificial data set of only 12 crystal structures (benzene and eleven polycyclic aromatic hydrocarbons, like naphtalene, anthracene, phenanthrene, triphenylene, pyrene, perylene, and coronene). While such a small set does make a nice example where you can still list all similarities (0.5*N*(N-1)), it is really too artificial.
Now, you may wonder if I am in the position to criticize this shortcoming, but I think I am. As part of my PhD work, I analyzed this problem myself, and published two years ago the paper Method for the computational comparison of crystal structures (DOI:10.1107/S0108768104028344). Apparently, Parkin was not aware of this publication and did not cite it. I should have went to a crystallography conference with a poster, and advertise my work more. In this paper, I analyzed a data set with 48 crystal structures, manually validated by visual inspection, resulting in having to compare 1128! crystal structure pairs. Took me two full weeks behind a Silicon Graphics. Yes, I really understand why the took only 12 structures 🙂
However, there is more prior art. While my approach was based on a new radial distibution function-based whole crystal structure descriptor, my supervisor (Ron) used the more common powder diffraction pattern and showed in Representing Structural Databases in a Self-Organising Map (DOI:10.1107/S0108768105020331) it to be a good enough descriptor for clustering of thousands of crystal structures using a self-organizing map (SOM).
Last week, my second paper in crystallography appeared: Supervised Self-Organizing Maps in Crystal Property and Structure Prediction (DOI:10.1021/cg060872y). In this paper, we show how supervised SOMs (see DOI:10.1016/j.chemolab.2006.02.003) can be used for supervised classification and even for property prediction. Note that these supervised SOMs are truely
Finally, another advantage of this last work: the code is open source. The code for the unsupervised SOMs is available as R package: kohonen; and for powder diffraction patterns: wccsom. Details can be found in this R News issue. The first package is not actually limited to crystal structures, and can be used for any clustering problem. However, the articles mentioned here make use of simulated diffraction patters, and I am not sure there are open source tools to generate those.
BTW, I would still be interested in teaming up with CrystalEye in one way or another, and couple these data analysis methods to live streams of new crystal structures. Nick, let me know if you are interesting in idea exchange.
Getting back to Parkin’s paper, I do like the work. Hirshfield surfaces are an interesting tool to visualize packing characteristics, and using them to describe a crystal structure sounds like an interesting idea indeed. I just hope that the method properly scales.
PMR: We would be delighted to see if CrystalEye can be used to help. A word of warning – Nick starts writing up at the end of this month and so the amount of effort is limited. However it would be great to be able to run – or re-run your study.
We see CrystalEye as part of the next generation of crystallographic knowledgebases. It already abstracts all current crystallography unless the publishers (Elsevier, Springer, Wiley in particular) prevent us doing this. It has several novel advantages over conventional compilations.
- It is free/Open to use in all senses
- it integrates inorganic and organic papers
- The complete experimental data (CIF) is available
- there are links back to the original publications
- it is available in RDF
- all new strucures, and parts of structures, are available in RSS feeds
- it integrates the Crystallographic Open Database and identifies duplicates
- All data is available in XML-CML
- The chemistry has been automatically extracted and analysed giving a complete set of bond orders, charges where possible and chemical structure
- The structure can be decomposed into moieties (individual disjoint molecules or ions)
- each has a Jmol display with links to bond lengths, angles and torsions
- There will (soon) be complete histograms of all bond lengths with hyperlinks to all entries
- The major crystallography sources of “error” – disorder, constrained refinement, etc. are automatically identified.
Joe Townsend has developed a protocol whereby he can – with 99% accuracy – reliably identify those structures which have errors in coordinates less than 0.01 Angstrom. This is good enough for almost all modern crystal and chemical structure analysis. It means that chemical deductions can be reliably drawn from such structures without worrying that you are merely analysing experimental errors.
We have an automatic program of computing these strucures by QM programs – GAMESS and MOPAC. These will initially identify any structures where computation and experiment disagree. In practice almost all the disagreements (for organic molecules) have been due to experiment, meaning that calculation is an effecfive means of checking the validity of structures.
There is a large, tested, library of crystallographic software (CIF2CML and JUMBO) which deals with symmetry, geometry, bonding etc. This makes it easy to ask and answer many questions rapidly with small Java programs. Moreover CrystalEye has been translted to RDF so that the full power of the semantic web can be brought into play.
There are at least two funded collaborations just starting which will use CrystalEye and we have several offers of contributions from individuals and organisation (e.g. of theses). Please let us have ideas.