CrystalEye repository: technical aspects

There has been some confusion recently (post+comments , post+comments) about copying and redistributing CrystalEye. While some of this relates to the legal, moral and ethical issues, there are major technical aspects that need to be understood. Here are some, without any “Open” issues.

  • CrystalEye is not a database, it is effectively a repository. We are in the process of developing this a technology to support a number of projects in-house and it is at an early stage. Jim is working actively on how to make the contents of a repository available and at present he believes Atom is the best approach. We are therefore starting to mount Atom feeds for this purpose. Note that even repositories such as DSpace are not good at managing scientific data content – we put 150,000 calculations in the Cambridge DSpace and cannot extract them without writing code to access them. So this is a general aspect of repositories and we are in contact with other repository research.
  • CrystalEye is dynamic. It is updated every day. If someone starts spidering the complete content (probably > 1 million “files” (“resources” is a better term)) and is courteous (i.e. only uses a single thread with a delay) it probably takes ca several weeks to spider. By that time the data will be out of sync.
  • CrystalEye is complex, because combining chemistry with crystallography is complex. The raw material of CrystalEye is crystallography, not chemistry. Crystallography detects the position and nature of atoms (actually electrons, but accept that atoms is a better concept here). It does not detect bonds (which are human concepts). It does not, except in the very best experiments, detect charges. So the primary data in CrystalEye is a list of atoms with fractional coordinates (not Cartesians).
  • There are several reasons why a simple list of atoms is not a good representation of chemistry. These include space-group translations, space group symmetry, disorder, special positions, partial occupancy, etc. It is a matter of judgement and heuristics as to whether these are present and how they map onto chemistry. CrystalEye is an experiment in high-throughput chemical heuristics in crystal structures.
  • CrystalEye uses these heuristics to “guess” bonds, bond orders and charges. We think it does a pretty good job, but it’s not perfect (we’d like to know where it fails). This is one of the main reasons for posting CrystalEye – how well does it work?
  • Many – if not most – crystals contain more than one “moiety”. Thus Na2SO4.10H2O contains 2 sodium cations, 1 sulfate anion and 10 water molecules. How we break this up affects what the InChI looks like. In this case we have probably got it right, but in many cases the InChI is a matter of judgement rather than fact. This is because chemistry is varied and complex.
  • Many scientists are interested in the crystal structure – the way the moieties pack together. Others are completely uninterested in this and only wish to know about individual moieties. We have to cater for all sorts of science – organic, metal organic, inorganic, materials science, nanotechnology, etc. All of these disciplines will want something completely different from a crystallographic repository.
  • Nick Day has also created a huge amount of derived data. The most obvious of these are the fragments, where he has split the molecules into “natural” subunits. We expect this to be very useful to people (like Openbabel and FROG and BUSTR3D) who wish to build 3D molecules from smaller fragments. IN fact there are more fragments than entries.

So in extracting data from crystalEye it is important to consider what the discipline is. I suspect that so far most of the requests have come from the molecular organic community, which often has a focus on drug design. They request “structures”, but there are no structures in CrystalEye, only entries with a variety of derived chemical concepts, some of which may be considered as “structures”. It is important to define precisely what is required before it can be provided.
We wish to make CrystalEye as useful as possible. Please remember, however, that it is the work of a single graduate student (Nick Day) who is now writing up. We are actively continuing to develop repository technology to fit to CrystalEye and Jim Downing will be blogging on this. We are also actively taking to 3 groups about sustainability and we are thinking very hard about how to “copy” and “update” a dynamic repository. DSpace, Fedora and ePrints developers and managers will know that isn’t easy – it’s a topic of research. It isn’t easy for CrystalEye either.
But it should be easy to use CrystalEye as installed here for the applications we have created (browse, bond search, substructure search and RSS feeds). If there are applications and extensions you are interested in we’d be delighted to know. You may have to write the code!

This entry was posted in crystaleye. Bookmark the permalink.

One Response to CrystalEye repository: technical aspects

  1. Regarding the InChIs: I would prefer one InChI for each moiety, not one InChI for the full structure. Or not only, at least.

Leave a Reply

Your email address will not be published. Required fields are marked *