The charges in the structure are indeed wrong. There are two challenges…Why chemistry-rich RSS feeds matter… data minging,
The example shown by Peter was nicely chosen: something is wrong with that example. It uncovers a bug in the pipeline, that could have been uncovered by a simple agent monitoring the RSS feed. That is why this technology is important! It allows pipelining of information between services. Anyway, before you read on, check the structure in the example yourself (Bis(pyrimidine-2-carboxylato-K2N,O)copper(II)). Done? Checked it? You saw the problem, right? Good.
- for structures with more than one moiety (isolated fragment) in the structure it is formally impossible to know the changes if the author doesn’t give them. The authors can give them in _chemical_formula_moiety but they are often difficult to parse correctly and in any case they often aren’t given. In those cases we don’t try to assign charges. (The crystallographic experiment itself cannot determine charges).
- In cases where the fragment contains only light atoms it is usually (but not always) possible to allocate charges by machine. In cases with metals it’s usually impossible to do a good job. The molecule in questions is:

The molecule itself is neutral. The easiest way is not to put any charges. Anything else in uncomfortable. We can have + charges on the N’s which is natural, but then there are 2 – charges on the CU. That’s formally correct but since the mertal is usually described as Cu(II) it’s not happy. Or we can play around with thearomaticity, or dissociate the Cu-N or C-O bonds but that’s not happy either. And this is simple compared with may metal structures.
What we have been doing is to disoociate the metal, do the aromaticity and charges, and then add the metal back. In doing so it’s easy to forget the charges and that is what has happened. We’ll try to fix it.
But in the end the only thing that matters is the total electron count and the spin state (which normally isn’t given except in the text). Cu2+ is d9 so it has one unpaired electron. But Fe is much more difficult and it’s virtually impossible to do anythig automatic. We’ll probably simply leave the charges off…
This was the process when Nick reported it at ACS. It’s changed slightly. CIFDOM is now CIFXML and it emits RAWCML. CML* represents CompleteCML. The 2D coordinates (but not the actual images) are held in CompleteCML. Complete CML also contains the moieties, each with its own InChI.
To respond to your requests:
- link to CrystalEye entries (assuming the CMLs contains a CrystalEye identifier). [PMR: The Atom feed contains the link - see above]
- derive properties myself [PMR: certainly. You should be able to work directly on the crystal structure and/or the moieties. The only thing you don't have is the fragments.]
- test the new CDK atom typing algorithms (which chemistry is there in the CIF files that CDK cannot deal with yet) [PMR: I think there will be objective problems with coordination compounds and organometallics. We normally rely on the author to give the overall charge on a moiety. If they don't we are usually hosed for coordination compounds.]
- detect uncommon geometries. [PMR: Certainly and we’d love to help. The bondlength plots already do this, and they are probably the first place to start. We would also plan to organise by fragments. Some fragments will have only one occurrence, others will have thousands (we would expect over 100,000 examples of phenyl groups as it can occur 10 times in some structures). Ideally we would do a cluster analysis for each fragment and then you could look for outliers – I did a brief hack at this some years back – in fact I think we corresponded. There is also enormous possibility in intermolecular interactions, perhaps a histogram of close contacts for X..Y
And we invite collaboration.
InChI=1/C20H28O4/c1-10-11-5-6-12-19(4)8-7-14(21)18(2,3)13(19)9-15(22)20(12,16(10)23)17(11)24/h11-13,15,17,22,24H,1,5-9H2,2-4H3/t11-,12-,13+,15+,17+,19-,20-/m0/s1
(
Nick has drawn the dication first, but the others are drawable by scrolling.
There is NO InChI for the complete molecule (I’m not sure if this is deliberate), but there IS an InChI for the dication under “Moities” as there also is for the solvent. (The anions are missing from the moities – this may be a CrystalEye bug or it may be an author problem). InChI for dication:
InChI=1/C11H18N4/c1-10-12(3)5-7-14(10)9-15-8-6-13(4)11(15)2/h5-8H,9H2,1-4H3/q+2
InChI for solvent (CH3CN):
InChI=1/C2H3N/c1-2-3/h1H3
(Nick: BUG. The picrates are in the complete CML file but they don’t have InChIs and they don’t appear in the pages)
(
The InChI is calculated for the major component.
(
Note that this is also disordered.
The InChI here represents the compound molecule as A2.B
InChI=1/2C16H14N4O2.C4H8O2/c2*1-
11-19-16(22-20-11)15(9-17-10-18-21)14-8-4-6-12-5-2-3-7-13(12)14;1-2-6-4-3-5-1/h2*2-10,21H,1H3,(H,17,18);1-4H2/b2*15-9-;
In summary, therefore, I think we should certainly have InChIs for the moieties (and I think we have, at least in principle). I am less clear how useful it is for the overall crystal structure (as in D). Note that for inorganic structures without discrete moieties there are no InChIs. I am looking for some with discrete moieties.
That’s enough for now. I’ll tackle fragments in the new post or so.