Andrew Dalke raised the issue of data corruption:
PMR: Here is a widespread and almost universal example of corruption which is almost entirely down to the use of SD (MOL) files and/or SMILES in particular (but is common to almost all legacy formats).
Nitric oxide (WP) is a very important molecule – it is an essential signalling molecule in the vascular system, and also a serious pollutant from transport. Its formula is NO, one nitrogen atom and one oxygen atom.
A large number of freely accessible databases give other formulas:
PMR: These variations are not because there are different opinions about what “nitric oxide” is, or whether the name may be used differently by different communities. They are because the use of SD/MOL or SMILES has corrupted the information. Because SD files have no mechanism for indicating that an atom does not have implicit hydrogens, many programs are “clever” and add them according to “valence rules”. While these are OK for a subset of chemistry they are a disaster for others. Nitric oxide is just one of many examples where they fail. So that is why I cannot answer Chemspider’s request for SD files of CrystalEye – I KNOW it will corrupt the information. It is possible that there is a simple algorithm that could filter out “most” of the entries which would not be corrupted, but it will not be watertight. That is why we have developed CML – it is designed to avoid corruption.
PMR: There are “obviously” parts of the information that can be omitted without corruption. An example is “iucr:_publ_contact_author_phone”. But what happens if you omit “occupancy” in
an entry ? It looks like:

Notice that the _chemical_formula_sum contains non-integral atom counts – this is common in crystal structures nd is supported by the _atom_site_occupancy flag in CIF which points to the last field before the two dots.
_atom_site_type_symbol
_atom_site_label
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
_atom_site_U_iso_or_equiv
_atom_site_adp_type
_atom_site_calc_flag
_atom_site_refinement_flags
_atom_site_occupancy
_atom_site_disorder_assembly
_atom_site_disorder_group
Ni Ni1 0.05235(10) 0.2500 0.4203(2) 0.0152(4) Uani d S 1 . .
Ni Ni2 0.14310(14) 0.2500 0.0802(3) 0.0185(4) Uani d SP 0.80 . .
Ni Ni3 0.15349(14) -0.2500 0.5956(3) 0.0191(5) Uani d SP 0.80 . .
Te Te1 0.25343(5) 0.2500 0.42149(10) 0.0131(3) Uani d S 1 . .
Te Te2 0.00373(5) 0.2500 0.78163(10) 0.0146(3) Uani d S 1 . .
Confirm that Ni(1+0.8+0.8) => Ni2.6 and Te(1+1) => Te2. CML is designed to hold this without loss (through the occupancy attribute) but SD files, SMILES and almost all other legacy (except PDB and a few other crystallographic files) are not. Therefore using SD to bundle this entry and transmit is is guaranteed to corrupt it.
[Note added later. There is a well characterised HN=O molecule - see
NIST Webbook - but it is nitrosyl hydride, not nitric oxide.]
This entry was posted
on Sunday, November 4th, 2007 at 12:18 pm and is filed under chemistry, crystaleye.
You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.
PMR says: These variations are not because there are different opinions about what “nitric oxide” is, or whether the name may be used differently by different communities. They are because the use of SD/MOL or SMILES has corrupted the information.
WR says: I agree, that for the special purpose of crystallography SD-files may be not sufficient. I DO NOT agree to your 2nd sentence that the above mentioned error of ‘N=O’ has been INTRODUCED BECAUSE of the use of SD-format ITSELF !!!!! I order to make things clear I am talking about the representation of a molecular connectivity within SD-files. The use of SD-files has NOT corrupted this piece of information – the correct version is ‘ The IMPROPER use of the SD/MOL-format has corrupted this piece of information’. The corruption has been caused by a HUMAN (not the format), who didnt read the description carefully and therefore dumped something wrong to the file. In SD-file format there is a column for the valency which exactly avoids this problem – obviously it was not filled with the correct value (same is true for the charge in the other ‘variant’) ….
I understand that you want to promote CML, etc. – but please dont mix up cause and consequence according to your needs !
SD-files have a lot of other problems, but at least 90% of the incomplete description of structures are caused by humans, who use for example the ‘highlighting feature’ of bonds as representation for stereochemistry, etc., etc. ….. other “real-world” problems with SD-files include e.g. bond types like ‘complex’ are missing, charges distributed over 2,3,…. atoms are missing, etc. – but the above mentioned errors CAN NOT be attributed to the SD-format definition !!! Details can be found on http://www.mdli.com (download-area)