CrystalEye: data loss and corruption through legacy files

Andrew Dalke raised the issue of data corruption:

  1. Andrew Dalke Says:
    November 4th, 2007 at 2:32 am e
  2. PMR: Moreover crystal structures contain problems such as disorder and partial occupancy which are impossible to hold in an SDFile as far as I know without corrupting the data.
  3. “Corruption” is a strong word. Why not think of it as the way you wrote in your “Round-trip format conversion” wikipedia article?

PMR: Here is a widespread and almost universal example of corruption which is almost entirely down to the use of SD (MOL) files and/or SMILES in particular (but is common to almost all legacy formats). Nitric oxide (WP) is a very important molecule – it is an essential signalling molecule in the vascular system, and also a serious pollutant from transport. Its formula is NO, one nitrogen atom and one oxygen atom.
A large number of freely accessible databases give other formulas:

PMR: These variations are not because there are different opinions about what “nitric oxide” is, or whether the name may be used differently by different communities. They are because the use of SD/MOL or SMILES has corrupted the information. Because SD files have no mechanism for indicating that an atom does not have implicit hydrogens, many programs are “clever” and add them according to “valence rules”. While these are OK for a subset of chemistry they are a disaster for others. Nitric oxide is just one of many examples where they fail. So that is why I cannot answer Chemspider’s request for SD files of CrystalEye – I KNOW it will corrupt the information. It is possible that there is a simple algorithm that could filter out “most” of the entries which would not be corrupted, but it will not be watertight. That is why we have developed CML – it is designed to avoid corruption.

  1. When a document in one format is converted to another there is likely to be information loss. Is “information loss” necessarily “corruption”? From my experience in dealing with PDB files, which has some of these crystallographic properties, I think there can be meaningful information despite the information loss. So long as the tools and the users understand that there are limitations in the conversion.

PMR: There are “obviously” parts of the information that can be omitted without corruption. An example is “iucr:_publ_contact_author_phone”. But what happens if you omit “occupancy” in an entry ? It looks like:
nite.PNG
Notice that the _chemical_formula_sum contains non-integral atom counts – this is common in crystal structures nd is supported by the _atom_site_occupancy flag in CIF which points to the last field before the two dots.
_atom_site_type_symbol
_atom_site_label
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
_atom_site_U_iso_or_equiv
_atom_site_adp_type
_atom_site_calc_flag
_atom_site_refinement_flags
_atom_site_occupancy
_atom_site_disorder_assembly
_atom_site_disorder_group
Ni Ni1 0.05235(10) 0.2500 0.4203(2) 0.0152(4) Uani d S 1 . .
Ni Ni2 0.14310(14) 0.2500 0.0802(3) 0.0185(4) Uani d SP 0.80 . .
Ni Ni3 0.15349(14) -0.2500 0.5956(3) 0.0191(5) Uani d SP 0.80 . .
Te Te1 0.25343(5) 0.2500 0.42149(10) 0.0131(3) Uani d S 1 . .
Te Te2 0.00373(5) 0.2500 0.78163(10) 0.0146(3) Uani d S 1 . .
Confirm that Ni(1+0.8+0.8) => Ni2.6 and Te(1+1) => Te2. CML is designed to hold this without loss (through the occupancy attribute) but SD files, SMILES and almost all other legacy (except PDB and a few other crystallographic files) are not. Therefore using SD to bundle this entry and transmit is is guaranteed to corrupt it.
[Note added later. There is a well characterised HN=O molecule – see NIST Webbook – but it is nitrosyl hydride, not nitric oxide.]

This entry was posted in chemistry, crystaleye. Bookmark the permalink.

8 Responses to CrystalEye: data loss and corruption through legacy files

  1. PMR says: These variations are not because there are different opinions about what “nitric oxide” is, or whether the name may be used differently by different communities. They are because the use of SD/MOL or SMILES has corrupted the information.
    WR says: I agree, that for the special purpose of crystallography SD-files may be not sufficient. I DO NOT agree to your 2nd sentence that the above mentioned error of ‘N=O’ has been INTRODUCED BECAUSE of the use of SD-format ITSELF !!!!! I order to make things clear I am talking about the representation of a molecular connectivity within SD-files. The use of SD-files has NOT corrupted this piece of information – the correct version is ‘ The IMPROPER use of the SD/MOL-format has corrupted this piece of information’. The corruption has been caused by a HUMAN (not the format), who didnt read the description carefully and therefore dumped something wrong to the file. In SD-file format there is a column for the valency which exactly avoids this problem – obviously it was not filled with the correct value (same is true for the charge in the other ‘variant’) ….
    I understand that you want to promote CML, etc. – but please dont mix up cause and consequence according to your needs !
    SD-files have a lot of other problems, but at least 90% of the incomplete description of structures are caused by humans, who use for example the ‘highlighting feature’ of bonds as representation for stereochemistry, etc., etc. ….. other “real-world” problems with SD-files include e.g. bond types like ‘complex’ are missing, charges distributed over 2,3,…. atoms are missing, etc. – but the above mentioned errors CAN NOT be attributed to the SD-format definition !!! Details can be found on http://www.mdli.com (download-area)

  2. pm286 says:

    (1) Thanks. I am familiar with the MDL CTFile format which states:
    vvv
    valence
    0 = no marking (default)
    (1 to 14) = (1 to 14) 15 = zero valence
    [Generic] Shows number of bonds to this atom, including bonds to implied H’s.
    PMR: I agree that nobody uses this – I also think this is an unclear definition – what does “number of bonds mean”? Coordination or valence sum?

  3. I have loaded just for fun the 5 ‘versions’ of “nitric-oxide” into CSEARCH and afterwards dumped it to a MOLfile: (the lines holding ‘-‘ have been introduced manually for better readability)
    ———————— H-N=O —————————————-
    2 1 0 0 0 0 0 0 0 0999 V2000
    -3.4533 0.0933 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
    3.5467 0.0933 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
    1 2 2 0 0 0 0
    M END
    ————————- N#O(+)—————————————-
    2 1 0 0 0 0 0 0 0 0999 V2000
    -3.4533 0.0933 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
    3.5467 0.0933 0.0000 O 0 3 0 0 0 0 0 0 0 0 0 0
    1 2 3 0 0 0 0
    M CHG 1 2 1
    M END
    ————————– NO (no radical) ——————————
    2 1 0 0 0 0 0 0 0 0999 V2000
    -3.4533 0.0933 0.0000 N 0 0 0 0 0 2 0 0 0 0 0 0
    3.5467 0.0933 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
    1 2 2 0 0 0 0
    M END
    —————————- NO (radical) ——————————–
    2 1 0 0 0 0 0 0 0 0999 V2000
    -3.4533 0.0933 0.0000 N 0 4 0 0 0 0 0 0 0 0 0 0
    3.5467 0.0933 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
    1 2 2 0 0 0 0
    M RAD 1 1 1
    M END
    ————————— (-)N=O ———————————–
    2 1 0 0 0 0 0 0 0 0999 V2000
    -3.4533 0.0933 0.0000 N 0 5 0 0 0 0 0 0 0 0 0 0
    3.5467 0.0933 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
    1 2 2 0 0 0 0
    M CHG 1 1 -1
    M END
    ———————————————————————-
    M RAD / M CHG line has higher priority that the information within the
    ‘atomblock’
    Conclusion: All 4(5) variants can be properly handled by MOL/SD-file structure representation
    BTW: During writing this post which took me approx. 5 minutes I have dumped ca. 500,000 structures from CSEARCH to MOLfiles for backup-purposes ….. I agree, there are problems with inorganics and organometallics ……….

  4. pm286 says:

    (3) Thanks, Wolfgang.
    I agree that it is possible to manage NO correctly in SD files – the point is that most systems don’t do it. The most robust way is to include H’s explicitly which is always possible, thought there is no way of stating that this represents all the H’s. The benefit of exclusion of H’s was that it reduced space, at the cost of this corruption.

  5. Andrew Dalke says:

    Crystallographic records are of course different than molecular formats and there are structures and cases which cannot be translated.
    You pointed out two such: treatment of implicit hydrogens/radicals, and occupancy. SD files can handle the first case, but “most systems don’t do it [correctly].” You do realize that if you generate the SD files then you get to influence that part of the process, and better ensure that any data “corruption” is minimized? You can even include fields in the tag section which help tools further on in the pipeline understand when and how a given connection table has problems.
    As for occupancy, yes, an SD file or a SMILES string is not a xtallographic data format. Some things don’t translate at all. How many records cannot be converted with any semblance of usability? 50%? 1%? 0.001%? How many can be translated with some loss of chemistry but in a still usable form? How many can be translated with loss of “other” information, like “iucr:_publ_contact_author_phone”? Can the conversion problems be identified and noted automatically?
    The OpenEye people have done some pretty good work (as certainly have people at PubChem) in doing chemistry perception of PDB structures, including handling these strange cases. It’s not easy, but also not unexplored territory. And you’re getting some pretty strong feedback which suggests that people would like that data, and probably would like it enough to write their own conversion codes, which will silently corrupt things worse than anything you would do.

  6. pm286 says:

    (5) Thanks Andrew.
    I reckon 5-10% of entries have a disorder flag.
    “Can the conversion problems be identified and noted automatically?” not sure, it’s start to become research.
    “chemistry perception of […] structures” – Nick has done a great job in perception from coordinates alone.
    People see CrystalEye from their own perspective which varies. I can think of at least the following high-value views:
    * a list of moieties indexed by InChIs (for pharma-like research)
    * a list of fragments – for molecular building
    * an index of cell dimensions (probably the most important thing that crystallographers want)
    * a index of bond lengths (for chemical research)
    * an index of structural types (for materials research)
    3-4 can be easily built from the CML files. There are codes such as Open Babel which transform CML to other formats.

  7. Andrew Dalke says:

    So 90% of the data is easily exportable to an SD file? Wouldn’t this subset be useful, and easily generated?
    InChIs handle moieties? Ahh, “moiety” here appears to be CIF derived terminology to mean “discrete bonded residue or ion”, and does not include fragment.
    I looked to see what CML says on this but I can’t find the spec. Last I looked at it was about 5 years ago, so I was curious to see what the current version was like. What I found were:
    An Open Babel page points to an empty wiki page. Going to the root takes me to a summary page which points me to the CML home page, from which I get “host not found”. I get DNS failures for http://www.xml-cml.org and for xml-cml.org, though WHOIS look correct.
    Going to Google to look in the cache, I get “No results found for http://www.xml-cml.org” and it suggests that maybe I want http://www.xml-ces.org/ or xml.cxml.org/ . This tells me that the xml-cml.org site has been down for a while. Going to archive.org I find a page from 3 January 2006 which says to look at http://cml.sourceforge.net/ or the wiki pages at http://wwmm.ch.cam.ac.uk/moin/ , which pointed me to xml-cml.org in the first place.
    Going to http://cml.sourceforge.net/ I looked for “specification”. After a short distraction looking for it in the “Downloads” page, which is archaic, I though to follow “schemas.” That looks like the specification, and partially explains what “CML specification” didn’t find it. I think of schemas as something used for machine validation, and not the primary specification. For an example of the difference between a schema and a specification, what human wants to read the list of allowed element names three times? And then do the work to verify (as I just did) that all three are the same, and that they reflect the current IUPAC definitions and not the mid-1990s definition, where Rf meant #106 instead of #104?
    That document does not define “moiety”.
    Given that you have five high-value views, which are your priorities? Generating the InChIs wil have effectively identical problems as generating an SD file, as will generating fragments.
    You’ve said that converting from CML to an SD file causes “corruption”. Is Open Babel good enough? If so, why not use it yourself?

  8. Peter, For your readers who might be puzzled by the exchange on SDF file and how “able” it is (though Wolfgang has adequately shown the details for NO) everyone is pointed to http://www.mdl.com/downloads/public/ctfile/ctfile.pdf
    ChemSPider does indeed have four entries. PubChem has three entries (you only highlighted two above)
    http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=945
    http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=84878
    http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=145068
    If you look back at the list of entries of ChemSpider that you did you’ll see four hits
    One is the HNO, cation, one the anion, one is the radical. This is displayed here: http://www.chemspider.com/images/Nitric_oxide_pre_curation.png
    Following your comments and with a couple of minutes of curation by yours truly you will now see this result only: http://www.chemspider.com/RecordView.aspx?id=127983
    That’s the power of feedback and creating a community of curators.
    To me CML is a format. As with any format it is what the human feeds into it that matters. If users drew a structure and then attached the name nitric oxide it is absolutely IRRELEVANT whether the format can handle an appropriate NO representation. It is the structure-identifier pair that matters and this would be human generated association. Clearly Wolfgang has shown that you cannot blame mol or SDF representations.
    Back to the issue of trying to link back to CrystalEye from ChemSpider. I get that you don’t want to share the structure files by SDF. So, send us the CML files and IDs instead. Of course then I think we will go back to the forking off of the data. Considering that CrystalEye is Open Data it’s incredibly difficult to get access to.

Leave a Reply

Your email address will not be published. Required fields are marked *