Automatic assignment of charges by JUMBO

Egon has spotted a bug in our code for assignment of charges to atoms:

Why chemistry-rich RSS feeds matter… data minging,

The example shown by Peter was nicely chosen: something is wrong with that example. It uncovers a bug in the pipeline, that could have been uncovered by a simple agent monitoring the RSS feed. That is why this technology is important! It allows pipelining of information between services.
Anyway, before you read on, check the structure in the example yourself (Bis(pyrimidine-2-carboxylato-K2N,O)copper(II)).
Done? Checked it? You saw the problem, right? Good.

The charges in the structure are indeed wrong. There are two challenges…

  • for structures with more than one moiety (isolated fragment) in the structure it is formally impossible to know the changes if the author doesn’t give them.  The authors can give them in _chemical_formula_moiety but they are often difficult to parse correctly and in any case they often aren’t given. In those cases we don’t try to assign charges. (The crystallographic experiment itself cannot determine charges).
  • In cases where the fragment contains only light atoms it is usually (but not always) possible to allocate charges by machine. In cases with metals it’s usually impossible to do a good job. The molecule in questions is:

Summary page for crystal structure from DataBlock I in CIF xu2383sup1 from article xu2383 in issue 2008/01-00 of Acta Crystallographica, Section E.


 

The molecule itself is neutral. The easiest way is not to put any charges. Anything else in uncomfortable. We can have + charges on the N’s which is natural, but then there are 2 – charges on the CU. That’s formally correct but since the mertal is usually described as Cu(II) it’s not happy. Or we can play around with thearomaticity, or dissociate the Cu-N or C-O bonds but that’s not happy either. And this is simple compared with may metal structures.

What we have been doing is to disoociate the metal, do the aromaticity and charges, and then add the metal back. In doing so it’s easy to forget the charges and that is what has happened. We’ll try to fix it.

 

But in the end the only thing that matters is the total electron count and the spin state (which normally isn’t given except in the text). Cu2+ is d9 so it has one unpaired electron. But Fe is much more difficult and it’s virtually impossible to do anythig automatic. We’ll probably simply leave the charges off…

 

This entry was posted in chemistry, crystaleye. Bookmark the permalink.

6 Responses to Automatic assignment of charges by JUMBO

  1. I think there is an additional bug with the generation of SMILES versus InChI since it appears that stereo is reversed. Also, maybe its intentional that you generate InChIs based on unit cells rather than structures but I don’t think it’s the best way to represent structures/compounds for indexing? I wrote about my observations tonight here (http://www.chemspider.com/blog/struggling-to-scrape-crystaleye.html). I understand it’s a research project of course so offer this information to help resolve the issues. I have some thoughts regarding how you might want to validate the SMILES/InChI differences and how regular expressions could help clean up the data if the unit cell capture is not intentional. Best wishes, ChemSpiderMan

  2. Peter, I have not been able to find time this weekend to set up a CMLRSS agent to process CrystalEye data, but hope to do so soon. Next weekend, next try. Good point about the electrons on copper; I actually even overlooked that problem in the 2D diagram, after enthusiastically seeing the link between the wrongly charged oxygens and the atom type perception I’m working on in the CDK…
    I remember the feed can give you only the updates… what I was wondering, did your group set up a Taverna node (local processor), to implement those hacks for the RSS services? Then I’d could simply use CDK-Taverna to set up the agent…
    Final thingy… is there a bug tracking system for CrystalEye, where the CrystalEye community can file problems and keep track of the state of those?

  3. Peter – as I just commented on Egon’s blog we use Bugzilla. (http://www.bugzilla.org/). It works very well for us.

  4. Jim Downing says:

    I’ve just applied for a sourceforge project site for CrystalEye. If we get approved we’ll run trackers there and I’ll ask Nick to upload the source in a spare moment when he’s sick of writing up.
    Egon – we haven’t set up a Taverna node.
    Have you considered hacking on my crystaleye harvester to get data from crystaleye?

  5. pm286 says:

    (all) There has been some offline correspondence as well. Nick and Jim may wish to add things.
    The present situation is that CrystalEye robots continue to collect data, convert it and index it. We are aware of a few bugs and plan to list them. However we do not plan bug fixes immediately.
    Nick developed CrystalEye as part of his thesis – he is now writing this up and only does further software work if it impacts on his thesis. The purpose of the thesis was – inter alia – to see how well calculations can reproduce crystal structures and so far none of the bugs impinge on that. We’re pleased to have them reported, of course and thank you all.
    Please note that in reporting bugs we wish to use unit tests to identify them so it will help if we have one or more specific instances with details (“Entry ddddd depicts stereochemistry X but the data show Y”). We then write a unit test which depends on the specific bug.
    Please also note that in a project with >1000000 data items it is not likely that there are problems – it is certain. The “id”s that authors give are often full of strange characters (and this cases many of our problems). Similarly there are chemical compounds that are beyond our anticipation. And there are simple data errors – a recent one I looked at had a PF8 anion (it was a disordered PF6)
    WRT to multiple molecules per asymmetric unit (not actually a unit cell). Many crystals have several different molecules per asymmetric unit – a typical example might be coper sulfate Cu(OH2)4.H20.SO4. Here the water of solvation is different from the other waters so needs to be explicit.
    In some cases the asymmetric unit has more than one identical molecule. Perhaps we should normalize to a single molecule, perhaps not. We have chosen not too. After all InChI was not designed to deal with aggregated systems.
    We have plans for further development of CrystalEye which includes changes to the software and which should be

  6. jat45 says:

    (5) The “identical” molecules in the asymmetric unit are sometimes sufficiently different that one will behave well when undergoing GAMESS geometry optimisations (i.e. the Hessian converges etc) whilst the other molecule will not.
    Of course, the InChI for each molecule is identical.

Leave a Reply

Your email address will not be published. Required fields are marked *