Archive for January, 2008
CML Blog will restart
Thursday, January 31st, 2008Automatic assignment of charges by JUMBO
Thursday, January 31st, 2008The charges in the structure are indeed wrong. There are two challenges…Why chemistry-rich RSS feeds matter… data minging,
The example shown by Peter was nicely chosen: something is wrong with that example. It uncovers a bug in the pipeline, that could have been uncovered by a simple agent monitoring the RSS feed. That is why this technology is important! It allows pipelining of information between services. Anyway, before you read on, check the structure in the example yourself (Bis(pyrimidine-2-carboxylato-K2N,O)copper(II)). Done? Checked it? You saw the problem, right? Good.
- for structures with more than one moiety (isolated fragment) in the structure it is formally impossible to know the changes if the author doesn’t give them. The authors can give them in _chemical_formula_moiety but they are often difficult to parse correctly and in any case they often aren’t given. In those cases we don’t try to assign charges. (The crystallographic experiment itself cannot determine charges).
- In cases where the fragment contains only light atoms it is usually (but not always) possible to allocate charges by machine. In cases with metals it’s usually impossible to do a good job. The molecule in questions is:

The molecule itself is neutral. The easiest way is not to put any charges. Anything else in uncomfortable. We can have + charges on the N’s which is natural, but then there are 2 – charges on the CU. That’s formally correct but since the mertal is usually described as Cu(II) it’s not happy. Or we can play around with thearomaticity, or dissociate the Cu-N or C-O bonds but that’s not happy either. And this is simple compared with may metal structures.
What we have been doing is to disoociate the metal, do the aromaticity and charges, and then add the metal back. In doing so it’s easy to forget the charges and that is what has happened. We’ll try to fix it.
But in the end the only thing that matters is the total electron count and the spin state (which normally isn’t given except in the text). Cu2+ is d9 so it has one unpaired electron. But Fe is much more difficult and it’s virtually impossible to do anythig automatic. We’ll probably simply leave the charges off…
What if chemistry data had been open?
Wednesday, January 30th, 2008History the first… [...hisorical details snipped...]Imagine a world with no GenBank, no PDB, no SwissProt, and no culture growing out of these of publically funded freely available databases of biological information like Brenda, KEGG, etc etc. Would we still be living in the 90s, the 80s, or even the 70s compared to where we have got to? History the second…
In the second half of the twentieth century synthetic organic chemistry went through an enormous technical revolution. … There was tremendous excitement as people realised that virtually any molecule could be made, if only the methodology could be figured out. Diseases could be expected to fall as the synthetic methodology was developed to match the advances in the biological understanding. The new biological databases were providing huge quantities of information that could aid in the targeting of synthetic approaches. However it was clear that quality control was critical and sharing of quality control data was going to make a huge difference to the rate of advance. So many new compounds were being generated that it was impossible for anyone to check on the quality and accuracy of characterisation data. So, in the early 80s, taking inspiration from the biological community a coalition of scientists, publishers, government funders, and pharmaceutical companies developed public databases of chemical characterisation data with mandatory deposition policies for any published work. Agreed data formats were a problem but relatively simple solutions were found fast enough to solve these problems…. [...] Ok. Possibly a little utopian, but my point is this. Imagine how far behind we would be without Genbank, PDB, and without the culture of publically available databases that this embedded in the biological sciences. And now imagine how much further ahead chemical biology, organic synthesis, and drug discovery might have been with NMRBank, the Inhibitor Data Bank…
PMR: If only. And what makes it even more poignant is that in the 1970′s the AI community developed many of their approaches round chemistry. DENDRAL, LHASA, etc. Years ahead of their time. But most AI relies on real-world knowledge and the chemists closed this and starved the efforts. Still we now know a lot of things that do and don’t work in CompSci. So as we start to prise cjemistry data out of the silos we should be able to move very quickly…
CrystalEye RSS
Wednesday, January 30th, 2008

Click to enlarge.
This is immediate – Nick’s robots determine when a new issue has come out and various publishers are talking to us about providing RSS feeds of new issues or new articles (talked with BMC and IUCr yesterday). It makes it a lot simpler.
The possibilities are enormous. All the information is semantic and can be turned into RDF. Andrew Walkingshaw has done this and in a later post he or I will show how to search for information contained in the CIF files. If you are only interested in Cu-N bonds there is a special feed exactly for that purpose. APE2008 thoughts on domain repositories
Wednesday, January 30th, 2008APE2008 more thoughts
Wednesday, January 30th, 2008- role of librarians
- beyond the full-text
- legal and contractual stuff
Chemistry Repositories
Wednesday, January 30th, 2008- ease of putting things in. It doesn’t require a priesthood (as so many relational databases do). You should be able to put in a wide range of things – these, molecules, spectra, blogs, etc. You shouldn’t have to worry about datatypes, VARCHARS, third normal forms, etc.
- it should also be easy to get things out. That means a simple understandable structure to the repository. And being able to find the vocabulry used to describe the objects.
- flexibility. Web 2.0 teaches us that people will do things in different ways. Should a spectrum contain a molecule or should a molecule contain a spectrum? Sme say one, some the other. So we have to support both. Sometimes required information is not available, so it must be omitted and that shouldn’t break the system.
- interoperability. If there are several repositories built by independent groups it should be possible for one lot to find out what the otehrs have done without mailing them. And the machines should be able to work this out. That’s hard but not impossile.
- avoid preplanning. RDBs suffer from having to have aschema before you put data in. Repositories can describe a basic minimum and then we can work out later how to ingest or extract.
- power is more important than performance (at least for me.) I’d rather take many minutes to find something difficult than not be ale to do it. When I started on relational databases for molecules it took at night to do a simple join. So everything is relative…
Semantic Chemical Computing
Wednesday, January 30th, 2008Big Science and Long-tail Science
Tuesday, January 29th, 2008
The main business was digital libraries, repositories, Open publishing, etc. It’s clear how CERN with it’s mega-projects (“big science”) can manage ventures such as the SCOAP3 Open Access publishing venture. And the community will need somewhere to find the publications – so that is where repositories come in. There is no question that High-energy physics (HEP) needs its own domain repository. The coherence, the specialist metadata, the specialist data for re-use. HEPhysicists will not go to institutional repositories – they have their own metadata (SPIRES) and they will want to see the community providing the next generation. And we found a lot of complementarity between our approaches to repositories – as a matter of necessity we have had to develop tools for data-indexing, full-text mining, automatic metadata, etc. But where do sciences such as chemistry, materials, nanotech, condensed matter, cell biology, biochemistry, neuroscience, etc. etc. fit? They aren’t “big science”. They often have no coherent communal voice. The publications are often closed. There is a shortage of data. But there are a LOT of them. I don’t know how many chemists there are in the world who read the literature but it’s vastly more than the 22,000 HEP scientists. How do we give a name to this activity. “Small science” is not complementary; “lab science” describes much of it it but is too fixed to buildings. Jim Downing cam up with the idea of “Long Tail Science”. The Long Tail is the observation that in the modern web the tail of the distribution is often more important than the few large players. Large numbers of small units is an important concept. And it’s complimentary and complementary. So we are exploring how big science and long-tail science work together to communicate their knowledge. Long-tail science needs its domain repositories – I am not sanguine that IRs can provide the metalayers (search, metadata, domain-specific knowledge, domain data) that are needed for effective discovery and re-use. We need our own domain champions. In bioscience it is provided by PubMed. I think we will see the emergence of similar repositories in other domains. I am on the road a lot so the frequency (and possibly intensity) of posts may decrease somewhat…
APE2008 – ARNE RICHTER: EGU and JACP
Sunday, January 27th, 2008- authors compile entire work in digital
- all software free of charge
- camera-ready[*] - if publishers provide macros, then it can be compiled into journal style
- servers have customised XML files
- upload to all archives, databases, etc.
- process moves from author’s client to publisher’s server… NO classical peer-review.
- referees and public comments are published in Open Access form alongside discussion. So whole world does peer-review.
- NO SECRETS. Avoids referees hiding behind curtains
- manuscripts are higher quality and cause less work at least 50% less work
- ENTIRE process can be handled by internet control
- If author cannot master the technology, they can pay the publisher