Monthly Archives: January 2008

CML Blog will restart

There has been a long hiatus on the CML blog but I am now convinced it is the best way to discuss the general topics on CML and to leave cml-discuss for more technical ones. I shall make cross references from this blog for a few weeks so that those wanting CML can switch or add feeds.

For the present I would welcome general questions about CML to act as topics to spark discussion. Please post to the CML blog in comments section.

More later

Automatic assignment of charges by JUMBO

Egon has spotted a bug in our code for assignment of charges to atoms:

Why chemistry-rich RSS feeds matter... data minging,

The example shown by Peter was nicely chosen: something is wrong with that example. It uncovers a bug in the pipeline, that could have been uncovered by a simple agent monitoring the RSS feed. That is why this technology is important! It allows pipelining of information between services.

Anyway, before you read on, check the structure in the example yourself (Bis(pyrimidine-2-carboxylato-K2N,O)copper(II)).

Done? Checked it? You saw the problem, right? Good.

The charges in the structure are indeed wrong. There are two challenges...

  • for structures with more than one moiety (isolated fragment) in the structure it is formally impossible to know the changes if the author doesn't give them.  The authors can give them in _chemical_formula_moiety but they are often difficult to parse correctly and in any case they often aren't given. In those cases we don't try to assign charges. (The crystallographic experiment itself cannot determine charges).
  • In cases where the fragment contains only light atoms it is usually (but not always) possible to allocate charges by machine. In cases with metals it's usually impossible to do a good job. The molecule in questions is:

Summary page for crystal structure from DataBlock I in CIF xu2383sup1 from article xu2383 in issue 2008/01-00 of Acta Crystallographica, Section E.


The molecule itself is neutral. The easiest way is not to put any charges. Anything else in uncomfortable. We can have + charges on the N's which is natural, but then there are 2 - charges on the CU. That's formally correct but since the mertal is usually described as Cu(II) it's not happy. Or we can play around with thearomaticity, or dissociate the Cu-N or C-O bonds but that's not happy either. And this is simple compared with may metal structures.

What we have been doing is to disoociate the metal, do the aromaticity and charges, and then add the metal back. In doing so it's easy to forget the charges and that is what has happened. We'll try to fix it.


But in the end the only thing that matters is the total electron count and the spin state (which normally isn't given except in the text). Cu2+ is d9 so it has one unpaired electron. But Fe is much more difficult and it's virtually impossible to do anythig automatic. We'll probably simply leave the charges off...


What if chemistry data had been open?

When people ask me for examples of why Open Data matters, I always refer them to the Openness of bioscience - or at least those parts close to the Central Dogma (DNA-> RNA->Protein->Structure->Function). All those parts are Open. You can get any information that can conceivably be shoehorned into some formal description (som can't but most can). Now Cameron Neylon has done a useful review of what we would have missed if the progenitors of bioinformatics had gone done the closed route (it nearly happened at the time of ESTs).

Picture this…

History the first…

[...hisorical details snipped...]Imagine a world with no GenBank, no PDB, no SwissProt, and no culture growing out of these of publically funded freely available databases of biological information like Brenda, KEGG, etc etc. Would we still be living in the 90s, the 80s, or even the 70s compared to where we have got to?
History the second…

In the second half of the twentieth century synthetic organic chemistry went through an enormous technical revolution. ...

There was tremendous excitement as people realised that virtually any molecule could be made, if only the methodology could be figured out. Diseases could be expected to fall as the synthetic methodology was developed to match the advances in the biological understanding. The new biological databases were providing huge quantities of information that could aid in the targeting of synthetic approaches. However it was clear that quality control was critical and sharing of quality control data was going to make a huge difference to the rate of advance. So many new compounds were being generated that it was impossible for anyone to check on the quality and accuracy of characterisation data. So, in the early 80s, taking inspiration from the biological community a coalition of scientists, publishers, government funders, and pharmaceutical companies developed public databases of chemical characterisation data with mandatory deposition policies for any published work. Agreed data formats were a problem but relatively simple solutions were found fast enough to solve these problems....

Ok. Possibly a little utopian, but my point is this. Imagine how far behind we would be without Genbank, PDB, and without the culture of publically available databases that this embedded in the biological sciences. And now imagine how much further ahead chemical biology, organic synthesis, and drug discovery might have been with NMRBank, the Inhibitor Data Bank…

PMR:  If only. And what makes it even more poignant is that in the 1970's the AI community developed many of their approaches round chemistry. DENDRAL, LHASA, etc. Years ahead of their time. But most AI relies on real-world knowledge and the chemists closed this and starved the efforts.

Still we now know a lot of things that do and don't work in CompSci. So as we start to prise cjemistry data out of the silos we should be able to move very quickly...

CrystalEye RSS

Nick Day's CrystalEye system can e thought of as an open, robotically managed, robotically quality-reviewed, data, overlay, "journal". It's not a conventional journal, but it ticks most of the buttons. And it publishes a new set of information each day. So I have subscribed to some of the RSS feeds on the site. There are many hundred - you can be alerted by journal, by chemical category, by bond type, by quality, etc. And they come naturally into the Feedreader. Here's one from today. (It caught my eye because I worked on copper N,O chelates during my doctorate and it's one of my favourite elemnts. Here's what the feed delivered. No frills, no adverts, no javascript. Just simple science:

Summary page for crystal structure from DataBlock I in CIF xu2383sup1 from article xu2383 in issue 2008/01-00 of Acta Crystallographica, Section E.

and here's what you get if you follow the link:


Click to enlarge.

This is immediate - Nick's robots determine when a new issue has come out and various publishers are talking to us about providing RSS feeds of new issues or new articles (talked with BMC and IUCr yesterday). It makes it a lot simpler.

The possibilities are enormous. All the information is semantic and can be turned into RDF. Andrew Walkingshaw has done this and in a later post he or I will show how to search for information contained in the CIF files.  If you are only interested in Cu-N bonds there is a special feed exactly for that purpose.

APE2008 thoughts on domain repositories

I'm sitting waiting for about 1 million files to transfer from one laptop to another - in the Computer Officer hideout where we have really strong coffee. I tend to twitch about such transfers - rather like a hermit crab - but I can spend the time blogging about APE2008 (see earlier posts APE2008 more thoughts and recursive links from that).

The final session on the last day was about money. I didn't take notes (no battery left). My impressions were that som new journals can manage on considerably reduced costs - a few hundred dollars. Of course there isn't a one-size-fits-all - it's clear that when a journal rejects 90% of submissions their costs are somewhat higher than one with a high acceptance rate. However some publishers spend a lot of money on things that IMO don't merit it. For example marketing - I remember a figure of 30% (not sure what of) but certainly many domains won't need that. And tutorials for information products. Should we be needing tutorials on modern products? How many five-year-olds need teaching how to use Google? Or Facebook? If you want help, ask the family. So one message to "author-pays" models is "challenge the costs". I'm going to stop using "author-pays" and substitute "organisation pays". The organisation might be a university, a funder, a learned society, a national organisation (e.g. JISC), the publisher themselves in hardship cases, and so on. Few authors pay, and shouldn't be expected to do so. This is clearly something that academia has to tackle for non-funded non-science subjects.

The  next morning had a session:

Panel Discussion: What Matters? The Future Role of Libraries in Science and Society? Swallowed by OA Repositories, turned into University Presses or kept as Book Museums?

Here I have a problem. I appreciate that libraries have many roles and I'm a keen supporter. Guardianship of scholarship, preservation, access, etc. But this doesn't come across in science. I see librarians because I'm working on information-rich projects but if I didn't I wouldn't. How many PhD chemistry students will come to the library
. (We have a lovely library in our building, funded by Unilever, and students like working there because it's quiet. But we wouldn't build the same facility today. And Henry tells me that Imperial has closed its departmental library. They have a nice quiet work area - with terminals - but it's not a library.  Librarians cannot make a new role out of being super-purchasing and contract officers for information - scientists neither see nor care. So I challenged the panel with this and similar points.

Science and technology move so fast that none of us can keep up. Subject librarians trained on the classical model cannot provide what scientists need. The bioscientists look to PubMed, EBI, PDB, etc as the repositories of knowledge - not to their institutions. What they need are information scientists embedded in their laboratories. People who know how to hack perl, python, Java, XML, RDF, RSS, etc. Where the flow of meta-information is from the scientist to the information scientists as well as the other way round. It's a tall order. But the average 18-year old does not look in a library for scientific information - they look to Google and Wikipedia (which is why I contribute when I can find time).

Thes views are reinforced by what the biscoientists and physicists are doing. They create domain repositories. They either have large national or international organisations which are beneficient and wish to oversee the free movement of scientific infomation. With bio- it's Pubmed and Pubchem, NCBI, PDB, EBI, etc. and with physics it's arXiv and SCOAP3. These are domain repositories and that's what we critically need.

I can see that certain primary research will naturally go to IRs - mandated fulltext, theses, etc. But  many will see Pubmed and SCOAP3 as the primary places, not their institution. Even where the material is in IRs we need domain metadata tools to extract it properly. (How do you look for a sequence in your IR? a chemical substructure? a spectrum? a partial differential equation?) The problem will be solved in big science. But in long-tail science we need global or national domain repositories and we need departmental repositories for the initial capture. If there are embedded information scientists then that is one of the first things they can be doing to help the community.

... still a few hundred thousand files to go (these are all part of our molecular repository effort). Why's it on the laptop? Because it fits quite well on planes and trains...

APE2008 more thoughts

Because there was no electricity and wireless at the APE meeting ( APE 2008) I took some notes, but they seem rather dry now and have lost some of the immediacy. So I shall use the meeting to catalyze some thoughts.

Michael Mabe - CEO STM - gave a useful presentation about facts in publishing, but they don't read well in this blog a week later.  The growth of publishing is not new - ca 3.5 percent for the last 300 years. So it's a good thing that we've gone digital or the whole world will be drowned in the Journal Event Horizon (cf Shoe Event Horizon). 1 million authors, over 1 billion article downloads. The primary motive of authors is to disseminate their ideas (it's reassuring to know that as we can plan new ways of doing it).

An afternoon session from with some snippets (rather random):

Ulrich Poeschel, Mainz,
in "bad papers" main problem is carelessness, not fraud, etc.,    overly superficial superficial reports of experiment, non-traceable arguments. He described interactive/dynamic publishing where review has several stages (but I can't remember his journal - maybe it was Atmospheric Chemistry and Physics??
Traditional peer review is not efficient today,    editors and referees have limited capacity,    too few editors and reviewers
traditional discussion papers are very rare - originally 1/20 papers were commented, now => 1/100
speed conflicts with thorough review
so develop speed first, review later
discussion paper = upper-class preprint (some pre-selection)
lengthier traditional peer-review later
referees can maintain anonymity - self-regulation works

rewards for well-prepared papers
most limiting factor is refreeing capacity
total rejection rate only ca 10%, so referees effort is saved
deters careless papers - self-regaultion through transparency
5 comments/paper - 1/4 papers get public comment
comment volume is 50% of publication

now #1 in atmospheric phys, #2 in geosciences

Catriona MacCallum:    PLoS - ...
journals and discussion cannot capture discussions in ways that blogs do - blogs are self-selecting communities     TOPAZ is open source publishing software - makes connections between all components of publishing systems - blogs, documents, data, services ...

Linda Miller Nature.

Purpose of peer-review is to decide where paper should be published
protects public (e.g. health and large policy)
avoids chasing spurious results
quality of review is decreasing

open trial for commenting on (PMR: I think) regular Nature papers.

12% of regular authors (PMR: I assume this is in Nature) accepted comment trial - mainly earth, eco, evo, physics
half papers had comments
comment's average score was 2/5 -(i.e. comments weren't very good)
no chemistry, genomics, genomics
low awareness of trial

why do people not comment? Overwork - no incentive?

so we need:
motivation for PR
stable identifier for reviewer
high ratings on pubmed
checks and balances on retributions
critical mass of submissions

referees need to get credit
need to develop online reputation score
CVs should include this

change is inevitale execpt from vending machine (Robert C Gallagher)

My next thoughts will hopefully include:

  • role of librarians
  • beyond the full-text
  • legal and contractual stuff

Chemistry Repositories

Richard Van Noorden - writing in the RSC's Chemistry World - has described the eChemistry repository project, Microsoft ventures into open access chemistry. This is very topical as Jim Downing, Jeremy Frey, Simon Coles and me are off to join the US members of the project at the weekend. It's exciting, challenging, but eminently feasible. So what are the new ideas.

The main theme is repositories. Rather a fuzzy term and therefore valuable as a welcoming and comforting idea. Some of the things that repositories should encourage are:

  • ease of putting things in.  It doesn't require a priesthood (as so many relational databases do). You should be able to put in a wide range of things - these, molecules, spectra, blogs, etc. You shouldn't have to worry about datatypes, VARCHARS, third normal forms, etc.
  • it should also be easy to get things out.  That means a simple understandable structure to the repository. And being able to find the vocabulry used to describe the objects.
  • flexibility. Web 2.0 teaches us that people will do things in different ways. Should a spectrum contain a molecule or should a molecule contain a spectrum? Sme say one, some the other. So we have to support both. Sometimes  required information is not available, so it must be omitted and that shouldn't break the system.
  • interoperability. If there are several repositories built by independent groups it should be possible for one lot to find out what the otehrs have done without mailing them. And the machines should be able to work this out. That's hard but not impossile.
  • avoid preplanning. RDBs suffer from having to have aschema before you put data in. Repositories can describe a basic minimum and then we can work out later how to ingest or extract.
  • power is more important than performance (at least for me.) I'd rather take many minutes to find something difficult than not be ale to do it. When I started on relational databases for molecules it took at night to do a simple join. So everything is relative...

The core to the project is the ORE - Object Re-use and Exchange (ORE Specification and User Guide). A lot of work has gone into this and it's been implemented at alpha, so we know it works. ORE is quite a meaty spec, but Jim understands it. Basically the repositories can be described in RDF and some subgraphs (or additional ones) are "named graphs" ( e.g. Named Graphs / Semantic Web Interest Group) which are used to describes the subsets of data that you may be interested in. There is quite strong constraint on naming conventions and you need to be well up with basic RDF. But then we can expect the power of the triple stores to start retrieving information in a flexible way. (As an example Andrew Walkingshaw has extrected 10 million triples from CrystalEye and show that these can be rapidly searched for bibliographic and other info). Adding chemistry will be more challenging and I'm not sure how this intergrates with RDF - but this is a research project. Maybe we'll precompute a number of indexes. And, in principle, RDF can be used to search substructures but I suspect it will be a little slow to start with.

But maybe not... In which case we shall have made a very useful transition

Semantic Chemical Computing

Several threads come together to confirm we are seeing a change in the external face of scientific computing. Not what goes on inside a program, but what can be seen from the outside. Within simple limits what goes on inside need not affect what is visible. The natural way now for a program to interface with other programs and with humans is to use a mixture of XML and RDF. XML provides a voculabulary and a simple grammar; RDF  provides the logic of the data and application.

The COSTD37 group has just met in Berlin  (I blogged the last meeting - COST D37 Meeting in Rome) COST is about interoerability in Comp Chem and it's proceeding by collaorative work to fit XML/CML into FORTRAN programs - at present Dalton and Vamp. We do this by exchange visits paid by COST, wo we are looking forward to having visitors in Cambridge shortly.

It coincided roughly with Toby White's session at NeSC in Edinburgh  on how to fit XML/CML into FORTRAN using his FoX library. I look forward to hearing how he got on.

And then, on Friday, we had a group meeting including outside visitors where the theme was RDF. I was very impressed by what the various members of the group had got up to - five or six mini-presentations. Molecular repositories, chemical synthesis, polymers, ontologies, natural language and term extraction. Andrew Walkingshaw showed the power of Golem which combines XPath with RDF to make a very powerful search tool. We are grateful to Talis for making their RDF engine available and when I have some hard URLs I'll blog how this works.

The main message is that the new technolgies work. Certainly well enough to support collections in the order of 100,000 objects with many triples (Andrew had ca 10 megatriples). We are also making great progress in extracting chemistry out of free text (PDF is still awful, so please let's have Word, or even better XHTML and XML). Or LaTeX. But in any case most of the toolset is now well prototyped. More later...

Big Science and Long-tail Science

Jim Downing and I were privileged to be the guests of Salvatore Mele at CERN yesterday and to see the Atlas detector of the Large Hadron Collider . This is a "wow" experience - although I "knew" it was big, I hadn't realised how big. I felt like Arthur Dent watching the planet-building in the The Hitchhiker's Guide to the Galaxy. It is enormous. And the detectors at the edges have a resolution of microns. I would have no idea how to go about building it. So many thanks to Salavtore and colleagues. And it gives me a feeling of ownership. I shall be looking for my own sponsored hadron (I've never seen one). So this is "Big Science" - big in mass, big in spending, big in organisation, with a bounded community. A recipe for success.


CMS detector for LHC


The main business was digital libraries, repositories, Open publishing, etc. It's clear how CERN with it's mega-projects ("big science") can manage ventures such as the SCOAP3 Open Access publishing venture. And the community will need somewhere to find the publications - so that is where repositories come in.

There is no question that High-energy physics (HEP) needs its own domain repository. The coherence, the specialist metadata, the specialist data for re-use. HEPhysicists will not go to institutional repositories - they have their own metadata (SPIRES) and they will want to see the community providing the next generation. And we found a lot of complementarity between our approaches to repositories - as a matter of necessity we have had to develop tools for data-indexing, full-text mining, automatic metadata, etc.

But where do sciences such as chemistry, materials, nanotech, condensed matter, cell biology, biochemistry, neuroscience, etc. etc. fit? They aren't "big science". They often have no coherent communal voice. The publications are often closed. There is a shortage of data.

But there are a LOT of them. I don't know how many chemists there are in the world who read the literature but it's vastly more than the 22,000 HEP scientists. How do we give a name to this activity. "Small science" is not complementary; "lab science" describes much of it it but is too fixed to buildings.

Jim Downing cam up with the idea of "Long Tail Science". The Long Tail is the observation that in the modern web the tail of the distribution is often more important than the few large players. Large numbers of small units is an important concept. And it's complimentary and complementary.

So we are exploring how big science and long-tail science work together to communicate their knowledge. Long-tail science needs its domain repositories - I am not sanguine that IRs can provide the metalayers (search, metadata, domain-specific knowledge, domain data) that are needed for effective discovery and re-use. We need our own domain champions. In bioscience it is provided by PubMed. I think we will see the emergence of similar repositories in other domains.

I am on the road a lot so the frequency (and possibly intensity) of posts may decrease somewhat...


I'm not keeping up with the backlog of things I have brought away from APE 2008 Academic Publishing in Europe "Quality & Publishing" - I find it difficult to comment several days after the event (Please can conferences install wireless for everyone. Then you are likely to get bloggers telling the world about what is going on.).  So the actual content here is bitty as it's real-time phrases rather than more joined-up prose.

The second plenary was from ARNE RICHTER: European Geosciences Union. He concentrated on the success of their OA journal - J. Atmospheric Chemistry and Physics. [see my comments below]

Points from the talk...

DFG 2005 survey showed that what matters for scientists was:

International worldwide distrib
reputation (NOT impact factor) matters
topical focus (e.g. journals)
quality of peer-review
long term avaialiliy
***low or zero cost

The internet is key and effectively drives scientific communication to Open Access. There is enormous benefit when everything is Open.  It's the only realistic platform for digital info - full multimedia. The default frame is landscape on screen with offscreen fonts for printing, NOT double column PDF.  Search engines are only possible with the Internet and support actions against plagiarism and IPR violations (OA is fundamental here). There can be customized online information systems about new publications. Decreasing prices for electronic eqpt drive  increased utilization

OA supported by scientific organizations. We need worldwide repositories. These will be domain-specific, e.g. topical data bases and archives (e.g. space science). They provide long-term archiving.

New models of publishing process for authors and publishers:

  • authors compile entire work in digital
  • all software free of charge
  • camera-ready[*]  -  if publishers provide macros, then it can be compiled into journal style
  • servers have customised XML files
  • upload to all archives, databases, etc.

[* PMR "camera-ready" extends easily into semantic content and I think this is the more appropriate term. The idea  is that authors author in a natural fashion, not driven by the needs of the journal.]

Atmospheric Chemistry and Physics Discussions is a  born digital publication

Online peer review...

  •   process moves from author's client to publisher's server... NO classical peer-review.
  • referees and public comments are published in Open Access form alongside discussion. So whole world does peer-review.
  • NO SECRETS. Avoids referees hiding behind curtains
  •  manuscripts are higher quality and cause less work at least 50% less work
  • ENTIRE process can be handled by internet control
  • If author cannot master the technology, they can pay the publisher

** the final decision for acceptance is made by editor

This ia a service model, low cost.    ACP uses Copernicus. The better the software gets the cheaper the artcile. => 300 EUR or less.

The EGU has shown how learned societies can have a major role. The messages I took away was then when you have an enthusiastic and competent learned society (or international union) which is committed to communication and the support of its discipline then this is the ideal medium. This was reinforced during the meeting - there seem to be significant costs in conventional closed access publishing which simply go away for OA -  one example is licence management - another is access control.

PS: I'm already familiar with this journal as I collaborate with atmospheric chemists in Cambridge and Leeds and STFC on semantic models of chemical reactions. With Michael Kohlhase I have started to mark up a paper into content MathML and CML. This would mean that a machine could read the paper and work out the chemical reates and mechanisms.  So praise for  ACP Atmospheric Chemistry and Physics

PPS not sure how much more I shall recover from the meeting - I'll probably write a mini-summary.