Monthly Archives: October 2006

Inorganic InChIs

Mark Winter - who has done an enormous amount to promote web-based chemistry such as WebElements - makes an important point:

  1. Mark Winter Says:
    October 18th, 2006 at 10:18 am eOK - having carefully and rather too obviously written in InChI and SMILES strings in a story about ozone at, and being an inorganic chemist who might want to write about a few inorganic species, I wondered how to write strings for, say, metal coordination complexes like the salt [Cr(OH2)6]Cl3. This compound is listed at PubChem at

    but shows a nonsense structure, and not being a fluent InChI reader I therefore distrusted the InChI string on that page. I looked at the above mentioned carcinogenic potency database and found

    where again the chemical structure drawn is nonsense and so again I have little confidence in the InChI string on that page.

    So how does one proceed for such species?

The structure in Pubchem (CrCl3.6H2O) does not reflect accurately our current knowledge of the compound (though it was probably OK in 1850). It should be Cr(OH2)6(3+).3Cl-. InChI does not have any builtin chemical knowledge and calculates what it is given. It sometimes points out potential valence errors (e.g. CH5) but since it is capable of representing unusual chemistry it doesn't throw actual errors. So this particular problem is PubChem's, not InChI. (Note that there is a small fraction of errors in Pubchem of many sorts - there is inconsistency in structural representation and some blatant errors. For those who like an amusing name, try CID: 27 and similar). Pubchem does accepts contributions from many places and does not check chemical "validity". (These problems are well addressed by social computing...)

There is a more difficult problem for compounds without an agreed connection table. How do we represent "glucose"? It can have an open form and four ring forms (furanose and pyranose, alpha and beta). Similarly "aluminimum chloride" can be AlCl3, Al2Cl6 or Al3+.3Cl-, etc. InChI represents all of these faithfully but does not provide means of navigating between them. And coordination compounds may be represented differently by different humans - there is clearly no simple approach here.

But InChI takes a useful intermediate approach - it can disconnect the metal from the ligands. While this reduces the amount of information is will provide better chances of finding isomers in a search - it should be fairly easy to sort them out.

Organic Theses: Hamburger or Cow?

This is my first attempt to see if a chemistry thesis in PDF can yield any useful machine-processable information. I thank Natasha Schumann from Frankfurt for the thesis (see below for credits).

A typical chemical synthesis looks like this (screenshot of PDF thesis).


For non-chemists this consists of name and number of compound, recipe for synthesis, structural diagram (picture), number and analytical data (Mass Spec, Infrared and Ultraviolet). This is a very standard format and style.The ugly passive ("To a solution of X was added Y") is unfortunately universal (cf. "To my dog was donated a bone by me"). The image is not easily deconstructed (note the badly placed label "1" and "+" making machine interpretation almost impossible - that is where we need InChIs).
I then ran PDFBox on the manuscript. This does as good a job as can be expected and produces the ASCII representation.

This is not at all bad (obviously the diagram is useless) and the greek characters are trashed but the rest is fairly good. I fed this to OSCAR1; it took about 10 seconds to process the whole thesis. You can try this as well!

OSCAR has parsed most of the text (obviously it can't manage the diagram labels but the rest is sensible. It has extracted much of the name (fooled a bit by the greek characters) and pulled out everything in the text it was trained to do (nature, yield, melting point). It cannot manage the analytical data because the early OSCAR only read RSC journals but OSCAR3 will do better and can be relatively easily trained to manage this format.

So first shots are better than I have got in the past. OSCAR found data for 40 compounds - ca. 4 per second. Assuming that there are many similar theses there is quite a lot it can do. But not all have PDF that behaves this well...

Acknowledgement (from PDF)
Chiral Retinoid Derivatives:
Synthesis and Structural Elucidation of a New Vitamin A Metabolite

Von der Fakultät für Lebenswissenschaften der Technischen Universität Carolo-Wilhelmina zu Braunschweig zur Erlangung des Grades einer
Doktorin der Naturwissenschaften (Dr. rer. nat.) genehmigte
D i s s e r t a t i o n von Madalina Andreea Stefan aus Ploiesti (Rumänien)

Presentation to Open Scholarship 2006

I am presenting this "talk" from the Web and including parts of my blog. This means I have to decide what I think I am going to say before I do or don't say it. You know by now what I think of PDF and Powerpoint. This talk is in HTML and can be trivially XMLised robotically. It should be preservable indefinitely.

==== what I intend to cover ====

Data as well as text is now ESSENTIAL - we should stop using "full-text" as it is dangerously destructive in science. "PDF" is an extremely effective way of doing this. We need compound documents (Henry Rzepa and I have coined the term datument).

Need automated, instant, access to and re-use of millions of published digital objects. The Harnad model of self-archiving on individual web pages with copyright retained by publishers is useless for modern robotic science.

Much scientific progress is made from the experiments of others by making connections, simulations, re-interpretation. We need semantic authoring. Librarians must support the complete publication process.


  • apathy and lack of vision - scientists (especially chemists) need demonstrators before people take us seriously
  • restrictive or FUDdy IPR. Enormously destructive of time and effort
  • emphasis on visual rendering rather than semantic content. Insidiously dangerous
  • broken economic model (anticommons)


Other inititiatives:

  • SPARC - Open Data mailing list

What must be done


=======Previous posts and related blogs======

Open Data - the time has come

Open Source, Open Data and the Science Commons

Is "peer-review" holding back innovation?


Blogging and the chemical semantic web the blogs

My data or our data?

Science Commons

Science Anticommons

Hamburger House of Horrors Horrible GIFS Hamburgers and Cows - the cognitive style of PDF

Thanks (and XML value chain)

The cost of decaying Scientific data

OSCAR - the chemical data reviewer

Linus' Law and community peer-review
============= Live demos =========


OSCAR1 (applet version)
OSCAR3 (local demo)
Crystallography (not yet released)


DSpace (individual molecule)
chemstyle (needs MSIE)

===== what I actually said ====

Many thanks to William for recording all the talks and I am delighted to have this record made available. (I have not yet discussed copyright but I hope it can go in our repository :-)

Is "peer-review" holding back innovation?

As part of my talk at Open Scholarship I'm going to show two pieces of scholarly work of which I am proud, which I believe fit all the criteria of publication and for which I get no formal credit. (I also regard this blog as a scholarly work, and also get no credit)...

The first is an invited talk at Google. (Yes, I can claim some minor formal credit for an invited talk, but probably not to a company!) This was videoed and has received 1727 downloads and 12 5-star ratings. (Of course some of this may be donw by robots or my friends, and probably some of them only watch the first few minutes, but there must be some serious viewers). It has everything a scientific publication requires:

  • accessible
  • peer-reviewable
  • formal record
  • re-usable
  • archivable (When I have time I'll put it in our DSpace...)

The second is our WorldWideMolecularMatrix (WWMM). This is an evolving system for open access to the world's molecules and properties and as part oif it we have put 175, 000 objects in the Cambridge DSpace. But it has never been formally published in a full paper. That's partly because it's not finsihed and partly beacuse everyone can see it. Why publish it?

But it has been peer-reviewed! Someone - I have no idea whom - started a Wikipedia entry. I'm naturally proud of this. The entry quotes extensively from the talk I gave at OAI4 in 2005 at CERN ("CERN Workshop on Innovations in Scholarly Communication (OAI4)" ) (Video). Joanne Yeomans recorded this talk as a video and this has - I gather - been regularly accessed. Again it has most of the features of a publication - but I can't get any formal credit for it.

So to the current UK Research assessment exercise (RAE) - 4 citable papers in peer-reviewed journals does not allow for this type of innovation in scholarly publishing. Should I abandon the new approaches and concentrate on paper? It's what the management would like...

Open Scholarship 2006 - 2

My colleague and DSpace superguru Jim Downing has also blogged parts of the meeting:

These are some impressions of the Open Scholarship meeting so far... Some are notes, so it may be a bit jerky in places. I shan't blog all talks.

IRs have made massive progress in last year. Hundreds even thousands of institutions now have them. There are commercial technology offerings and commercial hosting services.

Stephen Pinfield (Nottingham) reviewed progress - 250 repos (2004) 790+ (2006). 12 million records worldwide. Self-archiving has become common and recently - catalysed by Wellcome Trust - journals have moved towards hybrid publishing. He emphasised the bit-by-bit nature or progress "We overestimate the importance of short-term change, and underestimate the significance of long term change" (after John Kay). Even publishers are starting to take OA axioms on board. Challenges:
* Cultural change - the biggest problem. The "awareness" problem is being solved. But lack of incentives for *individuals" - they accept the idea intellectually, but...
* Practical support - still not easy enough. Must be drag and drop, self-archiving by proxy
* IR and institutional strategy - IR must be part of institutional policy - so IR managers must engage with *whole research process*, not just dissemination. Promote the institution, liaise with industry...
* discipline differences. Ginsparg believes all will converge on repository model, but others believe we have to have different models for different disciplines (I believe this - PeterMR). Early adoption happened in specific domains.

* Is self-archiving publication? Publication is now becoming a process, not an event (I shall show this in my presentation - PeterMR).
* versioning. "version of record"? "self-published"
* quality control. Current IRs are quality neutral - but quality flagging is essential. Not homogeneous within single IR.
* Metadata - cannot be worldwide agreement. but need standards and coordination

* standards - OA standards are community owned so still fluid
* digital preservation - which versions? Is institution responsible for preservation, or national agency
* IPR - who owns copyright is not clear. Institution? Author? we are still ducking the questions
* business models. costing and funding?

Don't yet have enough examples of good service providers.

Open access in NOT just about access - it is about USE. (Dear to my heart - PeterMR)

Institutional vs Subject? Shouldn't matter, but until we get better services it does. (Agreed - PeterMR. I need to know where to look for thousands of article in a subject)

* OA but otherwise limited change (Harnad model). No reason for anything to change
* Hybrid business model - income from input (publication charge)... cf Wellcome
* Deconstruct the journal. Quality control does not have to be done by publisher
* Overlay - virtual journals draw from IR. Maybe quality at time of assembly
* multi-layered process - screen - IR - submit to peer-review - then mounted - dialogue etc. Citation could determines course of future research. Demise of journal article?
* fluid communication model (this is me - I shall show it in my talk - PeterMR)

Bill Hubbard
(Open DOAR - 797+ repositories).
Quality assessment of repositories - does it have data? is it OA? broken links? metadata-only sources?

2/3 have no metadata policy, harvesting policy, some forbid robot harvesting. most don't allow commercial re-use of metadata. We need clear policies and DOAR hopes to have machine-readable policies in a few months.

Authors must find what they want in repositories.

A lot of repositiories are run on marginal costs - not easy to get startegic funding. Learned societies had the opportunity to creat subject repositories but have failed to respond.

Open Scholarship 2006 - 1

I'm at the University of Glasgow - in the splendid castellated Hunter Halls - for the European meeting on Open Scholarship. There are over 200 delegates - a mixture of librarians, information technologists, research funders, etc. Hardly any publishers - Biomed Central (which also manages repositories) being an exception. The theme "New Challenges for Open Access Repositories. I'll try to blog highlights.

Having worked for many years in a Scottish University (Stirling) I'm delighted to highlight the great progress and national coherence in Scottish Open Access. This was emphasised in the Opening Keynote by Derek Law from Strathclyde University. (Posts from this meeting may be a bit jerky as I am taking notes as we go)...

Scotland - "The best small country in the world". Small countries can aspire to national solutions. Scotland has a history of declarations of freedom (Arbroath 1320) and is disproportionally strong in research (12.5% on UK metrics vs 8% of population).

Why is Scottish government interested in Open Access? Scottish education is venerated and OA is seen as providing: wider access, better value, quality measures. And there is no Dept. Trade and Industry in Scotland (which in England/UK is heavily lobbied by publishers and slows down OA). So, IRs with the right metadata will create a quality resource to market Scottish Resources. Even 2 cabinet members understand what "metadata" means. Sharing resonates with government.

Scottish Science Information Strategy - Open Access thread has flourished (SLIC - Scottish Library Inf. Council). 2004 declaration of Open Access stresses

...also exposing Scottish research to rest of world. "publicly funded work must be luckily accessible".

Use the Research Assessment Exercise (RAE) as tools for mandating deposit. Glasgow has nearly 3000 entries in its IR. Scottish IRI-S has 3 out of 10 of top UK repositories. There will be a Cream Of Science project (cf. the Dutch one).

The publishers of the future will be a new generation and only the bravest of the current ones will survive.

and ended with a modified Declaration of Arbroath...
"for so long as 100 of us are left alive we will yield in no way to Elsevier domination"

However Stirling was where I made the biggest mistake of my scientific life - I first signed a form transferring the copyright of my work to a publisher (I think Acta Crystallographica). Why, in the early 1970's did no-one in the academic sector foresee the problems. A simple refusal by universities not to hand over copyright would have forestalled the commercial publishinig industry with its ownership, and worse , its power to direct scholarship. Why were librarians, senior editors and principals silent? Can we be sure that our continued inability to control our own scholarship is not leading us into an even worse future?

What are the advantages of XML and why should I care? (text)

This is an attempt to explain why XML is important in a scientific context. I shall try to assemble as many reasons as possible, but there are also many other tutorials and overviews.

I believe that XML is a fundamental advance in our representation of knowledge. It's not the first time this has been attempted - for example you can do anything in LISP that you can do in XML and a good deal more. But XML has caught on and is now found in every modern machine on the planet.

Let's start with a typical piece of information:

Pavel Fiedler, Stanislav Böhm, Jií Kulhánek and Otto Exner, Org. Biomol. Chem., 2006, 4, 2003

How do we interpret what this means? We guess that there are 4 authors (although it is not unknown for people to have "and" in their names), that the italic string is the abbreviation of a journal, that 4 is a journal number. But what are "2006" and "2003"? Unless you know that the first number is the year and the third the starting page (see RSC site) you have to guess. And many of you would guess wrong.

If, however, this is created as:

<author>Pavel Fiedler</author>

<author>Stanislav Böhm</author>

<author;>Jií Kulhánek</author>

<author;Otto Exner</author>

<journal>Org. Biomol. Chem.</journal>




you can see that each piece of information is clearly defined. There is no reliance on position, formatting or other elements of style to denote what something means.

But isn't this harder to create and read? If everything is done by a human, perhaps. But almost all XML documents are authored by machines, either from editors or the result of a program. And the good news is that the style - the italics, etc. - can be automatically added. XSLT allows very flexible and precise addition of syyle information through stylesheets.

So it won't surprsie yout that publishers actually create their content in XML. When you submit a Word or LaTeX document it gets converted into XML - either by software (which isn't always perfect) or by retyping :-( . The final formatting - either as PDF or HTML can be done automatically by applying different stylesheets. So the document process is:

XML + PDFstylesheet -> PDF

XML + HTMLStylesheet -> HTML

The stylesheets don't depend on the actual document being processed and work for any instance. Of course it takes some work and care to create them, but most of you don't need to worry.

So for anyone working with documents, XML allows the content to be stored independently of the style. That's a great advantage also when it comes to preservation and archival. Because XML is standard, Open, ASCII, etc. it doesn't suffer from information loss when it is moved from one machine to another (how many of you have lost newline characters when going from Windows to Mac to Zip to Word, etc.?) It's possible to calculate a digital checksum for a canonical XML document so any corruption can be immediately spotted.

There are a number of other aspects. Notice that the second and third authors have diacritic marks in their names. XML supports a very wide range of encodings and character sets so is an international specification.
In later posts I'll show the power of XML for validation, how software can be applied and how data can be structured. Please feel free to add comments or questions.

What are the advantages of XML and why should I care? (0)

As I have blogged before we are looking at ways of improving the information infrastructure in our Centre. We're all very consicous of how little we know - I know I know very little and I'm quite prepared to admit it in public. Ignorance per se is not a crime - only wilful ignorance. As part of the process we created some self-help groups and the first feedback is that they would like a set of FAQs for a wide variety of questions. Remembering that this is a group of 40+ molecular informatics scientists I'll post some of the questions on an occasional basis. Because others can contribute to this blog maybe we'll build some communal FAQs...

So I cannot resist "What are the advantages of XML and why should I care?". I've invested several years of my life in developing XML, and layering Chemical Markup Language (CML) on to of it. So it's very dear to my heart. This post won't answer the general question directly so there will be more.
I got introduced to Markup Languages after WWW2. At WWW1 (1994) it was clear that HTML had succeeded very well with text and graphics but that more formality was required for other science disciplines. Recall that the early web was about science, not commerce and although TimBL saw the commercial potential, it was low key at that time. So Henry Rzepa went off to WWW2 and came back saying that people were talking about "something called SGML". It was also clear that CERN (where TBL developed HTML) was strong on SGML and that it could support complex documents. I had been struggling for several years with the need formalize chemistry into a component-based system and with SGML this seemed possible. You could create your own tags for whatever you liked as long as you defined it formally (with a DTD).

So I created some sample CML documents with my own tags. That was the easy bit. The DTD (which defined the language) was harder but possible. The real difficulty was actually doing anything useful with SGML. You could read it and... agree it was correct... and send it to other people... but it didn't do anything. Why would chemists use it? At that stage they wouldn't....

The users of SGML were somewhat esoteric groups. Typical examples were the Text Encoding Initiative (a project to encode the world's literature in SGML). At the other end were creators of aircraft maintenance manuals. (Although there were hints that SGML could be used for anything it was primarily used for text). The good news was that almost all major publishers of scientific articles used SGML in the production process.
I soon realised that to do anything useful - especially for chemistry - required procedural code. And there was very little. Some of it was extremely expensive - one company wanted $250K (sic) for a site license. The main clients were technical publishers - e.g. in aerospace. So I started to write my own system without any idea what I had got into. I found myself having to refer to "parents and children" of parts of documents - this seemed very strange to me at the time. I was extremely grateful to Joe English who developed a system called CoST and gave me huge amounts of virtual help. Joe, you were very patient - hope all is well! However there were a few pioneers of Open Source like Joe and IMO they saved the day for SGML and paved the way for XML. Top of the list is James Clark - whom I've never physically met - but has underpinned much of XML with his code and ideas. His nsgmls system was the only code that had the power I required and which could transform the (potentially incredibly complex) SGML documents into something tractable.

So by 1995 I had a system which could represent chemistry in SGML and process it with a mixture of tcl/CoST and nsgmls. It had fairly advanced graphics (in tk) and could even do document analysis of sorts. At that stage (another story) I was converted to Java and effectively wrote a complete system for CML/SGML in Java. This had a simple DOM, a menuing system and a tree widget (in AWT!) and could hold a complete chemical document.

Then, in 1996, Henry pointed me at a small activity on the W3C pages called XML. (Actually Henry and I had already used "XML" as part of CML, but we surrended the term). I got myself onto the working group and was therefore one of abou 100 people who contributed to the development of XML.

When XML was first started it was "SGML on the Web". It wasn't expected to be important and it wasn't even on the front page of the W3C. As SGML was seen as complex and limited, XML wasn't really expected to flourish.

XML's success was due to the foresight and energy of a number of people, but especially Jon Bosak - the "father of XML". Jon worked on technical document management in Sun (I hope that's right) and saw very clearly that XML was part of the future of the Web. He coordinated the effort, got funding and political support, and I remember his pride in showing the back cover of the first draft of XML - " sponsored by Sun and Microsoft". This was a great technical and political achievement.

Tim Bray - another champion and parent of XML - writes:

"It is to Jon Bosak's immense credit that he (like many of us) not only saw the need for simplification [of SGML] but (unlike anyone else) went and hounded the W3C until it became less trouble for them to give him his committee than to keep on saying SGML was irrelevant."

It was supported by one of the largest and most active virtual communities. Henry and I offered to run the mailing list, XML-DEV, on which much of the planning and software development took place. By insisting on Open software as a primary means of verification the spec was kept to a manageable, implementable size. This meant that, unlike SGML, XML could be implemented by the DPH ("Desperate Perl Hacker"). And it was....

... the rest is history. XML has become universal. Jon (I think) described it as "the digital dial tone" - i.e. wherever information is being passed on the web it will increasingly be in XML.

So that explains why I care :-) . Next post I'll explain why you should also care.

Blogging and the chemical semantic web

This post will explain how chemically-aware blogs can be indexed and searched. If you're not a chemist, but still interested in the semantic web, this may be interesting.

I revealed in recent posts that molecules in blogs can be indexed on their chemical structure, thus making the web chemically semantic. (I use the lower-case version to show that we are not using the heavyweight Semantic Web (OWL, triples, etc.) but something much more akin to microformats. Anyway the idea is simple...

For any document containing chemistry, mark up the compounds with the InChI tag that can be guaranteed unique for each of these. I'm going to concentrate on blogs, but the idea extends to any web document. (I'll exclude most chemical papers as they are generally closed and so we can only access them with subscriptions and often are prevented legally from the indexing below).

The main ways of adding InChI tags are:

  • persuade the author to do this when they create the post. Most of the current types of chemical software either create InChIs or create a file that can be converted into InChIs (e.g. with our WWMM services). With practice this would probably take 1-2 extra minutes per compound, especially if we can create a drag-and-drop InChIfication service at Cambridge or elsewhere. The InChI (which is simply a text string) can either be added to the blog or hidden in the alt tags of the imgs for the chemical structures. Again fairly straightfoward (though I have had to fight my editor). And I think we can expect blog tools to become semantic - at least for microformats - during the next months.
  • extract the structure from the blog and turn it into InChI. This is harder (unless the authors use a robust format such as CML or possibly SMILES). One way is to interpret chemical names as structures - we'll explain our work on this later. But semantic authoring is far better.
  • extract a known Open chemical ID from the site. Pubchem is the only realistic approach (it has ca. 6 million compunds); CAS numbers are closed and copyright so cannot be used. If we do this, then I would suggest the Pubchem entry is indexed like this "CID: 2519" . (This is very easily cut-n-pasted from the pubchem site). I am normally hesitant to use IDs but I think we can make an exception for Pubchem.

A good example of an InChIfied site is: the Carcinogenic Potency Database (CPDB) at Berkeley which contains a list of chemicals with a typical entry which shows the InChI (scroll to bottom part of page). This site consistently gets good hits on Google when searched by the InChI string (try it at our GoogleInchI server).
So, this post is to suggest to chemical bloggers that they add InChIs to their blogs. There are about 15 blogs that seem to have enough chemistry to make this worthwhile (I've taken these from Post Doc Ergo Propter Doc ) and I'd be grateful for comments on what I have misrepresented or what I've left out. The loose criteria for inclusion are (a) are there frequent chemical strucure diagrams or (b) are there enough chemical names that are worth tagging.

I add:

but exclude RSS and CMLRSS feeds at this stage (though they will be the future of some chemical newsfeeds).

So this is to encourage chemical bloggers to add InChIs (or Pubchem CIDs) to your blogs. If you do, we can index your blogs and we'll be showing some more magic RSN.

The mystery unfolded - the molecules have been (and can be) found

I think this was delayed by WordPress.)

  1. Jean-Claude and his students cracked a bit of it. Egon has explained it fully and provided the motivation...
  2. Egon Says:
    October 14th, 2006 at 7:55 pm eI have not been able to track down all of the involved blogs, but my final guess would be that these molecules are taken from chemical blogs. The first one is from tenderbutton, the last one already recognized by J-C (thanx for the tip!).(Peter, please don’t say I’m wrong… :)

Yes - they are from the blogs I mentioned - Useful Molecules, Totally Synthetic, Org Prep Daily, and Tenderbutton. The second posting was the very fisrt molecules on those blogts; the third was the most recent molecules (which were also in PubChem so I could copy the images).

The other theme (which some people hinted at) was that the molecules had InChIs. These were concealed in the alt attribute of the img so they weren't visible to humans. Paul Docherty (Totally Synthetic) droped in for a chat last week and I showed him GoogleInChI. He was interested but worried that the InChI would take up too much space on the page. I first tried hiding it in:

<spane style="display:none">InChI=...</span>

in the first molecules but I was advised that Google doesn't like non-displaying information. So in the second batch I hid it in the alt attribute and this seems to work. (Unfortunately WordPress seems to corrupt handcrafted HTML on images and some of the alts got overwritten, so that is why the earlier molecules didn't all work).

So the main message is that if you put InChIs in alt attributes, Google will index your blogs. This means that we have chemically aware blogs for the first time. If all the blogs do this we shall have a de facto chemical knowledgebase.