ChemAxiom – an ontology for chemistry

As I said earlier, Nico Adams would be blogging about his (impressive) chemical ontology ChemAxiom.

All ontologies are hard. It is difficult to reach a consensus in any domain, and no single person’s or organization’s ontology is likely to carry a community. That is the way to “ontological warfare”. So Nico has made a call for cooperation and has already got an offer from Michel Dumontier, who has also been pioneering this area. That’s the great thing about the web – it identifies potential collaborators within minutes…

Posted in Uncategorized | Leave a comment

Talis platform – triple strength

Talis have been one of the great supporters of Open Data and also have an impressive triple store. They’ve helped us – especially Andrew Walkingshaw – to load largish sets of triples into a queriable base. They’ve also done fantastic work by supportinf the Open Knowledge Foundation and Science Commons to develop appropriate licences and protocols. So I am delighted to announce their very useful offer to host PDDL (Public Domain Dedication Licence)-compliant data. We’ll certainly be taking this up.

Hi Peter,

You may have seen this already, but I wanted to draw your attention to a new initiative from Talis that offers free hosting (and services) over public domain data.
There’s an announcement here:

http://blogs.talis.com/nodalities/2009/03/announcing-the-talis-connected-commons.php

The offer includes storage of up to 50 million RDF triples, complete with SPARQL endpoint. However the data must be licensed under the Open Data Commons PDDL,
or CC0 license to qualify. As both of these licenses are in line with the Science Commons protocol, I’m hoping that it will be of interest to the open science
community, and particularly the work you’re been doing in chemistry. Would be interested to hear your thoughts on this.

Cheers,

L.



Leigh Dodds
Programme Manager, Talis Platform
Talis
leigh.dodds@talis.com
http://www.talis.com

Posted in Uncategorized | 2 Comments

Software patents again… Oh dear

The price of freedom is eternal vigilance and the hydras have many heads. Just when you think PRISM is decapitated up pops Conyers and now the good old European Patents directive is still alive.

Please kill it… The great thing is that e-democracy can be mobilised very quickly now

From FII… (sorry about the formatting in my WordPress)

The question of software patents without democracy and the FFII response

In October 2008, the President of the European Patent Office (EPO) issued a Referral to its Enlarged Board of Appeal (EBoA) concerning the questions as to the examination and granting of software patents in Europe. In the absence of European legislative initiatives, the EBoA’s conclusion on this matter is likely to have the same effect as a software patent directive.

However, since this decision will be based on a purely legal interpretation of the European Patent Convention (EPC) by the EBoA, it will not be accompanied by more extensive political and economic debate.

As stated by the EPO, third parties may wish to use the opportunity to file written statements before the end of April
(http://tinyurl.com/chkljo)

We would like to ask you to consider writing a statement in the name of your company, organisation or as private person, and if possible also to support the action plan of the FFII (see below).

You can see statements already submited by others at http://www.epo.org/patents/appeals/eba-decisions/referrals/pending.html

We offer a dedicated mailing list for discussions on the referral at

https://lists.ffii.org/mailman/listinfo/boa

and a petition page against software patents at

http://stopsoftwarepatents.eu/

With our action plan, we are funding two experts to work full-time on the issue and also produce detailed documentation about software patents in Europe, to be published in the near future. We need your contribution in order to do this. Please consider making a donation, marking it as ‘EBoA Referral’.

International bank data:

IBAN:    DE78701500000031112097
BIC:     SSKMDEMM
Country: Germany
Name:    FFII e.V.
Address: Blutenburgstr 17, DE 80636 Muenchen

Germany bank data:

Name:            FFII e.V.
Account:         31112097
Sort code (BLZ): 70150000

For using Paypal, see
http://ffii.org/Donations

Background information

At present there is no central jurisdiction for European or community patents. National court decisions are still not fully aligned with the European Patent Office’s (EPO) granting policy concerning software patents that has been developed by decisions of the EPO Boards of Appeal. The disparity between national patent enforcement courts and the EPO’s granting practice was one of the reasons why a directive on the patentability of computer-implemented inventions was proposed. This directive, as well as the 2000 attempt to change the European Patent Convention, was rejected not least because of the larger FFII network’s activities.

Despite the fact that several attempts to formally legalise software patents in Europe proved unsuccessful, the EPO still has not adapted to the developments in the political arena. The EPO still grants software patents under the application of loopholes created by its Boards of Appeal decisions.

The EPO’s granting practice gradually gains more acceptance in national courts thanks to a trickle down effect, while the legal certainty of national software patents remains to be determined. Validity rulings and opposition mostly reject questionable software patents out of novelty and inventive step considerations, but not on grounds of the substantive scope of patent law.

On October 22, 2008 the Enlarged Board of Appeal was asked by the President of the European Patent Office, Alison Brimelow (UK), for an opinion concerning the exclusion of computer programs as such according to Article 112(1)b EPC. She highlights that this matter is of fundamental importance as it defines the limits of patentability in the field of computing. The Referral is divided into four chapters. The first chapter describes the background to the Referral, the second chapter concerns definitions of auxiliary terms such as software, while part three includes four questions about substantive law interpretation.
Part four describes the legal framework and options for its development.
The President also added background information and an overview of BoA decisions related to this specific matter.

The FFII has a wiki page where comments on the questions can be added.

https://www.ffii.org/EPOReferral

The EPO Enlarged Board of Appeal decided to allow third parties to make statements concerning the points of law (November 11, 2008). We will provide legal considerations which challenge the controversial Boards of Appeal decisions and thus influence the decision-making process. In the absence of legislative clarifications, some courts in the UK recently accepted EPO ‘case law’. The opinion of the Extended Boards of Appeal will create the precedent for all future legislative developments.

As there is no legislative scenario in sight which might overrule the EBoA in case it permits software patents, this particular Referral needs our attention. Other parties interested in software patents are going to submit comments in favour of software patents. Philips, in fact, has already done so.

Our action plan

We will submit entries to the Enlarged Board of Appeal in order to bring about a more balanced assessment, and to help the EBoA arrive at legal solutions that are closer to our expectations. Our communication targets are patent technocrats with a different belief system to which we need to adapt. So far we have concluded that several different strategies can be applied. We have discussed these extensively with patent experts. For strategic reasons we cannot make them public, suffice it to say that we are currently in the process of finding collaborators in our attempt to stop software patents.

Challenge

* Recent EPO legal patent literature has done little to challenge or even criticise the teachings of the EPO. Patent scholars from other professions such as political science, economics, etc. are hardly discussed in the legal literature. Patent professionals’ task is not normative legislature, but winning cases and applications. While there has been sustained disagreement with software patents in the field of business, legal literature still hardly reflects this shift.

* Inside the EPO there is no open debate and employees are bound by strict staff obligations (cmp. Communique 22). The EPO aggressively intervenes in political and scientific debates, while the patent community’s belief system is still largely determined by an unchallenged endorsement of software patents.

* The EBoA’s members are not necessarily eligible for judicial office, and some of them are merely technically qualified. The EBoA’s lack of independence is a known issue and an EPO reform is underway to make these bodies more independent. Some patent scholars altogether question the legal quality of EBoA reasoning.

* The political debate over patent law is largely blocked. The fact that no corresponding parliament report was issued in response to an official communication from the Commission about the future of Industrial Property policy testifies to this.

* Members of the EBoA will probably only accept legal considerations and solutions.

* The EPO’s dogmatic language is shielded against public criticism and, even for legally trained people, like a net in which one easily gets caught. Its reasoning is often based on logical fallacies and hidden value judgments.

* Patent law interpretation practice is expansive.
In an allegedly unclear situation, the patent community will always argue against exclusion from patentability. It lacks a negative definition of “invention” and a sound basis in legal teaching which could be used to explain why a field is not to be covered by patent law.
Patent professionals generally do not understand the economic rationale behind incentive system application, while economists often assume for their model that the patent system has the claimed effects.

* The EPO and its staff have a strong commercial bias in favour of granting patents and are hardly ever subjected to public scrutiny and control. Patent opposition is less than ideal due to free riding effects and associated risks and transparency gaps (cmp. Guellec07)

* Complicated institutional conflicts between German and UK patent traditions loom in the background of the Referral. De facto European patent policy and litigation is strongly dominated by UK and Germany stakeholders and traditions.

Conferences

The following conferences – among others which are not public – will be or have already been attended by some of our members.

Current Policy Issues in the Governance of the European Patent System
Venue: European Parliament, Rue Wiertz 60, Room Anna Lindt, P1A002, Brussels B-1047, BELGIUM
17 March 2009
Alison Brimelow : Closing remarks
www.europarl.europa.eu/stoa/events/workshop/20090317/programme_en.pdf

WIPO – STANDING COMMITTEE ON THE LAW OF PATENTS Geneva, March 23 to 27, 2009 (We have a written report available)

The future of intellectual property
Creativity and innovation in the digital era April 23rd -24th, 2009, Committee of the Regions, Brussels

Making IPR work for SMEs
27th of April 2009, Brussels
http://ec.europa.eu/enterprise/enterprise_policy/industry/ipr_conference.htm

Patinnova
April 28th-30th, Prague
Alison Brimelow opening it.
Workshop on patents and software
http://www.epo.org/about-us/events/epf2009.html

Measuring the value of IPR: theory, business practice and public policy September 24-25, 2009, Bologna Sponsored by the EPO. Alison Brimelow has been invited.
http://www.epip.eu/conferences/epip04/

How to support the FFII

The FFII is divided in working groups. We welcome new active people in our working groups which are listed at https://action.ffii.org

If you consider our work important but you are not able to help actively, you can become a passive sustaining member of the FFII, starting at 15 EUR per year. See

http://action.ffii.org/member_application

How to contact us

FFII e.V.
Blutenburgstr. 17
80636 Munich
Germany
https://www.ffii.org
office@ffii.org

Tel. +49 30 417 22 597
Fax: +49 30 417 22 597
IRC: #ffii @ irc.freenode.net
Blogs: http://planet.ffii.org/

Tax number: 143 / 843 / 17600 at the German tax office in Munich.
IBAN: DE78701500000031112097, SWIFT/BIC: SSKMDEMM Registered organisation in Munich, Amtsgericht München VR 16460
Board: Benjamin Henrion, Rene Mages, Ivan Villanueva, Andre Rebentisch,
Alex Macfie

Posted in "virtual communities", nmr, Uncategorized | Leave a comment

CML – semantics for pi-bonds

Rich Apodaca has asked how CML represents ferrocene. As there is no communal agreement on how to do this, CML has to support all possible current mainstream representations (the resolution of these is not a semantic, but ontological task). The remaining task is to represent sketch (b) in Metallome’s blog post.

The CML schema supports this through defining (i) a pi-bonded system and (ii) a bond from one or more atoms to this system. The schema () asserts:

       <xsd:attributeGroup ref="atomRefs">
            <xsd:annotation>
                <xsd:documentation>
                    <h:div class="specific">This is designed for
                    multicentre bonds (as in delocalised systems or electron-deficient
                    centres. The semantics are experimental at this stage.
                    As an example, a B-H-B bond might be described as
                 <bond atomRefs="b1 h2 b2"/.</h:div>
                </xsd:documentation>
            </xsd:annotation>
        </xsd:attributeGroup>
        <xsd:attributeGroup ref="bondRefs">
            <xsd:annotation>
                <xsd:documentation>
                 <h:div class="specific">This is designed for pi-bonds
                 and other systems where formal valence bonds are not drawn to
                 atoms. The semantics are experimental at this stage. As an example,
                 a Pt-|| bond (as the Pt-ethene bond in Zeise's salt) might
                 be described as <bond atomRefs="pt1" bondRefs="b32"/.</h:div>
                </xsd:documentation>
            </xsd:annotation>
        </xsd:attributeGroup>

So we’ll define a pi-bond system for atoms a1,a2,a3,a4,a5 and another for atoms a6,a7,a8,a9,a10:

<bond id="b12345" atomRefs="a1 a2 a3 a4 a5"/>
<bond id="b678910" atomRefs="a6 a7 a8 a9 a10"/>

and then bond the Fe (a0) to each separately:

<bond id="bpi1" atomRefs="a0" bondRefs="b12345"/>
<bond id="bpi2" atomRefs="a0" bondRefs="b678910"/>

Note how the use of pointers (refs) is a fundamental part of CML and makes much of the semantics tractable. Put it all together and we get:

<molecule id="mol123456789" title="ferrocene"
  xmlns='http://www.xml-cml.org/schema'>
  <formula concise="C 10 H 10 Fe 1" inline="Fe(C_5_H_5)_2_"/>
    <atomArray>
      <atom id="a0" elementType="Fe"/>
      <atom id="a1" elementType="C"/>
      <atom id="a2" elementType="C"/>
      <atom id="a3" elementType="C"/>
      <atom id="a4" elementType="C"/>
      <atom id="a5" elementType="C"/>
      <atom id="a6" elementType="H"/>
      <atom id="a7" elementType="H"/>
      <atom id="a8" elementType="H"/>
      <atom id="a9" elementType="H"/>
      <atom id="a10" elementType="H"/>
      <atom id="a11" elementType="C"/>
      <atom id="a12" elementType="C"/>
      <atom id="a13" elementType="C"/>
      <atom id="a14" elementType="C"/>
      <atom id="a15" elementType="C"/>
      <atom id="a16" elementType="H"/>
      <atom id="a17" elementType="H"/>
      <atom id="a18" elementType="H"/>
      <atom id="a19" elementType="H"/>
      <atom id="a20" elementType="H"/>
    </atomArray>
    <bondArray>
      <bond id="a1_a2" atomRefs2="a1 a2"/>
      <bond id="a2_a3" atomRefs2="a2 a3"/>
      <bond id="a3_a4" atomRefs2="a3 a4"/>
      <bond id="a4_a5" atomRefs2="a4 a5"/>
      <bond id="a5_a1" atomRefs2="a5 a1"/>
      <bond id="a1_a6" atomRefs2="a1 a6"/>
      <bond id="a2_a7" atomRefs2="a2 a7"/>
      <bond id="a3_a8" atomRefs2="a3 a8"/>
      <bond id="a4_a9" atomRefs2="a4 a9"/>
      <bond id="a5_a10" atomRefs2="a5 a10"/>
      <bond id="a11_a12" atomRefs2="a11 a12"/>
      <bond id="a12_a13" atomRefs2="a12 a13"/>
      <bond id="a13_a14" atomRefs2="a13 a14"/>
      <bond id="a14_a15" atomRefs2="a14 a15"/>
      <bond id="a15_a11" atomRefs2="a15 a11"/>
      <bond id="a11_a16" atomRefs2="a11 a16"/>
      <bond id="a12_a17" atomRefs2="a12 a17"/>
      <bond id="a13_a18" atomRefs2="a13 a18"/>
      <bond id="a14_a19" atomRefs2="a14 a19"/>
      <bond id="a15_a20" atomRefs2="a15 a20"/>
      <bond id="b12345" atomRefs="a1 a2 a3 a4 a5"/>
      <bond id="b678910" atomRefs="a6 a7 a8 a9 a10"/>
      <bond id="bpi1" atomRefs="a0" bondRefs="b12345"/>
      <bond id="bpi2" atomRefs="a0" bondRefs="b678910"/>
    </bondArray>
</molecule>

This completes our tour of four different representations of ferrocene. None have implicit semantics. They can only be reconciled through ontologies, not semantics – we have to assert that some authority says that they are equivalent (or different).

If we need we can give hints to the processing program. We could add an electron count to the pi-bonds:

<bond id="b12345" atomRefs="a1 a2 a3 a4 a5">
  <electron count="5"/>
<bond>
<bond id="b678910" atomRefs="a6 a7 a8 a9 a10"/>
  <electron count="5"/>
<bond>

if we like a neutral Fe and cps or

<bond id="b12345" atomRefs="a1 a2 a3 a4 a5">
  <electron count="6"/>
<bond>
<bond id="b678910" atomRefs="a6 a7 a8 a9 a10"/>
  <electron count="6"/>
<bond>

if we want a cp- and Fe2+ model.

You may ask “how does CML search for ferrocenes”? CML doesn’t. It’s not a program, it’s a representation. For that you need CML-aware engines and that’s what the Open Source community has been developing…
… please join us and help to take us to the semantic future. It won’t happen with SD files.

Posted in "virtual communities", Uncategorized | Leave a comment

CML – semantic representation of molecular structure

I have been asked by Rich Apodaca to show how the various styles of representing ferrocene are possible within CML. Let me stress that these are different connection tables which the community variously uses to represent a single compound. There is no way that a connection table, per se, can indicate that there are alternative ways of representing the same information. At one level it’s like expecting the equation

x = 1;

to indicate that it’s semantically equivalent to

x - 1 = 0;

which requires normalization and ontology.

So here is how we represent two more (valid) representations of ferrocene. The first is effectively cp-Fe-cp where single bonds are used to link the iron to particular atoms of the cp. We’ll remove the sub-molecule structure and add bonds…

<molecule id="mol123456789" title="ferrocene" xmlns='http://www.xml-cml.org/schema'>
  <formula concise="C 10 H 10 Fe 1" inline="Fe(C_5_H_5)_2_"/>
    <atomArray>
      <atom id="a0" elementType="Fe"/>
      <atom id="a1" elementType="C"/>
      <atom id="a2" elementType="C"/>
      <atom id="a3" elementType="C"/>
      <atom id="a4" elementType="C"/>
      <atom id="a5" elementType="C"/>
      <atom id="a6" elementType="H"/>
      <atom id="a7" elementType="H"/>
      <atom id="a8" elementType="H"/>
      <atom id="a9" elementType="H"/>
      <atom id="a10" elementType="H"/>
      <atom id="a11" elementType="C"/>
      <atom id="a12" elementType="C"/>
      <atom id="a13" elementType="C"/>
      <atom id="a14" elementType="C"/>
      <atom id="a15" elementType="C"/>
      <atom id="a16" elementType="H"/>
      <atom id="a17" elementType="H"/>
      <atom id="a18" elementType="H"/>
      <atom id="a19" elementType="H"/>
      <atom id="a20" elementType="H"/>
    </atomArray>
    <bondArray>
      <bond id="a1_a2" atomRefs2="a1 a2"/>
      <bond id="a2_a3" atomRefs2="a2 a3" order="D"/>
      <bond id="a3_a4" atomRefs2="a3 a4"/>
      <bond id="a4_a5" atomRefs2="a4 a5" order="D"/>
      <bond id="a5_a1" atomRefs2="a5 a1"/>
      <bond id="a1_a6" atomRefs2="a1 a6"/>
      <bond id="a2_a7" atomRefs2="a2 a7"/>
      <bond id="a3_a8" atomRefs2="a3 a8"/>
      <bond id="a4_a9" atomRefs2="a4 a9"/>
      <bond id="a5_a10" atomRefs2="a5 a10"/>
      <bond id="a11_a12" atomRefs2="a11 a12"/>
      <bond id="a12_a13" atomRefs2="a12 a13"/>
      <bond id="a13_a14" atomRefs2="a13 a14"/>
      <bond id="a14_a15" atomRefs2="a14 a15"/>
      <bond id="a15_a11" atomRefs2="a15 a11"/>
      <bond id="a11_a16" atomRefs2="a11 a16"/>
      <bond id="a12_a17" atomRefs2="a12 a17"/>
      <bond id="a13_a18" atomRefs2="a13 a18"/>
      <bond id="a14_a19" atomRefs2="a14 a19"/>
      <bond id="a15_a20" atomRefs2="a15 a20"/>
      <bond id="a0_a1" atomRefs2="a0 a1"/>
      <bond id="a0_a6" atomRefs2="a0 a6"/>
    </bondArray>
</molecule>

That’s fairly straightforward and here I have added some bond orders. I don’t terribly like doing this as it’s a rather meaningless retrofitting, especially when H atoms are explicit.

Here’s the approach using explicit bonds from Fe to all carbons (sketch (a)).

<molecule id="mol123456789" title="ferrocene" xmlns='http://www.xml-cml.org/schema'>
  <formula concise="C 10 H 10 Fe 1" inline="Fe(C_5_H_5)_2_"/>
    <atomArray>
      <atom id="a0" elementType="Fe"/>
      <atom id="a1" elementType="C"/>
      <atom id="a2" elementType="C"/>
      <atom id="a3" elementType="C"/>
      <atom id="a4" elementType="C"/>
      <atom id="a5" elementType="C"/>
      <atom id="a6" elementType="H"/>
      <atom id="a7" elementType="H"/>
      <atom id="a8" elementType="H"/>
      <atom id="a9" elementType="H"/>
      <atom id="a10" elementType="H"/>
      <atom id="a11" elementType="C"/>
      <atom id="a12" elementType="C"/>
      <atom id="a13" elementType="C"/>
      <atom id="a14" elementType="C"/>
      <atom id="a15" elementType="C"/>
      <atom id="a16" elementType="H"/>
      <atom id="a17" elementType="H"/>
      <atom id="a18" elementType="H"/>
      <atom id="a19" elementType="H"/>
      <atom id="a20" elementType="H"/>
    </atomArray>
    <bondArray>
      <bond id="a1_a2" atomRefs2="a1 a2"/>
      <bond id="a2_a3" atomRefs2="a2 a3"/>
      <bond id="a3_a4" atomRefs2="a3 a4"/>
      <bond id="a4_a5" atomRefs2="a4 a5"/>
      <bond id="a5_a1" atomRefs2="a5 a1"/>
      <bond id="a1_a6" atomRefs2="a1 a6"/>
      <bond id="a2_a7" atomRefs2="a2 a7"/>
      <bond id="a3_a8" atomRefs2="a3 a8"/>
      <bond id="a4_a9" atomRefs2="a4 a9"/>
      <bond id="a5_a10" atomRefs2="a5 a10"/>
      <bond id="a11_a12" atomRefs2="a11 a12"/>
      <bond id="a12_a13" atomRefs2="a12 a13"/>
      <bond id="a13_a14" atomRefs2="a13 a14"/>
      <bond id="a14_a15" atomRefs2="a14 a15"/>
      <bond id="a15_a11" atomRefs2="a15 a11"/>
      <bond id="a11_a16" atomRefs2="a11 a16"/>
      <bond id="a12_a17" atomRefs2="a12 a17"/>
      <bond id="a13_a18" atomRefs2="a13 a18"/>
      <bond id="a14_a19" atomRefs2="a14 a19"/>
      <bond id="a15_a20" atomRefs2="a15 a20"/>
      <bond id="a0_a1" atomRefs2="a0 a1"/>
      <bond id="a0_a2" atomRefs2="a0 a2"/>
      <bond id="a0_a3" atomRefs2="a0 a3"/>
      <bond id="a0_a4" atomRefs2="a0 a4"/>
      <bond id="a0_a5" atomRefs2="a0 a5"/>
      <bond id="a0_a6" atomRefs2="a0 a6"/>
      <bond id="a0_a7" atomRefs2="a0 a7"/>
      <bond id="a0_a8" atomRefs2="a0 a8"/>
      <bond id="a0_a9" atomRefs2="a0 a9"/>
      <bond id="a0_a10" atomRefs2="a0 a10"/>
    </bondArray>
</molecule>

The bonds are not, of course, 2-electron bonds – but we haven’t said they are – that’s the strength of the semantic approach. If we really wanted to indicate that each bond had 4 / 5 electrons, CML would allow us to do it – see next example.

Posted in "virtual communities", Uncategorized | Leave a comment

Why we need chemistry ontologies

Mat Todd is an example of the new generation of organic chemists who is concerned about the broader picture of information. Here’s a recent comment, which I address:

Mat Todd says:

Peter, I think trying to pin down the exact nature of a substance and label it is important. I suspect it’s important because we need computers to be able to handle the data. But it reminds me of efforts to label vague concepts with names more generally. What is ‘British?’ How many hairs must I lose before I’m bald? At what wavelength does red become orange? To decide that such labels are important is half the battle.

Beyond the zeolite/clay examples above, there was an interesting episode in a recent synthesis of quinine from Robert Williams (10.1002/anie.200705421). To quote another site (http://tinyurl.com/dg8me8):

“following the old ways without the benefits of modern storage methods of reactive metals may have been critical in their success. Initially, their yield of quinine was very low. They suspected that the aluminium powder used as a reducing agent in the last step was the problem. It was too fresh! Leaving it in air for a short period leads to the formation of a coating of aluminium oxide. When the experiment was repeated with this powder, the yield matched that reported by Woodward and Doering.”

Even commercial reagents with the same labels can be a mixture of things in a time-dependent manner.

PMR: Exactly so.

And this is where ontologies come in. I’ve been keen on ontologies for over 10 years (and am credited with having used the term “ontological warfare”). I have been sceptical of using Upper Ontologies as I thought they would be too general and too incompatible. However Nico Adams has done a fantastic job of creating a broad and deep ontology (ChemAxiom) for mainstream chemistry, based on an upper Ontology (BFO). I am deliberately not saying more here, as it’s his shout.

However I can say that Nico addresses these problems – the changing nature of entities and concepts. The upper ontology is necessarily abstract and uses terms such as “Continuants” and “Occurrents” which can be addressed to the decaying aluminium above. I shall keep emphasizing that the reconciliation between variant names, structures, samples and substances can only be properly made through ontologies.

My own part is to create lower-level ontologies that harmonize with Nico’s. I’ve converted the CIF dictionary into an ontology which now acts to validate data instances, and created computational ontologies such as for Gaussian (in our COST program). It’s clear that their time and their technology has arrived.

Posted in "virtual communities", Uncategorized | Leave a comment

Chem4Word – why semantics are necessary

I was asked to explain how Chem4Word and CML could encode ferrocene. I’ll start by using Wikipedia to give a clear and accurate picture. Sorry for the cut-and-paste mess.

WP: Ferrocene is the organometallic compound with the formula Fe(C5H5)2. It is the prototypical metallocene, a type of organometallic chemical compound consisting of two cyclopentadienyl rings bound on opposite sides of a central metal atom.

Other names dicyclopentadienyl iron
Identifiers
CAS number 102-54-5
PubChem 11985121
ChEBI 30672
InChI
IUPAC name
Other names dicyclopentadienyl iron

Very clear and tidy. By contrast the entries in Pubchem are a mess. That’s NOT Pubchem’s fault – it’s the non-semantic stuff that is sent by depositors. Again I shan’t bash the depositors too hard as they have voluntarily deposited their material – it the awful non-semantic authoring tools they use and the absence of agreed conventions.

Chem4Word aims to raise the standard. You’ll note from the entries below that the formulae for some of these structures are grotesque (10 negative charges). C4W will give authors a clear indication of the molecular formulae and charges and encourage semantic validation.

Anyway here goes. These are all the different compound IDs associated with ferrocene. I assume that all these compounds are meant to be ferrocene but their formulae are garbled by the tools – note the absurd charges. CML prevents such garbling.


Ferrotsen; Catane; FERROCENE …
Compound ID: 7611
Source: LeadScope (LS-357)
IUPAC: cyclopenta-1,3-diene; iron(2+)
MW: 186.031400 g/mol | MF: C10H10Fe


FERROCENE; Bis(.eta.-cyclopentadienyl) iron
Compound ID: 11985121
Source: NIST Chemistry WebBook (3993653726)
IUPAC: cyclopenta-1,3-diene; cyclopentane; iron
MW: 186.031400 g/mol | MF: C10H10Fe-6


FERROCENE; Di(cyclopentadienyl)iron; Bis(cyclopentadienyl)iron …
Compound ID: 10219726
Source: Sigma-Aldrich (F408_ALDRICH)
IUPAC: cyclopentane; iron
MW: 186.031400 g/mol | MF: C10H10Fe


FERROCENE
Compound ID: 504306
Source: NIST Chemistry WebBook (1113374621)
IUPAC: cyclopenta-1,3-diene; iron(2+)
MW: 186.031400 g/mol | MF: C10H10Fe


Ferrotsen; FERROCENE; Dicyclopentadienyl iron …
Compound ID: 24196050
Source: DTP/NCI (209798)
IUPAC: cyclopenta-1,3-diene; iron
MW: 177.967880 g/mol | MF: C10H2Fe-10
Tested in BioAssays: All: 3, Active: 0; BioActivity Analysis


Ferrotsen; FERROCENE; Dicyclopentadienyl iron …
Compound ID: 5150118
Source: DTP/NCI (44012)
IUPAC: cyclopenta-1,3-diene; iron(2+)
MW: 177.967880 g/mol | MF: C10H2Fe-8
Tested in BioAssays: All: 1, Active: 0; BioActivity Analysis


Posted in "virtual communities", Uncategorized, XML | 1 Comment

CML – a semantic approach to chemistry

Rich Apodaca has asked me to show how CML can deal with metallocene compounds – and I’m happy to do this – it comes at a very good time. He points to Metallome blog and I’ll copy some of the material on ferrocene. I’ll show the post and then explain the approach

Metallome: Drawing ferrocene

Ferrocene was discovered in 1951 and we still do not know the proper way to draw it. CrossFire example recommends to connect every carbon atom of the ring to the central metal atom. Which is fair enough and will be a valid query for CrossFire Gmelin database. Similarly, both ChEBI and NIST Webbook use decacoordinate iron in ferrocene structure (a). In this representation, all carbon—carbon bonds are single. But, according to IUPAC Recommendations, section GR-1.7.2,
    coordination bonds to contiguous atoms (most commonly representing a form of π-bonding) should be drawn to indicate most clearly that special bonding pattern. Depictions that imply a regular covalent bond — and especially, depictions that show a regular covalent bond to each member of a delocalized system — are not acceptable.

In other words, the preferred representation is the one with bicoordinate iron and delocalised bond system (b). The problem with that is there is no agreed (as far as chemoinformaticans are concerned) way to do that, even though solutions for different applications (e.g. for Marvin Sketch) do exist. In MolBase, the coordination number of iron in ferrocene is 6 (and I do remember Mark Winter confirming that this is true). On yet another hand, Beilstein and ChemIDplus databases represent ferrocene as a standalone Fe2+ and two standalone cyclopenta-2,4-dienide anions (c), thus avoiding the question of coordination number altogether. Naturally, the decacoordinate-iron query will not work in Beilstein. (For InChI implications, see this discussion.)

ferrocene with 10-coordinate iron
(a)
ferrocene with bi-coordinate iron
(b)
ferrocene as three standalone entities
(c)

PMR: Many thanks Kirill for this very clear explanation. The first and central point is that there is no agreed way to represent ferrocene, and the semantic approach honours this.

in CML we represent what we do know, and do that as fully as possible. Implicit semantics (i.e. information that has to be provided by the reader or reading program) creates enormous problems. A typical example of implicit semantics is omitting hydrogen atoms and although CML allows this we are not allowing omitted H in Chem4Word.

So let’s build up systematically. What do we know? We know we have a molecule (ferrocene exists in the gas phase so we can talk of single molecules and don’t have to worry about substances at this stage).

<molecule id="mol123456789" title="ferrocene"
    xmlns='http://www.xml-cml.org/schema'/>

What an anticlimax! we knew that. But we have at least told the world we have a molecule. Let’s see what the world has to offer… off to Pubchem for  Ferrocene search. This gives a huge amount of enties and shows the chaos when we don’t have semantic chemistry – several entries have a formula of MF: C10H10Fe-6 which is obviously caused by a non-semantic program trying to work out the formula from non-semantic input.

The lesson is simple: If you care about quality and validity you must use a semantic approach.

So how do we know which is actually “ferrocene”. Simple answer – we don’t. We have a number of conflicting pieces of information – some have a name “ferrocene” but the formula associated with those names varies. Of course we as inorganic chemists know the “correct” formula, but it still means Pubchem (and much else) doesn’t act as a simple lookup. We have to add metadata – who asserted what. That’s where semantics starts to come in.

We’ll take the first entry:

SID: 49854569 <!–
var Menu49854569_1 = [
[“UseLocalConfig”, “jsmenu3Config”, “”, “”],
[“Same Substances” , “window.top.location=’/sites/entrez?Db=pcsubstance&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pcsubstance_same_popup&LinkReadableName=Same%20Substances&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum’ “, “”, “”],
[“Same Parent” , “window.top.location=’/sites/entrez?Db=pcsubstance&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pcsubstance_parent_popup&LinkReadableName=Same%20Parent&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum’ “, “”, “”],
[“Same Parent, Connectivity” , “window.top.location=’/sites/entrez?Db=pcsubstance&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pcsubstance_parent_connectivity_popup&LinkReadableName=Same%20Parent%2C%20Connectivity&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum’ “, “”, “”],
[“Similar Substances” , “window.top.location=’/sites/entrez?Db=pcsubstance&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pcsubstance&LinkReadableName=Similar%20Substances&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum’ “, “”, “”],
[“PubChem Same Compound” , “window.top.location=’/sites/entrez?Db=pccompound&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pccompound_same&LinkReadableName=PubChem%20Same%20Compound&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum’ “, “”, “”],
[“PubChem Component Compounds” , “window.top.location=’/sites/entrez?Db=pccompound&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pccompound&LinkReadableName=PubChem%20Component%20Compounds&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum’ “, “”, “”]
]
–>Related Structures, <!–
var Menu49854569_4 = [
[“UseLocalConfig”, “jsmenu3Config”, “”, “”],
[“PubMed MeSH Keyword Summary” , “window.top.location=’http://pubchem.ncbi.nlm.nih.gov/pmsummary/pubmed.cgi?db=pcsubstance&amp;uid=49854569&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum&ordinalpos=1′ “, “”, “”],
[“PMC Articles” , “window.top.location=’/sites/entrez?Db=pmc&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pmc&LinkReadableName=PMC%20Articles&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum’ “, “”, “”],
[“PubMed (MeSH Keyword)” , “window.top.location=’/sites/entrez?Db=pubmed&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_pubmed_mesh&LinkReadableName=PubMed%20(MeSH%20Keyword)&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum’ “, “”, “”],
[“MeSH Keyword” , “window.top.location=’/sites/entrez?Db=mesh&DbFrom=pcsubstance&Cmd=Link&LinkName=pcsubstance_mesh&LinkReadableName=MeSH%20Keyword&IdsFromResult=49854569&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pcsubstance.Pcsubstance_ResultsPanel.Pcsubstance_RVDocSum’ “, “”, “”]
]
–>Literature<!–
var PopUpMenu2_LocalConfig_jsmenu3Config = [
[“ShowCloseIcon”,”yes”],
[“Help”,”window.open(‘/entrez/query/static/popup.html’,’Links_Help’,’resizable=no, scrollbars=yes, toolbar=no, location=no, directories=no, status=no, menubar=no, copyhistory=no, alwaysRaised=no, depend=no, width=400, height=500′);”],
[“TitleText”,” Links “]
]
var jsmenu3Config = [
[“UseLocalConfig”,”jsmenu3Config”,””,””]
]
function ShowLinks(url,linkscount)
{
var X,Y;
var H = (linkscount + 5)*30, W = 300;
if(parseFloat(navigator.appVersion)>= 4) {
if(navigator.appName==”Netscape”) {
X=window.innerWidth;Y=window.innerHeight;
if(H > window.innerHeight) { H=window.innerHeight-50;}
}else{
X=document.body.offsetWidth;Y=document.body.offsetHeight;
if(H > document.body.offsetHeight) { H=window.innerHeight-50;}
}
Y=(screen.height)/2-H/2;
X=(screen.width)/2-W/2;
}
window.open(url, ‘Links’,’alwaysRaised=yes,screenX=’+String(X)+’,screenY=’+String(Y)+’,resizable=no,scrollbars=yes,toolbar=no,location=no,directories=no,status=no,menubar=no,title=no,copyhistory=yes,width=’+String(W)+’,height=’+String(H)).focus();
}
–>
Ferrotsen; Catane; FERROCENE …
Compound ID: 7611
Source: LeadScope (LS-357)
IUPAC: cyclopenta-1,3-diene; iron(2+)
MW: 186.031400 g/mol | MF: C10H10Fe

There is a lot of semantics we should encode here:

Pubchem ASSERTS that Leadscope (a depositor) has deposited a substance entry (SID: 49854569). Leadscope ASSERTS that this entry relates to the name “ferrocene”; that this entry relates to the name “catane” (and many more); that this entry is associated with formula C10H10Fe; that this entry has a connection table [identical with (c) above]; and so on.

The proper way to do this is with RDF using triples and/or reification, bnodes or quads. In this way we can see who asserted what or who asserted who said what. This is not chopping logic – there is no correct connection table for ferrocene , there are only assertions made by authorities (here Leadscope, although they may have taken it from somewhere else unspecified).

Let’s encode the formula. cml:formula is one of the benefits of CML and C4W has been written to manage formulae properly. There are 2 formulae, the sum of the atoms (concise) and an inline (which can be any text).

<molecule id="mol123456789" title="ferrocene" xmlns='http://www.xml-cml.org/schema'>
<formula concise="C 10 H 10 Fe 1" inline="Fe(C_5_H_5)_2_"/>
</molecule>

Now we come to the bonding. This representation has 3 species and CML supports sub-molecules (again an important feature). The iron is:

<molecule id="m1">
  <atomArray>
    <atom id="a0" elementType="Fe" formalCharge="2"/>
  </atomArray>
</molecule>

That tells us everything about the iron with no implicit semantics. The Cps could be represented in a number of ways. I’ll write

<molecule id="m2" formalCharge="-1">
  <atomArray>
    <atom id="a1" elementType="C"/>
    <atom id="a2" elementType="C"/>
    <atom id="a3" elementType="C"/>
    <atom id="a4" elementType="C"/>
    <atom id="a5" elementType="C"/>
    <atom id="a6" elementType="H"/>
    <atom id="a7" elementType="H"/>
    <atom id="a8" elementType="H"/>
    <atom id="a9" elementType="H"/>
    <atom id="a10" elementType="H"/>
  </atomArray>
  <bondArray>
    <bond id="a1_a2" atomRefs2="a1 a2/>
    <bond id="a2_a3" atomRefs2="a2 a3/>
    <bond id="a3_a4" atomRefs2="a3 a4/>
    <bond id="a4_a5" atomRefs2="a4 a5/>
    <bond id="a5_a1" atomRefs2="a5 a1/>
    <bond id="a1_a6" atomRefs2="a1 a6/>
    <bond id="a2_a7" atomRefs2="a2 a7/>
    <bond id="a3_a8" atomRefs2="a3 a8/>
    <bond id="a4_a9" atomRefs2="a4 a9/>
    <bond id="a5_a10" atomRefs2="a5 a10/>
  </bondArray>
</molecule>

and the other by analogy. Note that I have not added any bond orders – this is deliberate. the community will argue about whether the ring bonds are single, double, delocalised, pi, etc. They will argue where the charge should be put. So I have added exactly enough information that stops before they start fighting.

Putting it together we get the complete CML for this particular representation:

<molecule id="mol123456789" title="ferrocene" xmlns='http://www.xml-cml.org/schema'>
  <formula concise="C 10 H 10 Fe 1" inline="Fe(C_5_H_5)_2_"/>
  <molecule id="m1">
    <atomArray>
      <atom id="a0" elementType="Fe" formalCharge="2"/>
    </atomArray>
  </molecule>
  <molecule id="m2" formalCharge="-1">
    <atomArray>
      <atom id="a1" elementType="C"/>
      <atom id="a2" elementType="C"/>
      <atom id="a3" elementType="C"/>
      <atom id="a4" elementType="C"/>
      <atom id="a5" elementType="C"/>
      <atom id="a6" elementType="H"/>
      <atom id="a7" elementType="H"/>
      <atom id="a8" elementType="H"/>
      <atom id="a9" elementType="H"/>
      <atom id="a10" elementType="H"/>
    </atomArray>
    <bondArray>
      <bond id="a1_a2" atomRefs2="a1 a2"/>
      <bond id="a2_a3" atomRefs2="a2 a3"/>
      <bond id="a3_a4" atomRefs2="a3 a4"/>
      <bond id="a4_a5" atomRefs2="a4 a5"/>
      <bond id="a5_a1" atomRefs2="a5 a1"/>
      <bond id="a1_a6" atomRefs2="a1 a6"/>
      <bond id="a2_a7" atomRefs2="a2 a7"/>
      <bond id="a3_a8" atomRefs2="a3 a8"/>
      <bond id="a4_a9" atomRefs2="a4 a9"/>
      <bond id="a5_a10" atomRefs2="a5 a10"/>
    </bondArray>
  </molecule>
  <molecule id="m3" formalCharge="-1">
    <atomArray>
      <atom id="a11" elementType="C"/>
      <atom id="a12" elementType="C"/>
      <atom id="a13" elementType="C"/>
      <atom id="a14" elementType="C"/>
      <atom id="a15" elementType="C"/>
      <atom id="a16" elementType="H"/>
      <atom id="a17" elementType="H"/>
      <atom id="a18" elementType="H"/>
      <atom id="a19" elementType="H"/>
      <atom id="a20" elementType="H"/>
    </atomArray>
    <bondArray>
      <bond id="a11_a12" atomRefs2="a11 a12"/>
      <bond id="a12_a13" atomRefs2="a12 a13"/>
      <bond id="a13_a14" atomRefs2="a13 a14"/>
      <bond id="a14_a15" atomRefs2="a14 a15"/>
      <bond id="a15_a11" atomRefs2="a15 a11"/>
      <bond id="a11_a16" atomRefs2="a11 a16"/>
      <bond id="a12_a17" atomRefs2="a12 a17"/>
      <bond id="a13_a18" atomRefs2="a13 a18"/>
      <bond id="a14_a19" atomRefs2="a14 a19"/>
      <bond id="a15_a20" atomRefs2="a15 a20"/>
    </bondArray>
  </molecule>
</molecule>

Notice how CML has represented exactly what we know about this depiction. (If I had wanted I could have put in the precise positions of the double bonds and carbanion, but it’s probably counterproductive). It’s completely semantic, no implicit information.

That’s enough for now; I promise that in the next post, Rich, I will deal with pi-bonding etc. I hope you will agree that this is one valid representation of ferrocene.

Posted in "virtual communities", open notebook science, Uncategorized | Leave a comment

CAS and InChI – who can assign identifiers?

I’ve had two useful comments on CAS and InChI identifiers which have updated my knowledge (a feature of closed organizations and authorities is that updates often trickle out in small amounts, particularly if they represent an unwelcome progress towards openness). Before I comment in detail, some thoughts about identifiers.

WP gives:

With reference to a given (possibly implicit) set of objects, a unique identifier is any identifier which is guaranteed to be unique among all identifiers used for those objects and for a specific purpose. There are three main types of unique identifiers, each corresponding to a different generation strategy:

  • serial numbers, assigned incrementally
  • random numbers, selected from a number space much larger than the maximum (or expected) number of objects to be identified. Although not really unique, some identifiers of this type may be appropriate for identifying objects in many practical applications and are, with abuse of language, still referred to as “unique”
  • names or codes allocated by choice which are forced to be unique by keeping a central registry such as the EPC Information Services.

I shall omit (2) – the UUID generated algorithmically and, if long enough, “likely to be unique”.

The assignment of identifiers is a non-trivial task which requires expert knowledge of the domain and training for those assigning the identifiers. To avoid collisions and errors there is normally a single authority which is repsonsible – I am not aware of common identifier systems which are created collectively though it’s possible.

Identifier systems are the IP of the authority creating them and can be protected by copyright. This type of protection is common in many domains such as maps and chemistry. It is also important for regulataory and safety purposes that identifiers are well maintained so that any dispersal of the identifier system maintains integrity. For this reason many authorities will not allow their identifiers to be used by others without licence.

I’ll now comment on the comments, and add a summary

  1. Rich Apodaca says:

    Peter, here are some other views about the limitations of InChI/InChIKey and the idea of an InChI resolver authority:

http://depth-first.com/articles/2008/12/02/five-questions-about-the-inchi-resolver

These are tough problems without easy answers.

I’d like to correct something you mention in this post:

>So Pubchem does not display any CAS numbers.
Not true. PubChem not only displays them, it lets anybody download their entire collection of CAS numbers (>350,000 at my last count) along with the rest of the PubChem database.

  1. To see CAS numbers in Pubchem, you need to look at Substance summary pages, not Compound summary pages. For example, you’ll see the CAS number for caffeine (58-08-2) appears on this page:

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=9684

This gives a nice CAS number lookup facility that is remarkably accurate:

http://depth-first.com/articles/2007/05/21/simple-cas-number-lookup-with-pubchem

These CAS numbers are added by individual depositors, and PubChem aggregates them. This feature was used to create a CAS number lookup facility in Chempedia with the ability to trace who ‘assigned’ which CAS number in PubChem (although the site is now down for major redesign):

http://depth-first.com/articles/2008/05/26/simple-cas-number-lookup-and-more-with-chempedia

You might also want to check out Common Chemistry, the free CAS number service created by CAS for the public to use:

http://zusammen.metamolecular.com/2009/03/31/sixty-four-free-chemistry-databases-part-6-common-chemistry-from-chemical-abstracts-service

PMR: First, many thanks for these corrections and updates (Common Chemistry came out very recently and I missed it)

re Pubchem. I believe that most of the CAS numbers came from the NCI database where NCI paid CAS for the right to include the numbers. I do not know whether those can be re-used without infringing copyright – and I’d welcome authoritative information.  I’m going to challenge “remarkably accurate”. This can only be asserted by CAS itself as it is the only authority that can assert what a VCAS number is. Alternatively they may, though I doubt it, have allowed individuals to use their Scifinder service as a way of checking CAS numbers. I suspect that this is forbidden by the use of contract.

When Wikipedia authors first checked their CAS numbers against Scifinder they were immediately told by CAS that they were in breach of contracts. There was an outcry (in which I took part) and CAS changed to allow this – I am not sure whether there is a limit, but I would be very surprised if widespread distribution of CAS numbers (against names or structures) was allowed in an authoritative manner.

“These CAS numbers are added by individual depositors”. This is part of the problem. No depositor has the authority to assert that a given substance is linked to a given CAS number. They can speculate, copy, etc. but mistakes will occur and there is no authority.

  1. Mat Todd says:

    …but is there anything inherently wrong with InChI? The behaviour of glucose in solution is arguably a chemical reaction, and therefore something that needs to be described by a network of InChIs, rather than being a limitation of InChIs themselves.

CAS numbers are used widely in chemical catalogues. They are useful for searching because they are short, and because they are unambiguous in what they are meant to describe. The shortest way of searching without CAS is molecular formula or drawing the structure, which are either longer or ambiguous. CAS can’t describe clays accurately either, beyond what one might buy from a supplier.

In the future, I’m going to search for chemical information using structures and networks of structures on web pages. For this I’m not going to care that InChIs are being used behind the scenes. What’s the upshot? Use InChIs and develop reaction networks. For fuzzy InChIs like clays – well, aren’t these cases minorities that can be worked out later? When was the last time you used a clay?

When was the last time you used a zeolite? Or a polymer? Probably last time you went into the lab.

Here we have the substance-molecule dichotomy very clearly. CAS states its numbers refer to substances. InChI necessarily refers to molecular structure. Many substances consist of several molecular structures. Many molecular formulae occur as more than one substance.

The mistake is as serious as equating a coding sequence to a protein structure. In many cases a 1:1 correspondence works; in many it is a completely wrong picture of our scientific knowledge. The same is true for chemistry.

It’s a very hard problem and requires a lot of work. Crowdsourcing InChIs and CAS may be the first generation and they will hopefully advance the political discussion to ther extent where Openness is seen to be essential. At present I think the two most likely authorities are Pubchem and Wikipedia as there has to be a promise of sustainability. I think WP will do a very good job on ca 10,000 common chemical substances and molecules though it badly needs a coherent identifier scheme itself (indexing pages by natural language name does not constitute an identifier for an entity). Pubchem – rightly – captures all depositor metadata, but we have yet to work out how to identify the conflicts.

Pubchem has substances as well as molecular formulae for those compounds physically submitted to the Molecular Libraries. For the rest is has assertions from depositors which may or may not make it clear what substances if any were involved.

I completely support the InChI effort but it’s now time to take stock of the complexities as well as continuing to try to make chemical information Open.

Posted in "virtual communities", Uncategorized | 1 Comment

Identifiers: why we need them and when CAS and InChI don't mix

I have tried to write this without needing to know chemistry as there is an important political point.

I have been involved in InChI since the beginning and I am a great supporter of it. But it’s not a simple concept and it’s now being overused and badly used. I don’t know whether the situation is recoverable. But I can at least explain the problem.

First, chemistry is complicated. God or the LawsOfPhysics requires that. In principle every time you make a chemical compound it has a different composition.

That’s true for a huge number of compounds – a common one is clay. It has varying amounts of metal ions – sodium, potassium, calcium, aluminium, etc. Yet we are prepared to use names like “montmorillonite” to describe a single chemical concept even though it’s subtly different every time. But that’s beyond InChI.

Similarly we have glucose. We all know what that is – you can buy bottles at the drugstore and all are the “same stuff”. But when you put it in water you get a mixture of at least three species (open chain, alpha and beta isomers). But that’s beyond InChI.

Informatics for modern science requires precise description of concepts (ontology) and identifier systems. For example when we (at Glaxo) determined the structure of a drug bound to HIV protease the experiment has an identifier in the Protein data bank (1HTE). That identifier has been given by the Protein Data Bank, which validates the data, including versioning, thus acting as an authority. It’s more difficult to describe HIV protease. It varies between strains of the virus which mutates extremely rapidly – that’s one reason why it’s difficult to create drugs. So there are zillions of identifiers – ours was BH10, but they are generally classified under PF00077 (Sanger centre). The actual ids don’t matter – the point is that there have to be authorities and there have to be identifier systems.

But that’s not so easy in chemistry. The main problem is that there is no authority that catalogues chemistry and assigns open identifiers. It’s a very tricky problem because the identity of many compounds is very difficult. The most obvious problem – which no-one has formalised and which we are starting to do – is that there is an enormous difference between the macroscopic and microscopic – which I will refer to as “substances” and “molecules”. The conventional way to do this is through names, but names are variously applied to substances and molecules without distinguishing. It needed an authority to help define the problem and it needs an ontology.

Now the International Union of Pure and Applied Chemistry (IUPAC) knows there is a problem. It’s a worthy body and has honoured me by making me a fellow. It creates a large and detailed set of rules for naming compounds – from their molecular structure. But it doesn’t have the resources to name everything.

One useful way forward is to identify the molecular structures (or connection tables). It is very important to realise that not all compounds (e.g. sodium chloride) have connection tables and some like glucose have several. It’s only when a compound and a connection table have a 1:1 relationship that the InChI approach is useful. But for millions of compounds that’s roughly true. Unfortunately you have to know a lot of chemistry to know when it works and when it doesn’t. There’s no system in the world that manages it – that’s why we need an an ontology.

But for those millions of compounds that pharma are interested in it’s often possible to draw structures that are good enough to identify the compound. The problem is that the structures can be drawn in different sizes, orientations, etc. So we create a connection table – which atoms are joined to which. This is an enormous advance and allows us to classify compounds and search them by computer.

But remember it only helps in the macroscopic world only if there is a 1:1 relationship between CT and compound.

Now anyone can create a connection table and everyone will do it differently – they will call the atoms different names so no-one can compare them. But by using graph theory it is possible to “canonicalise” the CTs and one early paper by the Weiningers described exactly how, using a representation called SMILES.

The SMILES system was adopted by the pharma industry which used it to produce “canonical SMILES”. Now this looks like a good thing, but it wasn’t. First the program (DAYLIGHT) was closed and commercial, and secondly the program gave different answers from the algorithm published in the literature. Chaos.

Several of us tried to get the Daylight company to release the algorithm, but to no avail. They never have done and the closed nature has helped to hold chemistry backwards as it discredited the idea of universal identifier systems and interoperability.

So IUPAC put together a working group to develop an Open alternative. I was on this group – contributed a lot of input. It’s been a political success in that it’s highly used. But the design has serious flaws.

First it tries to tackle the structure -compound problem by creating multiple representations for certain types of compound (tautomers). That complicates InChIs severely, without adding commensurate benefit. Secondly it tries to represent imperfect knowledge. One of the major problems in chemical representation is that people miss out hydrogen atoms. That’s saves writing time, but it’s lazy and it introduces untold errors. Another problem is that chemists often represent stereochemistry imprecisely, so it’s not uncommon to find many connection tables for the “same” compound or substance. That means that in practice there can be many InChIs for the same compound. There has been an attempt to fix this by declaring certain representations more fundamental, and this will help but the multiple-InChI still remains. Only an ontological approach will help.

It’s compounded because people found the long lengths of many InChIs were inconvenient and so created hashes (InChI-KEY). Here the different InChIs for the same compound have no similarity and they can only be compared by having a resolving authority. So we are now back to authorities, without the infrastructure or communal will to make them work.

Because there are already many identify-giving authorities in chemistry. Some identify substances (like safety authorities); others identify substances. The American Chemical Society has the best known and largest identify system – the CAS registry number. But it has several drawbacks. The way in which identifiers are assigned is not public so we cannot know whether a substance corresponds to a given CAS number. Then there is no public indication whether an identifier is for a connection table or a substance. And then CAS numbers are CAS’s IP.

Now that’s OK in that they have assigned them by the “sweat of their brow” so they copyright them. If you want to find out what structure corresponds to what number you have to pay 5.80 USD (last time I looked) per compound. If you have hundreds of thousands of compounds (such as government agency responsible for health or environment) that’s a lot of dollars. So Pubchem does not display any CAS numbers.

Pubchem is – as far as I know – the only system that tries to distinguish compounds (CID) from substances (SID). Substance information is donated by suppliers of information and without an ontology it won’t be clear whether it’s molecule or substance – you have to guess from the type of depositor.

In summary, chemistry, like bioscience, is complicated. It needs the community to work out description systems and identifiers, but unlike bioscience, the political aspects are keeping it in the dark ages. Until we have a public Open authority for substance identifiers we can’t solve the chemical problem. And while the ACS lobbies against openness and the NIH (for whatever reason; PRISM< Pubchem, Conyers) it's going to be tough.

But I know who my money is on.

Posted in Uncategorized | 4 Comments