Can you help OPSIN with disambiguating chemical names?

[This post is requesting input from the chemical community]

In creating our IUPACName2Structure converter OPSIN we need to cater for a diversity of usage. This arises from

  • ambiguity and multiple approaches in the IUPAC rules

  • authors of documents who ignore or adapt the IUPAC rules

  • programs which ignore or adapt IUPAC rules

It’s difficult for Daniel and me to assert that a given approach is better than another so I’m turning to the court of public informed chemical opinion. Ideally we would like to get to a system where an OPSIN user community rather like Wikipedia develops a communal view based on best reading of the literature and current practice. This post is really to set the scene. If there are any contributors to the IUPAC rules then they should be given special weight.

In principle we should adopt IUPAC practice wherever we can determine with certainty what it is.

In practice we have run 3 of the commoner commercial programs to see how they interpret (not generate) chemical names. There are differences of opinion (not just errors of implementation).

So here are two fundamental questions:

  • Do spaces in chemical names matter? If so what are the rules?

  • Do hyphens in chemical names matter? If so what are the rules?

Here is our standard example of ambiguous practice. The name chloroethylbenzene is, I think, a valid IUPAC name. But it is ambiguous and can represent 5 structures (one of which can have further stereoisomers). Pubchem correctly lists all 5:

2:

CID: 69330

Related Structures

graphics2

p-Chloroethylbenzene;

1-Chloro-4-ethylbenzene;

Benzene, 1-chloro-4-ethyl- …
IUPAC: 1-chloro-4-ethylbenzene
MW: 140.610060 g/mol | MF: C8H9Cl

3:

CID: 6995

Related Structures


graphics3

O-CHLOROETHYLBENZENE; 2-Ethylchlorobenzene; Benzene, 1-chloro-2-ethyl- …
IUPAC: 1-chloro-2-ethylbenzene
MW: 140.610060 g/mol | MF: C8H9Cl

4:

CID: 231496

Related Structures

graphics4

Ethylchlorobenzene; Phenethyl chloride;

(2-Chloroethyl)benzene …
IUPAC: 2-chloroethylbenzene
MW: 140.610060 g/mol | MF: C8H9Cl

  • So we’d welcome comments, answering the following and similar questions:

  • If OPSIN is given chloroethylbenzene should it refuse to parse it because of ambiguity. (We intend that OPSIN should, as far as possible, show where the ambiguity occurs)

  • If it does parse it, should it guess one of the five structures? (We are working on returning generic structures but that’s not for now).

  • Should the guess be random or should it be informed by some principles (including popularity of chloroethylbenzene as a synonym).?

  • Can punctuation help to remove the ambiguity? If so which of the following are un- or less ambiguous: chloro ethylbenzene, chloroethyl benzene, chloro ethyl benzene. Remember that the spaces may not be put in by the author, but by a wordprocessor or technical editor. Similar considerations apply to: chloro-ethylbenzene, chloroethyl-benzene, chloro-ethyl-benzene. Remember that wordprocessors and editors may hyphenate names

  • What about incomplete locants? Which of the following are unambiguous? 1-chloroethylbenzene, 2-chloroethylbenzene, 3-chloroethylbenzene, 4-chloroethylbenzene.

  • Would the context influence you? Is a study of 2-chloroethylbenzene, 3-chloroethylbenzene and 4-chloroethylbenzene less ambiguous?

Note that in some cases ambiguity can be resolved by enumerating the chemically sensible structures. Thus decachloroethylbenzene is unambiguous assuming normal valence rules but any number other than 10 chlorine substituents is ambiguous. And for, say, tetrachloroethylbenzene there will be quite a lot of isomers! We should not load this fortuitous resolution onto OPSIN and I would throw decachloroethylbenzene out as ambiguous (
no locants).

Your comments will be very useful and also any suggestions as to the governance of this in future.

This entry was posted in Uncategorized. Bookmark the permalink.

4 Responses to Can you help OPSIN with disambiguating chemical names?

  1. Alan McNaught says:

    Peter – you might find the brief section on punctuation in the 1993 Guide to IUPAC Nomenclature of Organic Chemistry helpful: there are comments on the use of spaces and hyphens:
    http://www.acdlabs.com/iupac/nomenclature/93/r93_45.htm
    Alan

    • pm286 says:

      @Alan many thanks. I am sure Daniel will be delighted to have a guide and will try to obey the letter of the law (which is more than some programs do).
      I’ve now looked at this and I think OPSIN honours these. It’s clear that there only certain places where spaces are required – others are simply wrong. The elison of (required) hyphens and locants is common and often creates ambiguity.

  2. David Hall says:

    I think it should parse it, and guess the two most popular. This shows to the user that there is ambiguity, but gives some indication of likely structures. Admittedly, this is probably the hardest option. No matter what, it shouldn’t allow the user to believe the input is unambiguous.

Leave a Reply

Your email address will not be published. Required fields are marked *