We’re looking to create an Open semantic resource for chemistry for a group of common chemicals – partly as a partner in the ORECHEM (Chemistry Repositories), project and partly because we need it for our own work in machine understanding of chemical text (OSCAR). We are developing an RDF-based repository and want to populate it with semantic information. Initially a maximum of 2-3 thousand common chemicals with names, identifiers, chemical formulae of various types and the commonest (mainly physical) properties. Nothing particularly special – the sorts of things that undergraduates ill come across – but it must be semantic and it must be Openly redistributable without permission (Open Data). Note that we cannot legally robotically access the major chemical databases maintained by CAS and Beilstein. Nor can we (yet) extract enough high-quality information from the published literature. We are close to having the technology, but still encounter the lawyer-barrier.
Initially we want to hold the following:
- chemical names
- chemical composition (brutto formula – When did this term start being used, and is there a definition?)
- chemical structure/connection table
- identifiers
- molecular mass
- physical properties
- (possibly) safety data
- links to other sites
A useful starting point is Rich Apodaca’s list of free chemistry databases. (and you can find many of the databases mentioned here in it). This about 15 months old and there may be more now. I also include ChemSpider as it has some community contributions. Many databases are not really relevant as they are too large, do not have programmatic access, or are only partly chemical.
The obvious starting point is Wikipedia and we are working closely as part of Wikipedia to add semantics to the information. Indeed we would see the results of our endeavours as giving a resource which could be used to help WP in quality control. The current main problem is the inconsistency of the information, especially in the variable syntax and semantics of the InfoBox.
It would be nice to start with the NIST webbook but this is not Open – and there are copious copyright notices and indications that NIST may wish to charge in the future. This is unusual for US government works but NIST has a special dispensation to recover costs.
CheEBI is the most semantic Open resource, and has been assembled by humans but there are many common chemicals not in it.
The various aggregators have very large numbers of molecules and therefore do not define a useful starting point for “common chemicals”.
All of these are potentially useful for enhancing information once it has been found.
The other major resource on the Web is MSDS (Materials Safety Data Sheets). These collections are freely accessible but probably not Open. However they form a useful starting point. The two main ones are the Inchem site and the collection hosted on the Oxford University Physical and Theoretical Chemistry server (Chemical and Other Safety Information from the Physical Chemistry). Each has somewhere between 1000 and 2000 unique chemical compounds. Manufacturers are obliged to create MSDS for their products, and we can expect them to be accurate because it has some legal force.
How can we check the “correctness” of the information on web pages. In general we can’t. All we can do is to compare information and note where it agrees or disagrees. To go further we need to know who the “authority” is. We trust some authorities more than others for a whole variety of reasons. But in general there is no “right” or “wrong” there are assertions made more or less strongly by authorities to which we give variable weights.
A good example is Wikipedia. I trust many of the the articles in physical science to a high degree. I may modulate that with the age of the article and the number of different collaborators. The relies on the “wisdom of crowds”, but I think it works well in chemistry. Chemspider has harnessed the wisdom of crowds but I suspect that only a very small fraction of their entries have been human-curated and I give an example below which seems to need attention.
I trust the suppliers of MSDS sheets … but read on.
In general the statement that “the formula of aspirin is C9H8O4” is unverifiable. I can, however assert that “Wikipedia asserts that the formula of aspirin is C9H8O4″ is true. (To be picky I should give the date of this assertion). Chemspider makes the same assertion. So does Oxford PTCL. And so do all the other ones I have checked. But there are lots of them, and for this we need robots. And, for less common compounds as we’ll see it doesn’t work out as wel..
How can a robot identify a chemical on a web site? It’s got the following choices:
- Common Names. This is how Wikipedia organizes the top-level access to chemicals. But, as we know, most common chemicals have tens of hundreds of synonyms and some of these synonyms refer to more than one compound.
- Systematic name. This can be useful, and it’s what we use in OSCAR. But it’s hard work parsing the totality of chemical names as there are many dialects, sub-grammars, etc. There are no good metrics for this – I heard a report values of ca 60% for name recognition for the commercial packages (our OPSIN does reasonably well for simple compounds but needs a lot more work – it’s an area where volunteer contributions might scale).
- (Brutto) formula – e.g. C4H10O. This does not normally identify a compound completely but is a useful constraint – two compounds with different formulae can be held to be non-equivalent.
- Molecular mass. This is often reported and can usually be calculated from the brutto formula. Again it can be used as a constraint to assert non-equivalence.
- Connection tables (also serialized as SMILES and InChI). These work well for organic compounds, poorly for inorganic ones. But there can be different levels of precision (hydrogens, stereochemistry, etc.) Identical connection tables (after canonicalisation) cane be held to show equivalence, but come compounds have several connection tables (e.g. glucose).
- Identifiers. Potentially identifiers are the easiest and most powerful tool. An identifier is a unique string associated by an authority with a substance (not necessarily pure). If an authority(X) asserts that substance A(X) and substance B(X) have the same identifier then they can be said to be equivalent. There are many authorities making such assertions. Ultimately it is only the authority(X) who can make assertions about its identifiers. To be widely useful the authority should provide a lookup (resolution) service which is both human- and machine-accessible. In practice many authorities don’t do this or provide only a toll-access service. The identifiers are also often copyright and may or may not be copied. This often leads to other authorities(Y) who copy identifiers without permission and make their own assertions which may or may not be compatible with the authority(X). Frequently also the source of the identifier is not given. Thus many people who submit information to Pubchem give identifiers and these are listed as “[RN]” = registry number. For aspirin for example, there seem to be many identifiers – in the Chemspider entry all the following link through to Pubchem, e.g. 2349-94-2[RN], 26914-13-6[RN], 98201-60-6[RN]
- physical Properties. It is generally assumed that for any pure compound many of the physical properties are invariant. (this is not true if it has solid polymorphs or similar metastable states, but it’s a very useful guide for non-equivalence.
In the next post I’ll show how we get on with some typical exploration. It may show the scale of the problem we face in reconciling current chemical information.