Catalysed by a recent comment on a 2007-12 post (Exploring RDF and CML) :
here’s an update of where we are at with molecular repositories. (We shall have a clearer idea when several of our group present at Open Repositories 2008 (OR08) in 10 days time. (Lots of progress can be made in 10 days). I’m omitting details here (so as not to spoil the show next month).
- We are committed to RDF+XML/CML as the future for molecular information. This is the only way that we can manage such diverse information as documents, recipes, results of calculations, spectra, crystallography, physical and analytical properties, etc. The CML schema is now being used in many places and has remained stable for 15 months. Almost all parts have now been tested in the field (the main exception is isotopic variation – e.g. in geoscience). We can easily go from CML to RDF – the reverse is not always possible. The value of CML is that it is currently easier to use for chemical calculations as there is a knowable coherence of related concepts. Note that the CML community is developing a number of subdomains (“conventions”) which allows some degree of local autonomy as in CMLComp.
- We are enthusiastic partners in the OREChem project (Chemistry Repositories, and from Jim Downing ORE! Unh! Huh! What is it good for?). This uses named (RDF) graphs to describe local collections (“Aggregates”) of URIs. The project will have several molecular repositories of which we shall contribute at least 2 (CrystalEye and our neascent “molecular repository”). All content will be Open Data.
- Jim Downing has developed a lightweight repository (MMRe) based on Atom/REST and RDF. I won’t give too much away except to say it is deployed and over the last few days Joe Townsend has been adding data from chemistry theses (SPECTRaT) and Lezan Hawizy has been adding our collection of “common molecules” (scraped from various sources). This can now be queried through SPARQL.
- Andrew Walkingshaw has converted the CML in CrystalEye to RDF – 100,000 entries and probably about 10 million triples. He’s been working with a well-known semantic web company (not sure if this is public yet) and has done some very exciting extraction and mashups. SPARQL searches work over this size. Andrew has also developed Golem – a system which extracts dictionary links (cml:@dictRef) from CML computations and is able to build dictionaries (ontologies) automatically and then to extract data.
- In the last four days Thomas Steinke has converted VAMP to emit CML. We have run a few hundred calculations automatically (by extracting molecules from the NMREye repository, converting them to input, running the calculation, and then converting to RDF). The results – which contain coordinates, energies and NMR peaks – are being fed into another local repository.
So we have a variety of sources which will all be available. We face a number of exciting questions.
- How do we express a molecule in RDF? We are gradually converging on an “aggregate” where a molecule has identifiers, properties, and special resources such as chemical formula, the CML connection table, and a list ofg chemical names.
- How do we assign identifiers. This is a really hard problem. Although for many chemicals there is little doubt about the relationship between names, identities and properties there cannot, in general, be a “correct” structure or a “unique URI” for a chemical. Look, for example, at “Phosphorus Pentoxide” (In WP). Experiment shows that there are several different forms, with different chemical connectivities. There are 2 formulae (P2O5 and P4O10) each with a different CAS number (Chemical Abstracts is a major authority in chemistry). Are these different chemicals or do they represent our changing chemical knowledge? Is one used for early publications and another for later ones? Only CAS can say when one number is used and not the other. It is because of this uncertainty that we cannot know exactly how many different chemicals there are in the CAS collection.
There cannot be a platonic semantic description of chemical identity – many chemicals do not have a “correct” structure. Antony Williams has been doing a heroic and valuable job in detecting inconsistencies in reporting chemical structure and resolving them where possible (eMolecules and ChemSpider – A Respectful Comparison of Capabilities). But he is not establishing a “correct” structure – he is making authoritative statements about the relationship between names, structures and identifiers.
This brings us to why RDF – probably in its quad form (i.e. with provenance) – is important to describe chemical structure.
- Many substances occur in several forms and there is no single structure. We hope that RDF can manage these relationships.
- Many name-to-structure assignments have changed over time as out experimental techniqures become more powerful. Thus the C19 chemists would first write PO5 (atomic weights were not “correct”), then P2O5 and only after X-ray crystallography P4O10. To understand historical chemistry we have to know the relationships used at the time.
We have scraped about 10000 compounds from the web including Wikipedia and have a variety of triples associated with each. There is little overlap of triples – names, CAS, formulae are present or absent. So we now need to use RDF technology to reconcile this information. It’s a complex task and we will probably have to add weights/probabilities to some of the statements – some authorities are less reliable than others.
In the first instance we’ll probably use some of the commonest identifiers to assert identity and that’s the version we should be releasing in a few days.