Wikipedia is (rightly) becoming the first place that people look for well-understood scientific information including chemistry. Chemical compounds are particularly suited as the concept is over 150 years old and it is universal practice to index parts of chemistry through compounds. In most cases the ompound can be given one or more identifiers, through the relationship of these can be complex. Examples are names, serial numbers and other arbitrary IDs, and chemical structures.
Recently two derivative works of WP compounds were announced:
This post is primarily to welcome these developments and add some general comments.
- The style of the two sites is different and they appear to be completely independent. They are somewhat complementary: CS integrates the entries into a datacentric format; MM describes entries as monographs and has an emphasis on text and images. Neither site references the other AFAICS.
- I think both sites use the WP title and URL as the primary identifier in WP. WP also has a set of numeric identifiers which I think represents the internal WP uniquification system. This may matter at some time as WP entries can be deleted or moved while the identifiers are sacrosanct.
- Both sites have a search capability (I have not compared them). I may have missed it but there was no clear way to download results.
- It is not clear what the ingestion strategy is for either site. MM has a mechanism for humans to ingest entries at the same time as they author them on WP.
- I am not clear what data transformation (if any) is carried out automatically by the ingest process. Data in WP Infoboxes is still variable (DBPedia 2008-02 release shows at least 4 different syntaxes for molecular mass. An ingestion program either has to deal with all lexical variants (quite a problem) or simply ingest the string. There is also potential confusion between minus, hyphen-minus, negative and ranges. Scientific units are not always easy to extract.
- Does either site have an RSS feed for new entries?
Wikipedia has about 5000 compounds (the number is fuzzy because most people would not include proteins, probably not peptides, and nucleic acids. There are also many substances which describe a range of constituents such as petrol, polystyrene and many solid state compounds.
I have, in the past, downloaded WP data from the lists of organic and inorganic compounds (this totals considerably less than the ca 5000 in the two derivative sites). Is there a central page, preferably with RSS or a watch list, which lists those entries primarily considered to be chemical compounds.
Our own work on collections of common compounds using RDF is progressing well though it has been technical harder than we thought mainly due to variability in data input. We will use and acknowledge gratefully material from the sites above, and particularly from DBPedia (though there needs to be continued work in standardising the infoboxes to give consistent semantics). It t is, however, critical that the process of copying or transclusion does not introduce errors (which I suspect is likely until there are consistent infoboxes). We shall, of course make our results freely and Openly available, modulo the difficult issues which have been raised about data sharing are re-use.