I am delighted to be corrected in my statement about the number of compounds in Wikipedia:
(ChemConnnector is written, I think, by Antony Williams, Chemspiderman).
[…] There are likely legal reasons for a number of these databases to have CAS Numbers. As I continued to peruse the list I was more than impressed by the number of databases serving up CAS numbers online, and, I believe, a number of them containing over 10,000 numbers which, as I have commented before, is rather a magic number. Should Wikipedia be concerned about the 10,000 CAS number issue with some of the other issues being discussed now?
PMR recently commented on my blogpost here. He said “PMR: Wikipedia has between 1000 and 2000 chemical substances (ca 0.01% of the total number of substances in CAS).”
The number of chemical substances in Wikipedia is actually MUCH higher than that…I know since I’ve been looking at them, in detail as described here. To clarify, I am building an SDF file from the chemicals on Wikipedia so that it can be deposited on ChemSpider hooked up back to Wikipedia. This was done earlier by linking up chemical names but it was far from perfect so we are doing it in this more “curated” manner. The outcome from the work, and thanks to multiple other sets of eyes from WP:CHEM, will be a curated SDF file. I will return the SDF file with the following fields generated: SMILES string, Systematic Name, InChIString, InChIKey. These can then be used to homogenize the available fields in the Chemical Boxes etc.
In doing the work (I have already worked through the whole alphabet) I have over 4900 compounds already curated at a first level. I have disregarded the majority of inorganics and organometallics for this pass. ca. 5000 organics manually curated is ENOUGH of a challenge. I estimate the number of chemical compounds to be about 6500-7000, and it’s growing. So, it’s about a factor of 3-4 times bigger than PMR’s estimate. The vast majority do have CAS numbers. While it hasn’t hit 10,000 yet… it’ coming.
PMR: I’m delighted to know this. I should perhaps have said “entries with infoboxes for chemical compounds and substances”. My estimate was taken from the formal lists on WP such as:
which had explicit pointers to entries.
The point of this was that I could automatically extract the infoboxes and convert to RDF (which I have done). This RDF is then combined with other sources and to show consistencies and inconsistencies with other sources. The goal of this – with the full involvement of Wikipedians – is to create an RDF resource for the OREChem project (Chemistry Repositories). The resource will, of course, be Open from the start. Although the primary goal is to develop and test the ORE technology and design we hope that it will also produce a top quality chemical resource with – perhaps – ca 10,000 compounds or substances.
We can, I believe, do this without violating either the CAS trademark or the terms of our Scifinder licence. The result will be a collection of compounds in widespread use (e.g. undergraduate work, commerce). It will have a superior informatics design to current chemical information sources and will be Open. It therefore has the technical potential to replace CAS numbers and information and I’ll post more about this later.