Wikipedia has ca 7000 chemical structures

I am delighted to be corrected in my statement about the number of compounds in Wikipedia:

chemconnector.com/chemunicating/the-curation-of-almost-5000-structures-on-wikipedia.html
(ChemConnnector is written, I think, by Antony Williams, Chemspiderman).
[…] There are likely legal reasons for a number of these databases to have CAS Numbers. As I continued to peruse the list I was more than impressed by the number of databases serving up CAS numbers online, and, I believe, a number of them containing over 10,000 numbers which, as I have commented before, is rather a magic number. Should Wikipedia be concerned about the 10,000 CAS number issue with some of the other issues being discussed now?
PMR recently commented on my blogpost here. He said “PMR: Wikipedia has between 1000 and 2000 chemical substances (ca 0.01% of the total number of substances in CAS).”
The number of chemical substances in Wikipedia is actually MUCH higher than that…I know since I’ve been looking at them, in detail as described here. To clarify, I am building an SDF file from the chemicals on Wikipedia so that it can be deposited on ChemSpider hooked up back to Wikipedia. This was done earlier by linking up chemical names but it was far from perfect so we are doing it in this more “curated” manner. The outcome from the work, and thanks to multiple other sets of eyes from WP:CHEM, will be a curated SDF file. I will return the SDF file with the following fields generated: SMILES string, Systematic Name, InChIString, InChIKey. These can then be used to homogenize the available fields in the Chemical Boxes etc.
In doing the work (I have already worked through the whole alphabet) I have over 4900 compounds already curated at a first level. I have disregarded the majority of inorganics and organometallics for this pass. ca. 5000 organics manually curated is ENOUGH of a challenge. I estimate the number of chemical compounds to be about 6500-7000, and it’s growing. So, it’s about a factor of 3-4 times bigger than PMR’s estimate. The vast majority do have CAS numbers. While it hasn’t hit 10,000 yet… it’ coming.

PMR: I’m delighted to know this. I should perhaps have said “entries with infoboxes for chemical compounds and substances”. My estimate was taken from the formal lists on WP such as:
http://en.wikipedia.org/wiki/List_of_inorganic_compounds
which had explicit pointers to entries.
The point of this was that I could automatically extract the infoboxes and convert to RDF (which I have done). This RDF is then combined with other sources and to show consistencies and inconsistencies with other sources. The goal of this – with the full involvement of Wikipedians – is to create an RDF resource for the OREChem project (Chemistry Repositories). The resource will, of course, be Open from the start. Although the primary goal is to develop and test the ORE technology and design we hope that it will also produce a top quality chemical resource with – perhaps – ca 10,000 compounds or substances.
We can, I believe, do this without violating either the CAS trademark or the terms of our Scifinder licence. The result will be a collection of compounds in widespread use (e.g. undergraduate work, commerce). It will have a superior informatics design to current chemical information sources and will be Open. It therefore has the technical potential to replace CAS numbers and information and I’ll post more about this later.

This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Wikipedia has ca 7000 chemical structures

  1. To assist me with my curation efforts can you highlight any errors/edits you have made to the Wikipedia data so that I don’t have to do rework. I know you’d commented to WP:Chem that you had already done some curation and I’d love to mesh your curations with mine if possible. The outcome will be an OPEN SDF file for all to use. If you don’t have anything as yet that’s fine too..I’ll keep working on my file until the work is over.

  2. pm286 says:

    (1) We are not making any edits – but rather creating annotations. The main thing at present is to extract the data from the XML/infoBox without syntactic corruption (there are some lack of consistencies). We aren’t putting anything back yet and are more likely to create annotations which people may like to act upon

  3. Pingback: Science in the open » Who’s got the bottle?

Leave a Reply

Your email address will not be published. Required fields are marked *