I had decided to take a mellow tone on re-starting this blog and I was feeling full of the joys of spring when I read a paper I simply have to criticize. The issues go beyond chemistry and non-chemists can understand everything necessary. The work has been reviewed in Wired so achieved high prominence (CAS display this on their splashpage). There are so many unsatisfactory things I don’t know where to begin…
I was alerted by Rich Apodaca who blogged…
A recent issue of Wired is running a story about a Chemical Abstracts Service (CAS) study on the distribution of scaffold frequencies in the CAS Registry database.Cheminformatics doesn’t often make it into the popular press (or any other kind of press for that matter), so the Wired article is remarkable for that aspect alone.
From the original work (free PDF here):
It seems plausible to expect that the more often a framework has been used as the basis for a compound, the more likely it is to be used in another compound. If many compounds derived from a framework have already been synthesized, these derivatives can serve as a pool of potential starting materials for further syntheses. The availability of published schemes for making these derivatves, or the existence of these desrivates as commercial chemicals, would then facilitate the construction of more compounds based on the same framework. Of course, not all frameworks are equally likely to become the focus of a high degree of synthetic activity. Some frameworks are intrinsically more interesting than others due to their functional importance (e.g., as a building blocks in drug design), and this interest will stimulate the synthesis of derivatives. Once this synthetic activity is initiated, it may be amplified over time by a rich-get-richer process. [PMR – rich-get-richer does not apply to pharma or publishing industries but to an unusual exponent in the power law].
With the appearance of dozens of chemical databases and services on the Web in the last couple of years, the opportunities for analyses like this (and many others) can only increase. Who knows what we’ll find.
Thanks Rich. Now the paper has just appeared in a journal published by ACS (American Chemical Society, of which Chemical Abstracts (CAS) is a division). (There is no criticism of the ACS as publisher in my post, other than that I think the paper is completely flawed). Because ACS is a Closed publisher the paper is not normally Openly readable, but papers often get the full text exposed early on and then may become closed. I’ve managed to read it from home, so if you don’t subscribe to ACS/JOC I suggest you read it quick.
I dare not reproduce any of the graphs from the paper as I am sure they are copyright ACS so you will have to read the paper quickly before it disappears.
Now I have accepted a position on the board of the new (Open) Journal Of Chemoinformatics. I dithered, because I feel that chemoinformatics is close to pseudo-science along the lines of others reported by Ben Goldacre (Bad Science). But I thought on balance that I’d do what I could to help clean up chemoinformatics and therefore take a critical role of papers which I feel are non-novel, badly designed, irreproducible, and badly written. This paper ticks all boxes.
[If I am factually wrong on any point of Chemical Abstracts, Amer. Chemical Soc. policies etc. I’d welcome correction and ‘ll respond in a neutral spirit.]
So to summarize the paper:
The authors selected 24 million compounds (substances?) from the CAS database and analysed their chemical formula. They found that the frequency of frameworks (e.g. benzene, penicillin) fitted a power law. (PLs are ubiquitous – in typsetting, web caches, size of research laboratories, etc. There is nothing unusual in finding one). The authors speculate that this distribution is due to functional importance stimulating synthetic activity.
I shall post later about why most chemoinformatics is flawed and criticize other papers. In general chemoinformatics consists of:
- selection of data sets
- annotating these data sets with chemical “descriptors”
- [optionally] using machine learning algorithms to analyse or predict
- analyse the findings and prepresentation
My basic contention is that unless these steps are (a) based on non-negotiable communally accepted procedures (b) reproducible in whole – chemoinformatics is close to pseudoscience.
- The data in CAS is based almost completely on data published in the public domain. I agree there is considerable “sweat of brow” in collating it, but it’s “our data”.
- CAS sells a licence to academia (Scifinder) to query their databse . This does not allow re-use of the query results. Many institutions cannot afford the price.
- There are strict conditions of use. I do not know what they are in detail but I am 100% certain that I cannot download and use a signifcant part of the database for research, and publish the results. Therefore I cannot – under any circumstances attempt to replicate the work. If I attempted I would expect to receive legal threats or worse. Certainly the University would be debarred from using CAS.