Closed Data at Chemical Abstracts leads to Bad Science

I had decided to take a mellow tone on re-starting this blog and I was feeling full of the joys of spring when I read a paper I simply have to criticize. The issues go beyond chemistry and non-chemists can understand everything necessary. The work has been reviewed in Wired so achieved high prominence (CAS display this on their splashpage). There are so many unsatisfactory things I don’t know where to begin…

I was alerted by Rich Apodaca who blogged…

Cheminformatics in the Popular Press: The Long Tail of Structural Scaffolds

A recent issue of Wired is running a story about a Chemical Abstracts Service (CAS) study on the distribution of scaffold frequencies in the CAS Registry database.

Cheminformatics doesn’t often make it into the popular press (or any other kind of press for that matter), so the Wired article is remarkable for that aspect alone.
From the original work (free PDF here):
It seems plausible to expect that the more often a framework has been used as the basis for a compound, the more likely it is to be used in another compound. If many compounds derived from a framework have already been synthesized, these derivatives can serve as a pool of potential starting materials for further syntheses. The availability of published schemes for making these derivatves, or the existence of these desrivates as commercial chemicals, would then facilitate the construction of more compounds based on the same framework. Of course, not all frameworks are equally likely to become the focus of a high degree of synthetic activity. Some frameworks are intrinsically more interesting than others due to their functional importance (e.g., as a building blocks in drug design), and this interest will stimulate the synthesis of derivatives. Once this synthetic activity is initiated, it may be amplified over time by a rich-get-richer process. [PMR – rich-get-richer does not apply to pharma or publishing industries but to an unusual exponent in the power law].

With the appearance of dozens of chemical databases and services on the Web in the last couple of years, the opportunities for analyses like this (and many others) can only increase. Who knows what we’ll find.

Thanks Rich. Now the paper has just appeared in a journal published by ACS (American Chemical Society, of which Chemical Abstracts (CAS) is a division). (There is no criticism of the ACS as publisher in my post, other than that I think the paper is completely flawed). Because ACS is a Closed publisher the paper is not normally Openly readable, but papers often get the full text exposed early on and then may become closed. I’ve managed to read it from home, so if you don’t subscribe to ACS/JOC I suggest you read it quick.

I dare not reproduce any of the graphs from the paper as I am sure they are copyright ACS so you will have to read the paper quickly before it disappears.

Now I have accepted a position on the board of the new (Open) Journal Of Chemoinformatics. I dithered, because I feel that chemoinformatics is close to pseudo-science along the lines of others reported by Ben Goldacre (Bad Science). But I thought on balance that I’d do what I could to help clean up chemoinformatics and therefore take a critical role of papers which I feel are non-novel, badly designed, irreproducible, and badly written. This paper ticks all boxes.

[If I am factually wrong on any point of Chemical Abstracts, Amer. Chemical Soc. policies etc. I’d welcome correction and ‘ll respond in a neutral spirit.]

So to summarize the paper:

The authors selected 24 million compounds (substances?) from the CAS database and analysed their chemical formula. They found that the frequency of frameworks (e.g. benzene, penicillin) fitted a power law. (PLs are ubiquitous – in typsetting, web caches, size of research laboratories, etc. There is nothing unusual in finding one). The authors speculate that this distribution is due to functional importance stimulating synthetic activity.

I shall post later about why most chemoinformatics is flawed and criticize other papers. In general chemoinformatics consists of:

selection of data sets
annotating these data sets with chemical “descriptors”
[optionally] using machine learning algorithms to analyse or predict
analyse the findings and prepresentation

My basic contention is that unless these steps are (a) based on non-negotiable communally accepted procedures (b) reproducible in whole – chemoinformatics is close to pseudoscience.

This paper involved steps 1,2,4. (1) is by far the most serious for Open Data advocates so I’ll simply say that

(2) There was no description of how connection tables (molecular graphs) were created. These molecules apparently included inorgnaic compounds and the creation of CTs for these molecules is wildly variable or often non-attempted. This immediately means that millions of data in the sample are meaningless. The authors also describe an “algorithm” for finding frameworks which is woolly and badly reported. Such algorithms are common – and many are Open as in CDK and JUMBO. The results of the study will depend on the algorithm and the textual description is completely inadequate to recode it. Example – is B2H6 a framework? I would have no idea.

(4) There are no useful results. No supplemental data is published (JOC normally requires supplemental data but this is an exception – I have no idea why not). The data have been destroyed into PDF graphs (yes – this is why PDF corrupts – if the graphs had been SVG I could have extracted the data). Moreover the authors give no justification for their conclusion that frequency of occurrence is due to synthetic activity or interesting systems. What about natural products? What about silicates?

But by far the most serious concern is (1). How were the data selected?

The data come – according to the authors – from a snapshot of the CAS registry in 2007. I believe the following to be facts, and offer to stand corrected by CAS:

The data in CAS is based almost completely on data published in the public domain. I agree there is considerable “sweat of brow” in collating it, but it’s “our data”.
CAS sells a licence to academia (Scifinder) to query their databse . This does not allow re-use of the query results. Many institutions cannot afford the price.
There are strict conditions of use. I do not know what they are in detail but I am 100% certain that I cannot download and use a signifcant part of the database for research, and publish the results. Therefore I cannot – under any circumstances attempt to replicate the work. If I attempted I would expect to receive legal threats or worse. Certainly the University would be debarred from using CAS.

The results of the paper – such as they are – depend completely on selection of the data. There are a huge number of biological molecules (DNA, proteins) in CAS and I would have expected these to bias the analysis (with 6, 5, and 6-5 rings being present in enormous numbers). The authors may say – if they reply – that it’s “obvious” that “substance” (with < 253 atoms) excluded these – but that is a consequence of bad writing, poor methodology and the knowledge that whatever they put in the paper cannot be verified or challenged by anyone else on the planet.

3 Responses to Closed Data at Chemical Abstracts leads to Bad Science

Gavin Baker says:

March 17, 2009 at 10:51 pm

Peter, it looks to me like this is an Author Choice paper — i.e. that the authors have paid ACS to make the paper open access (permanently).

pm286 says:

March 17, 2009 at 11:00 pm

@gavin
You’re right. The Author Choice logo is small and I missed it. Probably as they are employes of ACS/CAS they get it free. After all it’s good advertising for CAS.

Egon Willighagen says:

March 18, 2009 at 9:20 pm

Not going into how the paper was written, or why the picked 253 as limit for the heavy atom count (why not 250? how robust is 253? I would have chosen a nice round 250? Or does that make the statistics different? Would it then not fit a nice power law?), the work leaves plenty of questions indeed.
In my opinion it fails because it does not discuss the null-hypothesis: if we were ignorant of available chemicals and would make a random set of 24M chemicals, would we find a different distribution? In other words, is the found distribution actually significantly different from random sampling (I doubt it). The overall pattern is an expected pattern (from a random point of view), the simpler the framework the more likely it is. Figure 7 shows a few exceptions, the sterane skeleton for example.
The paper is not very convincing in showing that “the extreme unevenness in the way frameworks are distributed among organic compounds is somewhat surprising.” The authors already say themselves that “the most significant finding of this work is that the distribution of frameworks over all of organic chemistry conforms almost exactly to a power law.”
It seems that they basically did not find anything interesting… but hey, the have to publish as much as possible as anyone else… (and unlike people from Cambridge, Oxford, Harvard, etc they have to fight harder to get cited too).

Closed Data at Chemical Abstracts leads to Bad Science

3 Responses to Closed Data at Chemical Abstracts leads to Bad Science

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta