Closed Data at Chemical Abstracts leads to Bad Science

I had decided to take a mellow tone on re-starting this blog and I was feeling full of the joys of spring when I read a paper I simply have to criticize. The issues go beyond chemistry and non-chemists can understand everything necessary. The work has been reviewed in Wired so achieved high prominence (CAS display this on their splashpage). There are so many unsatisfactory things I don’t know where to begin…

I was alerted by Rich Apodaca  who blogged…

A recent issue of Wired is running a story about a Chemical Abstracts Service (CAS) study on the distribution of scaffold frequencies in the CAS Registry database.

Cheminformatics doesn’t often make it into the popular press (or any other kind of press for that matter), so the Wired article is remarkable for that aspect alone.
From the original work (free PDF here):

It seems plausible to expect that the more often a framework has been used as the basis for a compound, the more likely it is to be used in another compound. If many compounds derived from a framework have already been synthesized, these derivatives can serve as a pool of potential starting materials for further syntheses. The availability of published schemes for making these derivatves, or the existence of these desrivates as commercial chemicals, would then facilitate the construction of more compounds based on the same framework. Of course, not all frameworks are equally likely to become the focus of a high degree of synthetic activity. Some frameworks are intrinsically more interesting than others due to their functional importance (e.g., as a building blocks in drug design), and this interest will stimulate the synthesis of derivatives. Once this synthetic activity is initiated, it may be amplified over time by a rich-get-richer process. [PMR – rich-get-richer does not apply to pharma or publishing industries but to an unusual exponent in the power law].

With the appearance of dozens of chemical databases and services on the Web in the last couple of years, the opportunities for analyses like this (and many others) can only increase. Who knows what we’ll find.

Thanks Rich. Now the paper has just appeared in a journal published by ACS (American Chemical Society, of which Chemical Abstracts (CAS) is a division). (There is no criticism of the ACS as publisher in my post, other than that I think the paper is completely flawed).  Because ACS is a Closed publisher the paper is not normally Openly readable, but papers often get the full text exposed early on and then may become closed. I’ve managed to read it from home, so if you don’t subscribe to ACS/JOC I suggest you read it quick.

I dare not reproduce any of the graphs from the paper as I am sure they are copyright ACS so you will have to read the paper quickly before it disappears.

Now I have accepted a position on the board of the new (Open) Journal Of Chemoinformatics. I dithered, because I feel that chemoinformatics is close to pseudo-science along the lines of others reported by Ben Goldacre (Bad Science). But I thought on balance that I’d do what I could to help clean up chemoinformatics and therefore take a critical role of papers which I feel are non-novel, badly designed, irreproducible, and badly written. This paper ticks all boxes.

[If I am factually wrong on any point of Chemical Abstracts, Amer. Chemical Soc. policies etc. I’d welcome correction and ‘ll respond in a neutral spirit.]

So to summarize the paper:

The authors selected 24 million compounds (substances?) from the CAS database and analysed their chemical formula. They found that the frequency of frameworks (e.g. benzene, penicillin) fitted a power law. (PLs are ubiquitous – in typsetting, web caches, size of research laboratories, etc. There is nothing unusual in finding one). The authors speculate that this distribution is due to functional importance stimulating synthetic activity.

I shall post later about why most chemoinformatics is flawed and criticize other papers. In general chemoinformatics consists of:

  1. selection of data sets
  2. annotating these data sets with chemical “descriptors”
  3. [optionally] using machine learning algorithms to analyse or predict
  4. analyse the findings and prepresentation

My basic contention is that unless these steps are (a) based on non-negotiable communally accepted procedures (b) reproducible in whole – chemoinformatics is close to pseudoscience.

This paper involved steps 1,2,4.  (1) is by far the most serious for Open Data advocates so I’ll simply say that
(2) There was  no description of how connection tables (molecular graphs) were created. These molecules apparently included inorgnaic compounds and the creation of CTs for these molecules is wildly variable or often non-attempted. This immediately means that millions of data in the sample are meaningless. The authors also describe an “algorithm” for finding frameworks which is woolly and badly reported. Such algorithms are common – and many are Open as in CDK and JUMBO. The results of the study will depend on the algorithm and the textual description is completely inadequate to recode it. Example – is B2H6 a framework? I would have no idea.

(4) There are no useful results. No supplemental data is published (JOC normally requires supplemental data but this is an exception – I have no idea why not). The data have been destroyed into PDF graphs (yes – this is why PDF corrupts – if the graphs had been SVG I could have extracted the data). Moreover the authors give no justification for their conclusion that frequency of occurrence is due to synthetic activity or interesting systems. What about natural products? What about silicates?

But by far the most serious concern is (1). How were the data selected?

The data come – according to the authors – from a snapshot of the CAS registry in 2007. I believe the following to be facts, and offer to stand corrected by CAS:

  • The data in CAS is based almost completely on data published in the public domain. I agree there is considerable “sweat of brow” in collating it, but it’s “our data”.
  • CAS sells a licence to academia (Scifinder) to query their databse . This does not allow re-use of the query results. Many institutions cannot afford the price.
  • There are strict conditions of use. I do not know what they are in detail but I am 100% certain that I cannot download and use a signifcant part of the database for research, and publish the results. Therefore I cannot – under any circumstances attempt to replicate the work. If I attempted I would expect to receive legal threats or worse. Certainly the University would be debarred from using CAS.

The results of the paper – such as they are – depend completely on selection of the data. There are a huge number of biological molecules (DNA, proteins) in CAS and I would have expected these to bias the analysis (with 6, 5, and 6-5 rings being present in enormous numbers). The authors may say – if they reply – that it’s “obvious” that “substance” (with < 253 atoms) excluded these – but that is a consequence of  bad writing, poor methodology and the knowledge that whatever they put in the paper cannot be verified or challenged by anyone else on the planet.

There are many data sources which are unique – satellite, climate, astronomical, etc. The curators of those work very hard to provide universal access. Here, by contrast, we have a situation where the only people who can work with a dataset are the people we pay to give us driblets of the data at extremely high prices.

This post is not primarily a criticism of CAS per se (though from time to time I will publish concerns about their apparent – but loosening – stranglehold on chemical data). If they wish to collect our data and sell it back to us it’s tenable business model and I shall continue to fight it.

But to use a monopoly to do unrefereeable bad science is not worthy of a learned society.

This entry was posted in semanticWeb, Uncategorized, XML. Bookmark the permalink.