I was intending to blog about our collaboration with Dan Zaharevitz and colleagues at the National Cancer Institute in the DTP (Developmental Therapeutics Program). Dan beat me to it: in a CMLBlog comment (February 4th, 2008 at 5:02 pm e) to CML – what and why. In the comment he explains why the NCI has chosen to work with us on CML.
Dan and I first made contact ca. 5+ years ago. I think he had noticed my posting or contributing to CDK (Chemistry Development Kit) and had asked about what CML could do.  We got into correspondence and as a result he supported Henry and me  in the development of JUMBO – probably JUMBO 4.6.
It is refreshing to work with the NCI. Their agenda is ultimately simple – methods of combatting cancer. And they are very clear that the way to do this is through Openness – Open Data, Open Source, Open Standards. So it is wonderful to have a sponsor who says “we will help you to develop this code” and you can make it Open – indeed this is  virtually a requirement.
NCI is well known for pioneering the release of their data in Open form. For many years the NCI database – with about 250,000 compounds and associated biological data – was the only data that could be used for free in chemistry. This database was the logical predecessor of Pubchem which now has over 18 million compounds. (An important difference is that the NCI database relates to physical samples while many entries in Pubchem do not).
Dan’s support has been invaluable. Firstly it’s supported us to do the work. Secondly it gives much moral support to continue. And third it has given us important feedback. Since CML has many uses (publishing, computation, crystallography) it ‘s very useful to have an organisation who wants to manage data. NCI is not only interested in chemical structure but also associated data, including analytical.
So it was great to sit in Dan’s splendid basement and review how he was using CML and how we jointly felt it might develop. CML details will follow on the CMLBlog.

  1. It’s a great feeling for me personally to see that people are now acknowledging 18 million structures in PubChem and to know that we added at ChemSpider added about 7 million structures to that collection. When ChemSpider first released there was a lot of commentary that ChemSpider was nothing but rewrapped PubChem data and I was very clear with the fact that we would use them as our base dataset. I hope that people see that we have “given back” now. Now we have single structure depositions and another wave of large depositions to come I remain focused on developing our own dataset and ensuring that PubChem receives it in short order.

  2. pm286 says:

    (1) Indeed and this will be very valuable. Pubchem is now a magnet for Open deposition of structures. Moreover Pubchem have (IMO) the best model for deposition which is essentially based on RDF-quad-like assertions.
    CompoundX hasConnectionTable C1 (authority=chemspider).
    CompoundX hasSynonym N1 (authority=chemspider).
    These statements are neither true nor false other than that they have been made. The linkage of a name to a connection table is an assertion made by an authority, not a fact.

