Category Archives: XML

funding models for software, OSCAR meets OMII

In a previous post I introduced our chemical natural language tools OSCAR and OPSIN. They are widely used, but in academia there is a general problem - there isn't a simple way to finance the continued development and maintenance of software . Some disciplines (bioscience, big science) recognize the value of funding software but chemistry doesn't. I can count the following other approaches (there may be combinations)

  • Institutional funding. That's the model that ICE: The Integrated Content Environment uses. The major reason is that the University has a major need for the tool and it's cost-effective to do this as it allows important new features to be added.
  • Consortium funding. Often a natural progression from the latter. Thus all the major repository software (DSPACE, ePrints, Fedora) and content/courseware (Moodle, Sakai) have a large formal member base of instutions with subventions. These consortia may also be able to raise grants.
  • Marginal costs. Some individuals or groups are sufficiently committed that they devote a significant amount of their marginal time to creating. An excellent example of this is George Sheldrick's SHELX where he single-handedly developed the major community tool for crystallographic analysis. I remember the first distributions - in ca 1974 - when it was sent as a compressed deck of FORTRAN cards (think about that).  For afficionados there was a single variable A(32768) in which different locations had defined meanings only in George's head. Add EQUIVALENCE, blank COMMON and any alteration to the code except by George led to immediate disaster. A good strategy to avoid forks. My own JUMBO largely falls into this category (but with some OS contribs).
  • Commercial release. Many groups have developed methods for generating a commercial income stream. Many of the computational chemistry codes (e.g. Gaussian) go down this route - an academic group either licenses the software to a commercial company, or set up a company themselves, or recover costs from users. The model varies. In some cases charges are only made to non-academics, and in some cases there is an active academic devloper community who contribute to the main branch, such as for CASTEP
  • Open Source and Crowdsourcing. This is very common in ICT areas (e.g. Linux) but does not come naturally to chemistry. We have created the BlueObelisk as a loose umbrella organisation for Open Data, Open Standards and Open Source in chemistry. I believe it's now having an important impact on chemical informatics - it encourages innovation and public control of quality. Most of the components are created on marginal costs. It's why we have taken the view that - at the start - all our software is Open. I'll deal with the pros and cons later but note that not all OS projects are suited for crowdsourcing on day one - a reliable infrastructure needs to be created.
  • 800-pound gorilla. When a large player comes into an industry sector they can change the business models. We are delighted to be working with Microsoft Research - gorillas can be friendly - who see the whole chemical informatics arena as being based on outdated technology and stovepipe practices. We've been working together on Chem4Word which will transform the role of the semantic document in chemistry. After a successful showing at BioIT we are discussing with Lee Dirks, Alex Wade and Tony Hey the future of C4W
  • public targeted productisation. In this there is specific public funding to take an academic piece of software to a properly engineered system. A special organisation, OMII, has been set up in the UK to do this...

So what and why and who and where are OMII? :

OMII-UK is an open-source organisation that empowers the UK research community by providing software for use in all disciplines of research. Our mission is to cultivate and sustain community software important to research. All of OMII-UK's software is free, open source and fully supported.

OMII was set up to exploit and support the fruits of the UK eScience program. It concentrated on middleware, especially griddy stuff, and this is of little use to chemistry which needs Open chemistryware first. However last year I bumped into Dave DeRoure and Carole Goble and they told me of an initiative - ENGAGE - sponsored by JISC - whose role is to help eResearchers directly:

The widespread adoption of e-Research technologies will revolutionise the way that research is conducted. The ENGAGE project plans to accelerate this revolution by meeting with researchers and developing software to fulfil their needs. If you would like to benefit from the project, please contact ENGAGE (info@omii.ac.uk) or visit their website (www.engage.ac.uk).

ENGAGE combines the expertise of OMII-UK and the NGS ? the UK?s foremost providers of e-Research software and e-Infrastructure. The first phase, which began in September, is currently identifying and interviewing researchers that could benefit from e-Research but are relatively new to the field. "The response from researchers has been very positive" says Chris Brown, project leader of the interview phase, "we are learning a lot about their perceptions of e-Research and the problems they have faced". Eleven groups, with research interests that include Oceanography, Biology and Chemistry, have already been interviewed.

The results of the interviews will be reviewed during ENGAGE's second phase. This phase will identify and publicise the 'big issues' that are hindering e-Research adoption, and the 'big wins' that could help it. Solutions to some of the big issues will be developed and made freely available so that the entire research community will benefit. The solutions may involve the development of new software, which will make use of OMII-UK's expertise, or may simply require the provision of more information and training. Any software that is developed will be deployed and evaluated by the community on the NGS. "It's very early in the interview phase, but we?re already learning that researchers want to be better informed of new developments and are keen for more training and support." says Chris Brown.

ENGAGE is a JISC-funded project that will collaborate with two other JISC projects ? e-IUS and e-Uptake ? to further e-Research community engagement within the UK. "To improve the uptake of e-Research, we need to make sure that researchers understand what e-Research is and how it can benefit them" says Neil Chue Hong, OMII-UK's director, "We need to hear from as many researchers and as many fields of research as possible, and to do this, we need researchers to contact ENGAGE."

Dave and Carole indicated that OSCAR could be a candidate for an ENGAGE project and so we've been working with OMII. We had our first f2f meeting on Thursday where Neil, and two colleagues, Steve and Steve came up from Southampton (that's where OMII is centered although they have projects and colleagues elsewhere). We had a very useful session where OMII have taken the ownership of the process of refactoring OSCAR and also evangelising it. They've gone into OSCAR's architecture in depth and commented favourably on it. They are picking PeterC's brains so that they are able to navigate through OSCAR. The sorts of things that they will address are:

  • Singletons and startup resources
  • configuration (different options at statup, vocabularies, etc.)
  • documentation, examples and tutorials
  • regression testing
  • modularisation (e.g. OPSIN and pre- and post-processing)

And then there is the evangelism. Part of OMII-ENGAGE's remit is to evangelise, through brochures and meetings. So we are tentatively planning an Open OSCAR-ENGAGE meeting in Cambridge in June. Anyone interested at this early stage should mail me and I'll pass it onto the OMII folks.

... and now OPSIN...

Chem4Word - why semantics are necessary

I was asked to explain how Chem4Word and CML could encode ferrocene. I'll start by using Wikipedia to give a clear and accurate picture. Sorry for the cut-and-paste mess.

WP: Ferrocene is the organometallic compound with the formula Fe(C5H5)2. It is the prototypical metallocene, a type of organometallic chemical compound consisting of two cyclopentadienyl rings bound on opposite sides of a central metal atom.

Other names dicyclopentadienyl iron
Identifiers
CAS number 102-54-5
PubChem 11985121
ChEBI 30672
InChI
IUPAC name
Other names dicyclopentadienyl iron

Very clear and tidy. By contrast the entries in Pubchem are a mess. That's NOT Pubchem's fault - it's the non-semantic stuff that is sent by depositors. Again I shan't bash the depositors too hard as they have voluntarily deposited their material - it the awful non-semantic authoring tools they use and the absence of agreed conventions.

Chem4Word aims to raise the standard. You'll note from the entries below that the formulae for some of these structures are grotesque (10 negative charges). C4W will give authors a clear indication of the molecular formulae and charges and encourage semantic validation.

Anyway here goes. These are all the different compound IDs associated with ferrocene. I assume that all these compounds are meant to be ferrocene but their formulae are garbled by the tools - note the absurd charges. CML prevents such garbling.


Ferrotsen; Catane; FERROCENE ...
Compound ID: 7611
Source: LeadScope (LS-357)
IUPAC: cyclopenta-1,3-diene; iron(2+)
MW: 186.031400 g/mol | MF: C10H10Fe


FERROCENE; Bis(.eta.-cyclopentadienyl) iron
Compound ID: 11985121
Source: NIST Chemistry WebBook (3993653726)
IUPAC: cyclopenta-1,3-diene; cyclopentane; iron
MW: 186.031400 g/mol | MF: C10H10Fe-6


FERROCENE; Di(cyclopentadienyl)iron; Bis(cyclopentadienyl)iron ...
Compound ID: 10219726
Source: Sigma-Aldrich (F408_ALDRICH)
IUPAC: cyclopentane; iron
MW: 186.031400 g/mol | MF: C10H10Fe


FERROCENE
Compound ID: 504306
Source: NIST Chemistry WebBook (1113374621)
IUPAC: cyclopenta-1,3-diene; iron(2+)
MW: 186.031400 g/mol | MF: C10H10Fe


Ferrotsen; FERROCENE; Dicyclopentadienyl iron ...
Compound ID: 24196050
Source: DTP/NCI (209798)
IUPAC: cyclopenta-1,3-diene; iron
MW: 177.967880 g/mol | MF: C10H2Fe-10
Tested in BioAssays: All: 3, Active: 0; BioActivity Analysis


Ferrotsen; FERROCENE; Dicyclopentadienyl iron ...
Compound ID: 5150118
Source: DTP/NCI (44012)
IUPAC: cyclopenta-1,3-diene; iron(2+)
MW: 177.967880 g/mol | MF: C10H2Fe-8
Tested in BioAssays: All: 1, Active: 0; BioActivity Analysis


"should theses be Open?"

Until now most theses reside in a dusty basement or on a supervisor's shelf, but we are in transition to a world where all theses are -potentially - Openly visible to anyone. Surely this is a good idea.

In principle, of course, anyone can see my thsesis. It's badly written and the examiners rightly gave me a terrible time, but all the work in it was eventually published in peer-reviewed journals. In those days corrections meant ripping the thesis apart and Tippexing or rebinding. At the distance of some decades I'm now very happy if Oxford wishes to digitize it and put it on the web. Linguists can use it a useful source of typos.

Do all academics feel that Open theses are a good idea?

Two recent anecdotes - paraphrased and anonymized:

Academic 1: "I wouldn't want people to see our theses - many of them are of terrible quality".

Academic 2: "I wouldn't want anyone to see our theses - there are so many good ideas that we don't want our competitors to see."

This makes the case strongly that Open theses will improve quality and the dissemination of science.

The library of the future - Guardian of Scholarship?

I am still working out my message for JISC on April 2nd on "The library of the future". I've had suggestions that I should re-ask this as "“What are librarians for?”" (Dorothea Salo) and "what can a library do?" (Chris). Thanks, and please keep the comments coming, but I am currently thinking more radically.

When I started in academia I got the impression (I don't know why) that Libraries (capital-L = formal role in organization) had a central role in guiding scholarship. That they were part of the governance of the university (and indeed some universities have Librarian positions which have the rank of senior professor - e.g. Deans of faculties). I have held onto this idea until it has become clear that it no longer holds. Libraries (and/or Librarians) no longer play this central role. That's very serious and seriously bad for academia as it has left a vacuum which few are trying to address and which is a contributor to the current problems.

I current see very few - if any - Librarians who are major figures in current academia. Maybe there never was a golden age, but without such people the current trajectory of the Library is inexorably downward. I trace this decline to two major missed opportunities where, if we had had real guradians of scholarship we would not be in the current mess - running scared of publishers and lawyers.

The first occasion was about 1972 (I'd be grateful for exact dates and publishers). I remember the first time I was asked to sign over copyright (either to the International Union of Crystallography or the Chemical Society (now RSC)). It looked fishy, but none of my colleagues spoke out. (Ok, no blogosphere, but there were still ways of communicating rapidly - telephones). The community - led by the Librarians - should (a) have identified the threats (b) mobilised the faculty. Both would have been easy. No publisher would have resisted - they were all primarily learned societies then - no PRISM. If the Universities had said ("this is a bad idea, don't sign") we would never have had Maxwell, never had ownership of scholarship by for-profit organizations. Simple. But no-one saw it (at least enough to have impacted a simple Lecturer).

The second occasion was early 1990's - let's say 1993 when Mosaic trumpeted the Web. It was obviou to anyone who thought about the future that electronic publication was coming. The publishers were scared - they could see their business disappearing. Their only weapon was the transfer of copyright. The ghastly, stultifying, Impact Factor had not been invented. People actually read papers to find out the worth of someone's research rather than getting machines to count ultravariable citations.

At that stage the Universities should have re-invented scholarly publishing.The Libraries and Librarians should had led the charge. I'm not saying they would have succeeded, but they should have tried. It was a time of optimism on the Web - the dotcom boom was starting. The goodwill was there, the major universities had publishing houses. But they did nothing - and many contracted their University Presses.

There is still potential for revolution. But at every missed opportunity it's harder. All too many Librarians have to spend their time negotiating with publishers, making sure that students don't take too many photocopies, etc. If Institutional Repositories are an instrument of revolution (as they should have been) they haven't succeeded.

So, simply, the librarian of the future must be a revolutionary. They may or may not be Librarians. If Librarians are not revolutionaries they have little future.

In tomorrow's post I shall list about 10 people who I think are currently librarians of the future.

Closed Data at Chemical Abstracts leads to Bad Science

I had decided to take a mellow tone on re-starting this blog and I was feeling full of the joys of spring when I read a paper I simply have to criticize. The issues go beyond chemistry and non-chemists can understand everything necessary. The work has been reviewed in Wired so achieved high prominence (CAS display this on their splashpage). There are so many unsatisfactory things I don't know where to begin...

I was alerted by Rich Apodaca  who blogged...

A recent issue of Wired is running a story about a Chemical Abstracts Service (CAS) study on the distribution of scaffold frequencies in the CAS Registry database.

Cheminformatics doesn't often make it into the popular press (or any other kind of press for that matter), so the Wired article is remarkable for that aspect alone.
From the original work (free PDF here):

It seems plausible to expect that the more often a framework has been used as the basis for a compound, the more likely it is to be used in another compound. If many compounds derived from a framework have already been synthesized, these derivatives can serve as a pool of potential starting materials for further syntheses. The availability of published schemes for making these derivatves, or the existence of these desrivates as commercial chemicals, would then facilitate the construction of more compounds based on the same framework. Of course, not all frameworks are equally likely to become the focus of a high degree of synthetic activity. Some frameworks are intrinsically more interesting than others due to their functional importance (e.g., as a building blocks in drug design), and this interest will stimulate the synthesis of derivatives. Once this synthetic activity is initiated, it may be amplified over time by a rich-get-richer process. [PMR - rich-get-richer does not apply to pharma or publishing industries but to an unusual exponent in the power law].

With the appearance of dozens of chemical databases and services on the Web in the last couple of years, the opportunities for analyses like this (and many others) can only increase. Who knows what we'll find.

Thanks Rich. Now the paper has just appeared in a journal published by ACS (American Chemical Society, of which Chemical Abstracts (CAS) is a division). (There is no criticism of the ACS as publisher in my post, other than that I think the paper is completely flawed).  Because ACS is a Closed publisher the paper is not normally Openly readable, but papers often get the full text exposed early on and then may become closed. I've managed to read it from home, so if you don't subscribe to ACS/JOC I suggest you read it quick.

I dare not reproduce any of the graphs from the paper as I am sure they are copyright ACS so you will have to read the paper quickly before it disappears.

Now I have accepted a position on the board of the new (Open) Journal Of Chemoinformatics. I dithered, because I feel that chemoinformatics is close to pseudo-science along the lines of others reported by Ben Goldacre (Bad Science). But I thought on balance that I'd do what I could to help clean up chemoinformatics and therefore take a critical role of papers which I feel are non-novel, badly designed, irreproducible, and badly written. This paper ticks all boxes.

[If I am factually wrong on any point of Chemical Abstracts, Amer. Chemical Soc. policies etc. I'd welcome correction and 'll respond in a neutral spirit.]

So to summarize the paper:

The authors selected 24 million compounds (substances?) from the CAS database and analysed their chemical formula. They found that the frequency of frameworks (e.g. benzene, penicillin) fitted a power law. (PLs are ubiquitous - in typsetting, web caches, size of research laboratories, etc. There is nothing unusual in finding one). The authors speculate that this distribution is due to functional importance stimulating synthetic activity.

I shall post later about why most chemoinformatics is flawed and criticize other papers. In general chemoinformatics consists of:

  1. selection of data sets
  2. annotating these data sets with chemical "descriptors"
  3. [optionally] using machine learning algorithms to analyse or predict
  4. analyse the findings and prepresentation

My basic contention is that unless these steps are (a) based on non-negotiable communally accepted procedures (b) reproducible in whole - chemoinformatics is close to pseudoscience.

This paper involved steps 1,2,4.  (1) is by far the most serious for Open Data advocates so I'll simply say that
(2) There was  no description of how connection tables (molecular graphs) were created. These molecules apparently included inorgnaic compounds and the creation of CTs for these molecules is wildly variable or often non-attempted. This immediately means that millions of data in the sample are meaningless. The authors also describe an "algorithm" for finding frameworks which is woolly and badly reported. Such algorithms are common - and many are Open as in CDK and JUMBO. The results of the study will depend on the algorithm and the textual description is completely inadequate to recode it. Example - is B2H6 a framework? I would have no idea.

(4) There are no useful results. No supplemental data is published (JOC normally requires supplemental data but this is an exception - I have no idea why not). The data have been destroyed into PDF graphs (yes - this is why PDF corrupts - if the graphs had been SVG I could have extracted the data). Moreover the authors give no justification for their conclusion that frequency of occurrence is due to synthetic activity or interesting systems. What about natural products? What about silicates?

But by far the most serious concern is (1). How were the data selected?

The data come - according to the authors - from a snapshot of the CAS registry in 2007. I believe the following to be facts, and offer to stand corrected by CAS:

  • The data in CAS is based almost completely on data published in the public domain. I agree there is considerable "sweat of brow" in collating it, but it's "our data".
  • CAS sells a licence to academia (Scifinder) to query their databse . This does not allow re-use of the query results. Many institutions cannot afford the price.
  • There are strict conditions of use. I do not know what they are in detail but I am 100% certain that I cannot download and use a signifcant part of the database for research, and publish the results. Therefore I cannot - under any circumstances attempt to replicate the work. If I attempted I would expect to receive legal threats or worse. Certainly the University would be debarred from using CAS.

The results of the paper - such as they are - depend completely on selection of the data. There are a huge number of biological molecules (DNA, proteins) in CAS and I would have expected these to bias the analysis (with 6, 5, and 6-5 rings being present in enormous numbers). The authors may say - if they reply - that it's "obvious" that "substance" (with < 253 atoms) excluded these - but that is a consequence of  bad writing, poor methodology and the knowledge that whatever they put in the paper cannot be verified or challenged by anyone else on the planet.

There are many data sources which are unique - satellite, climate, astronomical, etc. The curators of those work very hard to provide universal access. Here, by contrast, we have a situation where the only people who can work with a dataset are the people we pay to give us driblets of the data at extremely high prices.

This post is not primarily a criticism of CAS per se (though from time to time I will publish concerns about their apparent - but loosening - stranglehold on chemical data). If they wish to collect our data and sell it back to us it's tenable business model and I shall continue to fight it.

But to use a monopoly to do unrefereeable bad science is not worthy of a learned society.