Where should we get our computing?

It’s always fun to find one’s blogging picked up in places you didn’t know existed – that’s a great virtue of the trackback system. This is from insideHPC.com (“HPC news for supercomputing professionals. * Reading the HPC news, so you don’t have to.”).

[Murray-]Rust asks, “Where should we get our computing?”

A post from Murray Rust’s blog at the Unilever Cambridge Centre for Molecular Informatics asks this very question. He leans toward hosted providers
HPC comes out of the “big science” tradition. CERN, NASA, etc. Where there are teams of engineers on hand to solve problems. Where there are managed projects with project managers. Where there are career staff to support the facilities. Where there are facilities.
…So who knows how to manage large-scale computing? The large companies. Amazon, etc. COST D37 had a talk from (I think) Rod Jones at CERN who said that in building computing systems he always benchmarked the costs against Amazon.
It’s an interesting post if you’re wanting to dip your toes in someone else’s perspective on large scale computing.

PMR: It’s actually very topical for us. Two events yesterday. I am handing over the chairmanship of the Computer Services Committee (in Chemistry) as discussing with my replacement and one of the Computer Officers (COs) what we should be doing about servers, clusters and air conditioned rooms. Our building is not designed ab initio – it has evolved over 60 years. New bits come and old bits go. So we are constantly having fun such as gutting rooms discovered to be full of asbestos, mercury and goodness knows what. Sometimes these are in the basement and turned into server rooms. With lots of power to run the machines and lots more power to remove the heat generated by the first lot of power. And, for those of you who do not live on this planet, the cost of power has been increasing. So it’s costing money to run these rooms, the rooms can’t be used for X-ray diffractometers or human scientists.
And every so often something goes wrong. Last week a fan belt went on the aircon. Things melted. The COs had to fix things. Opportunity cost. Money cost.
But we also had a visit from our eScience collaborators in Sydney – Peter Turner and Douglas du Boulay. They’ve been spending some weeks in Europe including visting NeSC, RAL and various other eSciency things and people. Should they use storage systems like SRB (NGS – SRB)? SRB is a distributed file system which allows you to distribute your files. In different continents, etc. It became de rigeur in eScience projects in the UK to use SRB, for example in the eMinerals project. This project, which combines eScience and minerals and has ca 6-9 instituions in the UK and several overseas collaborators  was run from the Department of Earth Sciences in Cambridge. Martin Dove has done a great job of combining science and technology and we have directly imported ideas, technology and people to our Centre.
The point here is that they started off with Globus, SRB, etc. And they found they ere a lot of hassle. And they didn’t always work. They worked most of the time, but when your files are 500 or 5000 kilometres away “nearly always working” isn’t good enough. Toby White joined us in the pub yesterday and expressed it clearly “I don’t want my files 500 miles away”. His files are at Daresbury. Daresbury has suffered severe loss problems under the black hole fiasco and files may not always be instantly available. It’s a fact of the world we live in.
I believe my colleagues in chemical computing would agree. There is a psychological need to have most of the resources “physically close”. And that’s difficult to define, but it means more than a telephone to someone in another country who is answerable to a different manager and project.
What about a Commercial service whose sole job is providing computing services? I think there’s a strong tensions here. Not least with the money. How important is it for use to tweak our CPUs? Especially with new generations of GPUs and so on? If it’s important, we need things locally. And often we do.
But it will cost money. At present much of the money is fudged – the machines are “there anyway”. But it won’t always be like that.
So long-tail science – including chemistry – will increasingly need to choose between academic HPC facilities (maybe “free”, but maybe real money), local clusters – autonomy but probably costly – and the cloud. There won’t be a single answer but we shall certainly see the market changing.

Posted in Uncategorized | Leave a comment

Text-mining at ERBI : Nothing is 100%; please comment

I was delighted to be asked to speak at a meeting of ERBI in Cambridge yesterday evening. ERBI is (roughly) a get-together of scientists and IT people in the dynamic biotech companies from the Cambridge region (“The Health, Wealth and Growth of Biotech in the [Eastern] Region”). Here was the program:
ERBI IT Special Interest Group Meeting – ‘Text Mining – Finding Buried Treasure’

17:00-17:35 – Richard Kidd, Head of Informatics, Royal Society of Chemistry
‘Prospecting chemical and biochemical literature’

The RSC’s Project Prospect, which was the first application of semantic web technologies to primary research publishing, won the 2007 ALPSP/Charlesworth Award for Publishing Innovation. We will discuss the problems with the conventional publication process which we tried to address, the development process, and successes and failures in applying new standards. We will look at the InChI and identifying chemical entities using text mining, using existing ontologies and building new ones, and their real-life application.
17:35-18:10 – Phil Hastings, Director Business Development, Linguamatics
Finding Answers from Text for Life Sciences

An overview of the application of Natural Language Processing (NLP) to text mining for researchers and information specialists, its potential impact and benefits. The presentation will include case studies from pharma/biotech. We will also include some insight into current challenges and potential opportunities for text mining in the future.
18:10-18:45 – Julie Barnes, Chief Scientific Officer, Biowisdom Ltd
‘A new information format for a new information age’

Julie will present opportunities to generate a new format for information, enabling us to better exploit the realms of historic literature and electronic information available. Case studies pertaining to drug safety will highlight the analytical power of assertional metadata for generating new insights for the purpose of pharmaceutical R&D.
18:45-19:20 – Peter Murray-Rust, Unilever Centre for Informatics, Cambridge University
‘The Chemical Semantic Web’

The semantic web is set to change the way we think about and use information. By providing explicit descriptions of concepts we make them accessible to machines and open up the possibility of simple reasoning. Chemical Markup Language (CML) can describe substances, reactions, molecules, solid state, spectroscopy and recipes. If chemistry is published in this way we can then use machines to do most of the tedious and error-prone work such as searching, transforming between formats, and integrating into documents. The presentation will include a number of interactive demonstrations.

PMR: The evening has a very nice logical flow. Although not consciously planned each presenter was able to rely on the audience (half scientists, half IT people – most from SME biotechs or IT companies). So – as always at the start of a meeting – I didn’t know what I was going to say. Who was the audience? If they had all been managers no point in giving the geek-candy. If scientists, go easy on the XML. If IT people, not too many slides filled with -ases, -ins, -ates, -osides, MAPKKKs, ERBs, etc. (Richard showed a lovely slide incorporating gene names such as “sleepy, bashful”, etc. I guessed it was from Drosophila (this community has a histiry of “amusing” names for genes). I was wrong – it was zebra fish. So the genomicists are competing for the Minister of silly names. Great fun, unless you are doing text mining. It’s difficult enough parsing “He filled the apparatus” (personal pronoun or element?). A good tactic used to be to throw away the common english words in a chunk of text – you can’t do that now or the zebra fish goes down the plughole.
So I changed my theme to “Chemistry in Documents” and pained the picture of publishing completely semantic chemistry. I also made it clear that nothing in this area is 100% correct. We have to adapt to this idea. There is no “right strucrure” for a compound. There are structures which have a very high probability of being associated with a name. There are names which have a probability of representing a chemical entity.
So I set the audience a question. Here’s a chunk of text from a thesis we – or rather OSCAR-the-journal-eating-robot – is reading. There are no tricks – it’s exactly as is. I asked them to same how many chemical entities there were in this chunk. (Ideally we should ask for where they start and end, put I just asked for a show of hands for the total count.

To a solution of crude bisynol 85 (1.05 g, assume 2.34 mmol) in dichloromethane (50 cm3) were added 4 Å molecular sieves (1.17 g), 4-methylmorpholine N-oxide (410 mg, 3.52 mmol) and TPAP (82 mg, 0.23 mmol). The reaction mixture was stirred at ambient temperature for 1 h. The crude reaction mixture was filtered through a plug of silica, washed with diethyl ether (100 cm3) and concentrated under reduced pressure. Gradient flash column chromatography (Petroleum ether:diethyl ether, 100:0 ? 95:5) afforded 1-(tert-butyldiphenylsilyloxy)-trideca-5,8-diyn-7-one 86 (750 mg, 32% over two steps) as a yellow oil:

PMR: The audience gave answers varying betwee between 4 and 11. To be fair some were not scientists and although they’d had an hour and a half of slides from the others they were not used to reading this sort of stuff.
So how many do YOU think there are?. Just a number between 4 and 11, although you can add comments if you wish. This competition is not open to Peter Corbett, Colin Batchelor, their friends or colleagues.

Posted in Uncategorized | 12 Comments

Open Serials Review

It’s out… Many thanks to Connie Foster (editor) for her patience. From Peter Suber:

17:33 04/06/2008, Gavin Baker,
The March issue of Serials Review (just out, apparently) is devoted to Open Access Revisited. The “revisited” refers to a theme issue of the journal from four years ago. The contents:

Update. Note that the contents are OA for a limited time — from the foreword:

… Elsevier has agreed to make this focus issue available in its published version for the next nine to twelve months as the sample issue of Serials Review to support the intent of the authors and the concept of Open Access. …

PMR: I’m off to read the others…

Posted in Uncategorized | Leave a comment

Ask what your repository can do for you

Chris Rusbridge has developed the idea that repositories need to be more than somewhere you put things, and suggest that they should offer help. This is great. There are lots of things he suggests I could use. However despite his title I doubt it comes cheap. I’d like them all but will write YES where I personally would like this, to help Chris prioritize (they are all valuable but I’m only allowed 4 votes)

02:57 04/06/2008, Chris Rusbridge,
I’ve been at a meeting of research libraries here in Philadelphia these past two days; a topic that came up a bit was the sorts of services that libraries might offer individuals and research groups in managing their research collections. I was reminded about my post about internal Edinburgh proposals for an archive service, last year. Subsequent to that it struck me that there is quite a range of services that could be offered by some combination of Library and IT services; I mentioned some of these, and there seemed to be some resonance. There could well be more, but my list included:
  • a managed current storage system with “guaranteed” backup, possibly related to the unit or department rather than individual PMR: YES
  • a “bit bucket” archive for selected data files, to be kept in some sense as a record (perhaps representing some critical project phase) for extended periods, probably with mainly internal access (but possibly including access by external partners, ie “semi-internal”). Might conflate to… PMR: YES
  • a data repository, which I would see as containing all or most data in support of publication. This would need to be static (the data supports the publication and should represent it), but might need to have some kind of managed external access. This might extend to… PMR: YES
  • a full-blown digital preservation system, ie with some commitment to OAIS-type capabilities, keeping the data usable. As well as that we have the now customary (if not very full)…
  • publications repository, or perhaps this might grow to be…
  • a managed publications system providing support for joint development of papers and support for publication submission, and including retention & exposure of drafts or final versions as appropriate. PMR: YES

I really like the latter idea, which I have seen various references to. Perhaps we could persuade people to deposit if the cost of deposit was LESS than the cost of non-deposit. The negative-cost repository, I like that!

xx

Posted in Uncategorized | Leave a comment

Quality is emerging in chemical software

An unplanned but very useful discussion on software quality has developed. In response to a remark that I made that there was no tradition of quality in the chemical software industry, Zsolt Zsoldos (ZZ) has responded carefully and at length and I’ll try to answer carefully). The discussion pings between the two blogs but I’ll copy major chunks.
Before I start I’ll re-emphasize what I said and make clear what I didn’t say. I believe the chemical software industry is useful, produces useful products, and that the customers value those products. What concerns me is public attention to quality.
I’m also only talking about chemistry. The debate has been widened to word processing and document formats. I’m not involved in developing these (thought I work with them) and the general intention is that they work that I and colleagues do is Open. I do not have a religious aversion to closed source products – in word processing or chemistry, but my point is that the quality of closed products may be difficult to measure.
By a tradition of quality I mean that there is a communal understanding that quality matters. Although quality is a wide term it is often difficult to discuss unless it is measured.

ZZ: Quality in chemical software – the debate continues

Peter Murray Rust has responded to my previous blog post and has raised some important points to which I have to respond, see comments section by section:

Quality in chemical software – a debate

PMR: ButSymBioSys Blog has replied to my post about unit testing in a long and thoughtful post. I don’t know who the individual is but the company sells a number of chemical software packages, a lot of which I recognize from Peter Johnson’s research group at Leeds.
Let me introduce myself: I am Zsolt Zsoldos, Chief Scientific/Technical Officer at SimBioSys. As Peter MR has recognised correctly, some of the software we market has been developed in Peter Johnson’s research group at Leeds, including the Sprout de novo design software which was my PhD project and Peter Johnson was my supervisor, and he is a scientific adviser and a director on the board of SimBioSys. There are a number of publications listed here covering my post-PhD work at SimBioSys as well as various presentations I gave at conferences, just to give some background on my work.
PMR:
I’m confining my remarks to “chemoinformatics” software. I exclude quantum mechanics programs (which take considerable care to publish results and test against competitors) and instrumental software (such as for crystal structure determination and NMR. Any software which comes up against reality has to make sure it’s got the right answers as far as possible. But chemoinformatics largely computes non-observables.
Reproducibility of results and robustness is not the whole story of quality. There are tens of thousands of docking and QSAR studies done each year and many of them are published. Are they reproducible? I expect that if a different researcher in a different institution with different software ran the “same” calculation they would get different results.
I fail to see how the “tens of thousands” of docking studies considered to compute “non-observables”, when we have tens of thousands of X-ray crystal structures to compare against. How is that less of a reality to come up against than quantum mechanics ? There are experimentally measured binding affinities to compare scoring results against. What better metric does QM has ? There is no exact mathematical solution to the Schrodinger wave functions, so all QM software computes approximations and there is no absolute benchmark point to compare against, because we cannot compute the exact solutions.

PMR: ZZ addresses this below in reporting a competition and I’ll continue there

Are the docking and QSAR study results reproducible ? With eHiTS and LASSO, the answer is definitely YES! I understand that many tools on the docking/QSAR market use stochastic (read random) methods and therefore their results are inherently unreproducible. Again, I can only speak with authority about our own software, which uses strictly deterministic and reproducible techniques. So if a different researcher in a different location runs our software on the same input they will get the same result. However, I do not see how one could run the “same calculation” using a different software. By definition, if you are using a different software (which embodies the calculation) then you are not running the same calculation. I can assure you the same is true for QM software as well, for the simple floating point error reasons I have explained in a previous blog post. So any different QM implementation will necessarily involve computation steps in different orders (as simple as summation in different order will suffice) and therefore get slightly different results.

PMR: Leaving aside the stochastic aspect – which we agree on (and which makes quality assessment much harder) my concern is not whether a given calculation is reproducible when confined to a manufacturers platform, but whether the results have been assessed as meaningful. Now I agree that this is not easy, but unless the manufacturers develop interoperable standards then the quality of the result is only assessable by public assessment, requiring standard data sets and standard results. I gave the example of “(total) polar surface area” which should, in principle, be computable reproducibly by all manufacturers. But only if it is defined in a manner that all agree upon. Otherwise we have as many different values as there are manufacturers. And I would content that – unless each has a clear defintions of the lagorithm and the proerty calculated – this is a lack of quality.
As an example from another field there is a standard way for all organizations to calculate their carbon footprint – AMEE. The same should be true for polar surface area.

PMR:
Which manufacturers publish the source code of their algorithms? Without this the user depends completely on trust in the manufacturer.
Hmmm, very good point. Let me see, does Microsoft publish their source code ? No. Then why do they have over 95% market share ? They must be very trust-worthy, right ? Then why are they facing anti-trust trials in US, Europe and Japan. Perhaps my example is off-topic and off-target, since PMR advocates open source over closed proprietary software and standard, like OpenOffice over MS Office and ODF over OOXML ? Nope, those links prove the exact opposite with statements like:
PMR:
The reason I currently like OOXML is that we can make it work and that we have material in Word that we can use.

My worry about Open Office (which emits ODT) is that I don’t yet believe that has reached a state where I could evangelize it without it falling over or being too difficult to install.

PMR: This example is out of scope, if only because we are talking about computational software. But I can and will answer the other points – briefly here, and in more detail later. We work with Word because that is the only useful source of chemical documents. If we find we can reliably convert OOXML to ODT we certainly will – we have some funding in that area. And, in the time we have started, MS have a plugin which is described as emitting ODF. We shall certainly see how it behaves.

So, let’s just agree that if something is open source that does not automatically guarantee good quality, and on the other hand, it is also possible to have good quality software that is proprietary. Although, I definitely see and acknowledge the quality values in open source, but in my opinion the open source model requires a critical mass (in terms of number of developers and users) to achieve the “any bug is shallow for many eyes” state of linux. Whether the user and developer base has reached that level for chemistry software is an interesting question worthy of its own debate. Let’s continue with our current debate:

PMR: I have not – and will not – claim that the Open Source movement in chemistry is of higher quality than closed source. I said there was no tradition of quality. As a result of your post I will moderate this statement slightly.

PMR:
Many communities have annual software and data competitions. They use standard data sets and different groups have to predict observables. Examples are protein structure and crystal structures. In text-mining and information retrieval there are major competitions. They rely on standard data sets (”gold standards”) against which everyone can test their software.
But in chemical software these type of standards are rare. If companies feel strongly about quality they should be doing something publicly. Developing test cases. Collaborating on the publication of Open Standard data. Creating Gold Standards. Developing Ontologies – if we don’t agree what quantity we are calculating then we are likely to get different answers.
Yes, indeed many communities have annual software competitions, including the docking community: for example, the SAMPL competition by OpenEye which the Bio-IT World has reported about, or the CASP docking competition as published by Lang et al. J Biomol Screen.2005; 10: 649-652. As for standard benchmarking data, how about GOLD validation set, or the more recent Astex diverse validation set specifically designed to be a high quality benchmark set for docking, published as:

    Diverse, High-Quality Test Set for the Validation of Protein-Ligand Docking Performance.
    M. J. Hartshorn, M. L. Verdonk, G. Chessari, S. C. Brewerton, W. T. M. Mooij, P. N. Mortenson, C. W. Murray
    J. Med. Chem., 50, 726-741, 2007.
    [DOI:10.1021/jm061277y]

For binding energy estimation we have the PDB-bind database, and for enrichment studies the DUD data set at docking.org. As for community based collaboration I have personally participated (among many others from the industry and academia) in the eChemInfo “Virtual screening and docking – comparative methodology and best practice” workshop last year at Bryn Mawr College, Philadelphia. A recent special issue of the Journal of Computer-Aided Molecular Design (Vol 22, Num 3-4 March/April 2008 131-266) has been devoted to “Recommendations for Evaluation of Computational Methods for Docking and Ligand-based Modeling”. As demonstrated by these links, it is unfair to say that standards, public data and collaboration do not exist in this area.

PMR: I agree this, but note that many of these are very recent. So I would be prepared to say that in certain fields a tradition of quality metrics is starting to emerge. Almost all of these relate to docking into proteins and are driven, at least in part, by the tradition of competitions in proteins such as CASP which has for many years been involved in predicting protein structure.
So I wish them well and will now exclude docking (but not QSAR) from my remarks. When there is a competition in QSAR, with open datasets, open descriptors and open algorithms (at least to the extent that in principle it is possible for a third party to implement them then I will happily accept the quality has been addressed.

ZZ

Posted in Uncategorized | 1 Comment

Quality in chemical software – a debate

ButSymBioSys Blog has replied to my post about unit testing in a long and thoughtful post. I don’t know who the individual is but the company sells a number of chemical software packages, a lot of which I recognize from Peter Johnson’s research group at Leeds. I’ve copied nearly all of the post because the writer has gone to some trouble (and indeed more trouble than is normally seen, which is the point I am making). Comments at end.

Research and software testing


And one major way is writing “unit tests”. Is that boring? Extremely. Do you get publications by writing unit tests? No. Are they simple to write. Not when you start, but they get easier.
Of course, writing unit tests for chemistry software is not chemistry research and so you do not get to write chemistry publications about it. However, it is an active topic in computer science. If you hop over to the ACM digital library and enter the search “unit test”, you get 19,314 hits all in peer reviewed journals, just to show you a few example hits:
[…]

When you read further Peter’s blog entry you see these statements:
The chemical software and data industry has no tradition of quality. I’ve known it for 30 years and I’ve never seen a commercial company output quality metrics.
Now, this is a bold statement if I have ever seen one. I am sure most commercial vendors who produce chemical software employ computer science or software engineering graduates, who during their training have been thought the standard unit testing and regression practices of the industry at school as part of the standard curriculum. How do I know that ? Because, not only do I have a BSc and an MSc myself in computer science (my PhD is in computational chemistry so that does not fall under CS), but I also spent 3 years as a teaching assistant at ELTE Budapest teaching programming methodology curses to CS undergraduates — including these techniques.

PMR: The first statement is subjective but it comes from 15 years in pharma industry buying software. Admittedly I have not worked in that industry for over 10 years, but I haven’t seen much to challenge that view. I would certainly argue that chemical software and data has no public face of quality – part of the problem arises from the lack of openly published metrics.

Of course, I can only speak about my own chemical software company with authority, so let me elaborate on how we do software testing. Our system consists of several compact software modules with well defined input and output data objects. These modules can be linked into a pipeline to perform complex tasks like docking or retrosynthetic analysis. Each of the modules have a unit test bed, which consists of a test engine, a set of test scripts and some input output data files and expected error report files. The test engine reads the test script, loads extracts the input data from the script, executes functions of the module and tests the responses, results returned comparing them to expected data from the script or data files. There are four distinct type of tests:
Func – functionality test; valid calls and parameters; checking certain scenarios to see if the module functions properly based on the script
Speed – performance test; valid calls and parameters; should be run with optimised compilation, debug turned off; measures speed
Error – testing of the exception handling; valid calls, parameters simulating extreme scenarios (e.g. file does not exist or incorrect file format used) that may happen in valid usage scenario due to wrong data being passed to the program by the user
Robust – robustness test; invalid call sequences and/or parameters to see whether the sanity checks (asserts) are thorough and complete. These tests programming errors in the integration pipeline, e.g. NIL pointers passed for required data input or calls made to uninitialized objects.
The last two categories have associated expected error files, where the error messages are listed that are expected to be in the response from the module that is being tested. An example functional test script is here from the MolFragGraph module. As you can see it contains a simple language, one command per line starting with a keyword followed by optional parameters and a data block. Of course, writing such scripts is boring, so we typically write only a few of them when a new module is developed. Then we add code like this to the program:
DBGMESSNLF(DEB_SCRIPT, “SCRIPT: MarkGridHead ClientID=0 NumLines=”<
<<” NumLineItems=”<
<<” Low=”<<_p_info->unit_min
<<” Dim=”<<_p_info->unit_dims
<<” CellSize=”<<_p_info->cell_size<<“\n”);
This is a macro call, that is controlled by a debug flag (DEB_SCRIPT). If that flag is turned on during run-time, then the code will output a line into the log file identified by the “SCRIPT:” header and containing one complete line for the test script along with parameters and data. When we run an integrated software pipe, we can generate a log file containing the actual data being passed in and output from any given module inthe format required by the test bed scripts. This allows us to automatically generate test scripts for any of the modules by running an integrated software pipe for a practical input case. If we find a bug, when we reproduce it with a debug version of the code, we can immediately generate test script for each module involved and test them separately to identify where is the root of the problem. Once the bug is fixed, we can generate the correct output expected for each module for the test case. This comes very handy for generating regression tests, so that if later changes of the code would break any of the previously fixed functionality, then we can notice because the corresponding test script would fail. Of course, the running of all these tests is automated in a nightly build and test script. Each module is assigned to a developer who is responsible for the module. When a test script fails during the automated nightly test, the developer gets an email notification so he can fix it during the next day. For quality metric we are producing similar tables each night, like the VTK dash board (I cannot show you our own for confidentiality reasons). We have been doing development with quality control in SimBioSys since the start of the company in 1996. I have also worked in larger software company for medical imaging where software development was carried out under ISO 9001 certified methodology and I have implemented the same principles (with some more automation) in SimBioSys even though we have not applied for the certification — which is a long bureaucratic process with a significant cost.
So what is the take-home message from this post? That software unit and regression testing is a very important, serious — although boring — part of the chemistry software development, and it is not limited to (nor invented by) open source groups like the Blue Obelisk, which is NOT the only place for software and data quality, contrary to what PMR would like you to believe.

PMR:  I am prepared to believe that a company is able to reproduce its own results internally and I suspect that the quality is better than it was 10 years ago. So is Open Source.
I’m confining my remarks to “chemoinformatics” software. I exclude quantum mechanics programs (which take considerable care to publish results and test against competitors) and instrumental software (such as for crystal structure determination and NMR. Any software which comes up against reality has to make sure it’s got the right answers as far as  possible. But chemoinformatics largely computes non-observables.
Reproducibility of results and robustness is not the whole story of quality. There are tens of thousands of docking and QSAR studies done each year and many of them are published. Are they reproducible? I expect that if a different researcher in a different institution with different software ran the “same” calculation they would get different results. Many calculations predict molecular properties- a simple one being molecular mass. What algorithm and what quantity is used for molecular mass? What atomic masses are used? I would be pleasantly surprised if all chemical software companies use the same atomic masses. If they do they don’t show it. I’ve not seen evidence of two companies collaborating to show that their software gives the same results.
And molecular mass is one of the simpler properties. Can you interchange “total polar surface area” from one manufacturer with another. Which manufacturers publish the source code of their algorithms? Without this the user depends completely on trust in the manufacturer.
Many communities have annual software and data competitions. They use standard data sets and different groups have to predict observables. Examples are protein structure and crystal structures. In text-mining and information retrieval there are major competitions. They rely on standard data sets (“gold standards”) against which everyone can test their software.
But in chemical software these type of standards are rare. If companies feel strongly about quality they should be doing something publicly. Developing test cases. Collaborating on the publication of Open Standard data. Creating Gold Standards. Developing Ontologies – if we don’t agree what quantity we are calculating then we are likely to get different answers.
So I welcome this debate. I’m quite prepared to take flak from other companies or groups that feel they have been slighted. But they have to show a public face of quality. And it’s difficult to do this without collaborating on the creation of Open Standards.
Open Data, Open Standards, Open Source – the Blue Obelisk mantra. At least you can tell where you are and how far you have to go.

Posted in Uncategorized | 3 Comments

What is the value of a paper? a citation?

At the RSC meeting I asked “what is the value of a paper in journal X?” where the metasyntactic variable X represents a prestigious organ. Not the cost. This is hard to determine as publishers despite their name do not often publish costs. Nor the price. Not surprisingly this may be difficult to relate to the cost. The value. The amount by which an individual, or an organization estimates that they would benefit. “I am Y USD better off because I have n papers in X”. Career progression involves money. Funding involves money. Universities need money to run. This money is increasingly steered by what jourbals are used for publication.
I was reading a paper pointed to by Peter Suber’s blog: Michael Taylor, Pandelis Perakakis, and Varvara Trachana, The siege of science. It;s “Open Access” so I’m gong to quote from it. The abstract (skip if you’ve read PeterS’s blog) sets the scene:

Science is in a state of siege. The traditional stage for scientific ideas through peerreviewed academic journals has been hijacked by an overpriced journal monopoly. After a wave of mergers and take-overs, big business publishing houses now exercise economic control over access to knowledge and free scientific discourse. Their ‘all is number’ rationale, made possible and perpetuated by single-parameter bibliometric indices like the Impact Factor and the h-index has led to a measurement of scientists, science and science communication with quality being reduced to quantity and with careers hanging in the balance of column totals. Other multi-parameter indices like the subscription-based Index Copernicus have not helped to resolve the situation. The patented and undisclosed black box algorithm of the Index Copernicus has just replaced one yardstick by another even less accessible one. Moreover, the academic as author, editor and/or reviewer, under intense competitive pressure, is forced to play the publishing game where such numbers rule, leading to frequent abuses of power. However, there are also deep paradoxes at the heart of this siege. Electronic software for producing camera-ready-copy, LaTeX style files, the internet and technology mean that
it has never been easier or cheaper to publish than it is today. Despite this, top journals are charging exorbitant prices for authors to publish and for readers to access their articles. Academic libraries are feeling the pinch the most and are being forced to cut journal subscriptions. Not surprisingly, scholars in droves are declaring their independence from commercial publishers and are moving to open access journals or are self-archiving their articles in public domain pre-print servers. That this movement is starting to hurt the big publishing houses is evidenced by their use of counter-tactics such as proprietary pre-print servers and pure propaganda in their attempts to guard against profit loss. Whether or not bibliometry will be an artefact in the future depends on the outcome of this battle. Here, we review the current status of this siege, how it arose and how it is likely to evolve.

PMR: It’s not a rant as there is a lot of supporting material though it’s arguing the case for OA. It’s got useful coverage of some of the historic battles between publishers are individuals or organisations and those should be required reading for those who want to criticize publishers – know your enemy. But among the Zipf’s law and other supporting data I found:

The dollar value of a citation In today’s very highly competitive academic environment, it is clear that not only do you have to publish in journals with high JIFs to get funding, but your articles also need to be highly cited to have an impact. Citations then, clearly translate into cash. In 1986, before the big wave of self-archiving and the adoption of the green road to OA, it was shown that the marginal dollar value of a citation to articles printed by commercial journals in the USA was estimated to be between US$ 50 and 1300 depending on the field (Diamond 1986). Scaling this up by 170% to account for inflation in the period 1986 to 2005, then the dollar value of a citation has risen to approximately US$ 90 to 2200. Although these figures may be surprising in themselves, their cumulative effect when taken in the context of world-wide or national research is astounding. For example, the UK research councils spend £ 3.5 billion annually, which results in an average of 130 000 journal articles being published per year (ISI figures) with an average citation rate of 5.6. This corresponds to 761 000 citations. Self-archiving increases citation impact by 50 to 250% but, so far, only 15% of researchers are spontaneously self-archiving their articles (Harnad et al. 2004). Taking the lower estimate, we find that the loss of potential research impact due to the 85% of authors not self-archiving in the UK is 50 × 80% × £ 3.5 billion = £ 1.5 billion. Hence, it is possible to argue that the green road to OA is a source of wealth creation.

PMR: I think this means value to the journals. IOW if an article is cited then the journal benefits by 90 USD to 2000 USD. Please tell me if I have got this wrong. But that’s just the value to the journal, large though it may be. The value to the institution may be even larger. There is a gearing between publication and research income which could be as much as 50. If the Wellcome trust values publications at 1-2% of a research grant then the cost of the research is – very crudely – 50 times more than the cost of the publication. Now it’s difficult to know what the absolute value of the research is – it might be more or less than the cost, but for an institution that’s real money, however valuable or not the research.
So I took a snap poll of 2-3 colleagues about what the value of a research paper was. A well-known metasyntactic journal asked me if I would publish this figure (“PMR thinks a paper in X is worth Y”). Given the enormous amounts of money riding on it, I think I’d better not, and leave it to an economist.
Unless, of course, metasyntactic journal X wishes to fund me to do a – completely objective – study.

Posted in Uncategorized | 1 Comment

Peter Suber puts us through the Mill

In his latest monthly newsletter Peter Suber deviates from his normal summarising and instead indicates how the principles of John Stuart Mill apply to Open Access. This is a must read. As most of you know PeterS is a philosopher and here he puts this to great use. If anyone asks “what is the use of philosophy?” here is an excellent example. It’s practical – it’s utilitarian. Some snippets:


The thesis in a nutshell is that OA facilitates the testing and validation of knowledge claims.  OA enhances the process by which science is self-correcting.  OA improves the reliability of inquiry.
Science is fallible, but clearly that’s not what makes it special.  Science is special because it’s self-correcting.  It isn’t self-correcting because individual scientists acknowledge their mistakes, accept correction, and change their minds.  Sometimes they do and sometimes they don’t.  Science is self-correcting because scientists eventually correct the errors of other scientists, and find the evidence to persuade their colleagues to accept the correction, even if the new professional consensus takes more than a generation.  In fact, it’s precisely because individuals find it difficult to correct themselves, or precisely because they benefit from the perspectives of others, that we should employ means of correction that harness public scrutiny and open access.


Mill, _On Liberty_ (Hackett Pub. Co., 1978) at p. 19:

[T]he source of everything respectable in man either as an intellectual or as a moral being…[is] that his errors are corrigible.  He is capable of rectifying his mistakes by discussion and experience….The whole strength and value, then, of human judgment depending on the one property, that it can be set right when it is wrong, reliance can be placed on it only when the means of setting it right are kept constantly at hand.

Mill at p. 20:

The beliefs which we have most warrant for, have no safeguard to rest on, but a standing invitation to the whole world to prove them unfounded. If the challenge is not accepted, or is accepted and the attempt fails, we are far enough from certainty still; but we have done the best that the existing state of human reason admits of; we have neglected nothing that could give the truth a chance of reaching us: if the lists are kept open, we may hope that if there be a better truth, it will be found when the human mind is capable of receiving it; and in the meantime we may rely on having attained such approach to truth, as is possible in our own day. This is the amount of certainty attainable by a fallible being, and this the sole way of attaining it.

PeterS: Here’s a quick paraphrase:  To err is human, but we can always correct our errors.  We needn’t distrust human judgment just because it errs.  But to trust human judgment, we must keep the means for correcting it “constantly at hand”.  The most important means of correction is “a standing invitation to the whole world” to find defects in our theories.  The only kind of certainty possible for human judgment is to face and survive that kind of public scrutiny.

PMR: There is no doubt that Open Access (and Open Data) increase the number of eyeballs and hence the chance or error-correction.  In arguing against that the opponents of OA have to show that there is a greater good by being closed.

This is difficult to argue, though I imagine cases can be made. I think it would form a good basis for philosophy and economics classes.

But coherent arguments against Open Access simply are not being made. We’d welcome them.

But they will have to be of the same quality as Peter Suber’s writing to convince me. And that’s hard.

I hope to write more later on the utilitarian aspects of Open Data.

Posted in Uncategorized | Leave a comment

The Blue Obelisk – Egon's diff is boring

Egon blogged the following yesterday. I have removed the geek-stuff but there’s a serious message so read on…

Finding differences between IChemObjects

CDK trunk is getting into shape, thanx to the many people who contribute to this, and special thanx to Miguel for cleaning up his code related to charge, resonance, and ionization potential calculations!
[…]
So, I started a new module called diff. If two objects are identical, it returns a zero-length String. If not, it lists the changes between the two classes, in a way much like that of the IChemObjects toString() methods.
[…]
Now, output will likely change a bit over time. But at least, I now have a easier to use approach for debugging and writing unit tests. Don’t be suprised to see test-* modules start depending on the new diff module.

PMR: This is what good software id based on. Quality and tools. What Egon has developed is a tool for measuring the quality of the CDK code. It’s not a tool which does something useful for the average user. It’s a tool to help the CDK users build high quality tools. And even those tools won’t be used by the “end-user” – they will be used in applications that the “end-users” (actually people) will use.
Does this sound boring? Yes. Because it is. Does it sound unimportant? I hope not. Has Egon done something important? Yes. Do most people realise it? No.
Modern software is built from toolkits just as computers are built from components. If those components fail, then the whole system fails. All tools sometimes fail. 100% success only exists in fairyland.
But we can make them better. And one major way is writing “unit tests”. Is that boring? Extremely. Do you get publications by writing unit tests? No. Are they simple to write. Not when you start, but they get easier.
I’ve spent the weekend (and before) writing a workflow framework for JUMBO. It’s not a tool – it’s a framework into which other tools can fit. Is it boring? Excruciatingly. Can I do it in front of the test match cricket? Sometimes, especially when that is also boring.
So not many people realise what Egon has had to go through. A system which tells you whether two things are the same? Sounds trivial. It isn’t. The sorts of things you don’t think of testing for:

  • is one of the objects null?
  • is one of the objects of zero size?
  • does one of the objects contain character swith unusual Unicode code points? Can we compare characters?
  • are there floating point problems (in FP 10.0/5.0 may not be 2.0)
  • does the order of subobjects matter? If not can we canonicalise the objects?

I’ve had to do this myself in CML. Are two molecules equal? Are two spectra? I’ve had to write a diff tool for every important CML class. I haven’t finished. Because it’s boring. And it bores my colleagues. And many people who pay or might provide funding.
But it’s essential for modern knowledge-driven science. The chemical software and data industry has no tradition of quality. I’ve known it for 30 years and I’ve never seen a commercial company output quality metrics. I have never seen a commercial company publish results of roundtripping. That’s another really boring and apparently pointless operation where you take a file A, convert it to B and then convert it back to A’. What’s the point? Well A and A’ should be the same. With most commercial software you get loss. If you are lucky it’s only whitespace. But it’s more likely to be hydrogens or charges or whatever.
But the Blue Obelisk cares about quality. Openbabel does roundtripping. JUMBO does roundtripping. CDK does roundtripping. Not necessarily for everything because it depends on volunteers. But we get there.
So the Blue Obelisk is emerging as the main area which takes quality in chemical software and chemical data seriously. More organisations are taking Open Source seriously. I met a chemical software company last week – no names – who is seriously looking at Open Source and thinking of integrating its competitors’ products. Perhaps not RSN, but they are looking at it.
And when they do they will find the Blue Obelisk is the only place for software and data quality. They’ll need it.
But at the moment there’s very little public encouragement for us. The pharma industry uses Blue Obelisk products but they don’t tell us, don’t give us feedback, don’t encourage us.
Well nothing in Open Source says the users have to contribute and we don’t expect it. But it’s nice when it happens. And when you save millions of dollars by using our products it would be nice to say “thank you”.
Because writing Blue Obelisk code is so mind-bogglingly excruciatingly boring.
I don’t know why we do it.
But on my wall I have a mantra from Alma Swan’s Open Access calligraphic calendar (“hardly any rights reserved”)

  1. First they ignore you
  2. Then they laugh at you
  3. Then they fight you
  4. Then you win

It was written by Ghandi for something more important than even than Open Source – human rights.  But it’s applicable in other domains. Open Access has reached #3. The Blue Obelisk is somewhere about 1.3.
But we started later…

Posted in Uncategorized | 2 Comments

OASPA – it's about giving up power

Some very welcome news from Peter Suber’s blog. The committed Open Access publishers have got their act together and are systematising their practices, their terminology, their community. Read Peter’s summary – as I shall omit much of it.

19:32 01/06/2008, Peter Suber,
The incipient Open Access Scholarly Publishers Association (OASPA) has released a draft of its bylaws. Excerpt:

…Section 1.02. Purpose & Mission. The mission of the Open Access Scholarly Publishers Association is to support and represent the interests of Open Access (OA) publishers globally. This mission will be accomplished through six main areas of activity:
[…]

  • Set Standards – Promote a uniform definition of OA publishing, set of best practices for maintaining and disseminating OA scholarly communications and a set of ethical standards.
  • […]

The founding members include: BioMed Central, Copernicus, Co-Action Publishing, Hindawi, Medical Education Online, Journal of Medical Internet Research, PLoS, {Others?}….
To be considered an OA scholarly publisher and eligible for full membership, the journals published by the Publisher must:
[… comply with various statements …]
Appendix II. Statement on Open Access
Full members of the OASPA shall adhere to a common interpretation of Open Access scholarly publishing inspired by the Budapest, Bethesda and Berlin Declarations on Open Access. This interpretation includes the following components:
[….]

b. Requirement that copyright holders allow users to “copy, use, distribute, transmit and display the work publicly and to make and distribute derivative works, in any digital medium for any responsible purpose, subject to proper attribution of authorship….” [PMR’s emphasis]

PeterS Comments

  • [… PeterS is (naturally) very supportive of almost everything the organization proposes, offering only a few quibbles and clarifications. Here’s his last comment…
  • Here’s my [PeterS] one question on an important issue. Will OASPA welcome members who differ in their access policies? Subpoint (b) in the Statement on OA effectively requires members to use CC-BY licenses or the equivalent, permitting all uses (or all “responsible” uses) that carry an attribution. While I support the CC-BY license as the best choice for an OA journal, I wonder whether OASPA will refuse to admit members who want to block commercial reuse, for example, or who want to remove price barriers without removing permission barriers.

PMR: I’d like to congratulate the OASPA because what they have done is clear, consistent, and is easy to interpret and the implementation is clear. It makes it clear that nothing less than CC-BY (or equivalent) is acceptable to the organization – as exemplified by the emphasized passage. So far (at least as far as I know) all the publishers have been clear that what they offer is CC-BY and that they are enthusiastic about it. They have the view that it’s simple: CC-BY or non-CC-BY. There’s no suggestion that they are unhappy about using CC-BY; they have developed a philosophy and business models that they believe are right and work.
PeterS suggests that there is a place for OA publishers who have a not-quite-CC-BY philosphy. We’ve been over this on this blog and elsewhere endlessly and I don’t want to resurrect too much. My own position is clear; for the sort of science I want to do there are only two types of access. CC-BY (BBB) and non-CC-BY. Free access is no use to me. Now I may be an oddity, wanting my machines to read the literature rather than me, so I don’t shout too loud. But I reiterate that for data-rich science with machine extraction the decision is between CC-BY and the rest. Lesser forms of OA are of no value.
Why would a publisher argue for less-than-CC-BY? I can see that a closed access publisher might reasonably say – no fee, no see. That’s logical. But why restrict the use of a document that an author has paid to make available? I can only think of the following reasons (remember that the author does not take these decisions – they are set by the publisher)

  • it allows the publisher to charge differential rates (i.e. increase the fee even further for full CC-BY access)
  • the publisher feels that any open access will damage their business model and want to give away as little as possible
  • the publisher wishes to retain control

I think the last is the main reason. The publishers have got used to pushing authors, readers, reviewers and librarians around. Many of them don’t care about some of these constituencies. Making material available is a sign that they are giving up power. So they limit it as much as possible.
Whereas the OASPA have given up power completely. Wisely, and we thank them. They would dilute their messge if members were less than 100% BBB-compliant.

Posted in Uncategorized | 1 Comment