Quality in chemical software – a debate

ButSymBioSys Blog has replied to my post about unit testing in a long and thoughtful post. I don’t know who the individual is but the company sells a number of chemical software packages, a lot of which I recognize from Peter Johnson’s research group at Leeds. I’ve copied nearly all of the post because the writer has gone to some trouble (and indeed more trouble than is normally seen, which is the point I am making). Comments at end.

Research and software testing

Egon Willighagen has written some blog posts recently about software unit tests in CDK. Then Peter Murray Rust responded that it is a boring but very important task:
And one major way is writing “unit tests”. Is that boring? Extremely. Do you get publications by writing unit tests? No. Are they simple to write. Not when you start, but they get easier.
Of course, writing unit tests for chemistry software is not chemistry research and so you do not get to write chemistry publications about it. However, it is an active topic in computer science. If you hop over to the ACM digital library and enter the search “unit test”, you get 19,314 hits all in peer reviewed journals, just to show you a few example hits:
[…]

When you read further Peter’s blog entry you see these statements:
The chemical software and data industry has no tradition of quality. I’ve known it for 30 years and I’ve never seen a commercial company output quality metrics.
Now, this is a bold statement if I have ever seen one. I am sure most commercial vendors who produce chemical software employ computer science or software engineering graduates, who during their training have been thought the standard unit testing and regression practices of the industry at school as part of the standard curriculum. How do I know that ? Because, not only do I have a BSc and an MSc myself in computer science (my PhD is in computational chemistry so that does not fall under CS), but I also spent 3 years as a teaching assistant at ELTE Budapest teaching programming methodology curses to CS undergraduates — including these techniques.

PMR: The first statement is subjective but it comes from 15 years in pharma industry buying software. Admittedly I have not worked in that industry for over 10 years, but I haven’t seen much to challenge that view. I would certainly argue that chemical software and data has no public face of quality – part of the problem arises from the lack of openly published metrics.

Of course, I can only speak about my own chemical software company with authority, so let me elaborate on how we do software testing. Our system consists of several compact software modules with well defined input and output data objects. These modules can be linked into a pipeline to perform complex tasks like docking or retrosynthetic analysis. Each of the modules have a unit test bed, which consists of a test engine, a set of test scripts and some input output data files and expected error report files. The test engine reads the test script, loads extracts the input data from the script, executes functions of the module and tests the responses, results returned comparing them to expected data from the script or data files. There are four distinct type of tests:
Func – functionality test; valid calls and parameters; checking certain scenarios to see if the module functions properly based on the script
Speed – performance test; valid calls and parameters; should be run with optimised compilation, debug turned off; measures speed
Error – testing of the exception handling; valid calls, parameters simulating extreme scenarios (e.g. file does not exist or incorrect file format used) that may happen in valid usage scenario due to wrong data being passed to the program by the user
Robust – robustness test; invalid call sequences and/or parameters to see whether the sanity checks (asserts) are thorough and complete. These tests programming errors in the integration pipeline, e.g. NIL pointers passed for required data input or calls made to uninitialized objects.
The last two categories have associated expected error files, where the error messages are listed that are expected to be in the response from the module that is being tested. An example functional test script is here from the MolFragGraph module. As you can see it contains a simple language, one command per line starting with a keyword followed by optional parameters and a data block. Of course, writing such scripts is boring, so we typically write only a few of them when a new module is developed. Then we add code like this to the program:
DBGMESSNLF(DEB_SCRIPT, “SCRIPT: MarkGridHead ClientID=0 NumLines=”<
<<” NumLineItems=”<
<<” Low=”<<_p_info->unit_min
<<” Dim=”<<_p_info->unit_dims
<<” CellSize=”<<_p_info->cell_size<<“\n”);
This is a macro call, that is controlled by a debug flag (DEB_SCRIPT). If that flag is turned on during run-time, then the code will output a line into the log file identified by the “SCRIPT:” header and containing one complete line for the test script along with parameters and data. When we run an integrated software pipe, we can generate a log file containing the actual data being passed in and output from any given module inthe format required by the test bed scripts. This allows us to automatically generate test scripts for any of the modules by running an integrated software pipe for a practical input case. If we find a bug, when we reproduce it with a debug version of the code, we can immediately generate test script for each module involved and test them separately to identify where is the root of the problem. Once the bug is fixed, we can generate the correct output expected for each module for the test case. This comes very handy for generating regression tests, so that if later changes of the code would break any of the previously fixed functionality, then we can notice because the corresponding test script would fail. Of course, the running of all these tests is automated in a nightly build and test script. Each module is assigned to a developer who is responsible for the module. When a test script fails during the automated nightly test, the developer gets an email notification so he can fix it during the next day. For quality metric we are producing similar tables each night, like the VTK dash board (I cannot show you our own for confidentiality reasons). We have been doing development with quality control in SimBioSys since the start of the company in 1996. I have also worked in larger software company for medical imaging where software development was carried out under ISO 9001 certified methodology and I have implemented the same principles (with some more automation) in SimBioSys even though we have not applied for the certification — which is a long bureaucratic process with a significant cost.
So what is the take-home message from this post? That software unit and regression testing is a very important, serious — although boring — part of the chemistry software development, and it is not limited to (nor invented by) open source groups like the Blue Obelisk, which is NOT the only place for software and data quality, contrary to what PMR would like you to believe.

PMR: I am prepared to believe that a company is able to reproduce its own results internally and I suspect that the quality is better than it was 10 years ago. So is Open Source.
I’m confining my remarks to “chemoinformatics” software. I exclude quantum mechanics programs (which take considerable care to publish results and test against competitors) and instrumental software (such as for crystal structure determination and NMR. Any software which comes up against reality has to make sure it’s got the right answers as far as possible. But chemoinformatics largely computes non-observables.
Reproducibility of results and robustness is not the whole story of quality. There are tens of thousands of docking and QSAR studies done each year and many of them are published. Are they reproducible? I expect that if a different researcher in a different institution with different software ran the “same” calculation they would get different results. Many calculations predict molecular properties- a simple one being molecular mass. What algorithm and what quantity is used for molecular mass? What atomic masses are used? I would be pleasantly surprised if all chemical software companies use the same atomic masses. If they do they don’t show it. I’ve not seen evidence of two companies collaborating to show that their software gives the same results.
And molecular mass is one of the simpler properties. Can you interchange “total polar surface area” from one manufacturer with another. Which manufacturers publish the source code of their algorithms? Without this the user depends completely on trust in the manufacturer.
Many communities have annual software and data competitions. They use standard data sets and different groups have to predict observables. Examples are protein structure and crystal structures. In text-mining and information retrieval there are major competitions. They rely on standard data sets (“gold standards”) against which everyone can test their software.
But in chemical software these type of standards are rare. If companies feel strongly about quality they should be doing something publicly. Developing test cases. Collaborating on the publication of Open Standard data. Creating Gold Standards. Developing Ontologies – if we don’t agree what quantity we are calculating then we are likely to get different answers.
So I welcome this debate. I’m quite prepared to take flak from other companies or groups that feel they have been slighted. But they have to show a public face of quality. And it’s difficult to do this without collaborating on the creation of Open Standards.
Open Data, Open Standards, Open Source – the Blue Obelisk mantra. At least you can tell where you are and how far you have to go.

3 Responses to Quality in chemical software – a debate

Pingback: SimBioSys Blog » Blog Archive » Quality in chemical software - the debate continues
Zsolt Zsoldos says:

June 3, 2008 at 10:54 pm

Peter,
My post was responding to the necessity of unit testing to ensure software quality at the base level, which is what Egon was blogging about. That is why I focused exactly on that question. Now, you have raised other higher level quality questions, about benchmarking cheminformatics software in general and mentioned docking, QSAR as an example. I have made a detailed response to that on my blog:
http://www.simbiosys.ca/blog/2008/06/03/quality-in-chemical-software-the-debate-continues/
Zsolt Zsoldos
(aka ZZ)

Pingback: SimBioSys Blog » Blog Archive » Chemical software quality 3 - polar surface area

Quality in chemical software – a debate

Research and software testing

3 Responses to Quality in chemical software – a debate

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta