The Blue Obelisk – Egon's diff is boring

Egon blogged the following yesterday. I have removed the geek-stuff but there’s a serious message so read on…

Finding differences between IChemObjects

CDK trunk is getting into shape, thanx to the many people who contribute to this, and special thanx to Miguel for cleaning up his code related to charge, resonance, and ionization potential calculations!
So, I started a new module called diff. If two objects are identical, it returns a zero-length String. If not, it lists the changes between the two classes, in a way much like that of the IChemObjects toString() methods.
Now, output will likely change a bit over time. But at least, I now have a easier to use approach for debugging and writing unit tests. Don’t be suprised to see test-* modules start depending on the new diff module.

PMR: This is what good software id based on. Quality and tools. What Egon has developed is a tool for measuring the quality of the CDK code. It’s not a tool which does something useful for the average user. It’s a tool to help the CDK users build high quality tools. And even those tools won’t be used by the “end-user” – they will be used in applications that the “end-users” (actually people) will use.
Does this sound boring? Yes. Because it is. Does it sound unimportant? I hope not. Has Egon done something important? Yes. Do most people realise it? No.
Modern software is built from toolkits just as computers are built from components. If those components fail, then the whole system fails. All tools sometimes fail. 100% success only exists in fairyland.
But we can make them better. And one major way is writing “unit tests”. Is that boring? Extremely. Do you get publications by writing unit tests? No. Are they simple to write. Not when you start, but they get easier.
I’ve spent the weekend (and before) writing a workflow framework for JUMBO. It’s not a tool – it’s a framework into which other tools can fit. Is it boring? Excruciatingly. Can I do it in front of the test match cricket? Sometimes, especially when that is also boring.
So not many people realise what Egon has had to go through. A system which tells you whether two things are the same? Sounds trivial. It isn’t. The sorts of things you don’t think of testing for:

  • is one of the objects null?
  • is one of the objects of zero size?
  • does one of the objects contain character swith unusual Unicode code points? Can we compare characters?
  • are there floating point problems (in FP 10.0/5.0 may not be 2.0)
  • does the order of subobjects matter? If not can we canonicalise the objects?

I’ve had to do this myself in CML. Are two molecules equal? Are two spectra? I’ve had to write a diff tool for every important CML class. I haven’t finished. Because it’s boring. And it bores my colleagues. And many people who pay or might provide funding.
But it’s essential for modern knowledge-driven science. The chemical software and data industry has no tradition of quality. I’ve known it for 30 years and I’ve never seen a commercial company output quality metrics. I have never seen a commercial company publish results of roundtripping. That’s another really boring and apparently pointless operation where you take a file A, convert it to B and then convert it back to A’. What’s the point? Well A and A’ should be the same. With most commercial software you get loss. If you are lucky it’s only whitespace. But it’s more likely to be hydrogens or charges or whatever.
But the Blue Obelisk cares about quality. Openbabel does roundtripping. JUMBO does roundtripping. CDK does roundtripping. Not necessarily for everything because it depends on volunteers. But we get there.
So the Blue Obelisk is emerging as the main area which takes quality in chemical software and chemical data seriously. More organisations are taking Open Source seriously. I met a chemical software company last week – no names – who is seriously looking at Open Source and thinking of integrating its competitors’ products. Perhaps not RSN, but they are looking at it.
And when they do they will find the Blue Obelisk is the only place for software and data quality. They’ll need it.
But at the moment there’s very little public encouragement for us. The pharma industry uses Blue Obelisk products but they don’t tell us, don’t give us feedback, don’t encourage us.
Well nothing in Open Source says the users have to contribute and we don’t expect it. But it’s nice when it happens. And when you save millions of dollars by using our products it would be nice to say “thank you”.
Because writing Blue Obelisk code is so mind-bogglingly excruciatingly boring.
I don’t know why we do it.
But on my wall I have a mantra from Alma Swan’s Open Access calligraphic calendar (“hardly any rights reserved”)

  1. First they ignore you
  2. Then they laugh at you
  3. Then they fight you
  4. Then you win

It was written by Ghandi for something more important than even than Open Source – human rights.  But it’s applicable in other domains. Open Access has reached #3. The Blue Obelisk is somewhere about 1.3.
But we started later…

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to The Blue Obelisk – Egon's diff is boring

  1. Pingback: SimBioSys Blog » Blog Archive » Research and software testing

  2. Pingback: SimBioSys Blog » Blog Archive » Chemical software quality 3 - polar surface area

Leave a Reply

Your email address will not be published. Required fields are marked *