Category Archives: programming for scientists

Please send us your Vistas

I recently got an invitation to speak (anonymized as I don’t want to fall out) which included:

“I would very much appreciate a copy of your presentation in advance of the event in Windows XP format as the venue is not migrated to Vista. ”

This is yet another example of technology driving scholarly communication. Increasingly we are asked “please send us your Powerpoints”.

I shall use modulated sound waves for the body of my message and I am tempted not to bring any visual material. But I shall, because I need to communicate what we are actually doing by showing it. Scientists, of course, frequently need to show images of molecules, animals, stars, clouds, etc. and this will and should be the mainstream.

But there is little need to echo the words that I speak by showing them on the screen. When speaking to an international audience it can be very useful to have text on the screen, but this is a UK event and whatever else I speak clearly and loudly and should be comprehensible.

What is inexcusable is how often conference organizers fail to provide any connection to the Internet, and some of these meetings are *about* the Internet and digital age. Even more inexcusable are those places which charge 100USD d^-1 for connections.

(BTW I am still fighting WordPress which loses paragraph breaks regularly…)

Semantic Chemical Computing

Several threads come together to confirm we are seeing a change in the external face of scientific computing. Not what goes on inside a program, but what can be seen from the outside. Within simple limits what goes on inside need not affect what is visible. The natural way now for a program to interface with other programs and with humans is to use a mixture of XML and RDF. XML provides a voculabulary and a simple grammar; RDF  provides the logic of the data and application.

The COSTD37 group has just met in Berlin  (I blogged the last meeting – COST D37 Meeting in Rome) COST is about interoerability in Comp Chem and it’s proceeding by collaorative work to fit XML/CML into FORTRAN programs – at present Dalton and Vamp. We do this by exchange visits paid by COST, wo we are looking forward to having visitors in Cambridge shortly.

It coincided roughly with Toby White’s session at NeSC in Edinburgh  on how to fit XML/CML into FORTRAN using his FoX library. I look forward to hearing how he got on.

And then, on Friday, we had a group meeting including outside visitors where the theme was RDF. I was very impressed by what the various members of the group had got up to – five or six mini-presentations. Molecular repositories, chemical synthesis, polymers, ontologies, natural language and term extraction. Andrew Walkingshaw showed the power of Golem which combines XPath with RDF to make a very powerful search tool. We are grateful to Talis for making their RDF engine available and when I have some hard URLs I’ll blog how this works.

The main message is that the new technolgies work. Certainly well enough to support collections in the order of 100,000 objects with many triples (Andrew had ca 10 megatriples). We are also making great progress in extracting chemistry out of free text (PDF is still awful, so please let’s have Word, or even better XHTML and XML). Or LaTeX. But in any case most of the toolset is now well prototyped. More later…

CMLBlog: Sourceforge resources

[This is the first of a continuing series of posts destined for the revitalised CMLBlog.]

The major developers resource for CML is at sourceforge. This is the traditional page which each project has and has several useful features:

cml.PNG

There has been a fairly steady set of releases, with relatively little drift in APIs.

cml1.PNG

There is one mailing list which has a fairly low traffic – mainly requests for technical information. I hope that the blog will be a better format for general discussion.

The SVN repository is the primary resource besides the downloads. For those who are familiar (use Tortoise for Windows) the natural way is to SVNUpdate from time to time. For casual browsing the SVNRepository posts HTML pages which are very well set out and a good way to find things if you know what you want.

There are about 10-15 active developers – a very few commit large amounts, most other offer patches, bug fixes and unit tests.

Learning RDF and RDFS – help!

I’m getting myself up to speed on RDF (and RDFS) and building molecular repositories as an example. I’m using  the Jena Semantic Web Framework (Open Source , Java, HP-inspired) and so far like it. But I have only done a little bit (subject-predicate-object) Jim tells me that what I have produced so far needs cleaning up. As a minimum I have to use RDF types (rdf:type).

I like learning by example – give me a few examples of RDF-XML and the corresponding Jena code and that will go a long way. But although I could do this easily for the simple stuff the Jena tutorial runs out before RDFS. And the Javadoc is enormous. I’m impressed, but I don’t know where to start. There are no obvious package or class names. Everything uses abstract language. How do I learn about ranges and domains unless I can see some working examples and create some?

So rather than exposiing this on an RDF-specific list (which I may do later) I’m wondering if there are any kind readers who can point to some examples of RDFS-XML and even better if they can suggest how to hack them in Jena.

TIA

Update on Open crystallography

There’s now a growing movement to publishing crystallography directly into the Open. Several threads include:

… so it was no great surprise when Jean Claude blogged:

X-Ray Crystallography Collaborator

20:41 20/12/2007, Jean-Claude Bradley, Useful Chemistry

We have another collaborator who is comfortable with working openly: Matthias Zeller from Youngstown State University.

With the fastest turnaround for any crystal structure analysis I’ve ever submitted, we now have the structure for the Ugi product UC-150D. For a nice picture of the crystals see here.

PMR: J-C also mailed us and asked how w/he could archive and disseminate the crystallography. So here’s a rough overview.

Crystallography is a microcosm of chemistry and we encounter many different challenges:

  • not all structures are Open (some not initially, some never). Managing the differential access is harder than it looks. It has to be owned by the Department or Institution. So you probably need access control, and probably an embargo system.
  • Institutional repositories are not generally oriented towards data. Some may, indeed, only accept “fulltext”. So there may be nowhere obvious to go.
  • The raw data (CIF) contains metadata, but not in a form where search engines can find it. That’s a important part of what SPECTRa does – extracts metadata and repurposes it.
  • The CIF can, but almost universally does not, contain chemical metadata. So part of JUMBO is devoted to trying to extract chemistry out of atomic positions.  Needs a fair amount of heuristic code.

So in conjunction with eChemistry and eCrystals and in the momentum of SPECTRa we are continuing to develop software for crystallographic repositories. There are several reasons why people want such repositories:

  • as a high-quality lab companion – somewhere to put your data and get it back later.
  • as somewhere to provide knowledge for data-driven science (e.g. CrystalEye)
  • as somewhere to save your data for publication and dissemination
  • as somewhere to archive your data for posterity (e.g. an IR)

These put different stresses on the software, so Jim and I are developing context-independent tools that can be used in any. I’m hacking the JUMBO software (CrystalTool) and he is hacking CrystalEye so it becomes a true repository.

This is our relaxation over the holiday.

???

FoX marches on

 

Toby White joined us – Jim Downing, Peter Corbett and me – in the pub yesterday to unwind and explore the challenges of tomorrow’s information. Toby has been one of the pillars of supporting CML – there was no requirement to do so but he and colleagues (mainly in Earth Sciences) saw the value and used it anyway. The added challenge is FORTRAN. FORTRAN is a great language – my first encounter was ca 1970.  It’s oriented towards  rectangular data – of variable dimensionality. It is extremely good at scientific computing with large number of numbers and it understands – as much as most – how real numbers work.

But it’s not easy to interface with XML unless your data model is also rectangular. Historically molecular data was – atoms vertically, coordinates and other properties across. Bit of a problem if data are missing – hacks include magic numbers (e.g. 1.0e-bignumber, or zero-and-hope, or a row of stars (great fun when reading back in)).

So Toby has written FoX – a real labour of love. If you develop ANY FORTRAN code, please use FoX for the data i/o. It’s easy and it saves huge amount of messy glueware. There’s now no technical reason why all comp.chem software shouldn’t emit XML/CML. It’s not just “another file format” – it’s a new way of thinking about information.

Anyway

From: Toby White

To: FoX@lists.uszla.me.uk

Subject: [FoX] Release of version 3.1

This is to announce the release of version 3.1 of the FoX library.
(download from <http://source.uszla.me.uk/FoX/FoX-3.1.tgz>)
This new release features
* extended portability across compilers

(see <http://uszla.me.uk/space/software/FoX/compat/>)
* a “dummy library” capability

(see <http://www.uszla.me.uk/FoX/DoX/Compilation.html#dummy_library>)
* extended DOM functionality, including several more Level 3 functions,

and additional Fortran utility wrappers

(see <http://www.uszla.me.uk/FoX/DoX/FoX_dom.html#dataExtraction>)

Merry Christmas,

Toby

PMR:  Enjoy!

Java: labelled break considered harmful

Readers of my last post may have thought that Eclipse makes refactoring easy. It does – up to a point. I had started to refactor an 800-line module with deeply nested loops – just a matter of extracting the inner loops as methods…

… NO!

When I tried this I got:

“Selection contains branch statement but corresponding branch target is not selected”

???

On closer examination I discovered that the code contained a construct like:

foo:
plunge();
for (int i = 0; i < 1; i++) {
boggle();
if (bar) {
break foo;
}
}

[Added later: PUBLIC GROVEL. Jim has pointed out that I have misunderstood the break syntax, so the code above is WRONG. At least this shows that I never use labelled break. It should read:

plunge();
foo: for (int i = 0; i < 1; i++) {
boggle();
if (bar) {
break foo;
}
}

Strikethoughs indicate my earlier misconceptions.

What’s happening here? The code contains a labelled break. If the break foo is encountered, then the control jumps to the label foo. This can be almost anywhere in the module – and in this case it was often before the start of the loop. to

out of the labelled loop.

Jumping to arbitrary parts of a module is considered harmful (Go To Statement Considered Harmful). Sun/Java announces:

2.2.6 No More Goto Statements

Java has no goto statement1. Studies illustrated that goto is (mis)used more often than not simply “because it’s there”. Eliminating goto led to a simplification of the language–there are no rules about the effects of a goto into the middle of a for statement, for example. Studies on approximately 100,000 lines of C code determined that roughly 90 percent of the goto statements were used purely to obtain the effect of breaking out of nested loops. As mentioned above, multi-level break and continue remove most of the need for goto statements.

but surely the code below is a direct replacement for goto.

while (true) {

break foo;
}

continue is useful. break out of single level (unlabelled) is useful. break out of multiple loops might just be OK if it was always downwards and always to the point immediately after a loop.

But it isn’t.

so – and I am surprised that I can’t easily find it on Google:

“labelled break considered harmful”

However as it is still extremely easy to write code that cannot be easily refactored I still hold that labelled breaks should be used only when essential.

Refactoring large modules using Eclipse

I have recently had to consider refactoring a piece of Java which had got slightly out of hand – the module was 800 lines long and the if statements so deeply nested that they ran well off the right-hand edge of the page. I will NOT identify where it came from or to criticize – I have written much worse in my past (you can do really fun things with computed GOTOs in FORTRAN.). But it was and is unmaintainable and we care about that in the Centre.

So I thought that I would sit down with Eclipse in front of the football and refactor it. Eclipse has this really neat Refactor that allows you to select a chunk of code and turn it into a method. For example:

public void add3DStereo() {
// StereochemistryTool stereochemistryTool = new
// StereochemistryTool(molecule);
ConnectionTableTool ct = new ConnectionTableTool(molecule);
List cyclicBonds = ct.getCyclicBonds();
List doubleBonds = molecule.getDoubleBonds();
for (CMLBond bond : doubleBonds) {
if (!cyclicBonds.contains(bond)) {
CMLBondStereo bondStereo3 = create3DBondStereo(bond);
if (bondStereo3 != null) {
bond.addBondStereo(bondStereo3);
}
}
}
List chiralAtoms = new StereochemistryTool(molecule).getChiralAtoms();
for (CMLAtom chiralAtom : chiralAtoms) {
CMLAtomParity atomParity3 = null;
atomParity3 = calculateAtomParity(chiralAtom);
if (atomParity3 != null) {
chiralAtom.addAtomParity(atomParity3);
}
}
}

I now select the first for loop and turn it into a method; and repeat for the second and get:

public void add3DStereo() {
// StereochemistryTool stereochemistryTool = new
// StereochemistryTool(molecule);
ConnectionTableTool ct = new ConnectionTableTool(molecule);
List cyclicBonds = ct.getCyclicBonds();
List doubleBonds = molecule.getDoubleBonds();
addBondStereo(cyclicBonds, doubleBonds);
List chiralAtoms = new StereochemistryTool(molecule).getChiralAtoms();
addAtomParity(chiralAtoms);
}

/**
* @param chiralAtoms
*/
private void addAtomParity(List chiralAtoms) {
for (CMLAtom chiralAtom : chiralAtoms) {
CMLAtomParity atomParity3 = null;
atomParity3 = calculateAtomParity(chiralAtom);
if (atomParity3 != null) {
chiralAtom.addAtomParity(atomParity3);
}
}
}
/**
* @param cyclicBonds
* @param doubleBonds
*/
private void addBondStereo(List cyclicBonds, List doubleBonds) {
for (CMLBond bond : doubleBonds) {
if (!cyclicBonds.contains(bond)) {
CMLBondStereo bondStereo3 = create3DBondStereo(bond);
if (bondStereo3 != null) {
bond.addBondStereo(bondStereo3);
}
}
}
}

The whole thing took 30 seconds, including choosing the module names. Eclipse did all the params, documentation return values – everything.

Try it – it will really fix up many sorts of grotty code…

Bioclipse awarded [prize] at Trophees du Libre

Ola Spjuth reports that Bioclipse – the  collaborative bi/chem client based on Eclipse – has won another prize. Bioclipse awarded at Trophees du Libre

I [Ola] just arrived home from the international contest for free software, Trophees du Libre 2007, which was held in Soissons, France. Bioclipse was awarded the Special Prize of the jury, and the prize was handed over by the president of the Free Software Foundation Europe (FSFE), Georg Greeve, who also was the chairman of the jury. It was a great event; great to meet other open source developers and people representing organizations and companies who actively support free software. Apparently we received the Special Prize because we were too famous already :-) .

 

Why is it so difficult to develop systems?

Dorothea Salo (who runs Caveat Lector blog) is concerned (Permalink) that developers and users (an ugly word) don’t understand each other:

(I posted a lengthy polemic to the DSpace-Tech mailing list in response to a gentle question about projected DSpace support for electronic theses and dissertations. I think the content is relevant to more than just the DSpace community, so I reproduce it here, with an added link or two.)

and

My sense is that DSpace development has only vaguely and loosely been guided by real-world use cases not arising from its inner circle of contributing institutions. E.g., repeated emails to the tech and dev lists concerning metadata-only deposits (the use case there generally being institutional-bibliography development), ETD management, true dark archiving, etc. etc. have not been answered by development initiatives, or often by anything but “why would you even want that?” incomprehension or “just hack it in like everybody else!” condescension.

PMR: This has been a perennial problem for many years and will continue to be so. I’m also not commenting on DSpace (although it is clearly acquiring a large code base).  But my impression of the last 10-15 years (especially W3C and Grid/eScience projects) is that they rapid become overcomplicated, overextended and fail to get people using them.

One the othe hand there are the piles of spaghetti bash, C, Pythin and so on which adorn academic projects and cause just as much heartache. Typical “head-down” or throwaway code.

The basic fact is that most systems are complicated. And there isn’t a lot that can be done easily. It’s summed up by the well-known  Conservation Of Complexity

This is a hypothesis that software complexity cannot be created or destroyed, it can only be shifted (or translated) from one place (or form) to another.

If, of course, you are familiar with the place that the complexity has shifted to it’s much easier. So if someone has spent time learning how to run spreadsheets, or workflows, or Python, and if the system has been adapted to those it may be easier. But if those systems are new then they will have serious rough edges. We found this with the Taverna workflow which works for bioscience but isn’t well suited (yet) for chemistry. We spent months on it, and but those involved have reverted to using Java code for much of our glueware. We understand it, our libraries work, and since it allows very good test-driven development and project management it’s ultimately cost-effective.

 

We went through something like the process Dorothea mentions when we started to create a submission tool for crystallography in the SPECTRa : JISC project.  We though t we could transfer the (proven) business process that Southampton had developed for the National Crystallographic Centre. And that the crystallographers would appreciate it. It would automate the management of the process from  receibving the crystal to repositing the results in DSpace.

 

It doesn’t work like that in the real world.

 

The crystallographers were happy to have a reposition tool, but they didn’t want to change their process and wouldn’t thank us for providing a bright shiny new one that was “better”. They wanted to stick with their paper copies, the way they disseminated theoir data. So we realised, and backtracked. It cost us three months, but that’s what we have to factor into these projects. It’s a lot better than wasting a year producing something people don’t want.

 

Ultimately much of the database and repository technology is too complicated for what we need at the start of the process. I am involved in one project where the database requires an expert to spend six months tooling it up. I thought DSpace was the right way to go to reposit my data but it wasn’t. I (or rather Jim) put150,000+ molecules into it but they aren’t indexed by Google and we can’t get them out en masse. Next time we’ll simply use web pages.

 

By contrast we find that individual scientists, if given the choice, revert to two or three simple, well-proven systems:

  • the hierarchical filesystem
  • the spreadsheet

A major reason these hide complexity is that they have no learning curve, and have literally millions of users or years’ experience. We take the filesystem for granted, but it’s actually a brilliant invention. The credit goes to Denis Ritchie in ca. 1969. (I well remember my backing store being composed of punched tape and cards).

If you want differential access to resources, and record locking and audit trails and rollback and integrity of commital and you are building it from scratch, it will be a lot of work. And you lose sight of your users.

So we’re looking seriously at systems based on simpler technology than databases – such as RDF triple stores copuled to the filesystem and XML.

And the main rule is that both the users and the developers have to eat the same dogfood.  It’s slow and not always tasty. And you can’t avoid Fred Brooks:

 Chemical engineers learned long ago that a process that works in the laboratory cannot be implemented in a factory in one step. An intermediate step called the pilot plant is necessary….In most [software] projects, the first system is barely usable. It may be too slow, too big, awkward to use, or all three. There is no alternative but to start again, smarting but smarter, and build a redesigned version in which these problems are solved…. Delivering the throwaway to customers buys time, but it does so only at the cost of agony for the user, distraction for the builders while they do the redesign, and a bad reputation for the product that the best redesign will find hard to live down. Hence, plan to throw one away; you will, anyhow.

Very simply, TTT: Things Take Time.