petermr's blog

FortranFormat, iChemLabs and Quixote (and a Test)

Posted on October 29, 2010 by pm286

#jiscxyz

In the Quixote project (http://quixote.wikispot.org/Front_Page) we are systematically developing Java (and other) OpenSource tools for managing the input, output, transformation, storage, searching and publication of computational chemistry. We intend to parse complete logfiles (not just the juicy bits) and so come across a lot of really grotty FORTRAN output. If you are thinking of writing a tool to parse FORTRAN output:

Stop
Don’t

Because there are so many beartraps that you will fall into unless you have actively worked with FORTRAN.

Here’s a simple question:

How many INTEGERs have been output in this 7-character string output by a FORTRAN program?

123**75

There is an answer. (No, Unicode is not involved – these are honest to goodness EBCDIC characters punched into a standard Hollerith card…) But the answer is not trivial.

So I am starting with GamessUS punch files and running a tutorial on how to write a parser. Among other things we have to parse chunks of Fortran output. So before writing my own I went to StackOverflow and asked:

http://stackoverflow.com/questions/4051559/parsing-fortran-output-files-using-java/4051615#4051615

Within about 7 minutes I had an answer. I accepted this as the best answer. That gives me two points. And I have since got 2 votes! So I get 2*5+2 =12 points. Not bad.

So public thanks to iChemLabs and Kevin J. Theisen. It’s great to see a chemical software company producing Open Source. It means one less set of wasted days duplicating what other people have done. How many other companies have Open Source that would save everyone labour and allow us to do innovative work. Because writing a FORTRAN parser isn’t fun and isn’t innovative.

FWIW it looks to have been done properly – with a BNF and JavaCC. That means that the code is written by a Compiler Compiler. All that is required is the FORTRAN language spec and, of course code to parse the different elements.

I’m planning to mavenize this and put it in a repository. I’m allowed to do that as it’s Open Source. All I have to do is say thank you and keep the licence attached to the code distro.

Posted in Uncategorized | Leave a comment

Span, Span, Span, Span, Span, loverly Span.

Posted on October 29, 2010 by pm286

Here is some code. Why won’t it compile?

private
void transformAtomArraysIntoFrequencies() {

Nodes atomArrayNodes = xmlInput.query(

“./*[local-name()=’atomArray’ and @dictRef=’gamessuk:normal_coordinates’]”);

}

UPDATE

Gillean (see comments) has cracked it.

I cut-and pasted it out of Eclipse, into Word (where I think the quotes were honoured as U+0022 (http://www.fileformat.info/info/unicode/char/22/index.htm). It was then published DIRECTLY by Word to WordPress. One of them corrupted it.

The underlying source in WordPress looks like:

void transformAtomArraysIntoFrequencies() {

Nodes atomArrayNodes = xmlInput.query(

”./*[local-name()=’atomArray’ and @dictRef=’gamessuk:normal_coordinates’]”);

(The quote is still U+0022 at this stage.) Note the emulation of the Vikings in span,span,span,span, span loverly span. What garbage. WordPress is one of the most awful tools I have every used for editing. It corrupts everything. (Ok it’s probably an old version – I am not in control).

Then if we look at the page source of the post we find

“./*[local-name()=’atomArray’ and @dictRef=’gamessuk:normal_coordinates’]“<span…

ARGHH! ARGHH! WordPress has changed the quotes into smart quotes U+8820. ARGHHH. I can’t stop it. Blink, and it reverts. You cannot cut the heads off the hydra. It regrows. Sometimes it multiplies. And you end up with rows of empty Spans.

It has beaten me. Which is why I don’t post code any more.

(Yes I have tried code formatters and other things in WordPress. No use).

Posted in Uncategorized | 3 Comments

A common problem in informatics (UPDATE)

Posted on October 28, 2010 by pm286

#jiscxyz

Amusement (hopefully). You don’t have to understand theoretical chemistry to take part.

The following ASCII text represents the output of a theoretical chemistry program. It contains an error. This error was not created by the program (and I have removed its name (XXXXXXXXXXX) so as not to throw any aspersions on it). The error is typical of a general problem in scientific information.

=-=-=-=-=-=-=-=-

N-N= 1.312300318470D+03 E-N=-4.587475340442D+03 KE= 8.364019136890D+02

AllDun Frequency-dependent properties on file 20721 Mask= 2 NFrqRd= 1 NDeriv= 1 LenFil= 12:

Frequencies= 0.077357

Property number 2 — FD Optical Rotation Tensor frequency 1 0.077357:

1 2 3

1 -0.139106D+02 0.342269D+02 0.258275D+01

2 0.381206D+02 0.255976D+02 0.312180D+02

3 0.157839D+02 0.161937D+03 -0.135731D+02

Job cpu time: 0 days 1 hours 53 minutes 7.4 seconds.

File lengths (MBytes): RWF= 419 Int= 0 D2E= 0 Chk= 6 Scr= 1

Normal termination of XXXXXXXXXXX at Fri Oct 03 15:18:55 2010.

=-=-=-=-=-=-=-=-

The error can be detected by computer more easily than by humans.

Please indicate:

The error
How you think the error might have arisen in practice
Whether there might be other undetected errors in the document

UPDATE:

Two people have found the error. Well done. They have not hypothesised how it might have occurred. This requires a flash of inspiration and/or common exposure to this very common problem.

UPDATE:

The error is in:

Normal termination of XXXXXXXXXXX at Fri Oct 03 15:18:55 2010.

This is an impossible date (in any current Chronology). So how could it have happened? And I promise you that this type of problem occurs zillions of times every day. And no, it’s not human mistyping (though that destroys and corrupts science very effectively).

UPDATE: The date was actually Fri Oct 08 15:18:55 2010 . Does that give any clues as to what could have happened?

Posted in Uncategorized | Leave a comment

Chemical MIME and the role of the IETF

Posted on October 28, 2010 by pm286

I’ve just described Chemical MIME – not in great detail, more to illustrate a highly virulent meme. Chemical MIME is what Fowler would call a “sturdy indefensible”. It breaks the rules, but it is used, it works and it upsets few except pedants.

Until now. Read on, even if you are not a chemist, because it’s a general problem in modern informatics. And we need your help. I don’t know how to solve it.

Egon has pointed out a problem. It’s hit the KDE bug list (https://bugs.kde.org/show_bug.cgi?id=235563) . It’s a software bug, not a chemical bug:

Top of Form

Bug 235563 – invalid MIME type in /usr/share/applications/kde4/kalzium.desktop

Summary:

invalid MIME type in /usr/share/applications/kde4/kalzium.desktop

Product:	kalzium
Component:	general
Status:	RESOLVED
Resolution:	FIXED
Target:	—

Version:	unspecified
Priority:	NOR
Severity:	normal

Votes:

Version Fixed In:

Description From Laurent Bonnaud 2010-04-27 19:19:39

Version: 2.3.80 (using 4.4.2 (KDE 4.4.2), Kubuntu packages)

Compiler: cc

OS: Linux (i686) release 2.6.32-21-generic-pae

Here is the problem:

# update-desktop-database

[…]

Error in file “/usr/share/applications/kde4/kalzium.desktop”: “chemical/x-cml”

is an invalid MIME type (“chemical” is an unregistered media type)

What does this mean?

It means that a server has labelled a file with the MIME type (Content-Type) as chemical/x-cml

And that the application software has said that’s invalid.

And the application software is pedantically right.

So, best beloved…

In the early days of the Internet when ordinary people hacked servers and small furry penguins were small furry penguins, there was a brilliant idea to label content with its type. It was a brilliant idea and it still is a brilliant idea. It means that anyone in the world, on whatever platform, getting documents from whatever server could determine their type. All you had to do was add a simple text-string and the machines would recognise it.

So if you were transmitting a piece of text, you could label it “text/plain”. And an image might be labelled “image/png”. If you didn’t do this then you couldn’t know whether the bit stream was meant to be displayed as text (e.g. in the body of a mail message) or as an image accompanying the mail.

Mail? I thought we were on browsers?

No. This far predates the browser. Read http://en.wikipedia.org/wiki/MIME. This will give you an idea of the enormous contribution made to the Internet and the modern world by the great body of those dedicated to interoperability. The Internet is based on RFCs.

RFCs? Read http://en.wikipedia.org/wiki/Request_for_Comments. Without RFCs there would be no HTML. There would be no Google. No Facebook. No HTTP. No Wikipedia. No online pornography. There would be a bickering mass of companies fighting in a sludge of non-interoperability. Everyone would have their own server spec. Everyone have their own client spec. I remember that time. It was awful. A Holy Roman Empire of isolated barons.

One of the greatest achievement of the twentieth century was the Internet. And it succeeded because of the IETF. http://en.wikipedia.org/wiki/IETF. The IETF?

Their goal: “The goal of the IETF is to make the Internet work better.”

Their motto: “Rough consensus and running code” . This is a great step towards the democratisation of the world through technology. It’s lead not only to a working system of physics and software but also as a touchstone for this century’s democracy. It’s exemplified in Wikipedia. It means listening to the other person’s point of view. And agreeing to come away with something that works.

In the IETF system, anyone can put forward a proposal. It’s called a draft. Here it is (https://datatracker.ietf.org/doc/draft-rzepa-chemical-mime-type/):

Document type:	Old Internet-Draft (Individual document)
Last updated:	1995-03-21
State:	Expired
Intended status:	–
Submission:	Individual
Responsible AD:	–

Bottom of Form

Document history

Date	Version	By	Text
1995-11-13		(System)	Draft expired
1995-03-21	01	(System)	New version available: draft-rzepa-chemical-mime-type-01 (diff from -00)

This Internet-Draft is no longer active. Unofficial copies of old Internet-Drafts can be found here:
http://tools.ietf.org/id/draft-rzepa-chemical-mime-type.

Abstract:
The purpose of this Internet Draft is to propose an update to Internet RFC 1521 to include a new primary content-type to be known as chemical. RFC 1521[1] describes mechanisms for specifying and describing the format of Internet Message Bodies via content-type/subtype pairs. We believe that chemical defines a fundamental type of content with unique presentational and processing aspects. We outline the typical expected uses of such a content type and propose a number of chemical sub-types. This document updates IETF Internet Draft draft-rzepa-chemical-mime-type-00.txt in which this specific proposal was made, incorporates suggestions received during the initial discussion period and indicates scientific support for and uptake of this proposal[2-7].

Authors:
Henry Rzepa <rzepa@ic.ac.uk>

P. Murray-Rust <pmr1716@ggr.co.uk>

B. J. Whitaker <benw@chemistry.leeds.ac.uk>

(Note: The e-mail addresses provided for the authors of this Internet-Draft may no longer be valid)

We put the idea into the IETF framework. It was a Draft, not an RFC. We had 6 months to convince the IETF. Henry went to a meeting. There was lots of discussion. One suggestion was that it could be used to send recreational drugs over the network. (Since I was working for Glaxo I wasn’t wild about being associated with this idea and it was not pursued in the body of the draft!).

The draft had a lot of supporters but it failed to get critical mass. It lapsed. Not enough rough consensus.

MIME is an excellent idea but its implementation does not allow easy extensibility. There’s a hardcoded set of type of the form foo/bar – seven major types and many secondary ones. Everyone knows that hierarchical classification systems break down sooner or sooner.

MIME’s extensibility was through “x-“. So suppose you had a new image format called penguin (designed to transmit pictures of penguins) , you might write “image/x-penguin”. At some stage in the future it might become accepted as a standard part of MIME.

So we started creating chemical/x-pdb, chemical/x-cml, etc. They are listed at http://www.ch.ic.ac.uk/chemime/ . The idea took off. There are probably hundreds of millions of documents labelled with chemical MIME. OK, the IETF didn’t want to know about them but they worked. MIME system did not appear to require know mime types.

And they have worked for 15 years.

Until, apparently, now. The software above checks primary MIME types. “chemical” isn’t one of them. So it throws an exception. It’s “right”.

But it’s not helpful.

What to do? I really don’t know. I can think of the following:

Go back to the IETF. Chance of success? 0.00000001

Get the chemical world to change to another MIME type (it’s possible that “x-chemical/pdb” would be allowed. But it might not). It would destroy hundreds of millions of working documents.
Fix the behaviour of KDE. Chance of success 0.00001
Ignore the problem
Try some awful kludgy workaround

How important is this problem? I don’t know. Is MIME becoming stricter? I doubt it. Are more systems validating it? ??

In so far as it is a problem it reflects the lack of community approach in chemistry. The chemical software industry is based largely on non-interoperability and lockin. All the approaches – and there aren’t many – come from outside either the software vendors or the pharma industry. Chemical MIME; CML; The Blue Obelisk; InChI. None of these have been industry-led. They succeed to the extent that they fill an essential need. Pharma ought to care – it doesn’t publicly show it. Software industry ought to care. It doesn’t until it’s forced to. I am not surprised by this – standards come when the industry is in a mess and they are essential, and we are at that stage now.

There will be a considerable number of new MIME types registered as a result of the Quixote project. We need to know the precise types of computational input and output. We do this without the active help of the companies producing the tools that create these files.

For Chemical MIME we will keep buggering on.

UPDATE:

Read the comments…

Bottom of Form

Posted in Uncategorized | 1 Comment

What makes an Internet meme? Chemical MIME and CML… “We must just KBO.”

Posted on October 27, 2010 by pm286

#quixotechem #jiscyxz

I am still intrigued (?amazed) by the unpredictability of new ideas and technology on the Internet. All I do is fire off memes and see what happens. I’m reasonably experienced in creating memes. And even more experienced at non-memes. I’m no better at predicting success than anyone else

For example in 1994 Henry Rzepa and I developed Chemical MIME (a way of supplying chemical typing in mail and server headers). It didn’t take off in the IETF but when we released a package of free software and specs it raced through the Internet in weeks. Admittedly at that time the scientific Internet (the number of sites) was smallish but the difficulty of configuring clients and of distributing software was also difficult. I surmise that the meme (like a virus) has to have 3 essential features:

The ability to infect someone, simply by its own power. For Chemical MIME this was the ability to display and rotate beautiful coloured molecules and to ask scientific questions. There was hardly a chemist in 1994 who would not go “wow!”. So infection was facile.
A near-zero or zero cost of replication. If the host has to expend a lot of energy to create replicas of the memes then the process slows. (For example if the host has to manufacture “stuff” – as in the RepRaps – then most people will not replicate.) In the case of Chemical MIME all the host has to do is to clone the incoming material – this involves obtaining and mounting copies of the displayed molecules and free software.
The desire to replicate and mutate. In this case the host wants to show the world what *they* can do. They create displays of their own molecules, which may be even more striking that the ones they saw. They also help to improve the process of replication – better tutorials, slicker web pages, improved software. So the process accelerates and prospers.

When we created Chemical Markup Language (in ca. 1994) we thought it might be rapidly copied – that in a year or two everyone would be using it. It needed 1997 to create XML but after that the world seemed to explode. We heard of financial consortia for XML which closed doors after 2 weeks. Tremendous hype. So clearly chemistry would be sucked up in the rush? Not quite.

In fact the same is true of MathML. It’s also taken its time. But both CML and MathML make steady progress. We keep hearing of new users and applications and we keep developing our toolkit. There isn’t an alternative to XML or similar language to represent the complexity of chemical documents (most of the legacy approaches deal only with molecules, or possibly simply reactions). CML can represent whole computations, crystal structures, preparations, etc.

“To everything there is a season” – I learnt at school (Ecclesiastes 3:1-8 NKJV) and timing is critical on the Internet. Because the Internet doesn’t care WHO wins, it just cares that someone does. So there was bound to be an Internet Encyclopedia, but Jimmy Wales’s wasn’t the first. Google was nowhere near the first search engine, but it hit an optimum of timing, design and performance. In rapidly changing fields – such as social networking – you have to hit everything at the right time. But some things have a different timescale. Launched too early, they fail to take off (or take root). Launched too late, and the early birds beat the latecomers. Like biological ecosystems we should expect great wastage. Ultimately if a meme has a potential place in an ecosystem its progenitors must keep launching it until it succeeds.

In WWII, Winston Churchill came up with the simple formula “We must KBO” (Keep Buggering On). It’s a simple, heartening formula. It evens out the highs and the many lows. It’s based on absolute faith that one will succeed. During the low periods you have to keep going, however boring, hopeless, however many setbacks. Rather than the heady explosion of Chemical MIME, CML has been a long long slog. There’s never been any Gartner curve (Hype cycle – Wikipedia). No peak of inflated expectations; no trough of disillusionment. Mainly long slog, with general apathy, sometimes hostility, and occasional positive moments.

There have been many friends on the journey: The Blue Obelisk; the Earth Science community; Microsoft Research; the bioscience community; and others. The time is now coming. We’re continually finding people who are starting to use CML. We’ve got a million lines of Open code written. There are clear applications – semantic publishing, computational chemistry, crystallography which simply cannot be done by other technologies.

And there’s the Open semantic revolution. Conventional tools cannot support either of these – it needs a rich structured language technology with vocabulary support.

It’s the ultra-rapid takeoff of Quixote which has finally convinced me that the technology is mature enough for success. We’re on JUMBO5. Stable. Schema 2.4. Stable. Tools for much of the computational work. 200,000 structures in Crystaleye. 200,000 downloads of Chem4Word.

There’s a lot more “buggering on” required. “Blood on the floor at midnight” it feels like to me. Rewriting the code for the 5^th time isn’t fun. The Blue Obelisk has been a lifesaver. The breathing space in Chem4Word has given us a stable, viable, robust validating toolkit. The Earth science experience has given us belief in a whole sector – compchem. Quixote has been a delight.

We have tunnelled through to the Slope of Enlightenment (in Gartner’s terms). If you want to join in, now is the start of a highly productive time.

Posted in Uncategorized | 7 Comments

REST in peace in the Matrix

Posted on October 26, 2010 by pm286

#quixote #jiscxyz

Many of the ideas that we’ve had on the World Wide Molecular Matrix are now starting to become possible. In my innocence in 2002 I thought that imaginings were one step from reality. The bits in between were so easy to conceive that they wouldn’t take much time.

They’ve taken 10 years

However I can look back and find that many of the ideas are pretty much unchanged.. Here’s a picture crafted at least 5 years ago which describes different types of site (server) in the WWMM. The details aren’t important.

It’s the technology that has risen up to meet them. And the general acceptability within the community. Back then it was WSDL and UDDI (imagine that!), oh and SOAP. And Portals. It’s taken courage to strip all that away and go back to the simple ideas. REST, schema-less designs, flexible vocabularies.

But most of all that the whole system is Open. I hadn’t realised back then how much of a drag AAA was. Authentication, Authorization, Accounting. They kill projects. Much of the eScience program was struggling with these monsters.

Of course if you want to transfer money between banks you need this. That’s why we pay bankers enormous salaries and bonuses. But for Open science we don’t need anything. A few social controls. Keep the spammers and wreckers out. Make sure people don’t DOS the system.

So Sam Adams has planned all this and I’m convinced his design is what modern eScience should be. (We also owe a lot of this to Jim Downing). All the bits are there. We need the following sorts of server (they not quite what is in the picture, but share the general idea):

One that allows scientists publish their data to the Open web. Pablo Echenique has already done this in the Quixote project. But not everyone is allowed to run a server.
A server to which anyone can upload. Anyone who is not a spammer. Sam will tell us how that can be done easily
A server that scrapes the exposed web. This is can Pablo-type, or journals or anything. Even Institutional repositories if they expose an iterator over the data (which most do not). Its results are exposed and read-only. It offers search and indexing
A customisable repository with embargo. Chem#, pronounced Chempound. It’s the results of several JISC projects – SPECTRa, CLARION, JISCXYZ and it’s coming together now. A few bits to come but RSN. It will allow people to store their data responsibility while they need it and archive it later

The WWMM is not restricted to molecules. The architecture will handle anything that’s semantic. It hates PDF. It hates Powerpoint (unless in XML). It likes anything in text and in XML. It’s not wild about images yet. The day will come shortly when images are semantic.

And you if you want to find out more, just join the Quixote list (quixote-qcdb@googlegroups.com ). You don’t have to be a chemist. You just have to enjoy seeing Open scientific data.

Posted in Uncategorized | Leave a comment

“books” On the Cambridge Train

Posted on October 26, 2010 by pm286

I’m coming back from JISC and again sitting on the floor among the Bromptons. Alice and Bob are in their regular seats. They must get out earlier than me or rush along platform Zero faster than the average punter. (The 1645 is not a good train to arrive just-in-time for unless you like bicycles). Anyway I catch part of their conversation.

A: So what are you reading?

B: “Jane Austen’s Pride and Prejudice”

A: You mean “Pride and Prejudice”.

B: No it’s called “Jane Austen’s Pride and Prejudice”.

A: Who’s the author, then. I thought it was Jane Austen.

B: It’s by “eyePoodle”.

A: ?????

B: Yes, it’s the name of the company that makes this e-reader.

A: But the book is by Jane Austen, right?

B: Well sort of. She wrote most of the words, but eyePoodle actually wrote the book.

A: You mean they copied her words.

B: No, they’ve actually changed them to give a better user-experience.

A: What the hell is a “user experience”?

B: It’s what you get when you buy an eyePoodle.

A: OK – well how does it start? I know this by heart from first-year English Literature. It should be “It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.”

B: Well more or less. It says “It’s generally true that a wealthy man needs a wife”.

A: !!!!! That’s appalling. Why have they edited it?

B: The long words don’t fit the screen well. So they’ve shortened some of them. And they’ve made the sentences easier to read.

A: They’ve ruined it. Let me see.

B: You won’t be able to read it.

A: Yes I can – I don’t need glasses.

B: Only I can read it.

A: Bullshit. Give it to me…

B: OK …

A: [Stares at blank screen]. How do I switch it on?

B: It is on. I told you only I can read it.

A: What do you mean?

B: It’s DRM’ed.

A: ?????

B: Digital Rights Management. Only I can read it. I have to hold my thumb over this fingerprint reader.

A: OK, pass it over and pass your thumb over.

B: It’ll be the wrong way up.

A: No it won’t. Look

B: No my THUMB will be the wrong way up. You’ll have to sit on my lap…

A: Easy tiger…

B: Anyway I’ll have to get back to reading.

A: Why the rush?

B: I’ve only got 4 hours left.

A: ?????

B: You only get the book for 24 hours. Then you have to pay more.

A: So who owns the book?

B: eyePoodle. They’ve started buying up books for the eyePoodle. That’s why the title is slightly different. Then they can copyright it.

A: You mean that because they’ve rewritten it they can copyright it?

B: Yes, and every 60 years or so they’ll alter a few words and recopyright it. Great business. I’ve got shares in eyePoodle.

A: Well I’ll go to the library and get the real book. And I’ll make my own copy as Jane Austen’s been dead for years.

B: You can’t – they got rid of the books and replaced them with eyePoodles.

A: I’ll go straight to my Reader Services and DEMAND a copy.

B: Sorry it’s now called “Vendor Services”…

… Royston …Next stop Cambridge …

UPDATE:

Posted on OKF open-bibliography:

http://www.guardian.co.uk/books/2010/oct/26/libraries-ebook-restrictions

No – I hadn’t seen this before writing the blog.

Posted in Uncategorized | 11 Comments

Miscellania

Posted on October 25, 2010 by pm286

I asked a question about stereochemistry http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=2643 and owe the community an answer. The answer is my answer. It may or may not be “right”. I don’t like the word “right” in science. But I hope it’s acceptable to those who think about the problem. I showed two pictures

And asked what the relationship of these molecules was. The “right” answer was to be that it was impossible to tell as there is a stereocentre in the middle of the molecule that is undefined. But it was suggested that because the two molecules were drawn in the same way then we might (not “should” but “could”) assume that the stereocentre was consistent. In which case we could say that although we didn’t know what the molecules actually were we could say they were geometrical isomers, not enantiomers.

If I had drawn all the centres explicitly

Then we could say “definitely” that the two were geometrical isomers (cis/trans, configuration, … I am using http://en.wikipedia.org/wiki/Cis%E2%80%93trans_isomerism terminology).

There’s an assumption that conformation doesn’t play a role – that the cyclobutane ring flips rapidly enough to “average” the structure. Here’s Wikipedia’s example of cyclohexanes:

Alicyclic compounds can also display cis-trans isomerism. As an example of a geometric isomer due to a ring structure, consider 1,2-dichlorocyclohexane:


trans-1,2-dichlorocyclohexane	cis-1,2-dichlorocyclohexane

Note that the cis- compound as drawn can have enantiomers. What we all “know” is that at room temperature they intercovert so that the molecule is not optically active. But if we cool it down or look at a very short time scale then it would, indeed have enantiomers. So we have to be very careful in how we phrase the questions because people make assumptions. And my assumption is often not your assumption.

==== Next divertissement ====

I am now writing parsers for compchem log files. This engages some of the pleasure/hate centres of my brain in the same way as Sudoku does – and it’s slightly more productive. Here’s a typical bit of a log file. It doesn’t matter what the numbers are or what they mean

…. . . . omitted . . .

28 H 2.700772 3.229400 5.731856 4.467482 7.448261

29 H 1.099958 2.072781 5.129949 3.144928 5.344895

30 H 1.920507 0.965845 5.216951 4.614282 6.231286

21 22 23 24 25

21 H 0.000000

22 H 1.777203 0.000000

23 H 5.002806 5.524624 0.000000

24 H 6.096661 6.315358 1.774278 0.000000

25 H 7.214158 8.252435 4.711394 4.561542 0.000000

26 H 8.228138 9.136704 6.730950 6.277461 2.501917

27 H 7.915186 8.417597 7.182659 6.464504 4.302447

28 H 6.508693 6.586577 5.931001 5.094904 4.950885

29 H 4.817286 4.590808 4.296008 3.705276 5.455187

30 H 6.292404 5.863642 3.978192 2.711028 6.081788

. . .Snipped.. .

Question: (a) what would be the next line – precisely – and (b) what would the line after that start with?

And how would you write a parser for it

Posted in Uncategorized | 5 Comments

The Quixote knowledgebase for compchem continues- and Open Bibliography??

Posted on October 25, 2010 by pm286

#quixotechem #blueobelisk

It’s a few days since we announced that we had prototyped a distributed knowledgebase for computational chemistry – Quixote. We’ve already had useful and positive feedback. Here’s Anna:

Anna says:

October 25, 2010 at 8:12 am (Edit)

Finally! I’ve been bemoaning the lack of and obviousness of a global compchem database for quite a while without the resources to do it. Where can I alpha-test, please? Well done guys.

Anna – you can become part of this; just go to http://quixote.wikispot.org/Front_Page and join the list and tell us what you would like and can offer. You don’t say whether you want to use an existing knowledgebase (e.g. for reference, starting geometries, teaching, etc.) or whether you want to store and publish your own work. We’d be particularly interested in collections of legacy log files that you think would be of general interest. (At present we’d see the files being saved in specific collections so that people have a natural way of browsing them, but of course the whole resource will be searchable).

I’ve also had an offer from Henry Rzepa. He has reposited several thousand files in DSpace. Trouble is that once a collection is in DSpace you can’t get it out. That happened to me. So my challenge to the DSpace repositarians is:

“How do I download 5500 entries in DSpace?” [https://spectradspace.lib.imperial.ac.uk:8443/dspace/handle/10042/28 ]

I do not, obviously, have time to click through all of them. I would be prepared to spend 30 minutes of my time. No, since it’s Henry, I am happy to spend 60 minutes. If it’s easier than I thought I will gently revise my opinion of Institutional Repositories for data. (But this is only one of the reasons why IRs don’t service scientific data). If it’s effectively impossible without writing my own scraper then I shall continue to look elsewhere. (After all this is partly why we built Quixote).

Then I have had a wonderful correspondence from another correspondent. This is a mainstream organic chemist. They write:

I am just starting out as a lecturer in … and I am increasingly aware of the problems of how I (and other members of the compchem community) handle and report our calculations. With my new principal investigator hat on, I am particularly concerned that:

( a) when students/post-docs come and go, their compchem data is not lost forever on their laptops/desktops

(b) rather than reading a doc/xls/pdf file summary of computations I will want to check the input/output files, particularly with new students to make sure they are doing things correctly

(c) when we publish work other groups are able to re-use our data easily (how much time have I wasted reconstructing input files from weird PDF formatting??!)

(d) that my work is reproducible.

…

I currently have output from Gaussian, NWChem, Orca, Gamess-US, Macromodel, Jaguar, Tinker on various machines/servers/external disks, which is clearly sub-optimal. The idea of some nice QM data in a searchable repository would be pretty cool too for the force-field community, as all the ab initio parameterization was done using HF and MP2 with small basis sets for small molecules – I would imagine in principle many of the calculations required to parameterize/reparameterize a forcefield with more exotic molecules and more sophisticated levels of theory have been done by various people already….

I was recently tasked by the journal[…] to see if authors were following their editorial guidelines on compchem. Supporting Info – from a sample of ten recent papers using DFT calculations we found that the requisite Cartesians [coordinates of atoms], absolute energy and imaginary frequencies (for [transition states]) were usually present although not always (in spite of a check list that authors have to use). Deposition of the original files in the same way as we do with cifs [crystallographic data files, required by journals] already would solve this problem. Another little anecdote – I wanted to visualize the MOs of retinal the other day, so I had to calculate them myself – hundreds of people must have done this already at some point, so if I could search for this data rather than waste hours of time/electricity that would be much better.

…if I can get involved/test or supply certain file formats/ raise awareness then please let me know!

The work with the Journal is extremely valuable. As my correspondent says all we have to do is persuade the Journal to publish the supporting info. We (in the JISCXYZ project) can validate it. Compchem is the best of all fields (even better than crystallography) for an almost complete data validation before refereeing.. it’s even possible to do it oneself and sign the result.

And another correspondent, this time from the Blue Obelisk. [I asked about REST and some of the problems with firewalls.]

I am forwarding this message to […] developers list since it is very similar to what […]project is about.

Perhaps we can collaborate with reusing/extending existing […] REST services with similar functionality and sharing experience with development of web services based on RDF .

…
One of my colleagues was managing certificate authority for EGEE grids and running a certificate authority, so there is such experience.

And just for the record, distributed security in REST is not trivial nor simple at all … there are no well accepted solutions currently.
This might be the major obstacle in front of any REST approach aiming at distributed services, rather than single REST site , as most major commercial REST services these days.

We are currently adopting not optimal (centralized) solution in […] and GEANT-wise [A European GRID project and infrastructure] I am involved in a small group preparing an RFC for a protocol for REST security .

…
Just sharing what we are struggling with for two years already, having tens of distributed REST services over Europe , 5 independent implementations in two languages, covering at least half of the functionality listed in your email.

This is all really exciting. You start to see how volunteers start to make major contributions to the project.

I am sure the redacted names will become public soon …

Now – if only we could get the same sort of excitement for bibliography. I’ve posted that we now have millions of Open Bibliographic records. Currently I have interest from 2 chemists, a zoologist, a mathematician, an economist, and several hackers. I met 4 librarians at JISC on Friday and bounced up to each and said “we’ve now got the British Library bibliography! Open! 3 million records”. What can we do together?

The answer was that they weren’t interested at all. “why should we be interested in some other library’s catalogue?” “bibliography isn’t interesting”. I was gobsmacked. Bibliography is the soul of scholarship. I thought that by collecting bibliography and turning it into an intelligent semantic resource then we would start a new era in the library.

I really don’t know now what I am going to say to the Research librarians in Edinburgh.

Posted in Uncategorized | 2 Comments

The Absolute Minimum Every Scientist with Data Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Posted on October 24, 2010 by pm286

I wrote a rant about PDF destroying scientific information (it keeps me sane). PDF is a hamburger. Power corrupts and Powerpoint corrupts absolutely. But they aren’t the only ways.

The third most common method of textual destruction of science is to use wordprocessors.

This is a long post. It’s longer because you HAVE to read Joel Spolsky’s post (from which I pinched the title). But you have to read it.

Word, Open Office, LaTeX. They all destroy science. Word is the worst because it’s the most commonly used and the smartest. Which means the most evil. (Yes, I’m sponsored by Microsoft to develop chemistry in Word. And those MS folks I collaborate with will understand and sympathize). Open Office will try to emulate the destruction. And LaTeX creates beautiful typeset output which also destroys science.

Your probably think I’m mad. It’s catalysed by a reply from Henry Rzepa to my post: I’m going to point to Henry’s reply rather than quote as I daren’t cut and paste. You’ll see why (and it’s not copyright this time): Here’s his comment http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=2660&cpage=1#comment-493508. Read it carefully and tell me what you see close to the last two “ROA” strings.

I see:

If I cut and paste this I get:

the (otherwise relatively unremarkable molecule) were rather larger than normal. The relevant field is identified in the output with the string ROA– and the numerical value is identified as -9999.9. Well, for our molecule, this ended up as ROA—10000.0 (the numbers are fictitious, to illustrate the problem). You can see how one missing space totally messed.

So everything looks OK? Yes?

If, however, I use the edit function on Henry’s post I get the raw text:

intensities for the (otherwise relatively unremarkable molecule) were rather larger than normal. The relevant field is identified in the output with the string ROA– and the numerical value is identified as -9999.9. Well, for our molecule, this ended up as ROA—10000.0 (the numbers are fictitious, to illustrate the problem). You can see how one

which renders on my machine as

What has happened?

WordPress (and this would also be true of Word and Open Office) is smart. It knows that Henry “wanted” to type the em-dash character . Now it’s not on Henry’s keyboard, but WordPress (and the other corrupting processors) know what is best for Henry. It’s not what Henry actually wants, it’s what they are going to give Henry whether he likes it or not. So he types three minuses in succession ‘-‘ ‘-‘ ‘-‘

ROA—10000.0

(rendered on my machines as )

He wishes us to interpret this “R” “O” “A” “-” “-” followed by “minus ten thousand”.

[Note incidentally that as I type this, Word has CHANGED, yes CHANGED, the quote character (ASCII 34) in my text to the evil smart quotes – the sloping things. Word knows what’s best for me as well as for Henry.]

So whereas Henry wrote something perfectly sensible and important, the tools have changed his text to gibberish.

THE ONLY FORMAT THAT SHOULD BE USED FOR TRANSMITTING SCIENCE IS ASCII. Code points 32-127.

Help! What’s a code point?

A code point is the platonic identification of the character. Not what it looks like (its glyph). Not what it is transmitted as (its encoding). The character itself.

So If we can only use code points 32-127 how do we do important things like the degrees sign? And the world-conquering copyright symbol? They aren’t in ASCII.

YOU MUST NOW READ JOEL SPOLSKY. The Absolute Minimum Every Software Developer Must Know About Unicode and Character Sets

And Wikipedia: http://en.wikipedia.org/wiki/Unicode .

If you do not understand the difference between code points, encodings and glyphs, and if you are creating any form of data-rich scientific document then you are probably committing a crime against science.

You are randomly multiplying your numbers by -1.

You are randomly destroying symbols denoting units

You are randomly losing digits.

You are transforming “mu” to “m”.

I’m sorry, but there is no escape. You HAVE to understand lab safety, right? It’s part of science.

So is character encoding.

So here is the good news:

When every scientist and scientific publisher uses Unicode, and when all the toolsets support Unicode and when documents are transmitted with a clear encoding (such as UTF-8), then the problem goes away.

And the bad news is that we are not there yet:

Publishers produce gibberish on their web pages. (Not all pages and not all publishers, but there is still gibberish). Here’s one of the good (the Royal Society of Chemistry):

<!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.1//EN” “http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd”>

<?xml version=”1.0″ encoding=”utf-8″?>

<html xmlns=”http://www.w3.org/1999/xhtml” xml:lang=”en”>

It specifies the XHTML namespace, the human language (EN) the DTD (for validation) and the ENCODING (UTF-8). Everything that should be there.

And here is the ACS:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

Missing the language and the encoding. But at least they got the COPYRIGHT properly encoded. It wouldn’t do to fail to display that:

            <div><a href="/action/clickThrough?id=317672&url=http%3A%2F%2Fportal.acs.org%2Fportal%2FPublicWebSite%2Fcopyright%2Findex.htm&loc=%2Fjournal%2Fjoceah&pubId=40026046" title="ACS Copyright Information" class="underline">Copyright &copy; 2010 American Chemical Society</a></div>

Where's the copyright symbol? It's the "&copy;" See Wikipedia: http://en.wikipedia.org/wiki/Copyright_symbol which says:

The character is mapped in Unicode as U+00A9 © copyright sign (HTML: © ©).^[4] On Windows systems, it may be entered by means of Alt codes, by holding the Alt key while typing the numbers 0169 on the numeric keypad. On Macintosh systems, it may be entered with ⌥G. The HTML entity is ©, and it can also be referenced as © or ©.

Unicode has also mapped U+24B8 Ⓒ circled latin capital letter c and U+24D2 ⓒ circled latin small letter c.^[5] They are sometimes used as a substitute copyright symbol where the actual copyright symbol is not available in the font or in the character set, for example, in some Korean code pages.

But hang on! I've just copied those paragraphs into TextPad and what I see is:

The character is mapped in Unicode as U+00A9 ©? copyright sign (HTML: © &copy;).[4] On Windows systems, it may be entered by means of Alt codes, by holding the Alt key while typing the numbers 0169 on the numeric keypad. On Macintosh systems, it may be entered with ?G. The HTML entity is &copy;, and it can also be referenced as © or ©.

Unicode has also mapped U+24B8 ?? circled latin capital letter c and U+24D2 ?? circled latin small letter c.[5] They are sometimes used as a substitute copyright symbol where the actual copyright symbol is not available in the font or in the character set, for example, in some Korean code pages.

It’s completely trashed. AND THIS IS WHAT HAPPENS WHENEVER YOU CUT AND PASTE INFORMATION WITH TOOLS THAT ARE NOT UNICODE COMPLIANT AND WHERE THE ENCODING IS NOT SET. Notice that many of the symbols are replaced by question marks (?). And the small capitals have been turned into lower case. And goodness knows what else.

Is it Wikipedia’s fault? Possibly. Their page shows:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" lang="en" dir="ltr">

<head>

<title>Copyright symbol - Wikipedia, the free encyclopedia</title>

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

<meta http-equiv="Content-Style-Type" content="text/css" />

They have identified the charset to the HTML agent (browser) but this isn’t the same as actually adding the encoding. Let’s edit the Wikipedia page in TextPad. And we get a message:

As soon as you see this you can be CERTAIN THAT YOUR INFORMATION IS BEING CORRUPTED. What it means is “You haven’t given TextPad the encoding so TextPad (which is a good program) has to guess what the characters are”.

What’s 1252? see http://en.wikipedia.org/wiki/Windows-1252 which explains:

Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages. In LaTeX packages, it is referred to as ansinew. The encoding is a superset of ISO 8859-1, but differs from the IANA’s ISO-8859-1 by using displayable characters rather than control characters in the 0x80 to 0x9F range. It is known to Windows by the code page number 1252, and by the IANA-approved name “windows-1252”. This code page also contains all the printable characters that are in ISO 8859-15 (though some are mapped to different code points).

It is very common to mislabel Windows-1252 text data with the charset label ISO-8859-1. Many web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 characters in order to accommodate such mislabeling but it is not standard behaviour and care should be taken to avoid generating these characters in ISO-8859-1 labeled content. However, the draft HTML 5 specification requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.^[1]

If you don’t understand this, it means “Windows does it differently from everyone else”. But, to be fair, everyone else (nearly) is also making a pig’s ear of it.

So the simple rule is:

DON’T CUT AND PASTE NON-ASCII CHARACTERS WITHOUT AN EXPLICIT ENCODING.

Which simplifies to

DON’T CUT AND PASTE UNLESS YOU KNOW WHAT YOU ARE DOING AND YOUR KNOW YOUR TOOLS ARE COMPLIANT

Which normally simplifies to

DON’T CUT AND PASTE

So this looks pretty bleak. Many tools are non-compliant. Almost all documents are non-compliant. Almost no scientist understands the problem. People are used to seeing gibberish in their browser. In Word. In LaTeX.

But it doesn’t apply to science, does it?

Here’s a simple true sentence:

1 μg = 0.001 mg

And cut and paste reduces it to

1 mg = 0.001 mg

Well if your experiments don’t worry about a factor of 1000, then don’t worry. This is happening every day – micrograms are transformed to milligrams. A clever tool somewhere “guesses”. And of course cutting and pasting from Powerpoint almost certainly guarantees this

Yes, if you think LaTeX is never evil, try the Hufflepuff test.

How many F’s are there in this sentence? Put your cursor here and search for “f”.

Huﬄepuﬀ, my Hufflepuff!

Well it depends on what you mean by “f”. A human may see 8. A machine sees only 4. Because the code points in the first word represent ligatures. Let’s look in a preformatted (monospace) font which shows the two words are different.

Huﬄepuﬀ, my Hufflepuff!

There are 7 code points in the first, 10 in the second. The first will NOT be found by natural Language processors (unless they expand out ligatures). This is what I created in HTML:

<html>

Huﬄepuﬀ, my Hufflepuff!

<pre>

Huﬄepuﬀ, my Hufflepuff!

</pre>

</html>

(Yes, I omitted the encoding). The first word defines two Unicode code points which represent the ligatures. I doubt there is any chemical text processing in the world which treats ligatures properly. We put a lot of effort into trying to recover from em-dash instead of minus. All because the tool sets and the publishers create underspecified information.

So what happens when we paste our sentences into TextPad? You should be able to guess:

Hu?epu?, my Hufflepuff!

The ligatures are replaced by “?”.

So this is a long post. There is no absolute solution as the toolset and the publication process is almost universally broken. Science progresses one funeral at a time, so it will take some years. But here are some rules:

USE XHTML with UTF-8. It costs nothing. It’s free. You don’t have to pay for it. You won’t break copyright.
USE A UTF-8 COMPLIANT TOOL. Most good Open tools will do this.
TURN OFF ALL SMART TYPESETTING IN YOUR TOOLS. QUOTES, MINUSES, LIGATURES.
THINK ABOUT THE PROBLEM.
USE MONOSPACE (COURIER) FOR IMPORTANT SCIENTIFIC INFORMATION, DATA, CODE. ARGUE WITH ANY PUBLISHER WHO TRIES TO RESET MONOSPACE INTO GARBAGE.
WRITE TO ANY PUBLISHER WHOSE PAGES FAIL TO DISPLAY FUNDAMENTAL SCIENTIFIC INFORMATION CORRECTLY in all browsers.

Posted in Uncategorized | 3 Comments

FortranFormat, iChemLabs and Quixote (and a Test)

Span, Span, Span, Span, Span, loverly Span.

A common problem in informatics (UPDATE)

Chemical MIME and the role of the IETF

What makes an Internet meme? Chemical MIME and CML… “We must just KBO.”

REST in peace in the Matrix

“books” On the Cambridge Train

Miscellania

The Quixote knowledgebase for compchem continues- and Open Bibliography??

The Absolute Minimum Every Scientist with Data Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Recent Posts

Recent Comments

Archives

Categories

Meta