There are other evils than PDF: what’s the problem here?

Sunday, January 30th, 2011

I am writing a parser for the #quixotechem project – in this case NWChem output. The output is generated AFAIK by FORTRAN. I am having difficulty parsing it. Why?

Here is some text I can parse (deliberately a snapshot from a text processor):

And here is the evil output:

What unexpected horror (or semi-unexpected, as I’ve had it before) has caused me to waste a lot of time?


Here’s a very strong hint. This is what I get when I load it into my text editor. Whatever has happened? And what could I do to make it at least human readable?

And what piece of code did I have to write last night to solve the problem in future?


The Blue Obelisk 5+ years on

Wednesday, January 26th, 2011


Not only was the blue obelisk a theme at #pmrsymp, but Brian Brook had MADE a blue obelisk! It’s made of wood and nearly 2 metres high. Many of the attendees signed it;

And then I went to San Diego where 5+ years ago we first met under the Blue Obelisk. Yes, there is a real Obelisk. In fact there are two – the Greater Blue Obelisk and the Lesser Blue Obelisk. We met under the “blue obelisk” –most of us found this one:

But Geoff Hutchison found this one!

Geoff waited for half-an-hour and we waited for half-an-hour and each of us thought “how could you possibly not find the obelisk in Horton Plaza”. Luckily we went to look for each other = without that the Blue Obelisk might happened.

Anyway I was there because Aaron Culich drove me. Aaron’s a compsci, very interested in how we interact with molecules and some great ideas for “strongly typed parser combinators” which will create a domain-specific-language for us.

And he took a picture of me

So the Blue Obelisk is flourishing. More obelisks have found their way to people – I am very flattered by mine. Much more software has been written. Our sphere of influence increases. Our code is taken seriously.

J.Cheminformatics (open Source) is running a special issue from the PMR Symposium and if anyone has an idea for a Blue-Obelisk-related paper let me or Christoph or David know. We’ve already got a lot of people who say they will contribute.

Until the next time


Beyond the PDF – the good the bad and the future?

Monday, January 24th, 2011

I’ve finished a hectic 9 days with #pmrhack, #pmrsymp and #beyondthepdf. I’ll need weeks to do justice to all of these. So a few immediate reactions to BtPDF.

A mixed community, mainly scientsis, some IT/Library people , some ICT companies and some publishers. Perhaps 50 people. Wonderful hospitality. Great discussions and presentations (with a strong bio-flavour – this was anticipated).

Surprisingly the most exciting stuff happened with the group of people staying on till today with no formal business. We’ve started a “Writing” project which now takes Peter Sefton’s “Fascinator” as the framework. I’d seen it before in USQ / Toowoomba but it’s come on a lot. It wowed everyone who saw it (many didn’t as they were in other sessions). I am sure it is one way that we manage the results of Quixote.

There is also a large groundswell of those who want to develop new types of scholarly communication. So, for example, can we publish our 2 days of #pmrhack? We think so and I’ll be tackling this when I get back. The bio- community is into ontologies, annotations, etc. (There was relatively little non-bio – we didn’t see any quantities with units, any chemical diagrams, any maps, any maths, etc.). But lots of excitement.

On the other hand I am increasingly depressed by the position of publishers. There is an assumption (by them) that they “own” the scholarly content and that they give us permission to use it. If you think I am being over-reactive, read Richard Poynder interviewing Springer . Springer is simply out for what it can milk the academic community for – there is nothi8ng about how S brings benefit to academia. Scientists are living in occupied territory – we need permission to use our own creation. We know that there needs to be money in the system – and we discussed business models in the meeting. But unless we scientists design these business models for our benefit we will simply be manipulated by the commercial world.

Plane is leaving – thx to San Diego airport for free wifi


Beyond The PDF Presentation

Thursday, January 20th, 2011


The presentation was a technical success – virtually everything worked. Not sure we got speech back from the skype. But otherwise great communication.

Several vocal contributions over skype including Daniel Mietchen from audience!

No idea how and where it has gone down. There is already a lively debate about PDF vs the rest and closed source vs Open source.

Scholarly communication let’s eat our own Dogfood

Thursday, January 20th, 2011


THIS REPRESENTS AN INTERACTIVE PRESENTATION TO ON 2011-01-20. The large letters are for projection


Just go to this page and give your name. Don’t go in unless you want to ask a question – I will be showing the site


This post tries to practise (Wikipedia)

rather than using Powerpoint (Power corrupts; Powerpoint corrupts absolutely. this kills kittens)


(used with thanks and probably violating copyright – not fair use but funny).

In case you are still reading I shall use this blog to present my message. Not “slides” but communication by “writing” – a form of linear communication. I’ll try to make this a “conversation” in places

“All Science is hypothesis-driven”.

No. Lots of it is data-driven. Part measurement and observation; Part computed. The result is data which can (and often is) published for its own sake. The fourth paradigm. Problem is:

  • Many scientists are arrogant about data publication
  • Many publishers can’t be bothered. Many that are bothered sell our data back to us (ACS, Elsevier are examples in chemistry).

OK – but data is hard, so it’s too expensive to do. All that collecting, standardising etc.

No it’s not. Vannevar Bush proposed the Memex (1948) to remember everything we do.

But that’s impossible

No – it’s possible it we wanted to do it. We make it a way of life and it becomes tractable


We’re already doing it. I’ll show 3 projects. Based on communities. Not how can I do this, but who wants to work with us.

But that means you work with your competitors!!!

I don’t have competitors. Just apathetics. Anyway here’s the projects. They are all ZERO cash.


They aren’t funded – they work with volunteer resources and marginal cpu+bandwidth+storage. Like Galaxyzoo. Open Streetmap. All results are O-P-E-N. Licensed with PDDL or CC0.

  • crystallography (Crystaleye). Goes out every night and crawls the scientific literature for crystallography. 200,000 separate structures. Pulls back data files (CIFs) indexes, validates and provides domain search. Sustainablity with IUCr.
  • chemical reactions (Greenchain reaction and results ). We trawled the patent literature for c hemical reactions. 100,000 . Hypothesis: reactions are getting greener (extract the solvents)
  • and compchem (Como (example) and Quixote).

But Crystaleye only has half the literature…

…Yes, because Wiley, Elsevier and Springer don’t publish Open supplementary info.

Well, just ask them…

…I have – several times. They never reply. Readers don’t matter.

Stop ranting…

… sorry.


So were the reactions getting greener …

… not that I could tell. But we’ll do another run.

But patents are grotty. Designed to conceal information. Why don’t you use the published literature? There’s 10,000,000 reactions reported…

… answer it yourself

Oh, I see. Only 1,000 are Libre Access…

… you’re getting the message


And Quixote. Why the name?…

… because PMR has mad expectations. And because it started in Spain…

So what makes it different?

… well NO-ONE publishes comp chem data. And it’s the easiest and best data of any discipline. The laws of physics work. So we will scrape scientist’s disks and convert them to semantic

You’ll have to give them the program…

… Yes we’ll let them use Avogadro. It’ll front up all the tough stuff. It’ll act as a computational Memex.

…Heres Quixote. Also

I don’t like the black background…

… it isn’t black. It just looks black. Browsers have a little way to go – see



Wednesday, January 19th, 2011


The symposium was a great effort and I personally am very pleased. I believe it was a success – about 100 people came and an unknown number attended virtually (probably about 20). I’m hoping we can capture pictures, reports, etc. But here are some main points:

  • Invited speakers. All excellent and fitted together well. There was NO coordination between speakers – this was deliberate – and delegates had to make their own sense.
  • Media. We had all sorts. Powerpoint. Html. Local. Web-based. Recorded slides. With/out audio. Skype conference. Silent and spoken. Slides and demos.
  • We had to communicate and record this
  • The timescale was ultra-tight. So when we presented the results of our group we split it into 2 parts. (a) 15 mins, with 10 mini-presentations. Timed. To the second. (b) 30 mins, 7 demos, with people doing them but me speaking over them. It worked!
  • John Wilbanks contributed a recorded session. Unfortunately we didn’t have a clear way of getting questions asked at the end. Sorry, John.

We had several demos; existing ones

  • Crystaleye
  • ChemicalTagger/OSCAR
  • IsItOpenData

And those hacked over the hackfest

  • Locative molecular art (David Murray-Rust). Several people were able to “see” local molecules in their phones so we are developing this and I’ll tell you more soon.
  • Spoken OPSIN. We are able to speak into an Android “Caffeine”. The phone interprets the speech (send it back to Google for interpretation) then send the text to OPSIN and then send the structure back to the phone. In seconds or less. Awesome (as they now say). Sam Adams, Daniel Lowe, etc.
  • Kinect molecules. Several people (DSM-R, Ben O’steen, Dan Hagon, …). You can now dance in front of a kinect and rotate a molecule or change its geometry. Fantastic.
  • Ami. Showing computer vision and intelligent molecules

And then the rest

  • Tom M-R painted some molecules specially for the meeting
  • Brian made a huge blue obelisk

Sue and Emma put on a wonderful day and evening and looked after speakers etc.

And then Julian and Steve. They did a really marvellous job on streaming. Picked up on mobiles and PCs throughout the world. They spent 3 days preparing. Tried several approaches. Streaming is NOR trivial. For example if you have too many streams the machine may overheat (I never thought of that). How do you route a skype conversation to remote audio? How do you manage questions? Etc.

So – all in all a great occasion. Thanks to everyone

(Oh, I am not retiring)

Jim Downing Blue Obelisk

Wednesday, January 19th, 2011


I am catching up on lots of blog posts that I should have done – and this is the oldest. It’s Jim Downing being presented with a Blue Obelisk last summer.

Jim has made an outstanding contribution to our group and to the JISC community in general. Without his vision and persistence we would not have SPECTRaT, CLARION, #jiscxyz etc. and we would therefore not have Quixote or OSCAR4. Jim has also set up a software development infrastructure including .

And he has created the Lensfield vision which has inspired GreenChain reaction and Quixote.

Jim has gone into IT-driven wealth-creating industry ( ) and I am expecting this to be a glittering success and one that we can all take satisfaction from.

PMR Symposium Hackfest and Blue Obelisks

Monday, January 17th, 2011

#pmrsymp #blueobelisk #pmrhack

Wonderful three days. Lots on twitterfall – just follow the hastages. Hackfest achieved a breakthrough with:

  • Dave M-R’s Spook Molecules in Cambridge (many people in the audience were able to see them)
  • Ami – intelligent fumecupboard
  • Kinect molecules – sensational – againd ca. 6 people involved in this hack. Gesturing to molecules to rotate them and change bonds!

PMRSymp was run literarally to the minute – as planned. Real-life talks, recording, skype twitter – everything. Ca 15 people involved in rapid presentations. Spkndid main talks. More later

Three blue Obelisks. Henry Rzepa, Dan Zaharevitz, Sam Adams. Also Jim Downing who got one last summer and I forgot to blog it.


#pmrsymp Vision of a Semantic Molecular Future WILL BE STREAMED

Monday, January 17th, 2011

Just about got everything ready. Get latest details from

#pmrhack great success – will demo results in evening

Please follow and tweet. We may take questions from the twittersphere if we can manage it

Must rush

Defending the Public Domain: Open Bibliography

Saturday, January 8th, 2011


A comment on my latest post on Open Bibliography deserves a full reply

Marius Kempe says:

January 8, 2011 at 7:42 pm  (Edit)

This is very heartening to read. Might I ask for one clarification, which I think you’ve addressed elsewhere but I can’t find: is it actually possible to copyright bibliographic data? Are they not un-copyrightable facts, automatically in the public domain? Or is that in fact true, but we need this effort anyway to combat publisher FUD?

The reason why so much of my effort goes into creating Open tools is exactly this – a mixture of unclarity and default or deliberate FUD.

I believe that many things “are” in the public domain but that many other people think they are not. The problem is that there is generally no simple way of determining the answer. The issue arises mainly from the automatic nature of copyright. If I create a work then the copyright automatically attaches to me. This blog is my copyright. Even if I do nothing it’s my copyright. I don’t have to register it, I don’t have to defend it. Until seventy years after my death (in UK, it varies between jurisdictions) it’s my copyright or my estate’s. Even little bits of it are copyright. If I create a song called “defeding the Public Domain” then that phrase is copyright.

Copyright is generally a civil matter though again this varies between jurisdictions. That means that a breach is not prosecuted by the state, but by an aggrieved individual or organization. If someone violates my copyright then my recourse is to the law. The ultimate decision is in courts with highly paid lawyers – there is not normally a copyright tribunal or office which gives objective judgments.

The copyright symbol does not determine whether or not something is copyright, but it’s a very powerful indication that the person adding the symbol believes they own the control the copyright. Since copyright is a matter of law, violating copyright can be seen as violating law. Most people – like me – believe in the power of the law and have an aversion to breaking it. Therefore if someone claims copyright ownership of something most people will accept that unless they have a direct commercial interest and have the financial and legal muscle to fight it.

Note that if something is “in the public domain” then no-one owns it. Therefore there is no-one to fight for it if someone else claims it is their copyright. It requires a defender of the public domain and this is not easy to get support for. To some extent the EFF and FSF does this for code, but no one does it for bibliography.

So here’s a typical problem. I quote from Wikipedia on

The Dewey Decimal Classification (DDC, also called the Dewey Decimal System) is a proprietary system of library
classification developed by Melvil Dewey in 1876; it has been greatly modified and expanded through 22 major revisions, the most recent in 2003.[1]

Administration and publication

While he lived, Melvil Dewey edited each edition himself: he was followed by other editors who had been very much influenced by him. The earlier editions were printed in the peculiar spelling that Dewey had devised: the number of volumes in each edition increased to two, then three and now four.

The Online Computer Library Center of Dublin, Ohio, United States, acquired the trademark and copyrights associated with the DDC when it bought Forest Press in 1988. OCLC maintains the classification system and publishes new editions of the system. The editorial staff responsible for updates is based partly at the Library of Congress and partly at OCLC. Their work is reviewed by the Decimal Classification Editorial Policy Committee (EPC), which is a ten-member international board that meets twice each year. The four-volume unabridged edition is published approximately every seven years, the most recent edition (DDC 22) in mid 2003.[4] The web edition is updated on an ongoing basis, with changes announced each month.[5]

The work of assigning a DDC number to each newly published book is performed by a division of the Library of Congress, whose recommended assignments are either accepted or rejected by the OCLC after review by an advisory board; to date all have been accepted.

In September 2003, the OCLC sued the Library Hotel for trademark infringement. The settlement was that the OCLC would allow the Library Hotel to use the system in its hotel and marketing. In exchange, the Hotel would acknowledge the Center’s ownership of the trademark and make a donation to a nonprofit organization promoting reading and literacy among children.

Melville Louis Kossuth (Melvil) Dewey (December 10, 1851 – December 26, 1931) was an American librarian and educator, inventor of the Dewey Decimal System of library classification, … Dewey copyrighted the system in 1876.

Here is my amateur analysis of the situation. If you take away one fact it should be that nothing is simple, and much is not algorithmic. Let’s assume that the work was created in the US and that Dewey is dead and has been since 1931. That’s 2010-1931 = 79 years dead. Here’s the US copyright law

How long does a copyright last?
The term of copyright for a particular work depends on several factors, including whether it has been published, and, if so, the date of first publication. As a general rule, for works created after January 1, 1978, copyright protection lasts for the life of the author plus an additional 70 years. For an anonymous work, a pseudonymous work, or a work made for hire, the copyright endures for a term of 95 years from the year of its first publication or a term of 120 years from the year of its creation, whichever expires first. For works first published prior to 1978, the term will vary depending on several factors. To determine the length of copyright protection for a particular work, consult
chapter 3 of the Copyright Act (title 17 of the United States Code). More information on the term of copyright can be found in Circular 15a, Duration of Copyright, and Circular 1, Copyright Basics.

So although Dewey copyrighted the system he’s now been dead for over 70 years so the original copyright has expired. The fact that someone bought the copyright doesn’t affect its duration. (BTW distinguish copyright from trademarks). So the original DDC is in the public domain.

Note that in the US A “work of the United States Government” is a work prepared by an officer or employee of the United States Government as part of that person’s official duties. And …

§ 105. Subject matter of copyright: United States Government works37

Copyright protection under this title is not available for any work of the United States Government, but the United States Government is not precluded from receiving and holding copyrights transferred to it by assignment, bequest, or otherwise.

So, assuming the Library of Congress staff produced their DDC work as part of their official duties (and I’m guessing they did) then their work is in the Public Domain in the US (and by extension elsewhere).

So, at a first reading the DDC is not copyrighted. However my guess is that every new version is freshly copyrighted and that the copyright subsumes the public domain material so that the copyright will be extended indefinitely. You may believe this is a good idea, or you may feel that it is unjustifiable. If the latter, you’ll have to hire a US lawyer, show you have a case (e.g. that you have suffered financial loss) and spend a lot of time and money.

So my analysis is that it’s unclear whether DCC is copyright. In practice OCLC says that it holds the copyright and most people and organizations go along with that whether they want to or not.

It’s trivial to add copyright symbols to a document. I can write © Peter Murray-Rust on this document. Do I have the right?

  • Yes, I wrote it
  • Hang on, you didn’t – you pinched some of it from Wikipedia.
  • Well, yes – but it’s very tedious to acknowledge it. They’re not going to sue me
  • You’ve also pinched stuff from the US government
  • That’s OK it’s in the Public domain
  • I suppose you can do that – but only in the US


By this time everything has become subsumed under my blog.

It’s absolutely universal for content providers to spray copyright symbols on everything – marking their territory. If I ask anyone in academia whether I can re-use it without permission they are all so hexed-out by the magic symbol ©that they will automatically say “no, you can’t use it without permission”. Many of them run in awe or terror of the large content providers, who occasionally sue people. Of course the music industry and the film industry are best known but it also happens in academica and scholarly publishing.

So, Marius, back to your question.

Is it possible to copyright bibliographic data?

Yes – just add your copyright symbol

Is that legal?

It’s not against the criminal law. Find out by hiring a lawyer.

So this is why we are identifying the problem. Pointing out to the community that there is a problem. That the problem costs us hundreds of millions of dollars a year. That as academia we have to start asserting our rights.

And the first step in asserting our rights is to define them.

Now I am hoping that libraries and their bosses will see that this is in their interests. And support Open Bibliography. And start asserting humanity’s right to it.