Monthly Archives: June 2009


Puntcon was a great success wonderful weather and probably > 40 people of all ages. Many travelled up from London and there was a good representation of geeks (e.g. people working in often startup companies doing software or webby things). Also many people from the culture of openness and digital democracy #mySociety, #OpenKnowledgeFoundation, #OpenRightsGroup, etc.

I met Cory Doctorow and Alice and next generation (I am terrible with names) as were were punted smoothly up the river by Citizen Pollock. Cory and I are sharing a platform in #ILI2009 in October. He's a science fiction writer and we talked about the values of self-publishing it's becoming an interesting option though whether this scales across the field is not so clear. He will be talking very authoritatively on the new aspects of publishing at #ILI2009 and I'm feeling I have very little positive to say. Maybe by then I will have got some input from the library community haven't so far.


The return AdamAmyl (red shirt) involved in running WriteToThem. Other annotations in comments welcome.


Bill Thompson (organizer) with Becky Hogge


Ross Anderson (Cambridge) talking with BillT and others. Andrew Walkingshaw (centre, black shirt) who worked on MaterialsGrid in our groups and is now a founder member of Timetrics


Half of the group (Rufus Pollock, centre, green shirt).

See also (chaileyf)


from joannejacobs

So if you missed it, some of the people will be at OpeTech on Saturday at ULU London, so come anyway.

Effective digital preservation is (almost) impossible; so Disseminate instead

I was just about to go back to refactoring Chem4Word, when I saw this pingback on my blog and just have to comment. It's really important. More of my comments at the bottom...

Which blogs should be preserved?

Richard M. Davis on 26th June, 2009 at 12:00 pm

Youd think it obvious that my blog should be preserved, though Im not so sure about yours! According to the poster summarising the fascinating 2007 survey by Carolyn Hank et al: The majority of bloggers agreed (36%) or strongly agreed (34.9%) that their own blogs should be preserved. Five per cent dont want their blogs preserved at all; nearly a quarter arent fussed either way.

Heres one of the data tables (which I had to retype as HTML Peter Murray Rust is right about PDFs and data):

Table 4. Preservation perceptions general

Strongly agree or agree

Neither agree or (sic) disagree

Strongly disagree or disagree

Should preserve

Personal blog




Every blog




Every comment




All online content




Should not preserve

Some blogs




Some comments




Some online content




The overall pattern seems a good vindication of  our own project approach, which will progressively move from capturing blog content (posts), to addressing comments and content, reflecting the scale of the bloggers own priorities.

It also seems a useful juncture in our project to throw open the question: which blogs should we preserve?

With over 5 million active blogs noted by Technorati, it seems daft to even start to enumerate them but in our field (libraries, archives, information science), several stand out, and its the very nature and importance of these that bolster the case for keeping them. I have in mind in particular Peter Subers Open Access News blog, but also blogs such as those of Peter Murray Rust, Brian Kelly, Lorcan Dempsey, Dorothea Salo, Jill Walker Rettberg all ripe with contemporary accounts and robust views on matters of scholarly communication. But in every case, we have cause to wonder: will that information survive, will that link still work tomorrow?

What blogs (or types of blogs) do you think should be preserved, and why?

PMR: This is really important. Blogs are evolving and being used for many valuable activities (here we highlight scholarship). Some bloggers spend hours or more on a popst. Bill Hooker has an incredible set of statistics about the cost of Open Access and Toll Access publications, page charges, etc. Normally that would get published in a journal no-one reads (I have even published in such it was a huge effort and it's got one citation. Not that I care about citations). So I tend to work out my half-baked ideas in public. Some people do their early science in the Open. Some are activists. Some review the current landscape, etc.

But preservation is really really difficult. I don't know how to tackle it. Since 1993 I have been determined to preserve my digital record.

And I've failed.

I've created courses, forums, data sets, teaching-learning objects, blogs, preprints, etc.

And I've lost most of them.

There are many reasons. First it's extremely hard to preserve complex digital objects. The problems include:

  • compound documents (and only after 15 years is the web coming round to realising this is important)

  • hyperlinks

  • moving URLs/URIs

  • formats

  • semantic behaviour

  • disorganised humans (me)

  • moving institution (4 times)

  • moving computer (about 10 times)

Henry Rzepa and I have worked hard on this and he is more organized than me. We put early versions of JUMBO on CD-ROMs and got the RSC to distribute them with an issue of the journal. I have saved things on DAT tapes from the SGI. DAT??? SGI??? I don't have a machine which will read 3.5 floppies at home. I have trashed my much beloved BBC Micro.

Every time I change machine I lose large amounts of data.

At some stage someone will invent a true Memex for my digital activities. Until then:

Preservation is effectively impossible.

So what's the answer? The only one I can think of at the moment is to disseminate as widely as possible. If people want to read your material they will take copies (if that is technically possible). I would urge University Repositories:

Stop agonizing about preservation and start disseminating.

If it's worth preserving the the web will have a reasonable chance of containing it somewhere. If it's not, well history will judge whether our current dross are the jewels of the future. We can't tell.





Peter Corbett and the OSCAR3 award

Today was a sad and happy event in that we said goodbye to Peter Corbett. Peter has been the chemistry lead in the SciBorg project and has made major contributions to understanding chemical documents and chemical language. He has developed the OSCAR3 program which many people (citation needed...) regard as the leading tool for chemical entity identification and extraction. In simpler terms OSCAR3 can analyse a document (as long as it's not some awful bitmap or grunged PDF) and identify the chemical words and phrases. Peter has also written on the linguistic science of this it's fairly easy to identify the word pyridine but this isn't enough. Peter identifies at least 3 uses of the term: the bulk substance (a bottle of pyridine), a part of a molecule (pyridine rings are aromatic) and a molecule itself (the pyridine molecule has C2v symmetry). He's written at length on his latest blog post about this.

OSCAR wasn't the primary scientific reason for the Sciborg project but Peter found time to develop a major tool. This is now being refactored by the OMII group so that it can be run standalone, as a service, as a component in a pipeline, as a chemistry checker in a word processor, etc. So it was natural to honour this when Peter leaves us.

So here's Peter's OSCAR. It's labelled:

Peter Corbett


Unilever Centre, 2005-2009

Peter is taking up a position in Linguamatics, a Cambridge-based company with activities in text-mining and other things. I am always proud when people leave us with positive motivation and it's important for the future of the UK that this type of work flourishes because it will generate a lot of wealth in the coming decades. (And the UK could do with some wealth).

Peter loved linguistic challenges especially with ambiguity (time flies like an arrow can be parsed as fruit flies like a banana). Another is the conjunction of (nounal) adjectives (pretty little girls school). So I described him as

A pretty large Unilever Centre language processor domain expert

which will keep most tree-banks busy for a bit.



The Guardian highlights the eScience pollution project

Well! My appeal for volunteers for the pollution project came in a most unexpected way it was picked up by the Guardian. The Guardian are champions of Open data (Free Our Data) and as you read their report of my blog you'll see the considerable and valuable amplification about the role of OpenStreetMap. (I didn't highlight this aspect although I have praised OSM before). I'll quote almost all (without permission, but as fair use): of

A chance encounter in the coffee lounge of the Cambridge chemistry department could lead to real-time maps of pollution in the city, as an offshoot of an EU project that is nearing completion in the city.

A team in Cambridge which has been running the Cambridge Mobile Urban Sensing project (CamMobSens) will begin equipping volunteers on bicycles and on foot with mobile phones and pollution sensors linked by Bluetooth.

The sensors will monitor the levels of carbon monoxide and NOx (nitrogen oxides) in the city air and relay them to satellites, which will pass them directly to openly accessible databases being run by the project.

The government does provide an overview of air quality at its own site,, but the data is not real-time, and is not mapped in detail, although it is possible to get a Google Earth download which will show the air quality as measured by roadside monitors.

But Professor Peter Murray-Rust, who happened to meet Mark Calleja, the head of the CamMobSens project during a coffee break, has now suggested that the results could be mapped in real time onto the free open-source maps provided by the OpenStreetMap project, a British-inspired project which uses volunteers using GPS locators to create maps of cities and, in time, countries.

Murray-Rust is now appealing to the OpenStreetMap team to get in touch with Calleja, so that by the time the project begins later this summer it will be possible to add the pollution information immediately to maps from OpenStreetMap.

The advantage of using OpenStreetMap rather than online maps from the UK's official mapping agency, the Ordnance Survey, is that there are no copyright implications in the addition of data to the maps - and no limits or charges on viewing of the maps. Ordnance Survey has recently eased its restrictions on non-commercial organisations using custom online maps through a specialised web interface, but issues remain over its licensing of unrestricted access to those maps.

If it succeeds, it won't be the first time that Cambridge's university coffee has had a dramatic effect. In 1991 a group at the Cambridge Computer laboratory aimed a webcam at a coffee pot downstairs from their laboratory so that they wouldn't have to walk downstairs to find out if it was empty or full. The webcam was later put onto the world wide web - and engineers at Microsoft showed it to Bill Gates in 1994 to persuade him that the web would be important by making it feasible for people anywhere to stay in touch - even with their coffee pots.

PMR: thanks Charles and the Guardian. A responsible and readable account. [I should make it clear that Mark Calleja is not the project leader.]

I asked Mark for pictures of the sensors and he replied:

Here are two pictures of a sensor (note coin for scale [ca. 1.8 cm diameter]), and one of the sort of phone we use (Nokia N80, and O2 give us free SIM cards to use).



graphics3I gather that the Guardian article has got people interested in collaborating please don't approach me but the project as posted yesterday.

Geek Puntcon Cambridge June 26 2009

Every year Bill no further metadata required runs a puntcon for geeks. Geeks covers a wide range of affiliations and ideals. There will certainly be a good representation from those who want to Open up the way we do things.

I shall, of course, claim this against expenses as it's clearly part of my work. We'll probably pass some ducks.


Cambridge, Sunday June 28, 2009

After the undoubted success of our earlier ventures, were going to head off up river again on Sunday June 28.

The Invite

You are invited to PuntCon V, or the seventh great geek punt picnic, to take place on or about the River Cam on the afternoon of Sunday July 13th 2008. We will be heading upriver rather than along the backs more picnic places, fewer tourists.

As before, turn up outside the Mill public house on Mill Lane between 1200 and 1230. We will head off  between 1230 and 1300 if you are late you can walk up river and catch us as we dont punt very fast!

Bring something to drink and something to eat.

I will provide bread, plates, cutlery, glasses and more food/drink

We will take as many punts as we need [one for every six people] and head up river to a convenient picnic place [eg Grantchester Meadows] where we will eat/drink/carouse.

We normally get back around 1800. Those heading back to the station can be dropped at a bridge within walking distance.

Post punting we have the option of retiring to the pub and letting Sunday evening happen around us.

How Much

There is no registration fee or indeed any other cost. Bring food and drink and entertainment. Punt cost will be split between all comers works out around a tenner per person. Infants are not expected to contribute.

Why PuntCon?

Conferences are fun, but dont have ducklings. Or champagne. So the legendary Geek Punt Picnic has morphed into PuntCon, the Cambridge leg of the alternative conference circuit.

In keeping with tradition there will be no talks, no presentations, no agenda and nothing to disturb the quiet delights of the river on a Sunday afternoon. But apart from that, its a conference and therefore probably tax-deductible.

What happened in previous years?

This Flickr photoset should give you all the information needed.


Let me know if youre up for it, but come along even if you didnt. Bring friends as the event is scalable let me know approx numbers if you can be bothered. Email me at for more.  If you use Facebook then there is also an event page.

The Invitees

Please feel free to invite other people its a big river and there are lots of punts. It would be nice (but is by no means essential) if I had a rough idea of numbers in advance so I know how much bread to get, but theres a Sainsburys five minutes walk away anyway

How to Get There

The event takes place at Scudamores Boatyard, at the corner of Mill Lane and Granta Place, Cambridge.

Details on the Scudamores website.

Can we mashup pollution data onto OpenStreetMap in realtime?

Sometimes a fantastic idea hits you in a millisecond and that's just happened to me at Coffee in the Chemistry Department. I happened to bump into Mark Calleja (who is part of our eScience (eMinerals) collaboration) and he told me about their latest project (Cambridge Mobile Urban Sensing)

CamMobSens is the Cambridge end of the MESSAGE project, a collaboration between Cambridge University, Imperial College London, Leeds University, Newcastle University and Southampton University. In Cambridge we mount sensors on pedestrians and cyclists to monitor pollution and send back the information to a website as soon as it is gathered.

graphics1Carbon Monoxide (CO)


Nitric Oxide (NO)

Team members:

Prof. Jean Bacon, Department of Computer Science         
Dr. Mark Calleja, Cambridge eScience Centre
Mark Hayes, Cambridge eScience Centre
Prof. Rod Jones, Department of Chemistry

Prof. Peter Landshoff, Department of Applied Mathematics
Dr. Iq Mead, Department of Chemistry
Michael Simmons, Cambridge eScience Centre
Dr. Eiman Kanjo, Department of Applied Mathematics

Now anyone who knows anything about will immediately make the connection as I did. OSM has been built by the voluntary efforts of zillions of pedestrians and cyclists who have used GPS to map the world. They've now built the best map of Cambridge and at least one has cycled every street.

Mark and colleagues need volunteers to go out and monitor pollution on a regular basis. The technical aspects are solved a mobile phone in one pocket and a sensor in the other emitting Bluetooth. The signal is routed to satellites and then to an Openly accessible database run by the project. All you have to do is follow the simple instructions of the project.

So I've asked Mark if I can be first on the list for the kit volunteer activity starts in late summer. I cycle every day down East Road (at the top left of the CO picture). It must be one of the most polluted roads (although the bus station is worst).

So I am appealing to OSM volunteers IN CAMBRIDGE to contact Mark. (The idea is clearly adaptable to other cities but we shouldn't overwhelm the project at this stage). If you know how to spread the word in the OSM and similar communities, please do so. There is no technical reason which this couldn't rapidly spread just as OSM has done.

More on Avogadro

More on Avogadro and the Blue Obelisk.

First, many apologies to Marcus Hanwell who is the real Doctor Who of Avogadro (but who has been temporarily transported through time and space due to his first child). It's always great to see new people joining.

Then to acknowledge the great synergy shown by Jan Jensen from Copenhagen who has adopted and promoted Avogadro and got some stunning movies. These are both fun to watch and also show the very nice interface. Here are some posts, enjoy

Nicking transition states from Nick Greeves

Vote early, vote often

A useful equation

The force is strong in this one

Just one of those links

It takes a village to solve a Jmol puzzle

Symmetry Prozac

An Atkins diet of Molecular Workbench

Some Jmol basics

Do I have to draw you a phase diagram?

Cool new build option in Avogadro 0.9.5

Building a Transition State


Tools of the trade

Getting started

Vote for Avogadro

Here is a really exciting message from Geoff Hutchison a founder member of the Blue Obelisk


Avogadro has been nominated as a finalist for the SourceForge
Community Choice Award for "Best Project for Academia":

This is a real honor for us, and we appreciate everyone who nominated
us for the award. I certainly didn't stuff the ballot box, so many of
you must have voted for us.

We haven't yet released version 1.0, but we're working hard on it. So
far, we've had over 40,000 downloads, been translated into ~14
languages on Launchpad.Net and are amazed and humbled by everyone
who's contributed in different ways.

PMR: Geoff has been working on a new OPEN SOURCE molecular editor. When I visited Geoff in Pittsburgh he showed me it in the cafe at lunch.



So why is this important? It's because the Blue Obelisk has now reached critical mass and is able to build on what it already has. So Avogadro uses the library in OpenBabel to minimise molecules. The almost haptic-like feel of the build depends on the system optimising the molecule in real time. It is the first time I have ever seen a system which can easily convert a chair to a boat. And, of course, it is currently limited to a mouse. When we go touchy-feely then this will be way in front.

Geoff is the Doctor Who of Avogadro. There's been a lot of contributions and he told me of one who had contributed a complete library. And Avogadro has happened within about 2 years. Most Blue Obelisk projects have taken 5 years or more to reach critical mass. That''s of course a tribute to Geoff but it also represents maturity in the libraries (OpenBabel has taken ca 10 years) and the better collaborative and engineering tools. And the fact that an increasing number of people believe in the Blue Obelisk.

Remember that most BO software is not directly funded (many of the competing software projects like OpenOffice have contributors who at least in part are expected to donate code as part of their day-job). It's probably fair to say that some like OSCAR are funded on the margins of grants and OPSIN now has a full-time graduate student Doctor Who (Daniel Lowe). But most are found in the recesses of the early mornings and weekends. And they are often not approved of by the establishment what is X doing when they should be doing science?

The Blue Obelisk software is like a series of telescopes. They will shortly reach the power of many commercial offerings and then they will go beyond them. That's because there is a great drive for innovation, for Open methods to ensure quality, for re-use of existing code. We've got a few problems to iron out different libraries and OS's but there is now enough redundacy.

So when you vote for Avogadro (as you will) you are not just voting for a piece of software but you are voting to add Openness to a major scientific domain which has been suffering from the darkness of closed source and hidden data for far too long. Just as mySociety is liberating our democracy, the Blue Obelisk is liberating chemistry. on July 4th (ans FOI in London)

From mySociety Blog. This looks like a must attend day. My only problem is do I go as a wonky geek or a geeky wonk? I think the former.

But it's stuffed with great contributors Heather Brooke who tried for years to bring MP's expenses into the light. Ben Goldacre who I heard last year but almost certainly has a different Bad Science.

And when I WroteToHim (using WriteToThem) on Net Neutrality I have only got acknowledgements. It's a race between getting a reply and being cut off by HADOPI-UK.

It is this sort of thing not conventional politics - that keeps my faith in democracy alive.


mySociety blog » Share tips with 6 brilliant Freedom of Information experts on 4th July

By Francis Irving on Monday, June 22nd, 2009

Is there something part of the government is doing that youd like to investigate? Find out everything from MPs expenses, to the length of allotment waiting lists, to whether your councils Guy Fawkes bonfire is properly checked for hedgehogs.

mySociety are running a practical workshop on Freedom of Information at OpenTech on 4th July.

The workshop will help you make your first Freedom of Information request, including working out what to request, where to request it from and what exactly to write.

If youre an old hand, you can get and give tips on how to take requests further.

Weve got a fantastic team of Freedom of Information (FOI) experts to kick things off and answer hard questions.

Heather Brooke used FOI to cause the frurore over MPs expenses.

Francis Davey is a lawyer with a specialist knowledge in FOI.

Elena Egawhary is a freelance journalist, currently working and using FOI for Panorama.

John Cross, Alex Skene and Richard Taylor are volunteers who run and improve WhatDoTheyKnow, and all use it for their own activism.

Bring a laptop if you have one. Internet will be provided for the workshop only, so we can scour Government websites, and make requests on mySocietys website.

As usual, the rest of OpenTech is brimming with great talks, and will be full of interesting geeky wonks and wonky geeks. Book your place here so you can go to them and to the workshop. Hurry, its nearly sold out.

This entry was posted on Monday, June 22nd, 2009 at 8:21pm and is filed under Events, WhatDoTheyKnow. Follow responses to this entry (RSS2 feed).

Peters S nd MR on NPG and fair-use

Recently Nature Publishing Group released a policy allowing users to test- and data-mine some of their content (specifically that which was in some way Open Access). This policy was negotiated with the Wellcome Trust who applauded it. Peter Suber attached some words of approbation. In contrast I feel it's s serious step backwards. I have set the scene in the previous article by addressing free, gratis and fair-use.

The first question, which no-one seems to have addressed, is why do we need permission at all. I believe that we actually have a right to mine both text and data without permission. I'm noT reproducing significant amounts of the original work in its original form so I don't believe I'm actually infringing fair-use. I invite any publisher to explain why content created by the community cannot be mined by machine in ways that have been done for centuries by hand.

If, however, the community agrees that we need specific permission to do mining then (to me at least) it is logical that it is a greater libre-dom than fair-use. If, after all, it is simply fair-use then the publisher should say so. And remember that fair-use applies to all the publisher content, not just the OA stuff. (Note, however, that librarians and other purchasing officers normally sign publisher contracts which are more restrictive and limit subscribers use to less than fair-use. That affects the closed access articles that I have access to in my institution).

So what really worried me was that Wellcome Trust (for whom I have a high regard) seem have impicitly agreed that NPG's free-doms for mining need special negotiated permissions. Whereas I regard them as fair-use. So, the agreement, formally limits fair-use to less than I thought we had. When it comes to other publishers and pay for authorFunder-pays articles then funders may be paying large amounts for what was our right anyway.

That's why the precise interpretation really matters. And where I finally get round to PeterS's comments:

Peter Suber says:

June 21, 2009 at 3:30 pm  (Edit)

Hi Peter. There are several threads here that Id like to separate.

Weak/strong was an early, regrettable proposal for the distinction now captured by gratis/libre. Dont think of it an additional pair of terms but as a superseded or deprecated pair of terms.

PMR: agreed and I am very glad to see the new pair.

When I introduced the terms gratis/libre into the OA context (borrowing them from the FOSS context), I tried to be clear, careful, and detailed about what I meant by them. I dont legislate usage, of course. But if the question is about how I use the terms, then my original article should answer it. I also think that my article will answer your questions about what the terms mean in practice, or how they can make our discussions less confusing rather than more confusing.

PMR: Yes. Peter writes a great deal of extremely clear explanations. For him libre represents the granting of at least one freedom. (For me it represents all in BBB but I'll go with Peter here).

I dont know whether the new NPG policy goes beyond fair use either. This depends on whether fair use already covers text-mining, a question on which informed people continue to disagree. We may not know whether fair use allows the downloading of full-text copies for processing, but at least we now know that NPG does allow it.

PMR: This is the central issue. But if it is fair-use, then let's recognise it as such and announce it as a clarification, not new added value.

Whether the NPG policy is libre OA in my sense depends on whether it exceeds fair use, and Im admitting that thats unclear. If the policy exceeds fair use, then its libre OA (barely). If it doesnt exceed fair use, then it isnt.

PMR: we agree on the central issue

Remember that libre OA is not a synonym for BBB OA. Libre OA covers all the different ways of exceeding fair use or removing permission barriers. It covers a *range* of positions, not just one position. If the NPG policy is libre at all, its at the lower or minimal end of the range; the BBB OA is a position at the higher or maximal end of the range.

PMR: we agree completely. I personally find it difficult to write libre-OA as meaning other than BBB-OA (as it gives an air of legitimzation) but I am happy to write enhanced-permission-OA or some-libre-dom-OA

5. I agree with everything you say about the limitations built into the NPG policy. Removing many more permission barriers would greatly facilitate text-mining and (Im convinced) cause no harm to NPG.

PMR: one of the worst aspects is that this is a complete waste of my time and yours. I should have been writing unit tests to manage chemistry today and I've been sidetracked onto this. I don't mind day-after-day trying to fight nature when trying to crystallise a glue, or distill an oil or whatever. But it incredibly dispiriting day-after-day trying to win back what is ours by right and with which we could actually do science.

But you inspire me so I shall struggle on.