Monthly Archives: September 2012

#okfest #openscience being streamed today. Updates.


OKFest is being streamed today: (6 viewers so far) Jenny Molloy is introducing. Now Puneet Kishor from CC-science

See for details. Yesterday we hacked PyBossa for crowdsourcing of spintronics.

I will update this blog over the next hour or so (I hope).

Now Joss Winn (JISC Orbital project)


#okfest #openscience: the hacking begins: OpenBib to analyse patterns of science funding

The great things about a good hackathon is that you don't know what you will be doing until you get together and pool ideas and expertise. Today we have our Open culture and science hack-day. About 20 "streams" ranging from Wikimedia-editing to #openbiblio. And we are working on a project that comes out of two ideas:

  • Tom Olijhoek is one of the driving forces behinf @ccess – true Open Access to the scientific literature. Tom runs OpenMalaria. His current idea is that science can be influenced through the way it's funded – especially in the areas of medicine and food. Can we discover this from what appears in the public literature. And, although most published science is CLOSED, we can pick up a lot from the metadata and from the 5% of Open Access science.
  • In parallel we have the #openbiblio project which is building the technology to make Open Bibliography fully re-usable. Examples can be found on (technology) and (collections of bibliography)

We're combining these in a project to scrape the bibliography from BioMedCentral (one of the few sources of Open Access publishing). Here's the team starting to make Open Access BioMedCentral BibSoup (note Gulliver the Open Access turtle):

And here's the actual team:

L2R: Jenny Molloy , Laura Newman, Michael Bauer (and some still at lunch) and Daniel Lombraña González (who's running the PyBossa project) and me with the iPhone. (And Chuff the OK-API).

Michael is creating the scraping and already has a chunk of bibliography. He's got a background of hacking and biomedical and is now with OKFN.

If you are reading this remotely and want to be involved we have an Ethernet pad – tweet @petermurrayrust


#okfest: The preparation

Today is when people start arriving en masse for OKFest – the main sessions start tomorrow. I'm hanging out with the organizers, volunteers , etc. The staff and volunteers are doing the last-minute stuff whih is always more than you think. The programmes and bags have to be packed. The name labels have to be checked. The venues got ready, and so on. I'm fortunate – I have relatively little to do – other than running a panel on Wednesday. So here's some photos.

Immediately after arriving – now the work starts:


This is a maker project. I think it's a machine make things. Very bravely the instructions were sent from Chicago and cut in Helsinki? Are all the bits there? More or less, yes! I've helped by trimming off some scarf.



And here the real work is happening. Staff and volunteers hacking the final details



Sam Leon doing the analytics:


#okfest name badges

OKFest is about fun, and making things (and a lot more). So here's examples of our laser cutting badges.

Chuff went to Massimo, smiled and asked could he have a name badge. "#OKFEST, CHUFF, OKAPI". He is now the smartest OKAPI in the whole world:

Kat has done a huge amount of work and deserves an even more special badge. So Stuart Childs (@sc_r) has made special light-up badges. They are ultra cool and shine until the battery gives out. Here's the whole crew displaying:

Clockwise: Juha , Stuart (with 2 badges, Massimo, Kat (hand only)


Open Content Mining: Authors and Readers should control the process; act before it’s too late

Scientists write articles so people can read them. But now we've realised that machines can read them as well. And we have argued that if you subscribe to read an article you also have the right to use machines to read it and extract and re-publish facts.

It's critically important that we unit behind this view. Because at present:

  • The STM publishers prevent content mining
  • There are signs that they wish to develop this as a new activity (for which they will undoubtedly charge, build walled gardens, and otherwise restrict access to their content.

"their content"? We wrote it. And while they might tweak the odd bit of text they cannot and must not alter the facts. So the facts are created directly by scientific research.

The pharmaceutical industry is desperate for these facts. So they got the pharma industry to meet them (I think in Bruges) a few months ago and have hammered out an agreement.

I don't know the details, because they are almost certainly secret. (I'll write to STM and ALPSP and they can save me the bother by replying to this blog). If the details aren't secret then one-cheer. (No more cheers till we see the details).

We've already seen on this blog that Springer are reselling their image content – double or even triple dipping into freely given content. So Springer, do you intend to charge for content mining in a fourth dipping process?

Why do I trust the worst rather than the best in this deal? Here's the meat of what has been reported (I know no more and would be grateful for insight) Sorry it's an image but the original is a PDF:

Nothing has been said about cost/prices so I won't speculate. The implication here is that the results can be mounted within the pharma company but not published.

I have spent three years trying to get permission to text-mine, e.g. from Elsevier's Directorate of Universal Access, without any progress. Universal Access is extremely helpful (because they say so). Heather Piwowar spent months getting an agreement for one group in one university (UBC). She tweeted (and I have her brave permission to re-quote):

"I am LOATH negotiating with publishers. It gives me hives" (Hives is a disease)

Elsevier have negotiated 20 text-mining contracts over the last 5 years – that's an average on 0.25 a month. There is no way they can and will scale this demand (even if the wanted to). Then there are 100 other publishers, all currently with restrictive licences.

So the danger is that whatever is negotiated here will be be put in from of Universities/Librarians. Whose track record is that they don't I publically challenge any contracts.

Please do not sign any content-mining contracts without alerting the world.

It is critical that we, the scientific and machine-readership argue for our rights, not the commercial benefit of publishers.

This is where YOU have to make a stand.





#okfest is and will be amazing

I am at #okfest in Helsinki – 13 (THIRTEEN) separate tracks on Openness. I am already gutted that I shall miss most of them because they are in parallel. I'm helping run Open
Research and Education and obviously that will be much of my time. (BTW we are full up so I won't urge you to come, sorry).

As an example the OKF is running all the conference management itself. Name labels? Make them! Here's Massimo – a doctoral student in the media lab in Aalto and he's running the lab's laser printer to make the labels from plywood. The laser printer is on the left:


And here is what he creates:

(This one is acrylic, you can see the label is for a "CREW" member. My job (I think) is to help punch out the little chads in the middle.

Chuff the OKAPI has already been tweeted and a comment (I think) from IBM that it should be a RESTful API. So here's the restful Chuff:


The humans are:

  • Joris Pekel (Open heritage GPP / Amsterdam
  • Juha Huuskonen (OKFestival coordinator / Helsinki)
  • Kat Braybrooke (OKFestival coordinatory / London)

It's just so exciting to be at a meeting where people discuss how to share and change the world. I've met Stuart who runs an Open laser workshop in Leeds and works with DaveMR (Mo-seph) and Matt Venn.


Our manifesto: “The Right to Read is the Right to Mine”; Universities: you must fight for Open Content Mining before it’s too late

Over the last ten years University (Libraries) have signed or resigned one million contracts with scholarly publishers (eg. Elsevier) which forbid re-use of the subscribed material. Thus, for example, if your university rents a scientific journal for 5000 USD a year a not uncommon figure) you are not allowed to extract factual information from this and republish. If you buy a BOOK (a fast disappearing thing) you can extract the facts by hand and re-publish them in Times or Trafalgar Square. But not if it's electronic. It is of course much easier to do and much saner and what this century is all about.

But publishers add restrictive clauses to their licences and librarians just sign them.

One million times my rights have been signed away without my even knowing. The very least that should have happened is that the libraries should have alerted the rest of the world and refused to sign. But no – the only thing they are worried about is price (and they aren't very good at keeping that down – Elsevier make 30% profit). Everything else in the information world has gone down in price, but scholarship costs more each year. And, of course, it's actually written by you and me and given to the publishers. They don't even produce it in a modern efficient manner – in a non-protected market #scholpub would go out of business in a year. (stop ranting, PMR and get to the point).

Jenny Molloy has collected a range of publisher restrictions that libraries sign (see full paper

). Here's one (from Elsevier):

"Schedule 1.2(a) General Terms and Conditions "RESTRICTIONS ON USAGE OF THE LICENSED PRODUCTS/ INTELLECTUAL PROPERTY RIGHTS" GTC1] "Subscriber shall not use spider or web-crawling or other software programs, routines, robots or other mechanized devices to continuously and automatically search and index any content accessed online under this Agreement. "

Fairly clear. Readers cannot do ANYTHING with machines. The others are just as restrictive. I cannot imagine how anyone could sign this without alerting the world to the problem.

And it gets worse. Elsevier will "allow" text-mining, but only if the individual scientist and their librarians negotiate a secret deal with Elsevier (as Heather Piwowar and UBC were required to do). This is completely unacceptable and doesn't scale.

So we (Diane Cabell, Jenny and I) using the OKF lists have created a manifesto. The only things you need to remember are in bold type:

Principle 1: Right of Legitimate Accessors to Mine

We assert that there is no legal, ethical or moral reason to refuse to  allow legitimate accessors of research content (OA or otherwise) to use  machines to analyse the published output of the research community.   Researchers expect to access and process the full content of the research literature with their computer programs and should be able to use their machines as they use their eyes. The right to read is the right to mine


Principle 2: Lightweight Processing Terms and Conditions

Mining  by legitimate subscribers should not be prohibited by contractual or  other legal barriers.  Publishers should add clarifying language in  subscription agreements that content is available for information mining by download or by remote access.  Where access is through researcher-provided tools, no further cost should be required. Users and providers should encourage machine processing


Principle 3: Use

Researchers can and will publish facts and excerpts which they discover by reading and processing documents.  They expect to disseminate and aggregate statistical results as facts and context text as fair use excerpts, openly and with no restrictions other than attribution. Publisher  efforts to claim rights in the results of mining further retard the advancement of science by making those results less available to the research community; Such claims should be prohibited. Facts don't belong to anyone.


And Diane wrote a superb supporting text (see paper) which explains the rationale, the law, and what we should do. Jenny and I stitched it together in a slightly frantic rush, added pictures, tables, references, etc. I have been elected to the Fellowship of the OpenForum Academy ( who are meeting on Sept 24. I can't go, so I have offered this paper.

The publishers have woken up to the fact that text-mining matters. They are starting to do secret deals with subscribers (I'll write about their deal with pharma next blog). They'll start to create walled gardens, special extra terms and who knows what.

Whereas it's actually our RIGHT to do this.

So universities and librarians – are you going to watch while yet another set of rights disappears uncontested?

Or are you going to fight for my (and everyone else's) rights?

#animalgarden at Digital Research 2012 (#openbiblio and BibSoup) and OKFest

It's a very busy time for #animalgarden – the group of animals committed to Openness. Last month they made the allegorical movie of weak chemical AI ("Magic Chemical Panda", and ). Now they've been busy on #openbiblio and #okfest.

[PMR I presented #openbiblio among a series of Open tools I call "Liberation Software" designed to create Open information, especially in #scholpub and Open Scholarship.

Mahendra Mahey ran a great evening session for new ideas / software along the lines of Dragon's Den. We all had to pitch. I showed the #animalgarden video explaining how Bibsoup works

Gulliver is the Open Access turtle from BioMed Central. He's very keen on making things Open.

We're VERY busy. We've got a new member of #animalgarden who is helping us with #okfest next week

It's Chuff the OKAPI

Chuff wears the OK logo.

OKFest is growing rapidly. We've got sessions on science and open-access. It's impossible to take in everything.

It's a must-attend event.

Data Liberation and the Long Tail: (and a puzzle)

Next Tuesday I am giving an invited talk at Oxford on Open Data , , and also involved with a session run by the OKF immediately afterwards. As always I don't know what I am going to say until 0200 of the morning of the talk – this gives a chance to talk with delegates and get a feel for what is valuable.

I'll touch on at least the following:

  • The Long Tail. Scientific disciplines which have little formal information infrastructure but huge amounts of science. Disciplines such as bioscience (outside mainstream bioinformatics-support, such as phylogenetics), chemistry. Materials science, observational sciences (other than astronomy), much computational and simulation research. Much of the data is valuable but thrown away. I estimate billions (sic) of dollars is wasted through non-existent infrastructure
  • Graduate Students. A seriously misused resource. Much of the innovation comes from third-year postgraduates and we need to give them expression
  • Software/informatics as a first-class activity. Builders of scientific software are often denigrated as not "doing proper science", but they are every bit as important as the scientists who build telescopes and other instruments.
  • Bottom-up communities. There is a huge cognitive/informatics surplus if we treat the citizen community as equals and not inferiors. (Much of the software we work with is developed outside "research universities". We should be helping this grow.
  • Liberation software. I and others are building software which will free data in dark silos, repositories, theses, journals, etc. I'll present some of this in the afternoon briefly. The main battle we face is closed minds and vested interests; liberation software will leapfrog many of these.

I'll be showing some of this in action, but here's a taster. It comes from the supplemental information in a paper behind a publisher's firewall. I don't know if I am allowed to show it, but I'll take the chance. It's a mass spectrum – in simple terms it measures the mass of a molecule (here to 4 sig figures). Here are some questions. (Please add answers as comments because then I know people are interested and also I might learn something). [BTW this is how it appears in the paper – I assume the journal prints text upside down to make it easy for Australians, but I have to hang from the ceiling to read this.


UPDATE: Walter Blackstock has given some answers and I reply

Questions (in order of difficulty):

  • What's the constitutional formula of the compound? (relatively easy for chemists)
  • How many peaks are there? (harder than it looks)
  • How would you find where this diagram was published? (very hard)

On Tuesday I will show how Liberation Software AMI2 can be used to answer Q 3.