Monthly Archives: May 2012

The right to read is the right to mine

UPDATE. I got feedback suggesting that part of principle 2 was inappropriate at this stage and I agree. So I have struck through parts in this post. There is merit in changing emphasis at such an early stage in the process. This document is subject to revision – that's part of the point of open discussion.

We – in the OKFN – have been spending some time on Etherpads and skype putting the principles of Open Content Mining. Yesterday we met on skype and decided that we'd done sufficient to take this to the world and get feedback and enhancement. Naomi Lillie (OKFN) will post the full version later and Peter Suber will link to it. This blogpost is an introduction and I'll quote the central points.

Kinder Scout (from Wikipedia) Fuaigh Mòr (Wikipedia)

Let's start with another historic area of rights – the right to roam. This is a 20th C movement in many countries to assert that everyone has access to land, whether or not it is privately owned. It's a good analogy. The fundamental ownership of land id critical, political and often poorly defined. In the 18/19th Century Scotland suffered the - where the residents of the land were thrown out – killed, emigrated, - the lands "improved" with sheep and the lands now "belong" to landlords. But there is a traditional right of access to these lands regardless of actual "ownership". Wikipedia ( ) says:

The freedom to roam, or everyman's right is the general public's right to access certain public or privately owned land for recreation and exercise. The right is sometimes called the right of public access to the wilderness or the right to roam.

Not everyone shares the same view as to what these rights are or even whether they exist. I have been thrown off Scottish land by a gamekeeper with a shotgun, even where there was a legal right. But just because not everyone agrees on the rights doesn't mean they don't exist.

So we believe that there is a right to mine the scientific literature and we have expressed this as:

The right to read is the right to mine.

That's our assertion of the fundamental rights. In the 20th Century the people asserted their right to roam. We are asserting the people's right to mine. This is a simple political statement – like "everyone has a right to a fair trial". Because the publishers[*] – like the 19th C landowners dispute this right we have to fight for it. The UK has had a series of fights for rights including freedom of speech, trial by jury, freedom from slavery, etc. Sometimes people went to jail, sometimes they died for these.

But we must fight. An extremely relevant example is the mass trespass at Kinder Scout ( , WP:

The mass trespass of Kinder Scout was a notable act of willful trespass by ramblers. It was undertaken at Kinder Scout, in the Peak District of Derbyshire, England, on 24 April 1932, to highlight that walkers in England and Wales were denied access to areas of open country. Political and conservation activist Benny Rothman was one of the principal leaders.

The trespass proceeded via William Clough to the plateau of Kinder Scout, where there were violent scuffles with gamekeepers. The ramblers were able to reach their destination and meet with another group. On the return, five ramblers were arrested, with another detained earlier. Trespass was not, and still is not, a criminal offence in any part of Britain, but some would receive jail sentences of two to six months for offences relating to violence against the keepers.

The mass trespass marked the beginning of a media campaign by The Ramblers Association, culminating in the Countryside and Rights of Way Act 2000, which legislates rights to walk on mapped access land. The introduction of this Act was a key promise in the manifesto which brought New Labour to power in 1997.

So it's a long struggle. Am I suggesting a Mass Trespass of publishers? That may depend on readers. But the same tensions are there as 80 years ago – an unjust control of access and the need to change the system by breaking the law. And we have a long tradition of noble lawbreaking – often it is the only way that we change minds and therefore laws. There is usually a debate as to whether change should come by legal means or – in today's language – "occupying" and "pirate" action.

So I repeat:

The right to read is the right to mine.

This isn't a negotiated position. It's not a summary of current practice. It's a statement of a fundamental right that we must fight for.

Yesterday we agreed that we could not at this stage list the "how" of Open Content Mining (OCM). That comes later. It will probably be filled with subjunctive clauses – this is a difficult and complex area. The right to roam has to yield to national security and rare species. It may or may not have to yield to personal privacy – a difficult area. So the right to mine will have to take account of the current law and decide what can be done within it or what needs changing (e.g. Hargreaves). It may require a definition of "fact". It may requires cases. It could take some time. But that does not mean we cannot NOW assert the right.

So here's the core of the principles. We'd welcome others being involved. But I repeat, this is not a negotiation – it's drafting something we expect to stand for decades or longer. Much of it needs commentary and redrafting – particularly IMO section 2. We don't want to rush these principles, but we do wish to kickstart the process.



Principle 1: Right of Legitimate Accessors to Mine


We  assert that there is no legal, ethical or moral reason to refuse to  allow legitimate accessors of research content (OA or otherwise) to use  machines to analyse the published output of the research community.   Researchers expect to access and process the full content of the research literature with their computer programs and should be able to use their machines as they use their eyes.


  • The right to read is the right to mine.


Principle 2: Lightweight Processing Terms and Conditions    


Mining  by legitimate subscribers should not be prohibited by contractual or  other legal barriers.  Publishers should add clarifying language in  subscription agreements that content is available for information mining  by download or by remote access.  Where access is through  researcher-provided tools, no further cost should be required. The right  to crawl is not the right to use a publisher's API for free, however,  when access is through publisher-supplied programmatic interfaces, the  fees should be transparent and per-api-call.  Processing by subscribers  should be conducted within community norms of responsible behaviour in the electronic age.


  • Users and providers should encourage machine processing.

Immediate feedback suggested deleting part of this section and I agree.


Principle 3: Use


Researchers can and will publish facts and excerpts which they discover by reading and processing documents.  They expect to disseminate aggregate statistical results as facts and context text as fair use excerpts, openly and with no restrictions other than attribution.  Publisher  efforts to claim rights in the results of mining further retard the  advancement of science by making those results less available to the  research community; Such claims should be prohibited.


  • Facts don't belong to anyone.

Guardian article on Content-mining (thanks Alok Jha) makes it mainstream

[I meant to blog this earlier but I have been spending time on writing content-mining software rather than the continued depressing struggle with reactionary commercial publishers of #scholpub.]

On 2012-05-24 the Guardian (a/the mainstream liberal daily newspaper in Uk published an article in its main news pages on content-mining from scientific #scholpub articles. In the paper "It's a useful research tool so why forbid it?" (p14), online

Text mining: what do publishers have against this hi-tech research tool?

Researchers push for end to publishers' default ban on computer scanning of tens of thousands of papers to find links between genes and diseases

Byline: Alok Jha, Science Correspondent

Alok was the person who promoted "Academic Spring" on the front page of the Guardian last month. He contacted me and others, especially Robert Kiley from Wellcome Trust. Robert and his Wellcome colleagues have made a massive contribution to free scientific information – without Wellcome we would have much poorer involvement. And as sponsors of UKPMC Robert is at the frontline of content-mining – he knows firsthand how hard it is to get any help from the publishing industry[*].

The coverage included stories from Casey Bergman + Max Haeussler, Heather Piwowar and myself – detailing carefully and accurately our major ongoing difficulties. Some snippets:

All of them [above] needed access to tens of thousands of research papers at once, so they could use computers to look for unseen patterns and associations across the millions of words in the articles. This technique, called text mining, is a vital 21st-century research method. It uses powerful computers to find links between drugs and side effects, or genes and diseases, that are hidden within the vast scientific literature. These are discoveries that a person scouring through papers one by one may never notice.

It is a technique with big potential. A report published by McKinsey Global Institute last year said that "big data" technologies such as text and data mining had the potential to create €250bn (£200bn) of annual value to Europe's economy, if researchers were allowed to make full use of it.

Unfortunately, in most cases, text mining is forbidden. Bergman, Murray-Rust, Piwowar and countless other academics are prevented from using the most modern research techniques because the big publishing companies such as Macmillan, Wiley and Elsevier, which control the distribution of most of the world's academic literature, by default do not allow text mining of the content that sits behind their expensive paywalls.

Absolutely correct.

Any such project requires special dispensation from – and time-consuming individual negotiations with – the scores of publishers that may be involved.

"That's the key fact which is halting progress in this field," said Robert Kiley, head of digital services at the Wellcome Trust. "For a lot of people, though there is promise there, the activation effort is just too great."

Exactly. My research has been set back 2-3 years by fruitless "discussions" with publishers.

Asking for permission from publishers is an option, though time-consuming. The University of British Columbia (UBC) researcher, Heather Piwowar, was trying to map the ways scientists use and share papers.

She was eventually contacted by Alicia Wise, Elsevier's director of universal access, who convened a conference call with Piwowar, a UBC librarian and five Elsevier colleagues. That conversation led to permission for UBC researchers to text mine the Elsevier journals to which they already had access.

Piwowar said: "It takes a lot of time and a lot of energy and doesn't scale at all. To me it's a good result because now I have access to things I didn't have access to before and also it will also hopefully drive change by people saying, 'This is not an OK way to build on our scholarly literature.'"

The colossal waste of time is clear. Elsevier want me to negotiate with them and the Cambridge University Library. I have to tell Elsevier what research I want to do. The library has better things to do with its time. So do I.

And it's technically completely unnecessary. I can access the articles I want by standard means. It's a pinprick in the daily Elsevier downloads. It's sheer FUD to suggest I will crash their servers. I don't want ZIP files from them through a special API. I already have what I want. All I need is Elsevier to say they won't sue me.

Wise said that, in principle, her company was happy to enable text mining for its content. "We want to help researchers deepen their insight and understanding, we want to help them to advance science and healthcare and we want to be able to do that in ways that help realise the maximum benefit from the content we publish. Text mining is clearly a part of this landscape and it will continue to be and we're keen to support it."

"In principle" means nothing. In the comments AW described

Elsevier is leading the research information industry to enable text mining.

NO! BMC and PLoS are leading it. I can mine them – as much as I like and I can't mine Elsevier at all.

We provide text mining solutions to an array of customers, and we also enable researchers to text mine our content for themselves. This is all done through licensing, which is highly efficient and easily scalable.

So efficient and scalable that I have got nowhere in ca. 3 years. So efficient that we need 5 Elsevier staff for one researcher.

We began partnering with the University of Southern California in 2007 to enable researchers in its Neuroscience Research Institute to content mine and we now have agreements with about 20 universities around the world.

Wow! 20/1500 universities in 5 years. Just over 1%.

We also serve researchers in a broad array of commercial organisations. Earlier this year we announced our acquisition of Ariadne Genomics and QUOSA, companies that both provide state-of-the-art text mining services to improve researcher productivity. We continue to invest to develop an array of text mining ourselves, and we offer other tools through collaboration with partners such as the UK's National Centre for Text Mining. We are also working with other publishers to ensure that text mining is possible regardless of who has published it or where it is located.

These are all words. I am still not allowed to text-mine. And it is Elsevier who makes the rules - in most science it's God who makes the rules, but here it's Mammon. I will write a blog on Elsevier and Helpfulness. "Elsevier is a helpful publisher" is similar to a British bank which advertises "helpful banking". Think of "helpful banking" whenever you think of Elsevier.

Back to the positive.

So what Alok has done is massive! To get national coverage at this level is a huge boost to the legitimacy of our effort. It means the issue is now clear to everyone and cannot be ignored as a minor fringe activity. The UCSF declaration for Open Access (still not mandatory and therefore of very limited practical effect) mentioned mining. Funders are starting to promote mining. UKPMC is fully aware of its huge potential – the dam is only maintained by publisher lawyers and publisher lobbyists in Capitol Hill (US).

So I have been aggressively tooling up for when I am allowed to mine the scientific content. The Guardian article acted as a trial in the court of public opinion and I think the publishers have very little support there.

But I am starting with BMC. Who knows, maybe there is enough hidden science in just 5% of the scholarly literature?

Today we continue developing our Manifesto on Content Mining



[*]Yes, I exempt PLoS, BMC, and lots of worthy society publishers

We meet in Berlin to prepare the #schoolofdata

I'm spending an exciting two days in Berlin helping the OKFN/P2PU prepare their School Of Data (SoD) course/s. I'm sure this will turn out to be a seminal event in both Internet education and advancement in "data wrangling". Here's the initial announcement - . "The School will be a joint venture between the Open Knowledge Foundation and Peer 2 Peer University (P2PU). "

There's a huge need for skilled and inventive data wrangling. This is a mixture of technical knowledge and knowhow and the "course" will cover both. We are working out the granularity of the "course" – almost certainly a collection of smaller units, generally self-paced but with some clear timelines. P2PU has had considerable experience in this – for example partnering with Mozilla on web skills.

Here's Laura Newman – the course coordinator – getting our thoughts organized and photographed, and here's Rufus Pollock and Stiivi Urbanek hard at work planning the details.

Stiivi has put together a great "architecture" for the technical side of the course which goes from acquiring data, to cleaning, filtering, repurposing and presentation. We have a strong sense of pipeline, where course participants take a problem from start to finish, using the appropriate skills are each stage. We are presenting this round "challenges" – we take a theme which everyone can relate to and go all the way from finding the data to drawing conclusions.

The course structure and participation is flexible and controlled – there is no hierarchical distinction between teachers and leaners – we are all a bit of both. We expect information to flow from and to the course.

The overall components (stages) – which have largely crystallized in our planning - are

  • Data sources
  • Discovery and acquisition
  • Extraction
  • Cleansing, transformation, and integration
  • Analytical modelling
  • Data mining
  • Presentation, Analysis, publishing and packaging

And an overarching subject of "data governance"

To analyse a particular subject a participant needs to go through the processes above, although not all will be needed for a given problem/challenge. We call this process a "journey", where we visit the different stages on a planned itinerary. Many courses will be organized like this – and the first we have designed is "What is unique about my country?"

In this participants (perhaps working in teams) will find and extract information about their country, clean, fliter and integrate it and finally present answers to this very general question (which requires comparison with other countries).

In an orthogonal fashion, participants will also study a particular stage in depth. In the journey metaphor, this is like spending your time in one place, finding the different ways of tackling it. So one early topic will be "Crawling and scraping" – there are several different tools, approaches and problems.

There's a real buzz! Over 300 people have signed up and we had an IRC meeting yesterday with 30 – who are very keen to be involved and contribute. Lots of great skills and ideas.

Much more later – on a regular basis - as this is an important part of my life.



#scholpub should be regulated

I recently asked "What's the difference between Elsevier and British Gas?" I didn't get many answers (it would be nice to have a greater response so I could highlight ideas other than mine). The question could also have replaced "British Gas" by "Virgin Trains", "Scottish Power", "East Anglian Water" or even "Lloyds Bank".

The answer is that the others are all, to a greater or lesser extent bound by regulation. They have a legal duty to:

  • Ensure the quality of service
  • Limit prices

Scholarly publishing is in a bizarre and completely unhealthy marker where there is no effective market regulation of price, there is no quality control (the quality of #scholpub is awful compared to other e-products on the web and hasn't changed in 20 years. ) We have NO IDEA what the true costs of publishing a paper are, or what they could be if the market operated.

Acta Crystallographica E publishes the highest quality papers in science. It's a data-only journal and doesn't completely scale to other journals. It charges 150 GBP for Gold Open Access and makes a margin. They have built their own authoring system which every crystallographer uses and the papers are full of checked, semantic data and there is high-quality peer review. It's difficult to extrapolate but I think a figure of 500 GBP would be the MAXIMUM cost of an efficient scholarly publisher. I'd like to see the high price publishers challenge this.

Yesterday I was asked by a journalist (I won't spoil their story) to comment on the UK Finch report. This hasn't formally reported but there are some open readable minutes at and I was asked to comment on what I thought of the pricing , market, etc. I don't have a strong view on Finch, but it says:

The Working Group first considered the tables in the annex, which were founded on modelling undertaken for the Heading for the Open Road report. It was noted that the 'central case' was a starting point under which APCs were set at a cost-neutral level for the HE sector in the UK of c£1,450 per article, with an assumed take-up rate of 23.3% for OA publications. All the tables therefore use that as a starting point, and vary the costs according to a series of different assumptions – some of which are obviously more realistic than others. The variability is determined by four factors: (i) the level of APCs, (ii) the level of take-up of the gold option, (iii) the difference between levels of take up in UK and rest of the world, and (iv) the proportion of APCs to be met by authors outside and within the UK for jointly-authored papers. The Group observed also that the £18.7m saving from subscription charges does not take account of 'stickiness' in a transitional shift from subscription to APCs – which is liable to take a significant amount of time. Such a transition implies additional costs.

I haven't read the annexe and I cannot see how they can actually assess the costs since almost no publishers analyse and publish them. Some publishers have argued that costs can approach 20,000 USD because of high rejection rates. This is a typical example of an unregulated market. It's like saying "we don't have enough capacity on our buses so we are going to throw most passengers off and charge the others a huge amount to make our profits". It's a sign of a broken market.

A typical example of how inefficient the industry is and how unresponsive to costs is that most publishers send the manuscripts off to be retyped – this is an appalling admission of lack of reaction to the 21st century. It's like having to send Amazon a snail mail to order something. It's because Amazon broke the model that we have efficient, price-competitive market of goods. If the academic sector wished to reduce costs of Gold OA they should create a system with author-side cost reduction. If I was given the option of paying 1450 GBP for APC or 500 GBP if I created it in NLM DTD XML I'd go for the latter. The NLM (which publishes Pubmed) is a world authority on publishing and far more efficient than publishers. It has been highly innovative and the only brake on progress has been the relentless destructive legalisation against it and restrictive practices imposed by major toll-access publishers. That's why we cannot get access to content-based search, not because they can't do it.

Anyway I wrote the following for the journalist. It echoes what I have written here:

"What I am concerned about [and what I intend to blog about as soon as I have time] is the lack of regulation in this market.  In almost all transactions, whether author->publisher or publisher->reader there is no price-sensitive market. There is little market pressure on publishers to bring down costs, nor to produce better products. (Scholarly publishing is one of the very few sectors to be completely unaffected by the web - the product is an electronic copy of what was done 20 years ago). There is even less market force in the hybrid Gold model where publishers can charge what they like with no regulation - it is simply up to the funders or authors to pay what is demanded. Moreover the products offered are often not significantly different from Green - there are no rights of re-use and in some cases not even of copying.

In areas such as transport, energy, banks, public services and many others the government regulates the market. Providers have to work within negotiated margins and provide an agreed level of service. None of this pressure is put on publishers. The market often resembles personal vanity products where only the brand matters and cost of production is irrelevant.

My view is that any Green/Gold model will be a seriously suboptimal model until all the current cost (10 billion USD/yr) can be brought funder/author-side. This desperately needs regulation and strong leadership from bodies - probably governments and major funders. I don't think Finch has addressed this at all - you cannot be convincing unless you demand a change of control and do the budgeting properly.

I believe that even at 1500 GBP per paper this represents a seriously overpriced market. I think it might be brought down by bringing in public contractors / purchasers as is done in Brazil, I believe. Nothing could be more inefficient than leaving market forces to libraries in 10,000 scattered uncoordinated universities.

So I am not getting excited about Finch unless the government (Willetts) does. AFAICS Finch says "we want a mixed Green/Gold model with the emphasis on Gold. We aren't putting money in. We aren't imposing regulation. We are not controlling prices related to costs." And of costs it's only one country.

#scholpub is now, at its worst , a vanity market such as fragrance or mineral water. The price is vastly higher than the cost. You ask what you can get, not what it costs. There is large, wasteful marketing, there is large and wasteful investment in technology and lawyers to prevent access.

So what's the difference between Elsevier and Chanel? Not much. They are both unregulated.

Oh, and stop thinking of publishers as collaborating partners. Alicia Wise on the GOAL Open Access mailing list asks "what can publishers do to help". She asserts publicly that I don't trust her. Actually I trust her completely. I trust her to behave like a middle manager public relations officer in "Customer Relations" for British Gas, or Scotrail or whomever. She is there to maximize profits for the company. And part of that is preserving the current pseudo-monopolies. I trust he to continue to try to defend that. And offering help is a well-used strategy.

And she can trust me to challenge almost everything that Elsevier does, says, and more importantly doesn't do.

Stevan Harnad is dismayed that Elsevier has introduced a catch-22 int their Green regulations. It's convoluted (well-designed Catch-22s are) and says something like "you can deposit Green, but if your institution mandates it then you cannot". Stevan feels this is a breach of trust and that Elsevier should change it. I say that until this is regulated by a body with teeth we shall continue to have these games played by the publishers. If I travel to somewhere via London on British trains the price is higher. The cost is not higher.

Think of Elsevier, Nature, Wiley, Springer, etc as gas, transport, telecoms, etc. They have no more reason the be loved or hated than those.

The sick part is that the trains have to pay for their fuel (and a lot else). In #scholpub we GIVE the publishing industry the content.




Today we mobilize the forces of academic Freedom

I'm attaching a mail that's going round the academic twittersphere – mobilizing everyone to sign a WhiteHouse (US) petition requiring that all federally funded research be made publicly available.

It's a no-brainer. :

  • Find the site below
  • Sign
  • Mail this message to your contacts

Will it do any good?


Every bit of publicity is good and every indication of support helps. This isn't asking you to occupy the streets. It's simply, democratically, asking the US government to act.

The US government has been flooded with contrary bills from vested interests (SOPA, ACTA, RAW, etc.) and public opinion has killed some of them and is chopping off the hydra heads as they emerge. It is unsustainable.

Meanwhile we are now taking positive measures. It make sense to everyone except those with a narrow view of corporate power over material they haven't produced and have little moral right to control.

S, in simple words:


The funders want this to happen.

The authors want this to happen

The readers (that's YOU) want this to happen.

The #scholarlyPoor want this to happen.

It's not revolution. It's our right and our responsibility.



On *Monday, May 21*, we lodge a petition on the White House's "We the People" page asking the Obama administration to require that all federally funded research be posted on the Web – extending the principle of the NIH policy to all federal agencies.

1. What We're Asking

· Publicity/ Call for Participation.  Please help line up publicity for the petition before Monday.  Specifically, can you help get it on the front pages of Reddit, Tumblr, Wikipedia, Boing Boing, and send out an all-hands-on-deck request through your own blogs/twitter feeds, etc?

· 25,000 signatures in 30 days gets an official Administration response.  We want to hit that number fast to escalate this issue inside the White House.  We believe the policy has support but is stuck.  This could well be the event that gets it through.

· Please sign the petition on Monday.

2. Social Media links/handles

The official campaign website is at and there are already Facebook pages ( and Twitter handles (@access2research) in place. 

3. Petition Text (800 character limit)

WE PETITION THE OBAMA ADMINISTRATION TO: [This doesn't count toward the character count]

Require free access over the Internet to journal articles arising from taxpayer-funded research.

We believe in the power of the Internet to foster innovation, research, and education. Requiring the published results of taxpayer-funded research to be posted on the Internet would give access to entrepreneurs, researchers, patients, caregivers, and students, who currently are blocked by high costs. We know this works without disturbing the process of scientific publishing because the National Institutes of Health is already doing it through its highly successful Public Access Policy. All other federal agencies that fund research should have similar policies.

President Obama, please act now to make federally-funded research freely available to taxpayers on the Internet.

4. The Ask to Others

To sign the petition:

-   Have to be 13 years or older
-   Have to create an account on, which requires giving a name and an email address and then clicking the validation link sent to that address
-   Click to sign

5. Further Context


After years of work on promoting policy change to make federally-funded research available on the Internet, and after winning the battle to implement a public access policy at NIH, it has become clear that being on the right side of the issue is necessary but not sufficient. We've had the meetings, done the hearings, replied to the requests for information.

But we're opposed in our work by a small set of publishers who profit enormously from the existing system, even though there is no evidence that the NIH policy has had any measurable impact on their business models. They can - and do - outspend those of us who have chosen to make a huge part of our daily work the expansion of access to knowledge. This puts the idea of access at a disadvantage. We know there is a serious debate about the extension of public access to taxpayer funded research going on right now in the White House, but we also know that we need more than our current  approaches to get that extension made into federal policy.

The best approach that we have yet to try is to make a broad public appeal for support, straight to the people. The Obama Administration has created a web platform to petition the White House directly called We The People. Any petition receiving more than 25,000 digital signatures is placed on the desk of the President's Chief of Staff and must be integrated into policy and political discussions. But there's a catch - a petition only has 30 days to gather the required number of signatures to qualify.

We can get 25,000 signatures. And if we not only get 25,000, but an order of magnitude more, we can change the debate happening right now.

Next week we will publish our petition and the 30 day cycle begins. What we're asking you to do is to leverage your personal and professional networks to get the word out.

You can do this in any way that makes you feel comfortable. A blog post, an email to constituencies, a tweet, a facebook share, you name it - something that tells thousands of people "I support this petition, I'm signing this petition, and I thought you should know about it too." Because this isn't just slacktivism with a "like" or a retweet - people need to go to the White House website, enter their name and email address, and hit the button.

Qualified signers must be 13 years old or more, and have a valid email address. That's all.

The goal is not just to get 25,000, but to get far more to show the White House that this issue matters to people, not just a few publishers.

We are launching the campaign on Monday May 21. The petition will go live late Sunday night May 20, so that the waves can start in the EU and sweep west with the sunrise. We're asking you to turn on your networks on Monday morning.

Thanks for considering this. If we can all come together to get the word out at once, and stay behind it for 30 days, we have a real chance to get access to taxpayer funded research across the entire government, and send a signal that the people have a voice in this debate, not just publishers and activists.


#scholpub , Maxwell and the Laws of Acadynamics

For many days we have been discussing #scholpub on the GOAL mailing list, run by Richard Poynder. Some important issues are coming up and there is now a healthy divergence of views which RP runs well. I'll talk more later, I hope.

In the time between trying to content-mine PDF (yes, more later), I thought about the tragedy of the academic commons. We have 10,000,000,000 USD (count the zeros) or mainly public money and student fees to "buy" the #scholpub we produce. That's a sizable market. It's not as large as many, but quite enough to run competently and for the benefit of everyone.

Including the #scholarlypoor

But we don't. #scholpub is the most inefficient "market" in the world. (No, perhaps arms procurement is worse ). I'll analyse more in a later post. Hint, here's the answer to my question:

"What's the difference between Elsevier and British Gas (or Central Trains, or Scottish Power or umpteen more)?"

Answer: There is no regulator for #scholpub.

I wondered why. Basically because academia is 10,000 institutions all going in different directions.

In molecular sciences these particles obey a Maxwellian distribution. Some fast, some slow, some east, some west, some north, some south, some up, some down. Occasionally they bump into each other, but they are basically uncoordinated.

And they give rise to the laws of thermodynamics. The analogy that follows has some merit – I am still working it out – feel free to contribute: The laws in their formal form are not easily accessible but there's a witty synopsis ( )

    0 You have to play the game

  1. You can't win; you can only break even.
  2. You can only break even at absolute zero.
  3. You can't reach absolute zero.

Law 1 says you can move resources (heat and work) around and that you conserve energy

Law 2 says that there are inefficiencies in the system (loss of useful energy) which only disappear at absolute zero (the lowest possible temperature)

Law 3 is obvious


So I thought – there is ten billion dollars in the system. It can be moved around. There are inefficiencies in the system, but if we work together we can achieve high efficiency, And then? The sad truth. So I proposed 3 laws. They are raw, you are welcome to tune the wording. But they are roughly based on the three laws of Thermodynamics and perhaps there is a zeroth here:

0. There is a lot of money in the academic #scholpub system

1. We can change the system by moving money around

2. To do this academics must collaborate

3. Academics will never collaborate



And when I published them Jan Velterop came up with the lovely "Laws of Acadynamics". Thanks Jan.

Now there is a way to get round the Second Law. Maxwell's Demon ( ) . A superintelligent being that bats individual molecules around. Organizes Universities to point in the same direction. Yes, we need a Maxwell demon.

But haven't we had a Maxwell Demon already in #scholpub?



Whats’ the difference between Elsevier and British Gas?

This is a serious question and I have a serious answer. See if you can guess it. If so add a comment.

You can substitute "FooPub" for "Elsevier" where FooPub is any #scholpub such as ACS, PLoS, Wiley, BMC, etc.

You can substitute Eastern Water, Scottish Power, First Capital Connect (a train operator) and many others for "British Gas".

I shall continue to turn my attention to content-mining in the next few posts.



Data are part of the future; the OKFN’s contribution

I am really excited about the OKF's commitment to data.

Most data is lost, badly produced, unclear, etc. The OKFN-P2PU School of Data intends to create a new approach to education for the data-age. I'm very excited to be part of this.

Don't have time to do more than advertise:

The Open Knowledge Foundation are currently recruiting for a Data Wrangler and a Data Visualisation Developer. If you'd like join our team, please visit our jobs page.

At the Open Knowledge Foundation, we build tools and communities to create, use and share open knowledge – and to help others to do the same. In recent months, we have become involved in a growing number of open data projects, and two new positions have now been created within our team.

We are seeking two data experts to join us as a Data Wrangler and a Data Visualisation Developer. Read on to find out more about what the roles involve.

Data Wrangler

We're looking for a data wrangler who is excited to tell stories through data. You will work on various datasets, to understand them and to tell their story to a broader audience. You will also be involved in training efforts, creating and teaching courses in data analysis to technical and non-technical audiences.

Your role will be exciting and varied, and will include:

  • Work on the School of Data, building learning challenges and course content (see our previous post for more information on the School)
  • Research for our new data blog, coming soon.
  • Collaborations with our Working Groups, for example the Working Group on Open Economics
  • Work on OpenSpending, one of our flagship projects.


We are open to people from a wide variety of backgrounds; whether coding, visualisation, journalistic, statistical or otherwise. We are seeking someone who has:

  • Experience in data analysis and statistical methods
  • Experience with data cleansing, ETL patterns
  • Good written communication skills
  • Experience with R/Stata/SPSS
  • Coding skill in a modern script language, e.g. Python, Javascript.
  • Basic skills in information/data visualization

If that sounds like you, please visit our jobs page to find out more.

Data Visualisation Developer

As a Data Visualisation Developer, much of your time will be spent on our flagship OpenSpending project.

OpenSpending is about mapping the money. We want to make government finances accessible to advocates, journalists and citizens. Our goal is to collect budgeting information from across the world and to present it in a form that promotes understanding, analysis and participation. Some of the questions we ask are:

  • How much is government spending on health? Is expenditure growing or shrinking? How does this translate into results?
  • What are the proportions of different government programmes? What is spending on prisons compared to schools? How much is Ghana spending on education compared to Nigeria?
  • How much tax do I pay into which area of government?

Our day-to-day work has many facets. We work on the core platform, undertake journalistic projects as part of "Spending Stories", which won the Knight News Challenge in 2011, and work with organizations and civic activists world-wide to set up local budget transparency projects.

Your role with us

You'll help us to create new visualizations to answer spending questions through meaningful, visual narration.

Skills we're looking for:

  • Strong visual design skills
  • HTML5/Javascript visualisation experience
  • Familiarity with several visualization toolkits (e.g. D3, Raphael)
  • Experience with cross-browser compatibility
  • Plus (but optional): Knowledge of Python

Basically: send us some demos of good stuff you've done.

Towards a manifesto on Open Mining of scholarship

Tomorrow a small group of people interested in "textmining" will have a Skype meeting under the auspices of the OKFN. We have sort-of-pushed this agenda for some years and now it's come to fruition – there is clear public awareness of the value of textmining and the barriers that prevent it being used. Indeed my blog has even got mentioned in a financial analyst's review of Elsevier (the implication being that if Elsevier continues to drag their feet their market will react against them). Of course it's not just Elsevier, but they are the ones that have had most prominence. So this post if to prepare my mind and hopefully come out with some useful ideas.

There is no doubt that the lack of positive approaches to textmining is having huge costs:

  • Opportunity. We cannot do the things that we want to. Moreover this stifles the imagination of the rest of the community – without exciting examples of what can be done – and they *are* exciting – people do not realise what they are missing. And that's all of us, not just subscribers to journals.
  • In wasted time. Anyone wishing to do textmining has to spend huge amounts of time trying to get permissions, worrying about being taken to court, and simply waiting for null responses.
  • Bad science. Much published scientific data is flawed. Not necessarily deliberately, but by the outdated methods of publication. Almost no scientific data are reviewed (a few publishers like Int. Union of Crystallography are shining exceptions). And their tools have unearthed bad and fraudulent science. There is no reason to believe it is different elsewhere – in fact I suspect it's worse – the chance of getting caught is often near zero. Textmining is a major tool in data review.
  • Unexploited information and products. Google et at have shown that there are huge new markets. There is undoubtedly a large market in downstream information and information products from scientific research. I estimate it at low billions for chemistry alone.
  • Bad policy decisions. If the scientific literature is not used fully then decisions are flawed. These range from new drugs, to climate, to the effects of chemical to… Machines can provide decision support that complements humans.
  • Bad scholarship and bad scholarly relations. When a new technology emerges of benefit to scholarship then its wilful prevention for non-scholarly reasons has harmful effects on the whole community. It's fair to say that many textminers see publishers as a major problem who are solely bent on making money by restrictive practices

There are more – but that should be more than enough to build an overwhelming case.

Now what is "textmining". The word is very unfortunate for several reasons:

  • There are specific legal aspects of text which may differ from other forms of information.
  • There is a confusion with "fulltext".
  • It suggests that only the words in scholarship are involved. This is particularly damaging since much information is conveyed in images, diagrams, audio and video (in fact all of the major MIME-types!). For example commercial publishers often forbid the re-use of diagrams or charge large amounts because artistic images have special protection under copyright.

I would like to see a more general term – perhaps "information mining" (IM) which covers all the types about and also "data". Or possibly "publication mining". It would be a disaster if we only agree how to manage "text" and left the rest unchallenged.

Some technical background. (I actually suspect that most of the people who make the rules about IM (libraries, publishers) haven't a clue how it's done). Simply:

  • You write (or borrow) a program that retrieves the things you want to mine. A simple F/OSS one is called wget. Ours (Nick Day, Sam Adams) is called "PubCrawler and has been specially built for crawling scholarly publications. You point it at a website and it systematically retrieves files/pages one-by-one. The only problem is that if you do this too quickly then it may overload the website, so responsible crawlers have a delay (perhaps 5 seconds) – POINT 1. The argument that textmining will destroy servers is a smokescreen. (There are many ways of avoiding technical problems). Note that if you already have the papers on a local machine this step is unnecessary. Universities create caches to avoid repeated downloads but publisher want the downloads so they can count-the-clicks. This process does NOT violate copyright though it may technically violate the restrictive publisher contracts that Universities have signed.
  • You have another program that mines information from each paper. This is hard and tedious to write but once done is automatic to run. How well it performs depends on many factors (the format of the paper, the language/style of the journal/authors, the use of dumb (GIF/PNG) or semi-semantic (SVG) diagrams, etc.). For text you could use Lucene – an Apache project. Daniel Lowe has shown that it's possible to mine 500,000 chemical reactions from US patents using our F/OSS OSCAR/OPSIN/ChemicalTagger and the NIH's OSRA for chemical diagrams. Things are better than they were 5 years ago and I am fairly hopeful about the technical mass-mining of chemistry. This process does NOT violate copyright though it may technically violate the restrictive publisher contracts that Universities have signed.
  • You publish your results. Here there is a potential problem with copyright although I suspect it has never been tested. I suspect anything less than bulk republishing of verbatim full-text would be allowable in many courts. In particular republishing "factual" information would incur no legal penalties, whether or not for commercial purposes.

The miner's problem.

Simply stated:

  • IM MIGHT fall foul of copyright law. Because of the risk-averseness of libraries and the pressure from some publishers to limit activities such as UK/PMC no authorities are prepared to challenge of test this. Individual researchers left to make their own judgments, with little hope that they will get support from institutions. This canopy of fear is a dampener for research.
  • There are NO explicit rules. Because of this researchers do not know what they can and cannot do. Logic does NOT work in courts of law – only laws and precedence. People who make facile assertions that you can/not do something only muddy the waters.
  • It MIGHT fall foul of database laws such as sui generis in Europe. Against in our risk-averse culture no-one offers support to challenge this.
  • It probably WILL fall foul of the Publisher-imposed extensions to University contracts. These are basically unethical and imposed solely (IMO) for protecting the market.

Simply stated: Miners need clear, simple, permanent, automatic answers so they know what they can and cannot do.

Researchers are responsible people. There are many places where research has to take account of law and there are very few public breaches. The same should be assumed for IM.

The publishers' problem.

The primary problem is that publishers now have a market (not necessarily of their own making) which is profitable and where change may bring problems. The flip-side, that IM may bring benefits is never mentioned! Thus Richard Kidd of the Royal Soc. Chemistry on this blog has voiced the fear that he/they are worried that my textmining may undermine the RSC's viability and he wants an assurance that I won't do anything to harm their income. I think of all publishers in the world the RSC is best placed to benefit massively from IM instead of preventing it happening.

This is a typical problem with monopolies (which the publishers have). They want to see their income continue indefinitely in the same way rather than changing their models. It's natural, and history shows it's ultimately doomed. Only the conservatism of academia (see Michael Eisen's blog) keeps them in business. Whether or not we take the publishers' interests into account depends on the worth that society gives to their services – and that is changing rapidly.

There is no natural law that says we do or don't have to accommodate the publishers, whether or not they are learned socs. They no longer have the moral right to control unilaterally how scientific knowledge is published and used. There has been no constructive debate in this area and publishers should think about their source of material and its volatility.

The libraries' problem.

This is a completely new technology which is opaque to many libraries. There are, of course some world-leaders in information management , especially the NLM and national libraries but the average University has no experience of either the technology or the law. This makes it problematic when publishers suggest that text-miners should go through their libraries and have joint discussions with publishers. This is counterproductive as is drastically slows the process and means that many of the decisions are made by non-practitioners. [I have so far written several times to my librarian and am waiting for a reply]. The rigmarole that Elsevier put Heather Piwowar through with UBC librarians is out of order and in any case doesn't scale across publishers , libraries or researchers.

Current concerns and why we need principles

There is a high probability that some well-intentioned academics will "negotiate" terms with publishers which then are used a precedent to constrain everyone else. I, for example, am unwilling to accept the terms that UBC have. For that reason we are setting out principles, which we believe are absolute and which will inform the practices and their adoption. In the spirit of the excellent crafted BOAI and other declarations we are working towards words which will last for decades.

Bases of the principles:

  • The scholarly literature is created to inform and enlighten humankind. Authors expect that their material will be as widely used in an many ways as possible and by as many people as possible.
  • Information mining is a natural and major advance in the use of the scholarly literature and brings very large benefits.
  • The only inexorable laws relating to IM are copyright and database rights. These were not designed to restrict the flow of scholarship and should not be used for this purpose.
  • Subscribers to the scholarly literature are responsible people and will not deliberately break the law. They need a globally published set of principles by which they can determine what they may do.
  • Technology and human attitudes are changing rapidly and we should be positively and proactively responsive to them. We cannot and should not try to guess the future and we should not jeopardies it by short-term considerations

And perhaps a single definition. I suggest the term "Open Mining" as inclusive. Note that these principles are statements of what we wish to be the case, not a negotiation. BBB are statements of aspiration.

  • "By Open-mining we mean the unrestricted use of machines to extract, process and republish content in whatever form (text, diagrams, images, data, audio, video, etc.) without prior specific permissions other than community norms of responsible behaviour in the electronic age."

"Responsible behaviour" and "community norms" covers stuff like server overloading, personal data, deliberate corruption, and adherence to generally accepted Internet practice.

That's the aspiration. BBB are aspirations. Some scholars and some publishers have adopted them enthusiastically. They have helped enormously.