ContentMining: My Video to Shuttleworth about our proposed next year

I have had two very generous years of funding from the Shuttleworth Foundation to develop TheContentMine. Funding is in yearly chunks and each Fellow must reapply if s/he wants another year (up to 3). The mission is simple: change the world. As with fresh applicants we write a 2-page account of where the world is at, what and how we want to change things.

TL;DR I have reapplied and submitted a 7 minute video ( ).

These two years have been a roller-coaster – seriously changed my life. I can honestly say that the Fellowship is one of the most wonderful organizations I know. We meet twice a year with about 20 fellows/almuni/team committed to making sure the world is more just, more harmonious, and that humanity and the planet have a better chance of prospering.

There's no set domain of interest for applying, but Fellows have a clear sense of something new that could be done or something that badly needs mending. Almost everyone uses technology, but as a means, not as an end. And almost everyone is in some way building or enhancing a community. I can truly say that my fellow Fellows have achieved amazing things. Since we naturally live our lives openly you'll find our digital footprints all over the Internet.

I'm not going to describe all the projects – you can read the web site and you may know several Fellows anyway.

  • Some are trying to fill a vacuum – do something exciting that is truly visionary – and I'll highlight Dan Whaley's . This project (and ContentMine is proud to be an associate) will bring annotation to documents on the Web. That sounds boring – but it's as exciting as what TimBL brought with HTML and HTTP (which changed the world). Annotation can create a read-write web where the client (that's YOU!) can alter/enhance our existing knowledge and it's so exciting it's impossible to see where it will go. The web has evolved to a server-centric model where organizations pump information at dumb clients and build walled gardens where you are trapped in their model of the world. Annotation gives you the freedom to escape , either individually or in subcommunities.
  • Others are challenging injustice – I'l highlight two. Jesse von Doom ( ) is changing the way music is distributed – giving artists control over their careers. Johnny West ( ) is bringing transparency to the extractive industries. Did you know “BP” consists of over 1000 companies? Where the fracking contracts in UK are?

So when I launched TheContentMine as a project in 2014 we were in the first category. Few people were really interested in ContentMining and fewer were doing it. We saw our challenge as training people, creating tools, running workshops, and that was the theme of my first application ( ). Our vision was to create a series of workshops which would train trainers and expand the knowledge and practice of mining. And the world would see how wonderful it was and everyone would adopt it.


In the first year we searched around for likely early adopters, and found a few. We built a great team – where everyone can develop their own approaches and tools – and where we don't know precisely what we want for the future. And gradually we get known. So for the second year our application centred on tools and mining the (Open ) literature ( It's based on the idea that we'd work with Open publishers, show the value, and systematically extend the range of publishers and documents that we can mine. And that's now also part of our strategy.

But then in 2014 politics...

The UK has already pushed for and won a useful victory for mining. We are allowed to mine any documents we have legal access to for “non-commercial research”. There was a lot of opposition from the “rights-holders” (i.e. mainstream TollAccess publishers to whom authors have transferred the commercial rights of their scientific papers). They'd also been fighting in Europe under “Licences for Europe” to stop the Freedom to mine. Indeed I coined the phrase “The Right to Read is the Right to Mine” and the term “Content Mining”. So perhaps when the UK passed the “Hargreaves” exception for mining, the publishers would agree that it was time to move on.

Sadly no.

2015 has seen the eruption of a fullscale conflict in EU over the right to mine. In 2014 Julia Reda MEP was asked to create a proposal for reform of copyright in Europe's Digital Single Market. (The current system is basically unworkable – laws are different in every country and arcanely bizarre [1]). Julia's proposal was very balanced – it did not ask for copyright to be destroyed – and preserved rights for “rights-holders” as well as for re-users.

ContentMining (aka Text and Data Mining, TDM) has emerged as a totemic issue. There was massive publishers pushback against Julia proposal, epitomised in the requirement for licences [2]. There were over 500 amendments, many being simply visceral attacks on any reform. And there has been huge lobbying, with millions of Euros. Julia could get a free dinner several times over every night!

There is no dialogue and no prospect of reconciliation. There is simply a battle. (I am very sad to have to write this sentence)

So ContentMine is now an important resource for Freedom. We are invited to work with reforming groups (such as LIBER who have invited us to be part of FutureTDM, an H2020 project to research the need for mining). And we accept this challenge by:

  • advocacy. This includes working with politicians, legal experts, reformers, etc.
  • software. Our software is unique, Open, and designed to help people discover and use ContentMining either with our support or independently.
  • Science. We are tackling real problems such as endangered species, and clinical trials.
  • Hands-on. We've developed training modules and also run hands-on workshops to explore scientific and technical challenges.
  • Partners. We're working with university and national libraries, open publishers, and others.

So I've put this and more into the video. [3] This tells you what we are going to do and with whom. And I'll explain the detail of what we are going to do in a future post.


[1] Read and laugh, then weep. You cannot publish photos of the Eiffel Tower taken at night....

[2] Licensing effecetively means that the publishers have complete control over who, when, where, how is allowed to mine content (and we have seen Elsevier forbidding Chris Hartgerink to do research without their permission, see and earlier blog posts).

[3] It's a non-trivial amount of work. Approximately 1 PMR-day per minute of final video. It took time for the narrative to evolve (thanks to Jenny Molloy and Richard Smith-Unna for the polar bear theme). And it's CC-BY.


Posted in Uncategorized | Leave a comment

Content-mining; Why do Universities agree to restrictive publisher contracts?

[I published a general blog about the impasse between digital scholars and the Toll-Access publishers . This is followed by a series of detailed posts which look at the details and consequences

This is the second]

If you have read these earlier posts you will know that the issue is whether I and others are allowed to use machines to read publications we have legal access to read with our eyes.

The (simplified) paradigm for Content-mining scholarly articles consists of:

  • finding links to papers (articles) we may be interested in (“crawling”). The papers may be on publishers web sites (visible or behind paywall) or in repositories (visible). Most of this relates to paywalled articles

  • downloading these papers from (publisher) servers onto local machines (clients). (“scraping”). If paywalled this requires paid access (subscription) which is only available to members of the subscribing institution. Thus I can read thousands of articles to which Cambridge University has a subscription.

  • Running software to extract useful information from the papers (“mining”). This information can be chunks of the original or reworked material.

  • (for responsible scientists – including me) publish the results in full.

This is technically possible. Messy, if you start from scratch, but we and others have created Open Source tools and services to help.

The problem is that Toll-Access publishers don't want us to do it (or only under unworkable restrictions). So what stops us?


What follows is simplistic and IANAL (I am not a lawyer) though I talk with people who are. I am happy to be corrected by people more knowledgeable than me.

There are two main types of law relevant here:

  • Copyright law. . TL;DR any copying may infringe copyright and allow the “rights-holder” to sue. The burden of proof is lower : “However, in a civil case, the plaintiff must simply convince the court or tribunal that their claim is valid, and that on balance of probability it is likely that the defendant is guilty”. Copyright law varies between countries and can be extraordinary complex and difficult to get clear answers. The simple, and sad, default assumed by many people and promoted by many vendors is that readers have no rights. (The primary method of removing these restrictions is to add a licence (such as CC-BY) which is compatible with copyright law and explicitly gives rights to the reader/user).

  • Contract law.
    Here the purchasers of goods and services (e.g. Universities) may agree a contract with the vendors (Publishers) that gives rights and responsibilities to both. In general these contracts are no publicised to users like me and may even be secret. Therefore some of what follows is guesswork. There are also hundreds of vendors and a wide variation on practice. However we believe that the main STMPublishers have roughly similar contracts.

    In general these contracts are heavily weighted in favour of the publisher. They are written by the publisher and offered to the purchaser to sign. If the University doesn't like the conditions they have to “negotiate” with the publisher. Because there is no substitutability of goods (you can't swap Nature with J. Amer. Chem. Soc.) the publisher often seems to have an advantage.

    The contracts contain phrases such as “you may not crawl our site, index it, spider it, mine it, etc.” These are introduced by the publisher to stop mining. (There is already copyright law to prevent the republishing of material without permission, so the new clauses are not required.). I queried a number of UK Universities as to what they had some – some were constructive in their replies but many – unfortunately – unhelpful.

    However there is no legal reason why a University has to sign the contract put in front of them. But they do, and they have signed clauses which restrict what I and Chris Hartgerink and other scientists can do. And they do it without apparent internal or external consultation.

    And this was understood by the Hargreaves reform which specifically says that text-miners can ignore any contracts which stop them doing it. Presumably they reasoned that vendors pressure Universities into signing our rights away, and this law protects us. And, indeed it's critically important for letting us proceed.

But this law doesn't (yet) apply to NL and so can't help Chris (except when he comes to UK). We want it changed, and library organizations such as LIBER, RLUK, BL etc. want it changed.

So this mail is to ask Universities – and I expect their libraries will answer:




And then we'll work out how to help.

Posted in Uncategorized | 2 Comments

Content-mining; Why do Publishers insist on APIs and forbid screen scraping?

[I published a general blog about the impasse between digital scholars and the Toll-Access publishers . This is the first of a number of posts which look at the details and consequences]

Chris Hartgerink described how Elsevier have stopped him doing content-mining:


There is a lot of comment on both of these , to which I may refer but will not reproduce in detail. It informs my comments. The key issue is “APIs”, commented by Elsevier's Director of Access & Policy (EDAP)

Dear Chris,

We are happy for you to text mind content that we publish via the ScienceDirect API, but not via screen scraping. You can get access to an API key via our developer’s portal ( If you have any questions or problems, do please let me know. If helpful, I am also happy to engage with the librarian who is helping you.

With kind wishes,

Dr Alicia Wise
Director of Access & Policy

The TAPublishers wish contentmining to be done through their APIs and forbid (not merely discourage) screenscraping. On the surface this may look like a reasonable request – and many of us use APIs – but there are critically important and unacceptable aspects.

What is screen scraping and what is an API?

Screen scraping simulates the action of a human reading web pages via a browser. You feed the program (ours is “quickscrape”) a URL and it will retrieve the HTML “landing page”. Then it find links in the landing page which refer to additional documents and downloads them. If this is done responsibly (as quickscrape does) is causes no more problem for the server than a human. Any publisher who anticipates large numbers of human readers has to implement software and which must robust. (I run a server, and the only time it's had problems is when I have been the interest on Slashdot or Reddit, which are multi-human sites). A well-designed polite screen scraper like “quickscrape” will not cause problems to modern sites.

Screen-scraping can scrape a number of components from the web page. These differ for evry publisher or journal, and for science this MAY include:

  • the landing page
  • article metadata (often in the landing page)
  • abstract (often in the landing page)
  • fulltext HTML
  • fulltext PDF
  • fulltext XML (often only on Open Access publishers' websites, otherwise behind paywall)
  • references (citations),
  • required files (e.g. lists of contributors, protocols)
  • supporting scientific information / data (often very large). A mixure of TXT, PDF, CSV, etc.
  • images
  • interactive data, e.g. 3D molecules

An excellent set of such files is in Acta Crystallographica journals (e.g. ) where the buttons represent such files.

I and colleagues at Cambridge have been screen-scraping many journals in this way for about 10 years to get crystallographic data for research and have never been told we have caused a problem. We have contributed our output to the excellent Free/Open Crystallography Open Database.

So I reject the idea that screenscraping is a problem, and regard the EDAP's argument as FUD. I say that because despite the EDAP's assertion that they are trying to help us, the reverse is true. I have spent 5 years of my life beating emails back and forth and got no-where, ( ) and you should prepare for the same.

An API ( allows a browser or program to request specific information or services from a server. It's a precise software specification which should be documented precise and where the client can rely on what the server can provide. At EuropePMC there is such an API, and we use it frequently and in our “getpapers” tool. Richard Smith-Unna in Plant Sciences (Univ Cambridge) and in ContentMine has written a “wrapper” which issues queries to the API and stores the results.

When well written, and where there is a common agreement on rights, then APIs are often, but not always, a useful way to go. Where there is no common agreement they are unacceptable.

Why Elsevier and other TAPublishers' APIs are unacceptable.

There are several independent reasons why I and Chris Hartgerink and others will not use TAPublisher APIs. This is unlikely to change unless the publishers change the way they work with researchers and acceptable that researchers have fundamental rights.

  • An API gives total control to the server (the publisher) and no control to the client (reader/user).

That's the simple, single feature that ultimately decides whether an API is acceptable. The only way that I would use one, and would urge you consider, is

  • is there a mutually acceptable public contract between the publisher and the researcher?

In this case, and the case of all STMPublishers, NO. Elsevier has written its own TandC. It has done this without even the involvement of the purchasing customer. I doubt that any library, any library organization, any university, and university organization has publicly met with Elsevier or the STMPublisher's association and agreed mutually satisfactory terms.

All the rest is secondary. Very important secondary, which I'll discuss. But none of this can be mended without giving the researcher their rights.

Some of the consequences (which have already happened) include:

  • It is very heavily biassed towards Elsevier's interests and virtually nothing about the user interests.
  • The TandC can change at any time (and do so) without negotiation
  • The API can change at any time.
  • There is no guaranteed level of software design or service. When (not if) it breaks we are expected to find and report Elsevier bugs. There is no commitment to mend them.
  • The API is designed and built by the publisher without the involvement of the researcher. Quite apart from the contractual issues this a known way of producing bad software.
  • The researcher has no indication of how complete or correct the process is. The server can give whatever view of the data they wish.
  • The researcher has no privacy.
  • (The researcher probably has no legal right to sign the TandC for the API – it is the University that contracts with the publisher.)
  • The researcher contracts only to publish results as CC-NC, which debars them from publishing in Open Access journals.
  • The researcher contracts not to publish anything that will harm Elsevier's marketplace. This immediately rules me out as publishing chemistry will compete with Elsevier database products.

So the Elsevier API is an instrument for control.

To summarise, an API:

  • Allows the server to control what, when, how, how much, in what format the user can access the resource. It is almost certain that this will not fit onto how researchers work. For example, the Elsevier API does not serve images. That already make it unusable for me. I doubt it serves supplemental data such as CIFs either. If I find problems with EuropePMC API I discuss this with the European Bioinformatics Institute. If I have problems with Elsevier API I ...
  • Can monitor all the traffic. I trust EuropePMC to behave responsibly as it has a board of governance (including one I have been on). It allows anonymity. With Elsevier I … In general no large corporate can be trusted with my data, which here includes what I did, when, what I was looking at and allows a complete history of everything I have done. From that machines can work out a great deal more, and sell it to people I don't even know exist.


  • APIs can be well written or badly written. Do you, the user, have an involvement?
  • Their use can be voluntary or mandatory. Is the latter a problem?
  • Is there a guarantee of privacy and non-use of data?
  • Do you know whether the API gives the same information as the screen-scraper (almost certainly not, but how)
  • what do you have to sign up to? Was it agreed by a body you trust?


APIs are being touted by Elsevier and other STMPublishers as the obvious friendly answer to Mining. In their present form, and with present Terms and Conditions they are completely unacceptable and very dangerous.

They should be absolutely rejected. Ask your library/university to cancel all clauses in contracts which forbid mining by scraping. They have the legal right to do so.


Posted in Uncategorized | Leave a comment

Content-mining; Rights versus Licences

[I intend to follow with several more detailed posts.]

Last week was a critical point for those who regard the scholarly literature as a public good, rather than a business. Those who care must now speak out, for if they do not, we shall see a cloud descend over the digital century where we are disenfranchised and living in enclosures and walled gardens run by commercial mega-corporations.

Chris Hartgerink, a statistician at the University of Tilburg NL, was using machines to read scholarly literature to do research (“content-mining”). Elsevier, the mega-publisher, contacted the University and required them to stop Chris. The University complied with the publisher and Chris is now forbidden to do research using mining without Elsevier's permission.

Some reports include:

The issues are simple:

  • Chris has rightful access to the literature and can read it with his eyes.

  • His research is serious, valuable and competent.

  • Machines can save Chris weeks of time and prevent many types of error.

What Chris has been doing has been massively resisted by mainstream “TAPublishers” [1]. This includes:

  • lobbying to reject proposed legislation (often by making it more restrictive).

  • producing FUD (“Fear Uncertainty and Doubt”) aimed at politicians, libraries and researchers such as Chris. Note that “stealing” is now commonly used in TAPublisher-FUD.

  • physically preventing mining (e.g. through CAPTCHAs).

  • Preventing mining though contractual or legal means (as with Chris).

Many of us met in The Hague last year to promote this type of new and valuable research, and wrote The Hague Declaration . A wide range of organisations and individuals ranging from universities, libraries, and liberal publishers have signed. This is often represented with by my phrase “The Right to Read is the Right to Mine”.

Many reformers, led initially by Neelie Kroes (European Commissioner till 2014) and now by Julia Reda (MEP) have pushed for reforms of copyright to allow and promote mining. The European Parliament and the Commission have produced in-depth proposals for liberalising European law.

The reality is that reformers and the Publishers have little common ground on mining. Reformers are campaigning for their Rights; TAPublishers are trying to prevent this. This is often encapsulated in additional mining “Licences” proposed by TAPublishers. This is epitomised by the STMPublisher-lobbied “Licences for Europe” proposed in 2013 in Commission discussions, but which broke down completely as the reformers were not prepared.

The TAPublishers are trying to coerce the scholarly and wider community into accepting Licences; we are challenging this by asserting our Rights.

Unfettered Access to Knowledge is as important in the Digital Century as food, land, water, and slavery have been over the millenia.

The issue for Chris and others is:

  • Can I read the literature I have access to

    1. in the way I want,

    2. for any legal purpose

    3. using machines when appropriate

    4. without asking for further permission

    5. or telling corporations who I am and what I am doing

    6. and publishing the results in the open literature without constraints

Chris has the moral right to do 1-6, but not the legal right, because the TAPublishers have added restrictions to the subscription contracts, and his University has signed them. He is therefore (probably) bound by NL contract law.

In the UK the situation is somewhat better. Last year a copyright Exception was enacted which allows me to do much of this. (2) has to be for “non-commercial research” and (6) would only be permissable if I don't break copyright in publishing the results. So I can do something useful (although not nearly as much as I want to do, and as reponsible science requires). I know also that I will have constant opposition from publishers, probably including lobbying of my institution.

European reformers are pushing for a similar legal right in Europe and many propose removing the “non-commercial” clause. There is MASSIVE opposition from publishers primarily through lobbying, where key politicians and staff are constantly fed the publishers's story. There is no public forum (such as a UK Select Committee) where we can show the fallaciousness of TAPublisher arguments. (This is a major failing of European democracy – much of it happens in personal encounters with unelected representatives who have no formal responsibility to the people of Europe). The fight – and it is a fight – is therefore hugely asymmetric. If I want to represent my views we have to travel to Brussels at our own expense – TAPublishers have literally billions.

The issue is RIGHTS (not APIs, not bandwidth, not cost, not FUD about server-load, not convenience)


I hope you feel that this is the time to take a stand.

What can we do?

Some immediate and cost-free tasks:

  • Sign the Hague Declaration. Very few European University / Libraries have so far done so

  • Write to your MEP. Feel free to take this mail as a basis, but personalise it

  • Write to Commissioner Oettinger (“Digital Single Market”)

  • Write to your University and your University Library. Use Freedom of Information to required that they reply. Challenge the current practice

  • Alert your learned socity to the muzzling of science and scholarship.

  • Alert organizations who are campaigning for Rights in the Digital age.

  • Tweet this post, and push for retweets

And think about what ContentMining could do for you. And explore with us.

And what are PMR and colleagues going to do?

Because I have the legal right to mine the Cambridge subscription literature for non-commercial purposes, I and colleagues are going to do that. Ross Mounce and I have already shown that totally new insights are possible (see We've developed a wide range of tools and we'll be working on our own research and also with the wider research community in areas that we can contribute to.

[1]. There is a spectrum of publishing ranging from large conventional, often highly profitable, publishers through learned societies, to new liberal startups in the last 10 years. I shall use “TAPublisher” (TollAccess publisher) to refer to publishers such as (but not limited to) Elsevier, Wiley, Springer, Macmillan, Nature . They are represented by an association (STMPublishers Association) which effectively represents their interests and has been active in developing and promoting licences..

Posted in Uncategorized | Leave a comment

Extracting 100 million facts from the Scientific literature -1

In TheContentMine, funded by the Shuttleworth Foundation, we aim to extract 100 million facts from the scientific literature. That's a large number so here's our thinking....

What is a fact?

Common Mist Frog (Litoria rheocola) of northern Australia. Credit: Damon Ramsey / Wikimedia / CC BY-SA

Common Mist Frog (Litoria rheocola) of northern Australia. Credit: Damon Ramsey / Wikimedia / CC BY-SA

"Fact" means slightly different things to philosophers, lawyers, scientists and advertisers. Wikipedia describes facts in science,

a fact is a repeatable careful observation or measurement (by experimentation or other means), also called empirical evidence. Facts are central to building scientific theories.


a scientific fact is an objective and verifiable observation, in contrast with a hypothesis or theory, which is intended to explain or interpret facts.

In ContentMine we highlight the "objective", i.e. people will agree that they are talking about the same things in the same way. We concentrate on facts reported in scientific papers and my colleague Ross Mounce showed (Daily updates on IUCN Red List species) ... some excellent examples about a mistfrog[1]Here are some examples he quoted:

  • Litoria rheocola is a small treefrog (average male body size: 2.0 g, 31 mm; average female body size: 3.1 g, 36 mm [20])
  • the common mistfrog (Litoria rheocola), an IUCN Endangered species [22] that occurs near rocky, fast-flowing rainforest streams in northeastern Queensland, Australia [23]…
  • we tracked frogs using harmonic direction finding [32,33].
  • individuals move along and at right angles to the stream
  • Fig 3. Distribution of estimated body temperatures of common mistfrogs (Litoria rheocola) within categories relevant to Batrachochytrium dendrobatidis growth in culture (<15°C, 15–25°C, >25°C). [Figure]
  • Fig 3.  Distribution of estimated body temperatures of common mistfrogs (Litoria rheocola) within categories relevant to Batrachochytrium dendrobatidis growth in culture (<15°C, 15–25°C, >25°C).

All of the above , including the components of the graph, are FACTS. They have the features:

  • they are objective. They may or may not be "true" - another author might dispute the sizes of the frogs or where they live - but the authors have stated them as facts.
  • they can be represented in formal representations without losing meaning or precision. There are normally very few different ways of representing such facts. "Alauda arvensis sing on the wing" is a fact. "Hail to thee blithe spirit, bird thou never wert" is not a fact.
  • they are uncopyrightable. We content that all the facts we extract are uncopyrightable statements and therefore release them as CC0.

How do we represent facts? Generally they are a mixture of simple natural language statements and formal specifications. "A brown frog that lives in Queensland" is adequate; "L. rheocola. colour: brown; habitat: Queensland" says the same, slightly more formally.  Formal language is useful for us as it's easier to extract. The form:

object: name;    property1: value1 ; property2: value2

is very common and very useful. Often it's put in a table, graph or diagram. Transforming between these is one of the strengths of ContentMine software. The box plots could be put in words: "In winter in Windin Creek between 0 and 12% of the frogs had body temperatures below 15 Celsius", but the plot may be more useful to some scientists (note the redundancy!).

So the scientific observations - temperatures, locations, dates - are all Facts. The sentence contains 6 factual concepts: winter, Windin Creek, 0%, 12%, L. rheocola, body temperature, < 15 C. In ContentMine we refer to all of these as "Facts". Perhaps more formally we might say:
section-i/para-j/sentence-k in DOIjournal.pone.0127851 contains Windin Creek

section-i/para-j/sentence-k in DOIjournal.pone.0127851 contains L. rheocola

Those who like RDF (we sometimes use it)  may regard these as triples (document-component contains entity). In similar manner the linked data as in Wikidata should be regarded as Facts (which is why we are working with Wikidata to export extracted facts there).

How many facts does a scientific paper contain?

Every entity in a document is a Fact. Each author, each species, each temperature, date, colour. A graph may have 100 facts, or 1000. Perhaps 100 facts / page? A 10-page paper might have have 1000 facts. Some chemistry papers have 300 pages of supporting information. So if we read 1 million papers we might get a billion facts - our estimate of 100 million is not hyperbole.


[1] reported in PloSONE (Roznik EA, Alford RA (2015) Seasonal Ecology and Behavior of an Endangered Rainforest Frog (Litoria rheocola) Threatened by Disease. PLoS ONE 10(5): e0127851. doi:10.1371/journal.pone.0127851).

Posted in Uncategorized | Leave a comment

Should Wikipedia work with Elsevier?

This story has erupted in the last 2 days - if it had been earlier I would have covered it at my talk to Wikipedia Science].

TL;DR. Elsevier has granted accounts to 45 top editors at Wikipedia so they can read closed access publications as part of their editing. I strongly oppose this and say why. BTW I consider myself a committed Wikipedian.]

Glyn Moody in Ars Technica has headlined:

WikiGate” raises questions about Wikipedia’s commitment to open access

Glyn mailed me for my opinion and the piece, which is accurate, also highlights Michael Eisen's opposition to the new move. I'll cut and paste large chunks and then add additional comment.

Scientific publisher Elsevier has donated 45 free ScienceDirect accounts to "top Wikipedia editors" to aid them in their work. Michael Eisen, one of the founders of the open access movement, which seeks to make research publications freely available online, tweeted that he was "shocked to see @wikipedia working hand-in-hand with Elsevier to populate encylopedia w/links people cannot access," and dubbed it "WikiGate." Over the last few days, a row has broken out between Eisen and other academics over whether a free and open service such as Wikipedia should be partnering with a closed, non-free company such as Elsevier.

Eisen's fear is that the free accounts to ScienceDirect will encourage Wikipedia editors to add references to articles that are behind Elsevier's paywall. When members of the public seek to follow such links, they will be unable to see the article in question unless they have a suitable subscription to Elsevier's journals, or they make a one-time payment, usually tens of pounds for limited access.

Eisen went on to tweet: "@Wikipedia is providing free advertising for Elsevier and getting nothing in return," and that, rather than making it easy to access materials behind paywalls, "it SHOULD be difficult for @wikipedia editors to use #paywalled sources as, in long run, it will encourage openness." He called on Wikipedia's co-founder, Jimmy Wales, to "reconsider accommodating Elsevier's cynical use of @Wikipedia to advertise paywalled journals." His own suggestion was that Wikipedia should provide citations, but not active links to paywalled articles.

Agreed. It is not only providing free advertising, but worse, it implicitly legitimizes Elsevier's control of the scientific literature. Rather than making it MORE accessibile to the citizens of the world, it makes it LESS.

Eisen is not alone in considering the Elsevier donation a poisoned chalice. Peter Murray-Rust is Reader Emeritus in Molecular Informatics at the University Of Cambridge, and another leading campaigner for open access. In an email to Ars, he called the free Elsevier accounts "crumbs from the rich man's table. It encourages a priesthood. Only the best editors can have this. It's patronising, ineffectual. And I wouldn't go near it."

This arbitrary distinction between the 45 top editors and everyone else is seriously divisive. Even if this was a useful approach (it isn't) why should Elsevier decide who can, and who can't, be a top Wikipedia editor? Wikipedia has rightful concerns about who and how editors are "appointed" - it's meritocratic and, though imperfect, any other solution (cf. Churchil on democracy) is worse.

You may think I am overreacting - that Elsevier will behave decently and collaboratively. I've spent 6 years trying to "negotiate" with Elsevier about Content Mining - and it's one smokescreen after another. They want to develop and retain control over scholarship.

And I have additional knowledge. I've been campaigning for reform in Europe (including UK) and everywhere the publishers are fighting us. Elsevier wants me and collaborators to "licence" the right to mine - these licences are desiged to make Elsevier the central control. I would strongly urge any Wikipedian to read the small print and then run a mile.

This isn't the first time that Wikipedia has worked closely with a publisher in this way. The Wikipedia Library "helps editors access reliable sources to improve Wikipedia." It says that it supports "the broader move towards open access," but it also arranges Access Partnerships with publishers: "You would provide a set number of qualified and prolific Wikipedia editors free access to your resources for typically 1 year." As Wikipedia Library writes: "We also love to collaborate on social media, press releases, and blog posts highlighting our partnerships."

It is that cosy relationship with publishers and their paywalled articles that Eisen is concerned about, especially the latest one with Elsevier, whom he described in a tweet as "#openaccess's biggest enemy." Eisen wrote: "it is a corruption of @Wikipedia's principles to get in bed with Elsevier, and it will ultimately corrupt @Wikipedia." But in a reply to Wikipedia Library on Twitter, Eisen also emphasised: "don't get me wrong, i love @wikipedia and i totally understand everything you are doing."

Murray-Rust was one of the keynote speakers at the recent Wikipedia Science Conference, held in London, which was "prompted by the growing interest in Wikipedia, Wikidata, Commons, and other Wikimedia projects as platforms for opening up the scientific process." The central question raised by WikiGate is whether the Wikipedia Library project's arrangements with publishers like Elsevier that might encourage Wikipedia editors to include more links to paywalled articles really help to bring that about.

Elsevier and other mainstream publishers have no intention of major collaboration, nor of releasing the bulk of their material to the world. Witness the 35-year old paper, which is hidden behind a paywall, that predicted that Ebola could break out in Liberia. It's still behind an Elsevier paywall.

[These problems aren't confined to Elsevier, many of the major publishers do similar things to restrict the flow of knowledge.  When it appeared that ContentMining might become a reality, Wiley recently added "Captcha's" to its site to prevent ContentMining . But Elsevier is the largest and most unyielding publisher, often taking the lead in devising restrictions,  and so it gets most coverage.]

Wikimedian Martin Poulter, who is the organiser of the Wikipedia Science Conference, has no doubts. In an email, he told Ars: "Personally, I think the Wikipedia Library project (which gives Wikipedia editors free access to pay-walled or restricted resources like Science Direct) is wonderful. As a university staff member, I don't use it myself, but I'm glad Wikipedians outside the ivory towers get to use academic sources. Wikipedia aims to be an open-access summary of reliable knowledge—not a summary of open-access knowledge. The best scholarly sources are often not open-access: Wikipedia has to operate in this real world, not the world we ideally want."

The debate will continue publicly in Wikip/media. That's good.

The STM publishers, Rightslink, and similar organisations are working to lobby politicians, librarians, to prevent the liberation of knowledge. That must be fought every day


Posted in Uncategorized | 5 Comments

Wikipedia and Wikidata. Massive Open resources for Science.

I think Wikipedia is a wonderful creation of the XXIst Century, the Digital Enlightenment. It has arisen out of the massive cultural change enabled by digital freedom - the technical ability for over half the world (and hopefully soon almost all) to read and write what they want.

I was invited to give the plenary lecture at Wikipedia Science - a new venture, and one which was wonderfully successful. Here's me, promoting "The Right to Read is The Right to Mine".


I'm not sure whether there is a recording or transcript - I'd certainly value them as I don't read prepared speeches.

My theme was that Wikidata - one of the dozen major sections of Wikimedia - should be the first stopping place for people who want to find and re-use scientific data. It doesn't mean that WD necessarily contains all the data itself, but it will have structured validated link to where it can be found.

Here's my slides which contain praise for Wikim/pedia, the problems of closed information, and the technology of liberating it through ContentMining.  In ContentMine we are mining the daily literature for science and Wikidata will be one of the places that we shall look to for recording the results .

One of the great aspects of Wikipedia is that it has an Open approach to governance. Last year at Wikimania I was impressed by the self-analysis of Wikipedia - how can we run a distributed, vibrant, multicultural, multidisciplinary organisation? If anyone can find the answer it' Wikimedia.

But running societies has neve been and never will be easy. People will always disagree about what is right and what is wrong; what will work and what won't.

And that's what the next post is about. Wikipedia has embarked on a collaboration with Elsevier to read the closed literature. Many people think it's a good way forward. Others like Michael Eisen and I think it's a dereliction of our fundamental values.

It's healthy that we debate this loudly in public. During that process we may lose friends and make new ones, but we advance our communal processes.

What's supremely unhealthy is that larged closed monopolistic capitalist organisations make decisions in private, colluding with governments and constrain and control the Digital Enlightenment.




Posted in Uncategorized | Leave a comment

Announce: Microbial Supertree through ContentMining

I haven't blogged for some time as I have been writing Liberation Software (software to make knowledge and people free). Now we (Ross Mounce, Matt Wills and I) have got our first significant scientific result - a supertree:


I am going to leave Ross the opportunity to blog this in detail - he was hacking this late last night - so a brief overview:

For every new microorganism it's obligatory to compare it with other organisms in an evolutionary (phylogenetic) tree. Here's a typical one (don't be frightened - everyone can understand this if they are familiar with evolutionary ideas.) . The image was pusblished in (Citation: International Journal of Systematic and Evolutionary Microbiology b(2009),59,972–980 DOI 10.1099/ijs.0.000364-0)


wikipedia2015 are

[I have added "root" and magnified some of it].

31 microorganisms (mainly bacteria) listed in the middle. Each has a binomial (scientific) name (Pyramidobacter piscolens) , a strain identifier (W5455T), and an identifier in an RNA database (e.g. EU379932). The lines represent a "tree" with its root (not shown) at the left hand side and presumed divergence of the species. It's certainly a useful classification; you can debate whether it's a useful historical model of the actual evolution over many million years. Thus it says that Pyramidobacter piscolens is closely related to Jonquetella anthropi and much more distantly related to Escherichia coli, a bacterium in everybody's gut.

Each paper provides one such tree - which could take significant amounts of computation (often hours depending on the strictness). What we have done - and this is a strength of Matt's group, is to bring thousands of such trees together. They weren't calculated with this in mind, and we are adding value by extracting them from the literature and making comparisons and aggregations.

Ross downloaded about 4,300 papers and the trees in them. I wrote the software to extract trees from the images. This is not trivial - the images are made of pixels - there are no explicit lines or characters or words and this research is full of heuristics. So we can't always distinguish "O" (Oh) from "0" (one).  So there will be an unavoidable percentage of garbles.

BUT we have ways of detecting and correcting these ("cleaning") and the most valuable are:

  • comparing the scientific name with the RNA ID
  • looking up the name in the NCBI's Taxdump (a list of all biomedical species)

Ross has developed several methods of cleaning and we are reasonably confident that the error  rate in species is no worse that 1 in 1000. (Note, by the way, that the in a sibling image the authors have made a misprint: "Optiutus" should be "Opitutus". so the primary literature also contains errors).

Everything we do is Open Notebook. We post stuff as soon as it is ready. We store it on Github (see above link, which has all 4300 trees), discuss it on ContentMine's Discourse ( - you can see that every detail is made open), software in ( and many other repos, fully Open and often updated several times a day), and where Ross will be blogging it.

I hope to write much more frequently now.






Posted in Uncategorized | Leave a comment

Postal voting UK style


Whenever there is a national, international or local vote I feel it's my duty to vote. My grandmother was a suffragette and their actions won the right for today's British women to vote. (She didn't go to jail because she had small children and the suffragettes therefore wouldn't allow her to be put in the front line.). Universal suffrage represents one of the main struggles of the C20 - and it's still going on today.

So even if you think your vote makes no difference, the very fact of voting in the UK supports the disadvantaged elsewhere.

I take the act of voting seriously and dress up for the occasion - see my voting suit above. The suit came about because it was unclear (and still is) whether a voter has to show their face. In fact every time I have voted as a bear I have been asked to remove my head and have done so.

This year I can't vote in person so I applied for a postal vote. I have found the process archaic and arcane. I am sure that many voters don't use a postal vote because of the hassle.

  • Step 1. find out how to do it.
  • Step 2 send a postal request for a voting form. This cost me a first class stamp. It seems absurd. Why can we not ask for a form online? Anyway I sent the form off and heard nothing. I tweeted my MP (Julian Huppert) and he agreed that I should have heard. Anyway I rang up this Monday and they said forms had been sent out last Friday. Heard nothing.  Finally the forms arrived on Wednesday. So it's taken 3 (or 4) working days for the forms to travel 2 miles (3 km). This is absurd. Relying on the Royal Mail to provide "next day" delivery for first-class local post seems broken.
  • step 3. I have filled the forms in, for the National and Local elections. How did I vote? [see below]. I then dressed up to vote tday 2015-05-01, went to my local post-box, and put the envelope in it. It's got 4 working days to get to the Guildhall (3 km distant). I will never know if it reached it in time. It's possible I may be disenfranchised by slow postage.

So the system must be changed. It should be possible to get a voting form instantaneously. I am not arguing for electronic voting, but it should be possible to know that your vote has arrived and will go into the counting process.

So who did I vote for?

The party system is so broken in UK that it's impossible to vote for a party. I used to know what Labour, Conservative and Liberal stood for. Now I don't. Blair betrayed the system for ever - with the implied slogan "Trust me, I know what's best". The current leaders are pathetic. They are currently trying to find small gaps in the opponent's policies and add sweeties for the electorate. "No new income tax for 5 years". If that was so important it should have been in the manifesto. They are simply gibbering and most people who think about the issues have long ago given up. Clegg has destroyed the Lib-Dems - left them with no solid bedrocks.

So I don't vote for parties. I vote for people. That's on the basis that a responsible representative will recognize when policy is so far adrift that it has to be challenged. (Yes, you have spotted that I am an idealist). It makes it very difficult in Europe because you can only vote for parties.

Politics is carried out 365 days a year. Many issues are not party-political but require an independent, committed analysis and also hard work. There is good politics in the UK, In Europe, and in US (I don't have first hand knowledge of most countries). There is also awful politics in all of them, and it's that that I am fighting.

So you will have to guess who I voted for.


Posted in Uncategorized | Leave a comment

Is Figshare Open? "it is not just about open or closed, it is about control"

[Quote in title is from Mark Hahnel, see below]

I have been meaning to write on this theme for some time, and more generally on the increasing influence of DigitalScience's growing influence in parts of the academic infrastructure. This post is sparked by a twitter exchange (follow backwards from ) in the last few hours, which addresses the question of whether "Figshare is Open".

This is not an easy question and I will try to be objective. First let me say - as I have said in public - that I have huge respect and admiration for how Mark Hahnel created Figshare while a PhD student. It's a great idea and I am delighted - in the abstract - that it gained some much traction so rapidly.

Mark and I have discussed issues of Figshare on more than one occasion and he's done me the honour of creating a "Peter Murray-Rust" slide ( ) where he addresses some (but not all) of my concerns about Figshare after its "acquisition" by Macmillan Digital Science (I use this term, although there are rumours of a demerger or merger). I use "acquisition" because I have no knowledge of the formal position of Figshare as a legal entity (I assume it *is* one? Figshare FAQs ) and that's one of the questions to be addressed here.

From the FAQs:

figshare is an independent body that receives support from Digital Science. "Digital Science's relationship with figshare represents the first of its kind in the company's history: a community based, open science project that will retain its autonomy whilst receiving support from the division."

However lists Figshare among "our products" and brands it as if it is a DigitalScience division or company. Figshare appears to have no corporate address other than Macmillan and I assume trades through them.

So this post has been catalysed by a tweet of a report from a DS employee(?) Dan Valen

John Hammersley @DrHammersley tweeted:
Such a key message: "APIs are essential (for #opendata and #openscience)" - Dan Valen of @figshare at #shakingitup15

This generated a twitter exchange about why APIs were/not essential. I shan't explore that in detail, but my primary point is that:

If the only access to data is through a controlled API, then the data as a a whole cannot be open , regardless of the openness of individual components.

There is no doubt that some traditional publishers see APIs as a way of enforcing control over the user community. Readers will remember that I had a robust discussion with Gemma Hirsh of Elsevier, who stated that I could not legally mine Elsevier's data without going through their API. She was wrong, categorically wrong, but it was clear that she and Elsevier saw, and probably still see, APIs as a control mechanism. Note that Elsevier's Mendeley never exposed their whole data - only an API.

An API is the software contract with a webserver offering a defined service. It is often accompanied with a legal contract for the user (with some reciprocity). The definition of that service is completely in the hands of the provider. The control of that service is entirely in the hands of the provider. This leads to the following technical possibilities:

  • control: The provider can decide what to offer , when, to whom, on what basis. They can vary this by date, geography or IP of user, and I have no doubt that many publishers do exactly this. In particular, there is no guarantee that the user is able to see the whole data and no guarantee that it is not modified in some way from the "original". This is not, per se, reprehensible but it is a strong technical likelihood.
  • monitoring: ("snooping") The provider can monitor all traffic coming in from IP addresses, dwell times, number of revisits, quite apart from any cached information. I believe that a smart webserver, when coupled to other data about individuals, can deduce who the user is, where they are calling from and, with the sale of information between companies, what they have been doing elsewhere.

By default companies will do both of these. They could lead to increased revenue (e.g. Figshare could sell user data to other organizations) and increased lockin of users. Because Figshare is one of several Digital Science products (DS words, not mine) they could know about a user's publication record, their altmetric activity, what manuscripts they are writing, what they have submitted to the REF, what they are reading in their browser, etc. I am not asserting this is happening but I have no evidence it is not.

Mark says, in his slides,

"it is not just about open or closed, it is about control"

and I agree. But for me the question is who controls Figshare? and is Figshare controlling us?

Figshare appears to be one of the less transparent organizations I have encountered. I cannot find a corporate structure, and the companies' address is:

C/o Macmillan Publishers Limited, Brunel Road, Basingstoke, Hampshire, RG21 6XS

I can't find a board of directors or any advisory or governing board. So in practice Figshare is legally responsible to no-one other than UK corporate law.

You may think I am being unfair to an excellent (and I agree it's excellent) service. But history inexorably shows that these beginnings become closed, mutating into commercial control and confidentiality. Let's say Mark moves on? Who runs Figshare then? Or Springer buys Digital Science? What contract has Mark signed with DS? Maybe it binds Figshare to being completely run by the purchaser?

I have additional concerns about the growing influence of DigitalScience products, especially such as ReadCube, which amplify the potential for "snoop and control" - I'll leave those to another blogpost.

Mark has been good enough to answer some of my original concerns, so here are some othe'r to which I think an "open" ("community-based") organization should be able to provide answers.

  • who owns Figshare?
  • who runs Figshare?
  • Is there any governance process from outside Macmillan/DS? An advisory board?
  • How tightly bound is Figshare into Macmillan/DS? Could Figshare walk away tomorrow?
  • What could and what would happen to Figshare if Mark Hahnel left?
  • What could and what would happen to Figshare if either/both of Macmillan / DS were acquired?
  • Where are the company accounts for the last trading year?
  • how, in practice, is Figshare a "a community based, open science project that will retain its autonomy whilst receiving support from the (DS) division."?

I very much hope that the answers will allay any concerns I may have had.



Posted in Uncategorized | 11 Comments