Content-mining; Why do Universities agree to restrictive publisher contracts?

[I published a general blog about the impasse between digital scholars and the Toll-Access publishers /pmr/2015/11/22/content-mining-rights-versus-licences/ . This is followed by a series of detailed posts which look at the details and consequences

This is the second]

If you have read these earlier posts you will know that the issue is whether I and others are allowed to use machines to read publications we have legal access to read with our eyes.

The (simplified) paradigm for Content-mining scholarly articles consists of:

  • finding links to papers (articles) we may be interested in (“crawling”). The papers may be on publishers web sites (visible or behind paywall) or in repositories (visible). Most of this relates to paywalled articles

  • downloading these papers from (publisher) servers onto local machines (clients). (“scraping”). If paywalled this requires paid access (subscription) which is only available to members of the subscribing institution. Thus I can read thousands of articles to which Cambridge University has a subscription.

  • Running software to extract useful information from the papers (“mining”). This information can be chunks of the original or reworked material.

  • (for responsible scientists – including me) publish the results in full.

This is technically possible. Messy, if you start from scratch, but we and others have created Open Source tools and services to help.

The problem is that Toll-Access publishers don’t want us to do it (or only under unworkable restrictions). So what stops us?


What follows is simplistic and IANAL (I am not a lawyer) though I talk with people who are. I am happy to be corrected by people more knowledgeable than me.

There are two main types of law relevant here:

  • Copyright law. . TL;DR any copying may infringe copyright and allow the “rights-holder” to sue. The burden of proof is lower : “However, in a civil case, the plaintiff must simply convince the court or tribunal that their claim is valid, and that on balance of probability it is likely that the defendant is guilty”. Copyright law varies between countries and can be extraordinary complex and difficult to get clear answers. The simple, and sad, default assumed by many people and promoted by many vendors is that readers have no rights. (The primary method of removing these restrictions is to add a licence (such as CC-BY) which is compatible with copyright law and explicitly gives rights to the reader/user).

  • Contract law.
    Here the purchasers of goods and services (e.g. Universities) may agree a contract with the vendors (Publishers) that gives rights and responsibilities to both. In general these contracts are no publicised to users like me and may even be secret. Therefore some of what follows is guesswork. There are also hundreds of vendors and a wide variation on practice. However we believe that the main STMPublishers have roughly similar contracts.

    In general these contracts are heavily weighted in favour of the publisher. They are written by the publisher and offered to the purchaser to sign. If the University doesn’t like the conditions they have to “negotiate” with the publisher. Because there is no substitutability of goods (you can’t swap Nature with J. Amer. Chem. Soc.) the publisher often seems to have an advantage.

    The contracts contain phrases such as “you may not crawl our site, index it, spider it, mine it, etc.” These are introduced by the publisher to stop mining. (There is already copyright law to prevent the republishing of material without permission, so the new clauses are not required.). I queried a number of UK Universities as to what they had some – some were constructive in their replies but many – unfortunately – unhelpful.
    However there is no legal reason why a University has to sign the contract put in front of them. But they do, and they have signed clauses which restrict what I and Chris Hartgerink and other scientists can do. And they do it without apparent internal or external consultation.

    And this was understood by the Hargreaves reform which specifically says that text-miners can ignore any contracts which stop them doing it. Presumably they reasoned that vendors pressure Universities into signing our rights away, and this law protects us. And, indeed it’s critically important for letting us proceed.

But this law doesn’t (yet) apply to NL and so can’t help Chris (except when he comes to UK). We want it changed, and library organizations such as LIBER, RLUK, BL etc. want it changed.

So this mail is to ask Universities – and I expect their libraries will answer:




And then we’ll work out how to help.

This entry was posted in Uncategorized. Bookmark the permalink.

4 Responses to Content-mining; Why do Universities agree to restrictive publisher contracts?

  1. Henry Rzepa says:

    For many of our licences, these are negotiated by Jisc and we use the Jisc Model Licence:
    So perhaps JISC should answer these questions as well as librarians?

    • pm286 says:

      Thanks very much Henry,
      I’ve read this and there seems to be nothing that stops content-mining. Indeed there is a clause that allows text- and data-mining and the posting of the output as long as it doesn’t infringe copyright. I think I would be largely statisfied (from a mining point of view) if JISC licences were applied universally.

  2. I’d like to offer a theory of why libraries are signing away content-mining rights
    Let’s say that the library wants to subscribe to the Journal of X Studies because 80% of the library’s users want to read the content. Let’s say that an additional 15% of the users want to be able to mine the content and are willing to forgo access entirely if the publisher doesn’t agree to grant content-mining rights. (The remaining 5% have no interest in this particular journal.) If you’re the librarian, you are choosing between serving the 80% and the 15%.
    The situation is even more complicated if access to that journal only comes through a package deal for which the publisher is not willing to compromise on content-mining rights. Will the library hold-up access to all of that content for reading just to serve a smaller population of users interested in content mining?

  3. Pingback: Content-mining; Why do Universities agree to restrictive publisher contracts? – ContentMine

Leave a Reply

Your email address will not be published. Required fields are marked *