Content-mining; Why do Publishers insist on APIs and forbid screen scraping?

[I published a general blog about the impasse between digital scholars and the Toll-Access publishers /pmr/2015/11/22/content-mining-rights-versus-licences/ . This is the first of a number of posts which look at the details and consequences]

Chris Hartgerink described how Elsevier have stopped him doing content-mining: http://onsnetwork.org/chartgerink/2015/11/16/elsevier-stopped-me-doing-my-research/

and

http://onsnetwork.org/chartgerink/2015/11/20/why-elseviers-solution-is-the-problem/

There is a lot of comment on both of these , to which I may refer but will not reproduce in detail. It informs my comments. The key issue is “APIs”, commented by Elsevier’s Director of Access & Policy (EDAP)

Dear Chris,

We are happy for you to text mind content that we publish via the ScienceDirect API, but not via screen scraping. You can get access to an API key via our developer’s portal (http://dev.elsevier.com/myapikey.html). If you have any questions or problems, do please let me know. If helpful, I am also happy to engage with the librarian who is helping you.
With kind wishes,
Alicia
Dr Alicia Wise
Director of Access & Policy
Elsevier
a.wise@elsevier.com
@wisealic

The TAPublishers wish contentmining to be done through their APIs and forbid (not merely discourage) screenscraping. On the surface this may look like a reasonable request – and many of us use APIs – but there are critically important and unacceptable aspects.

What is screen scraping and what is an API?

Screen scraping simulates the action of a human reading web pages via a browser. You feed the program (ours is “quickscrape”) a URL and it will retrieve the HTML “landing page”. Then it find links in the landing page which refer to additional documents and downloads them. If this is done responsibly (as quickscrape does) is causes no more problem for the server than a human. Any publisher who anticipates large numbers of human readers has to implement software and which must robust. (I run a server, and the only time it’s had problems is when I have been the interest on Slashdot or Reddit, which are multi-human sites). A well-designed polite screen scraper like “quickscrape” will not cause problems to modern sites.
Screen-scraping can scrape a number of components from the web page. These differ for evry publisher or journal, and for science this MAY include:

  • the landing page
  • article metadata (often in the landing page)
  • abstract (often in the landing page)
  • fulltext HTML
  • fulltext PDF
  • fulltext XML (often only on Open Access publishers’ websites, otherwise behind paywall)
  • references (citations),
  • required files (e.g. lists of contributors, protocols)
  • supporting scientific information / data (often very large). A mixure of TXT, PDF, CSV, etc.
  • images
  • interactive data, e.g. 3D molecules

An excellent set of such files is in Acta Crystallographica journals (e.g. http://scripts.iucr.org/cgi-bin/paper?S2056989015020885 ) where the buttons represent such files.
I and colleagues at Cambridge have been screen-scraping many journals in this way for about 10 years to get crystallographic data for research and have never been told we have caused a problem. We have contributed our output to the excellent Free/Open www.crystallography.net Crystallography Open Database.
So I reject the idea that screenscraping is a problem, and regard the EDAP’s argument as FUD. I say that because despite the EDAP’s assertion that they are trying to help us, the reverse is true. I have spent 5 years of my life beating emails back and forth and got no-where, (/pmr/2011/11/27/textmining-my-years-negotiating-with-elsevier/ ) and you should prepare for the same.
An API (https://en.wikipedia.org/wiki/Application_programming_interface#Web_use_to_share_content) allows a browser or program to request specific information or services from a server. It’s a precise software specification which should be documented precise and where the client can rely on what the server can provide. At EuropePMC there is such an API, and we use it frequently and in our “getpapers” tool. Richard Smith-Unna in Plant Sciences (Univ Cambridge) and in ContentMine has written a “wrapper” which issues queries to the API and stores the results.
When well written, and where there is a common agreement on rights, then APIs are often, but not always, a useful way to go. Where there is no common agreement they are unacceptable.

Why Elsevier and other TAPublishers’ APIs are unacceptable.

There are several independent reasons why I and Chris Hartgerink and others will not use TAPublisher APIs. This is unlikely to change unless the publishers change the way they work with researchers and acceptable that researchers have fundamental rights.

  • An API gives total control to the server (the publisher) and no control to the client (reader/user).

That’s the simple, single feature that ultimately decides whether an API is acceptable. The only way that I would use one, and would urge you consider, is

  • is there a mutually acceptable public contract between the publisher and the researcher?

In this case, and the case of all STMPublishers, NO. Elsevier has written its own TandC. It has done this without even the involvement of the purchasing customer. I doubt that any library, any library organization, any university, and university organization has publicly met with Elsevier or the STMPublisher’s association and agreed mutually satisfactory terms.
All the rest is secondary. Very important secondary, which I’ll discuss. But none of this can be mended without giving the researcher their rights.
Some of the consequences (which have already happened) include:

  • It is very heavily biassed towards Elsevier’s interests and virtually nothing about the user interests.
  • The TandC can change at any time (and do so) without negotiation
  • The API can change at any time.
  • There is no guaranteed level of software design or service. When (not if) it breaks we are expected to find and report Elsevier bugs. There is no commitment to mend them.
  • The API is designed and built by the publisher without the involvement of the researcher. Quite apart from the contractual issues this a known way of producing bad software.
  • The researcher has no indication of how complete or correct the process is. The server can give whatever view of the data they wish.
  • The researcher has no privacy.
  • (The researcher probably has no legal right to sign the TandC for the API – it is the University that contracts with the publisher.)
  • The researcher contracts only to publish results as CC-NC, which debars them from publishing in Open Access journals.
  • The researcher contracts not to publish anything that will harm Elsevier’s marketplace. This immediately rules me out as publishing chemistry will compete with Elsevier database products.

So the Elsevier API is an instrument for control.
To summarise, an API:

  • Allows the server to control what, when, how, how much, in what format the user can access the resource. It is almost certain that this will not fit onto how researchers work. For example, the Elsevier API does not serve images. That already make it unusable for me. I doubt it serves supplemental data such as CIFs either. If I find problems with EuropePMC API I discuss this with the European Bioinformatics Institute. If I have problems with Elsevier API I …
  • Can monitor all the traffic. I trust EuropePMC to behave responsibly as it has a board of governance (including one I have been on). It allows anonymity. With Elsevier I … In general no large corporate can be trusted with my data, which here includes what I did, when, what I was looking at and allows a complete history of everything I have done. From that machines can work out a great deal more, and sell it to people I don’t even know exist.

And…

  • APIs can be well written or badly written. Do you, the user, have an involvement?
  • Their use can be voluntary or mandatory. Is the latter a problem?
  • Is there a guarantee of privacy and non-use of data?
  • Do you know whether the API gives the same information as the screen-scraper (almost certainly not, but how)
  • what do you have to sign up to? Was it agreed by a body you trust?

So…
APIs are being touted by Elsevier and other STMPublishers as the obvious friendly answer to Mining. In their present form, and with present Terms and Conditions they are completely unacceptable and very dangerous.
They should be absolutely rejected. Ask your library/university to cancel all clauses in contracts which forbid mining by scraping. They have the legal right to do so.
 

This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Content-mining; Why do Publishers insist on APIs and forbid screen scraping?

  1. Laurent Vial says:

    Dear Prof. Murray-Rust,
    I do understand that contracts between the institutions and publishers are highly restrictive, but do you know if it’s allowed to automatically collect free metadata (title, author, GA, abstract..) from a private internet network (and without using APIs)?
    Best wishes,
    Laurent

    • pm286 says:

      It’s a slightly grey area. I believe that bibliographic metadata should be uncopyrightable and free of restrictions and I work on this basis. However collections of metadata may be protectable in some jurisdictions. Some publishers regard abstracts as copyrightable and I therefire do not include them in the metadata,

  2. Pingback: Content-mining; Why do Universities agree to restrictive publisher contracts? – ContentMine

Leave a Reply

Your email address will not be published. Required fields are marked *