ContentMine at WOSP2014: Text and Data Mining: III What Elsevier's Chris Shillum thinks we can do; Responsible Mining

Posted on September 16, 2014 by pm286

My last post (/pmr/2014/09/15/wosp2014-text-and-data-mining-ii-elseviers-presentation-gemma-hersh/ ) described the unacceptable and intransigent attitude of Elsevier’s Gemma Hersh at WOSP2014 Text and Datamining of Scientific Documents.
But there was a nother face to Elsevier at the meeting, Chris Shillum. (http://www.slideshare.net/cshillum ). Charles Oppenheim and I talked with him throughout lunch and also later Richard Smith-Unna also talked. None of the following is on public record but it’s a reasonable approximation to what he told us.
Firstly he disagrees with Gemma Hersh that we HAVE to use Elsevier’s API and sign their Terms and Conditions (which gives away several rights and severely limit what we can do and publish.) We CAN mine Elsevier’s content through their web pages that we have the right to read and we cannot be stopped by law. We have a reasonable duty of care to respect the technical integrity of their system, but that’s all we have to worry about.
I *think* we can move on from that. If so, thanks, Chris. And if so, it’s a model for other publishers.
We told him what we were planning to do – read every paper as it is published and extract the facts.
CS: Elsevier published ca 1000 papers a day.
PMR : that’s one per minute; that won’t break your servers
CS: But it’s not how humans behave….
I think this means that is we appear to their servers as just another human then they don’t have a load problem. (For the record I sometimes download manually in sequence as many Elsevier papers as I can to (a) check the licence or (b) to see whether the figures contain chemistry, or sequences. Both of these are legitimate human activities either for subscribers or unsubscribers. For (a) I can manage 3 per minute, 200/hour – I have to scroll to the end of the paper because that’s often where the “all rights reserved, C Elsevier” is. For (b) it’s about 40/hour if there are an average of 5 diagrams (I can tell within 500 milliseconds whether a diagram has chemistry, sequences, phylogenetic trees, dose-response curves…). [Note: I don’t do this for fun – it isn’t – but because I am fighting for our digital rights.]
But it does get boring and error-prone which is why we use machines.
The second problem is what we can publish. The problem is copyright. If I download a complete Closed Access paper from Elsevier and post it on a public website I am breaking copyright. I accept the law. (I am only going to talk about the formal law, not morals or ethics at this stage).
If I read a scientific paper I can publish facts. I MAY be able to publish some text as comment, either as metadata, or because I am critiquing the text. Here’s an example (http://www.lablit.com/article/11) :

In 1953, the following sentence appeared near the end of a neat little paper by James Watson and Francis Crick proposing the double helical structure of DNA (Nature171: 737-738 (1953)):
“It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.”
Of course, this bon mot is now wildly famous amongst scientists, probably as much for its coyness and understatement …

Lablit can reasonably assume that this is fair comment and quote C+W (Copyright who? they didn’t have transfer in those days). I can quote Lablit in similar fashion. In the US this is often justified under “fair use” , but there is not such protection in UK. Fair use is, anyway, extremely fuzzy.
So Elsevier gave a guide as to what they allow IF you sign their TDM restrictions. Originally it was 200 characters, but I pointed out that many entities (facts) , such as chemical names or biological sequences, were larger. So Chris clarified that Elsevier would allow 200 characters of surrounding context. This could means something like (Abstract, http://www.mdpi.com/2218-1989/2/1/39 ) showing ContentMine markup:
“…Generally the secondary metabolite capability of <a href=”http://en.wikipedia.org/wiki/Aspergillus_oryzae”> A.oryzae</a> presents several novel end products likely to result from the domestication process…”
That’s 136 characters without the Named Entity “<a…/a>”
Which means that we need guidelines for Responsible Content Mining.
That’s what JISC have asked Jenny Molloy and me to do (and we’ve now invited Charles Oppenheim to be a third author). It’s not easy as there are few agreed current practices. It’s possible for us to summarise what people currently do, what has been challenged by rights holders and then for us to suggest what is reasonable. Note that this is not a negotiation. We are not at that stage.
So I’d start with:

The right to read is the right to mine.
Researchers should take reasonable care not to violate the integrity of publishers’ servers.
Copyright law applies to all parts of the process although it is frequently unclear
Researchers should take reasonable care not to violate publishers’ copyright.
Facts are uncopyrightable.
Science requires the publication of source material as far as possible to verify the integrity of the process. This may conflict with (3/4).

CC BY publishers such as PLoS and BMC will only be concerned with (2). Therefore they act as a yardstick, and that’s why we are working with them. Cameron Neylon of PLoS has publicly stated that single text miners cause no problems of PLoS servers and there are checks to counter irresponsible crawling.
Unfortunately (4) cannot be solved by discussions with publishers as they have shown themselves to be in conflict with Libraries, Funders, JISC, etc. Therefore I shall proceed by announcing what I intend to do. This is not a negotiation and it’s not asking for permission. It’s allowing a reasonable publisher to state any reasonable concerns.
And I hope we can see that as a reasonable way forward.

This entry was posted in Uncategorized. Bookmark the permalink.

7 Responses to ContentMine at WOSP2014: Text and Data Mining: III What Elsevier's Chris Shillum thinks we can do; Responsible Mining

Catherine Pitt says:

September 16, 2014 at 4:21 pm

test comment

Reply
Peter Carroll says:

September 16, 2014 at 6:25 pm

Re: [PMR]”The second problem is what we can publish.”
From 1st October 2014 UK copyright law will change with the implementation of the The Copyright and Rights in Performances (Quotation and Parody) Regulations 2014.(2014 No. 2356 Regulation 3)
Source: http://www.legislation.gov.uk/uksi/2014/2356/regulation/3/made
which reads as follows:
“Quotation: amendments to section 30
3. (1) Section 30(1) is amended as follows.
(2) In the heading, after “review” insert “, quotation”.
(3) In subsection (1), after “acknowledgement” insert “(unless this would be impossible for reasons of practicality or otherwise)”.
(4) After subsection (1) insert—
“(1ZA) Copyright in a work is not infringed by the use of a quotation from the work (whether for criticism or review or otherwise) provided that—
(a)the work has been made available to the public,
(b)the use of the quotation is fair dealing with the work,
(c)the extent of the quotation is no more than is required by the specific purpose for which it is used, and
(d)the quotation is accompanied by a sufficient acknowledgement (unless this would be impossible for reasons of practicality or otherwise).”
(5) In subsection (1A)—
(a)for “subsection (1)” substitute “subsections (1) and (1ZA)”, and
(b)for “that subsection” substitute “those subsections”.
(6) After subsection (3) insert—
“(4) To the extent that a term of a contract purports to prevent or restrict the doing of any act which, by virtue of subsection (1ZA), would not infringe copyright, that term is unenforceable.””
This will be the UK law that will apply IF you do not sign Elsevier’s TDM terms and conditions. Even IF you do sign them new subsection (4) states that: “To the extent that a term of a contract purports to prevent or restrict the doing of any act which, by virtue of subsection (1ZA), would not infringe copyright, that term is unenforceable.” It seems to me (But then I am not a lawyer) that this would allow you to make quotations of more than Elsevier’s “200 characters of surrounding context”, provided that under subsection 1ZA(c)”the extent of the quotation is no more than is required by the specific purpose for which it is used”.
I would just add one other thing. The new quotation exception does not apply only to textual quotation. The explanatory memorandum to the exception
http://www.legislation.gov.uk/uksi/2014/2356/memorandum/contents
states:
“7.9.5 As with the existing criticism and review exception, the exception for quotation applies
to all types of copyright work including film, broadcasts, and sound recordings, as well
as to traditional text quotation. It should be noted that while EU case law
acknowledges that there may be circumstances in which it may be permissible to
‘quote’ a whole photograph, in practice it is likely to be more difficult to do this in a
way that is considered “fair dealing” when compared to the use of shorter extracts of
other works.”
There is further explanation in Section 3.5 which deals with full quotations of photographs. There is EU case law that “complete reproduction may be necessary in order to create the necessary material reference back to the work”. Whilst section 3.5 deals explicitly with photographs it starts with section 3.5.1 that “Neither the Convention nor the Directive expressly exclude photographs or other artistic works from the scope of the quotation exception.” Again I am not a lawyer but graphs,infographics and diagrams in scientific publications would seem encompassed under “other artistic works”. Therefore, their “quotation” (from October 1st 2014) seems allowable so long as it is fair dealing.

Reply
- pm286 says:
  
  September 16, 2014 at 6:55 pm
  
  Many thanks indeed.
  
  Reply
Gemma Hersh and Chris Shillum says:

September 18, 2014 at 12:49 pm

We feel it is important to clarify the following:
1. We are fully aligned on the details of Elsevier’s text mining policy and feel that, despite your attempt for accuracy, that this is a somewhat subjective and rather unproductive recap of our conversations last week.
2. Chris had the opportunity to speak with you informally and at greater length over lunch, and Gemma gave a short presentation at the invitation of colleagues from Mendeley, who helped co-organise the conference. It is clear from the workshop program at http://core-project.kmi.open.ac.uk/dl2014/ that this was neither presented as, nor intended to be, a peer-reviewed session nor a sales discussion.
3. If a single crawler behaves as you propose, then it alone would pose no stability risk to publisher systems. However it does not follow that this is sustainable when you consider that many researchers may wish to mine. We are designing services that scale globally.
4. We therefore continue to ask and recommend that you use our APIs for programmatic access to the content. You have not, to our knowledge, ever used the TDM service of which you are so critical and we invite you try it.
5. We would welcome the opportunity to participate in a community-driven process to further refine your proposed Guidelines for Responsible Text Mining and acknowledge the more constructive tone of your later blog post. We look forward to more constructive dialogue in future.
Gemma Hersh and Chris Shillum

Reply
- David Roberts says:
  
  September 20, 2014 at 10:13 am
  
  If Peter wants to “try” the TDM service, can he do so without being locked in to contractually binding terms? If he has to agree to T&C, can he unagree later? I think the criticism of the API is not of the technology, but by what people are allowed to do with it.
  
  Reply
  - pm286 says:
    
    September 20, 2014 at 10:50 am
    
    Thanks,
    I will blog on this and why I am unlikely to use Elsevier’s TDM API
    
    Reply
Pingback: ContentMine at WOSP2014: Text and Data Mining: III What Elsevier’s Chris Shillum thinks we can do; Responsible Mining – ContentMine