My last post (/pmr/2014/09/15/wosp2014-text-and-data-mining-ii-elseviers-presentation-gemma-hersh/ ) described the unacceptable and intransigent attitude of Elsevier’s Gemma Hersh at WOSP2014 Text and Datamining of Scientific Documents.
But there was a nother face to Elsevier at the meeting, Chris Shillum. (http://www.slideshare.net/cshillum ). Charles Oppenheim and I talked with him throughout lunch and also later Richard Smith-Unna also talked. None of the following is on public record but it’s a reasonable approximation to what he told us.
Firstly he disagrees with Gemma Hersh that we HAVE to use Elsevier’s API and sign their Terms and Conditions (which gives away several rights and severely limit what we can do and publish.) We CAN mine Elsevier’s content through their web pages that we have the right to read and we cannot be stopped by law. We have a reasonable duty of care to respect the technical integrity of their system, but that’s all we have to worry about.
I *think* we can move on from that. If so, thanks, Chris. And if so, it’s a model for other publishers.
We told him what we were planning to do – read every paper as it is published and extract the facts.
CS: Elsevier published ca 1000 papers a day.
PMR : that’s one per minute; that won’t break your servers
CS: But it’s not how humans behave….
I think this means that is we appear to their servers as just another human then they don’t have a load problem. (For the record I sometimes download manually in sequence as many Elsevier papers as I can to (a) check the licence or (b) to see whether the figures contain chemistry, or sequences. Both of these are legitimate human activities either for subscribers or unsubscribers. For (a) I can manage 3 per minute, 200/hour – I have to scroll to the end of the paper because that’s often where the “all rights reserved, C Elsevier” is. For (b) it’s about 40/hour if there are an average of 5 diagrams (I can tell within 500 milliseconds whether a diagram has chemistry, sequences, phylogenetic trees, dose-response curves…). [Note: I don’t do this for fun – it isn’t – but because I am fighting for our digital rights.]
But it does get boring and error-prone which is why we use machines.
The second problem is what we can publish. The problem is copyright. If I download a complete Closed Access paper from Elsevier and post it on a public website I am breaking copyright. I accept the law. (I am only going to talk about the formal law, not morals or ethics at this stage).
If I read a scientific paper I can publish facts. I MAY be able to publish some text as comment, either as metadata, or because I am critiquing the text. Here’s an example (http://www.lablit.com/article/11) :
In 1953, the following sentence appeared near the end of a neat little paper by James Watson and Francis Crick proposing the double helical structure of DNA (Nature171: 737-738 (1953)):
“It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.”
Of course, this bon mot is now wildly famous amongst scientists, probably as much for its coyness and understatement …
Lablit can reasonably assume that this is fair comment and quote C+W (Copyright who? they didn’t have transfer in those days). I can quote Lablit in similar fashion. In the US this is often justified under “fair use” , but there is not such protection in UK. Fair use is, anyway, extremely fuzzy.
So Elsevier gave a guide as to what they allow IF you sign their TDM restrictions. Originally it was 200 characters, but I pointed out that many entities (facts) , such as chemical names or biological sequences, were larger. So Chris clarified that Elsevier would allow 200 characters of surrounding context. This could means something like (Abstract, http://www.mdpi.com/2218-1989/2/1/39 ) showing ContentMine markup:
“…Generally the secondary metabolite capability of <a href=”http://en.wikipedia.org/wiki/Aspergillus_oryzae”> A.oryzae</a> presents several novel end products likely to result from the domestication process…”
That’s 136 characters without the Named Entity “<a…/a>”
Which means that we need guidelines for Responsible Content Mining.
That’s what JISC have asked Jenny Molloy and me to do (and we’ve now invited Charles Oppenheim to be a third author). It’s not easy as there are few agreed current practices. It’s possible for us to summarise what people currently do, what has been challenged by rights holders and then for us to suggest what is reasonable. Note that this is not a negotiation. We are not at that stage.
So I’d start with:
- The right to read is the right to mine.
- Researchers should take reasonable care not to violate the integrity of publishers’ servers.
- Copyright law applies to all parts of the process although it is frequently unclear
- Researchers should take reasonable care not to violate publishers’ copyright.
- Facts are uncopyrightable.
- Science requires the publication of source material as far as possible to verify the integrity of the process. This may conflict with (3/4).
CC BY publishers such as PLoS and BMC will only be concerned with (2). Therefore they act as a yardstick, and that’s why we are working with them. Cameron Neylon of PLoS has publicly stated that single text miners cause no problems of PLoS servers and there are checks to counter irresponsible crawling.
Unfortunately (4) cannot be solved by discussions with publishers as they have shown themselves to be in conflict with Libraries, Funders, JISC, etc. Therefore I shall proceed by announcing what I intend to do. This is not a negotiation and it’s not asking for permission. It’s allowing a reasonable publisher to state any reasonable concerns.
And I hope we can see that as a reasonable way forward.