Wellcome welcomes NPG's textmining policy; I don't

Posted on June 18, 2009 by pm286

From Open Access News, interspersed with my comments:

NPG permits text-mining on green OA manuscripts

Nature Publishing Gr:oup (NPG) will explicitly permit academic reuse of archived author manuscripts. Head of Content Licensing David Hoole announced the development today at the OAI6 meeting in Geneva, Switzerland. Researchers can now data-mine and text-mine author manuscripts from NPG journals archived in PubMed Central and other academic repositories.

PMR: one of the really important aspects of PubMed central is that it contains millions (sic) of abstracts, of which some have openly accessible full-text. However readers cannot assume they can text-mine this without contravening publisher contracts.

“NPG supports reuse for academic purposes of the content we publish. We want the excellent research that we publish to help further discovery, and recognize that data-mining and text-mining are important aspects of that,” said David Hoole.

PMR: Nature is among the more innovative publishers and has realised the value of text-mining for some time. It originally came out with the (almost useless) OTMI which published all the sentiences in some publications but (as Eric Morcambe observed) “not necessarily in the right order”. Here, however they are starting (sic) to do the right thing.

Under NPG’s terms of reuse, users may view, print, copy, download and text and data-mine the content for the purposes of academic research. Re-use should only be for academic purposes, commercial reuse is not permitted. Full conditions are available [here].

PMR: Oh dear, “non-commercial” yet again. Apart from the questionable motivation, it’s almost impossible to define. If we include material in a text-book is that “commercial”? [Yes, I have read the conditions and we can’t]

The re-use permissions apply to author manuscripts, of articles published in NPG’s journals, which have been archived in PubMed Central, UK PubMed Central (UKPMC) and other institutional and subject repositories. The terms were developed in consultation with the Wellcome Trust, the leading biomedical research charity….

“The Wellcome Trust is supportive of NPG’s efforts to make archived content more reusable,” said Sir Mark Walport, Director of the Wellcome Trust. “This is an important development because it shows that reuse can be facilitated, independent of business model, for text-mining and academic research.” …

PMR: If Wellcome have actually OK’ed the terms pointed to, then this is not an important development.

NPG’s re-use terms will be included in the metadata of these archived manuscripts.

PMR: anything which helps make things clearer is useful. Where is the metadata kept?

Peter Suber’s Comment. OA supporters have disagreed on whether text-mining is covered by fair use (or fair dealing etc.) or whether it requires fresh permission. Regardless of where you came down on that, it’s good to have explicit permission. (On the other hand, if permission is unnecessary, then it wouldn’t be good if researchers and publishers began to believe that it was; but that’s a different issue.) I regard this as a small but welcome step beyond gratis green OA to libre green OA.

PMR: I rarely disagree with PeterS, but this is not a step towards “libre”. I do not welcome this step.

Here are Nature’s terms:

Academic research only
1. Archived content may only be used for academic research. Any content downloaded for text based experiments should be destroyed when the experiment is complete.

PMR: This is not libre. It’s incredibly restrictive. It forbids the use of this material in creating corpora (which are essential for building text-ming tools properly). In fact, by forbidding the creation of corpora publishers as a whole are holding academic research back.

Wholesale re-publishing is prohibited
3. Archived content may not be published verbatim in whole or in part, whether or not this is done for Commercial Purposes, either in print or online.

PMR: If I do proper text-mining (as opposed to trivial lexical matching) and build tools such as OSCAR I need to be able to show the sources used to train and develop the tools. This is good science. Forbidding me to show my sources is bad science.

4. This restriction does not apply to reproducing normal quotations with an appropriate citation. In the case of text-mining, individual words, concepts and quotes up to 100 words per matching sentence may be reused, whereas longer paragraphs of text and images cannot (without specific permission from NPG).

PMR: This is no more than fair use. 100 words is far too small for creating text-mining tools. It is not “libre” it is restrictive.

It would be a disaster if other publishers copy Nature and if Wellcome adopt this appalling policy as the standard. Wellcome are a major guardian of scientists’ rights and in this case they are not doing so.

This entry was posted in Uncategorized. Bookmark the permalink.

12 Responses to Wellcome welcomes NPG's textmining policy; I don't

Peter Suber says:

June 19, 2009 at 6:24 pm

Hi Peter,
I agree with nearly everything you’ve written. On the one point where we disagree:
There are two issues: (1) whether the NPG policy goes far enough to support text-mining and (2) whether it moves past green gratis OA to green libre OA. I agree with you on the first, which is why I called it a small step. On the second, remember that libre OA is not just one thing but a range of things. We enter the range of libre OA as soon as we permit more uses than fair use (or fair dealing) alone would allow. Clearly the NPG policy does that.

Reply
- pm286 says:
  
  June 19, 2009 at 6:44 pm
  
  @Peter thanks. I’m glad to see the use of the term libre OA as long as we can agree on libre. The NC restriction IMO makes it non-libre
  
  Reply
Klaus Graf says:

June 19, 2009 at 6:57 pm

It’s libre in the Suber/Harnad-World but not in our (and the BBB-World), PMR!

Reply
- pm286 says:
  
  June 19, 2009 at 7:17 pm
  
  @Klaus yes. Libre = free as in speech. For everyone in all circumstances. At least BMC gets it.
  
  Reply
Peter Suber says:

June 20, 2009 at 5:21 pm

Just don’t make this verbal dispute appear to be more than it is. We’re talking about whether a certain word (“libre”) applies to a certain policy, not whether the policy would be better if it were BBB OA (it would) or whether a person using that word supports BBB OA (I do and you know it).

Reply
- pm286 says:
  
  June 20, 2009 at 6:00 pm
  
  @Peter I know you do. What I really do not know is how the OA terminology translates into practice. We have “strong” and “weak” OA, “gratis” and “libre” OA and I don’t know in practice what any of them means. As far as the Wellcome/NPG policy goes it seems that it offers nothing more than fair use already. We are allowed to pass documents into machines – we already do that. We can index them locally – we already do that. we are not allowed to reproduce the results. We are not allowed to crawl the publisher’s site. We are not allowed to crawl PubMedCentral for text-mining. We can’t use the results for commercial use. I can’t see what the NPG policy has given us that we don’t effectively have already and by legitimizing it we are weakening our case. It also confuses everyone by having so many restrictions about everything – and there’s a limit to what we can assimilate.
  
  Reply
Peter Suber says:

June 21, 2009 at 3:30 pm

Hi Peter. There are several threads here that I’d like to separate.
1. “Weak/strong” was an early, regrettable proposal for the distinction now captured by “gratis/libre”. Don’t think of it an additional pair of terms but as a superseded or deprecated pair of terms.
2. When I introduced the terms gratis/libre into the OA context (borrowing them from the FOSS context), I tried to be clear, careful, and detailed about what I meant by them. I don’t legislate usage, of course. But if the question is about how I use the terms, then my original article should answer it. I also think that my article will answer your questions about what the terms mean in practice, or how they can make our discussions less confusing rather than more confusing.
3. I don’t know whether the new NPG policy goes beyond fair use either. This depends on whether fair use already covers text-mining, a question on which informed people continue to disagree. We may not know whether fair use allows the downloading of full-text copies for processing, but at least we now know that NPG does allow it.
4. Whether the NPG policy is “libre OA” in my sense depends on whether it exceeds fair use, and I’m admitting that that’s unclear. If the policy exceeds fair use, then it’s libre OA (barely). If it doesn’t exceed fair use, then it isn’t.
5. Remember that libre OA is not a synonym for BBB OA. Libre OA covers all the different ways of exceeding fair use or removing permission barriers. It covers a *range* of positions, not just one position. If the NPG policy is libre at all, it’s at the lower or minimal end of the range; the BBB OA is a position at the higher or maximal end of the range.
5. I agree with everything you say about the limitations built into the NPG policy. Removing many more permission barriers would greatly facilitate text-mining and (I’m convinced) cause no harm to NPG.

Reply
- pm286 says:
  
  June 21, 2009 at 4:45 pm
  
  @PeterS – thanks, I will reply in a full post.
  
  Reply
DietrichRS says:

July 1, 2009 at 4:39 pm

Hi PMR and PeterS,
after reading your Blog, I next looked up the terms that you are using to make up my mind on the NPG agreement.
I believe that the NPG/WCT agreement goes beyond fair use, since “fair use” allows exchange of documents (only) between permission holders and interested “user” on a request basis. Now, the NPG agreement allows the “user” to get the document and process them WITHOUT asking permission. [This is progress: legal clarity and scientific processing.]
The NPG agreement allows academic use of the full content in experiments, where the content can only be kept for a limited period of time. Conclusion: any further use requires permission => this is a weak OA agreement.
The other restriction addresses the reuse of the data: only a limited stretch of text and only with correct attribution, which is meant to protect NPGs business model. You can propose that publishers should deliver all content for reuse without any restrictions, but not all publishers can follow easily this business model.
I know that you have high ambitions for OA to literature. Nonetheless, do you agree to the interpretation that I have given?
Cheers,
Dietrich

Reply
- pm286 says:
  
  July 1, 2009 at 7:32 pm
  
  @DietrichRS Many thanks. I think the simple answer is that no-one knows. “Fair-use” is a US-centric term (http://en.wikipedia.org/wiki/Fair_use) so it’s already blurred. I am assuming that “user” here means subscriber-only – i.e. if I, as a member of an institution which pays subscriptions to Nature for its employees to use, send a Nature document to a non-subscriber then I assume I violate the permissions. So the permission is only for subscribers. (If I am wrong about this I will be delighted and re-post). Now I believe that a subscriber has the right to use machines to extract information from the material just as if they had a pencil and scissors. There is nothing (I think) in law which forbids it – it is simply that publishers have maintained that subscribers do not have this right. There is also nothing to stop subscribers publishing “facts” from copyright material, although it’s debatable whether a journal is a “database” under the EU directive. So I believe I have the right to do what is described and I don’t see clear new permission in this.
  
  Reply
Chris Bird says:

July 3, 2009 at 3:20 pm

Peter (MR) — you state that “the permission is only for subscribers [to Nature]”. This is not the case. The new NPG policy means that the author manuscript (post peer review) will be deposited, by NPG, into PMC or UKPMC, to be available to any user, after a 6-month embargo. Crucially, the manuscript will be also available to download in XML from PMC’s open access subset and, as such, should facilitate text-mining initiatives both in terms of the permissions attached, and in terms of the availability of the content. Though it is too early to see a NPG article as XML (this policy only became effective in June), this example from Elsevier shows how an article is made available through PMC in XML:
http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&metadataPrefix=pmc&identifier=oai:pubmedcentral.nih.gov:2671587
Specifically regarding the permissions issue, you also state “This is no more than fair use. 100 words is far too small for creating text-mining tools” but I don’t think this is the case. This is not 100 words for the whole article, but 100 words *per matching sentence* (i.e. keyword match). This hardly seems restrictive, as very few sentences are over 100 words in length.
Incidentally, my understanding of the fair dealing versus fair use issue is that fair dealing (UK) is more restrictive than fair use (US) in the context of text mining. As I understand it, in order to carry out a text-mining study, it is first necessary to download (i.e. take a copy of) an article. I think this goes beyond what fair dealing allows, but my understanding from US legal colleages is that this may well be allowed in that jurisdiction. Either way, the certainty provided by the NPG terms, and the availability of the content in xml from the OA subset, is surely a good thing?

Reply
- pm286 says:
  
  July 3, 2009 at 6:55 pm
  
  @Chris thanks for this very long and considered reply. I will address it in detail. My first impression is that it may give some permissions explicitly which may be claimed by the publisher but they are relatively marginal. But I will attempt to be objective
  
  Reply