There have been some very useful responses (see comments to Can I data- and Text-mine Pubmed Central? and followup) to my assertion that we may not text-mine the major part of the material to be deposited under the NIH mandate. Peter Suber , as always, makes it precisely clear in No data- or text-mining at PMC
- Peter MR is right. PMC removes price barriers and leaves permission barriers in place. Users may not exceed fair use, which is not enough for redistribution or most kinds of text- and data-mining. For detail –and official confirmation– see Question F2 in the NIH FAQ:
What is the difference between the NIH Public Access Policy and Open Access?
The Public Access Policy ensures that the public has access to the peer reviewed and published results of all NIH funded research through PubMed Central (PMC). United States and/or foreign copyright laws protect most of the articles in PMC; PMC provides access to them at no cost, much like a library does, under the principles of Fair Use.
Generally, Open Access involves the use of a copyrighted document under a Creative Commons or similar license-type agreement that allows more liberal use (including redistribution) than the traditional principles of Fair Use. Only a subset of the articles in PMC are available under such Open Access provisions. See the PMC Copyright page for more information.
PMR: I was aware of this paragraph – and like so many – it expounds general principles without giving precise indications as to what can be done. It is clear now that PMC does not – by default – promote text- or data-mining.
[Additional points: Some correspondents suggest the only NIH/PMC barrier is robots.txt – a guide to how and when robots can download. This is not the primary problem – it is that the papers in PMC still carry restrictive copyright and PMC restricts download from all except the Open Access subset. And others have commented that PMC has an Open Access subset. Yes, and I am aware of this and have been working on how to mine it. But if voluntary Open Access – authors and publishers – was delivering what we want then there would have been no need for the mandate. The mandate forces authors to deposit copies of Closed Access publications by removing price barriers.]
PeterS continues:
- Removing price barriers from NIH-funded research was a major victory, and one we couldn’t have achieved if we demanded the removal of permission barriers at the same time. But Peter is right that researchers need more and that we have to keep working for further goals. In time, I hope we can shorten the permissible 12 month embargo and remove permission barriers from the copies covered by the NIH policy.
PMR: Yes. So it’s critical that we provide a submission to the NIH on this point. I know there are individuals in the NIH who appreciate the value of full-text mining but it’s not obvious to many.
I will deal with the question of “is text-mining useful” [which surprised me] in a separate post.
“Some correspondents suggest the only NIH/PMC barrier is robots.txt – a guide to how and when robots can download. This is not the primary problem”
My opinion is that this depends on what you intend to do with the content. Almost all the sites Google indexes are “All rights reserved” copyright but they have robots.TXT protocols that permit indexing and linking back (which is what Google does). So, robots.TXT does permit search engines the right to index. That is one of the functions of having a robots.TXT file on your site in the first place.
If, however, you intend to redistribute (not just build a searchable index which links back) the mined content, this is not in the scope of the robots.TXT protocol and it wouldn’t be a justification as you say.
Pingback: Non-OA Full-text for text mining « Research Remix
Pingback: Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Open Access Week - Green is not enough