NO-ONE MAY DATA- OR TEXT-MINE PUBMED CENTRAL

I realised with considerable disappointment ( Can I data- and Text-mine Pubmed Central?) that I might not be able to text- and data-mine the material that the NIH has required to be deposited in Pubmed Central in its mandate. Now I have got confirmation by email from an authoritative source (who asks not to be named in case the information is not quite precise). But in general terms the answer is simple:
NO-ONE MAY DATA- OR TEXT-MINE PUBMED CENTRAL
In short Pubmed Central is “free access” (no price barriers), not “open access” (no permission barriers). You may not download material from it (except to expose it to your own eyeballs), and certainly not redistribute it. You may not data-mine it.
I am aware of the struggle that was required to get George Bush to sign the mandate and it certainly wasn’t the time to break ranks. But now that the mandate is passed (and starts tomorrow) we must press ahead immediately to campaign for full access to the text.
We have the right and the duty to submit our views to NIH. For example Stevan Harnad has argued (recommendations to the NIH) that it is better to reposit in institutional repositories (“green”). Whether or not this is a good idea (and I personally don’t think so as it make datamining almost impossible) it is clearly outside the current approach from Pubmed Central. For example, I gather, the mirrors of PMC have to agree to the same absolute permission barriers that PMC imposes – it would be impossible to ensure that thousands of libraries enforced this – almost draconian – contractual system.
So we have to argue to the NIH that bioscience is desperately impoverished by the unreasonable permission barriers that are now in place. I’m not a (US) politician and I think the NIH and advocates have done well to win the first battle. But at present the policy is seriously hindering modern science.
So the whole area is incredibly complex. The goal is simple – use scientific publications to further our understanding of science and – hopefully – make progress in enhancing human health. For this we MUST have robots. We cannot do it with humans alone – every week we get thousands of new papers.
I’d be grateful to know what the position is with Wellcome. I thought they had removed permission barriers.

This entry was posted in Uncategorized. Bookmark the permalink.

6 Responses to NO-ONE MAY DATA- OR TEXT-MINE PUBMED CENTRAL

will says:

April 6, 2008 at 6:10 pm

There is a subset of PubMed Central called the Open Archives Service. You are allowed to index this since it consists of articles where publishers have deposited specifically with open access in mind. This subset is OK to index/datamine and link back to (provided you use the PubMed e-utilities – not a web spider directly to the URLs which is forbidden in robots.txt – and respect request delays – one per 3 secs I think)
If you want to extract and host (as opposed to link back to / datamine for personal use) I am less sure.

pm286 says:

April 6, 2008 at 6:56 pm

(1) thanks Will – I should have made this clearer.
Yes – I am aware of the Pubmed Open Archives. This effectively reflects the material which would be available without the mandate. (FWIW I always try to creare robts which are friendly.)
Unfortunately relatively few authors either publish in Open Access journals or bought the full open Access rights from the publishers. The mandate requires the publication of the material with “free” or “public” access. My understanding is that this material cannot be spidered,

Egon Willighagen says:

April 6, 2008 at 7:28 pm

Did your private correspondent say anything about license issues? I’d say that this NIH policy violates the CC-licenses used by some of the more liberal publishers.

Egon Willighagen says:

April 7, 2008 at 11:54 am

I have discussed my questions on the legal aspects of PMC versus the CC-BY 2.0 license requirements. The latter forbids putting technical obstacles to control access, exactly what PMC seems to be doing.
See http://chem-bla-ics.blogspot.com/2008/04/legal-advice-needed-nih-restricting.html

David Rothman says:

April 7, 2008 at 6:14 pm

So what does BioSearch actually search?
http://biosearch.berkeley.edu/

pm286 says:

April 7, 2008 at 9:32 pm

(3, 4). Egon, the articles do not carry a CC licence – they retain the publisher licence AFAIK.
(5) It says Open Access journals, so I assume that…

NO-ONE MAY DATA- OR TEXT-MINE PUBMED CENTRAL

6 Responses to NO-ONE MAY DATA- OR TEXT-MINE PUBMED CENTRAL

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta