petermr's blog

A Scientist and the Web


Can I data- and Text-mine Pubmed Central?

Until last week I had assumed that the NIH policy on access to publicly funded research grants full Open Access rights to anyone in the world. The works will be deposited in Pubmed Central (PubMed Central site). Pubmed Central has its own definition of “open access”) and generally uses the phrase “public access” – which is operationally unclear.

Last week I learned at Dagstuhl that data- text-mining of Pubmed Central was blocked by the site itself – delgates had found that there is a maximum of two papers that can be downloaded before the IP address is blocked.
I’d very much like clarification (as I have found the NIH sites and elsewhere extremely difficult to navigate on a consistent basis). There is no explicit mention of the right to download material for data-mining and a lot of verbiage about “consistency with publishers’s policies” which is no help to scientists like me.

So – simply – when the flood of public depositions comes on stream after April 7 (obviously with some delay) can I text-mine them?

This is important. Biology is in critical need of machine help in reading papers. The bioscience community spends tens of millions of dollars (a figure mentioned at Dagstuhl) on annotating genomes including the ontologies and lexicons. Without this we simply do not understand much of the science being published It is hugely costly to use humans for this.

When George Bush signed the mandate he clearly envisaged that the information should be used for the benefit of human health…

…and this means text-mining.

So – simply – can I run my robots over the material deposited by mandate?

  1. Yes – without question or fear of reprisal.
  2. No – not at all.
  3. Well – um – err – it depends on each individual paper and each individual publisher and nobody can give a clear answer

The current answer appears to be 2 (I will be cut off mechanically). I suspect the real answer is 3. Note that although our group has been able to write robots that can understand chemistry we are a long way from understanding publishers’ policies on access (mainly because many are designed to be unhelpful). So it is impossible to do bulk mining as we cannot differentiate publisher policies.

Please tell me I am wrong and that it’s really 1. If not, should we not prepare a case to the NIH – they have asked for submissions – asking them to assert that the policy is 1. and to make it clear. Perhaps the Open Knowledge Foundation should create a submission.

If the NIH aren’t prepared to do this then the “victory” is only the first step in a long struggle for liberating data.


Leave a Reply