Can I data- and Text-mine Pubmed Central?

Until last week I had assumed that the NIH policy on access to publicly funded research grants full Open Access rights to anyone in the world. The works will be deposited in Pubmed Central (PubMed Central site). Pubmed Central has its own definition of “open access”) and generally uses the phrase “public access” – which is operationally unclear.

Last week I learned at Dagstuhl that data- text-mining of Pubmed Central was blocked by the site itself – delgates had found that there is a maximum of two papers that can be downloaded before the IP address is blocked.
I’d very much like clarification (as I have found the NIH sites and elsewhere extremely difficult to navigate on a consistent basis). There is no explicit mention of the right to download material for data-mining and a lot of verbiage about “consistency with publishers’s policies” which is no help to scientists like me.

So – simply – when the flood of public depositions comes on stream after April 7 (obviously with some delay) can I text-mine them?

This is important. Biology is in critical need of machine help in reading papers. The bioscience community spends tens of millions of dollars (a figure mentioned at Dagstuhl) on annotating genomes including the ontologies and lexicons. Without this we simply do not understand much of the science being published It is hugely costly to use humans for this.

When George Bush signed the mandate he clearly envisaged that the information should be used for the benefit of human health…

…and this means text-mining.

So – simply – can I run my robots over the material deposited by mandate?

  1. Yes – without question or fear of reprisal.
  2. No – not at all.
  3. Well – um – err – it depends on each individual paper and each individual publisher and nobody can give a clear answer

The current answer appears to be 2 (I will be cut off mechanically). I suspect the real answer is 3. Note that although our group has been able to write robots that can understand chemistry we are a long way from understanding publishers’ policies on access (mainly because many are designed to be unhelpful). So it is impossible to do bulk mining as we cannot differentiate publisher policies.

Please tell me I am wrong and that it’s really 1. If not, should we not prepare a case to the NIH – they have asked for submissions – asking them to assert that the policy is 1. and to make it clear. Perhaps the Open Knowledge Foundation should create a submission.

If the NIH aren’t prepared to do this then the “victory” is only the first step in a long struggle for liberating data.


15 Responses to “Can I data- and Text-mine Pubmed Central?”

  2. Anonymous says:

    To be perferctly honest, this does not surprise me. I’ve hit this issue whilst trying to crawl NIH sites before (NCBI in particular). I wonder if this is in fact an accidental hangover from the mechanisms in place to stop irresponsible spidering of the site for data?

  3. Rich Apodaca says:

    The Open Access fight is just the beginning of a larger process that could take some time to play out:

    I would guess that NIH would be a best-case scenario for opening services to robots/spiders. Other publishers may take a dimmer view…

    PMC’s robots.txt file makes it pretty clear what the answers to you questions are likely to be, if asked today:

  4. pm286 says:

    (3) Thanks Rich.
    My take is that robots.txt is simply a guide – it certainly has no legal force. Note that the file appears to be 10 years’ old:

    Version: 20 February 1998, Rand S. Huntzinger

    so it is reasonable to expect a revised version sometime.


  5. The robot.txt was last modified on: “09/19/2007 09:57:32 PM”

  7. will says:

    It’s critical that you use the OAI e-utilities, respect the time delays and not break robots.txt – then they WILL let you spider whatever is within this. PubMed Central IS open to indexers.

  9. This is an excellent point, Peter. The NIH has a period of commenting on its policy, open until May 31. We need scientists to point out that we need much more than the current policy; please submit a comment, and encourage others to submit such comments, too.

  10. I think that there are two important points to be made here:

    1) Some people seem to have missed the point the PubMed Central is not only about Open Access. This is not the case – PubMed Central also contains a lot of papers that are not Open Access.

    2)Regarding the blocking of access, this is simply a matter of NCBI blocking people who use spiders. However, you can download download the entire Open Access subset of PubMed Central from their FTP service, so they are certainly not trying to prevent you from mining the Open Access articles.

  12. Euan says:

    See Will’s comment – how are you getting the papers, Peter? robots.txt implies that you’re scraping, what happens if you go through OAI?

    arXiv do exactly the same thing.

  13. Deepak says:

    Using FTP is extremely limiting, and basically kills the idea of an open data web. Regardless of your stand on open access, the fact remains that we should be able to mine the open papers on pubmed central, via the web, otherwise, everything becomes a silo … again.

  14. pm286 says:

    (13) I think Deepak is replying to Lars (10), not me.

  15. S. Honcho says:

    Is there a potential link between this and the sweeping budget cuts the NIH has made at the NCBI? I read that a number of the support staff, including the NCBI service desk, was removed. Maybe this is potentially part of the ripple effect of those cuts?

