There have been a number of useful replies to my concern over text-mining the NIH. To resolve some of the confusion:
- NIH have ca 1,000,000 journal articles. These are NOT permissionFree Open Access. There is a limit on what you may legally do with them. They retain publisher copyright. In each case there may have been intensive negotiations with publishers as to what conditions apply. You may not bulk download these, either through robots or OAI-PMH. robots.txt is irrelevant to these articles. If you try to mine you will probably be cut off. If you republish for whatever purpose and if you go beyond fair use (in the publisher’s judgment) you may be pursued by the publisher. This is NOT good enough for any text-mining.
- The NIH have ca 50, 000 articles in “OA journals” or otherwise known to be “OA” – which are permissionFree. You may mine these and do whatever (though I am unclear whether there is a trap on CC-By vs CC-NC vs CC-ND). In addition there are about 10,000 author-deposited articles.
Only 6% of the NIH material is therefore Open/permissionFree. You can mine this, etc. as several correspondents have pointed out. But as we know in the OA struggle, 5% is about all you get from requests to authors and publishers. It required a legal mandate from George Bush to ensure that authors HAVE to deposit. This is the real concern of my postings – the 94% than cannot be text-mined.
We need to show WHY textmining is critical. Here is a splendid post from Glen Newton. PLEASE let us collect other examples to send to the NIH…
FREE THE ARTICLES! (Full-text for researchers & scientists and their machines)
At a recent plenary I gave [earlier post] at the Colorado Association of Research Libraries Next Gen Library Interfaces conference, I went a little off-script and was educating (/haranguing) the mostly librarian audience about the present-and-near-future importance of the accessibility of full-text research articles to their researchers and scientists.
By accessibility of full-text I didn’t mean the ability of a human to access the PDF or HTML of an article via a web browser: I was referring to the machine-accessibility of the text contained in the article (and the metadata and the citation information).
I was concerned because of the increasing number of discipline-specific tools that use full-text (& metadata & citations) to allow users (via text mining, semantic analysis, etc.) to navigate, analyze and discover new ideas and relationships, from the research literature. The general label for this kind of research is ‘literature-based discovery‘, where new knowledge hidden in the literature is exposed using text mining and other tools.
Most publisher licenses do not allow for the sort of access to the full-text that many of these discovery and exploration tools need.
When I asked for a show of hands of how many were aware of this issue, of the ~200 in the audience, no one raised their hand.
I went on to suggest/rant that librarians should expect more of their researcher/scientist patrons to be needing/demanding this sort of access to the full-text of (licensed) journal articles. They need to anticipate this response, and I suggested the following non-mutually-exclusive strategies:
- demanding licenses from publishers and aggregators that allow them to offer access to full-text for analysis by arbitrary patron tools
- asking publishers to publish their full-text in the Open Text Mining Interface (OTMI)
- supporting Open Access journals which allow-for much of this this out-of-the-box
Recently I retro-discovered an article[1] in The Economist, which explains to the lay-person some of the kind of things that can be done with access to the literature. This study [2] shows how researchers discovered the biochemical pathway involved in drug addiction from the literature alone. They did no experiments. This discovery3]. Clearly, this sort of analysis can save time and money in discovering important and relevant scientific knowledge.
[1] Drug Addiction: Going by the book (2008). The Economist, January 10 print issue.
[2] Li, C., Mao, X., Wei, L. (2008). Genes and (Common) Pathways Underlying Drug Addiction. PLoS Computational Biology, 4(1), e2. DOI: 10.1371/journal.pcbi.0040002
[3] Swanson, D. (1986). Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med, 30:1:7-18.
This was derived from an analysis and extraction of information from more than 1000 articles! This is not the first time this sort of thing has happened
PMR: If am now off to talk to the UK Serials Group in Torquay. I shall highlight this example of why it;s so important.
Peter, the NIH is conducting a request for information on its new public access policy. I would encourage you to prepare a comment about this for the NIH as a suggestion for how to improve the policy.
Pingback: Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Have any closed access articles appear in PMC?