The NIH has asked for public comments on its access policy. 150+ people and organisations have responded. Almost no-one has said anything about text/data mining.
So I have:
1. Do you have recommendations for alternative implementation approaches to those already reflected in the NIH Public Access Policy?
There is an urgent requirement in bioscience to use machines to extract information from the full-text of papers (“text-mining” mining and “data-mining”). Examples of this use are the machine-assisted annotation of genomes, the extraction of concepts from text and the linking of information from many different disciplines. In my own field of molecular informatics it is possible to scan a million Pubmed abstracts a day and extrcat mention of new chemical compounds of biological interest. It is now well known that abstracts alone do not give sufficient information and that access to the full-text is required.
Many publications are accompanied by data, and indeed for many of these (e.g. about sequences and structures of biomolecules) the data are often more important than the fulltext. Although the STM publishers have urged their members to regard data as facts and therefore free of copyright, several publishers label data as copyright, thus effectively barring the legitimate re-use of data. It is important that the NIH challenges this and forbids it on PMC.
Many data are embedded in the full text and can be extracted by machines (“text-mining”). This process is made more tractable if the text is available in XML form (including XHTML) and I support the use of these formats.
Text-mining” and “data-mining” are hardly mentioned – if at all – in the NIH’s description and requirements. I would therefore wish to see positive indication that the NIH supports the re-use of the material, in high-throughput mode.
3 In addition to the information already posted at http://publicaccess.nih.gov/communications.htm, what additional information, training or communications related to the NIH Public Access Policy would be helpful to you?
The information provided gives users very little positive indication that the can legitimately re-use the material published on PMC. I write a blog on Open Access and Open Data (http://wwmm.ch.cam.ac.uk/blogs/murrayrust) and the informed opinion was that PMC does not allow data- or text-mining and that attempts to do this will result in the NIH server cutting off access to the given IP. The words “fair use” are useless. In practice no scientist has enough knowledge of case law to know what is and is not fair use and the term effectively frightens many into “no use”.
I would urge that the NIH make clear what their policy on data- and text-mining is, using those terms. I would also suggest that the NIH add machine-readable versions of licences or similar documents so that robots are aware of what they may and not do.
Do you have other comments related to the NIH Public Access Policy
I am a user of the material available on the NIH sites, including PubChem, and PubMed. The volume of information is now so great that machines are essential to use it properly. I believe it is essential for the NIH to enable text/data-mining of its information if it is to recoup the maximum value of its research investment.