I am using the ACS session on Open data as an opportunity to create principles that allow textmining in science. With Jenny Molloy and Graham Steel we are creating a draft Panton Paper, on an Etherpad at:
Please feel free to contribute (it’s trivial to edit, but please leave your identity). I’d also like to show the power of the Etherpad …
Panton Paper for text and data Mining
Co -authors Graham Steel and Peter Murray-Rust
I am using this as a basis for my talk at the ACS on “Open Data and the Panton Principles”. I’ think text-mining is one of the biggest problems in scientific data so this is a good excuse to air it and to argue for Open Data
Scientific articles (papers) are the commonest and most highly valued ways of transmitting science. It has been accepted for at least 130 years (Beilstein, organic chemistry) that it is valuable to extract data from scientific articles and republish it without permission of the original author. This has led to countless review articles where factual data in primary sources are summarised and critiqued in secondary articles (reviews).
Factual data in articles occurs as:
- graphs (e.g. plots of X against Y)
- numbers embedded in running text (“the melting point of benzene is 5 Celsius”)
Articles are giverned by copyright and this restricts their re-use. The exact details are not – and never will be – precisely specified but we can generally assume that copyright holders could (not necessarily would, but still could) take action on:
- copying the whole article even for colleagues and collaborators
- copying any diagram for re-publication (graphics are “creative works”). Copyright holders have objected to the ree-publication of a graph, even to make a valid scientific point for which the graph was necessary
- republishing paragraphs of text, even for scholarly purposes.
A comment on “fair use” (fair dealing). See http://en.wikipedia.org/wiki/Fair_use This is NOT applicable outside the US so is of little global value.
Scientific authors do not expect (or receive) payment for their articles and almost all reviewers give their services freely (although there may be costs involved). There is no ethical or utilitalarian reason for restricting the re-publication of science except for the need to provide publishers with income. This paper does not debate the ethics of this (“Open Access”) but is confined to the right to extract data from non-Open material
We take as agreed that a human may extract data from an article to which they have legitimate access. That they may republish this without further permission.
Text-mining is the use of machines to extract data and other information from articles (as opposed to extracting it from databases). The technology can allow high precision/recall rates (> 90%) making it very useful as a way of reading and systematising the primary literature. There are various aspects to TM:
- information retrieval. Classification of documents either supervised (into predetermined categories) or unsupervised (e.gt. cluster analysis)
- information extraction. Extraction of information from subcomponents of the text of a document. A common approach is Named Entity Recognition where words and phrases may be identified as people, places, species, chemicals, etc.
- sentiment and argumentation. More general (and harder) interpretations of the role of the article (or parts of it) – “we believe that”, “this is incompatible with” “this article has been discredited”
Note that if this is done automatically the machines often have to be “trained” by giving them examples of material which have been classified by humans (annotators) as positive or negative. Machines are not perfect (but nor are humans – our work shows interannotator agreement of 90% with machines not much behind).
We argue that any analysis of a document that can be freely published by a human can also be freely published by a machine. That copyright refers to the precise wording and formatting of the document, not to the abstract ideas or facts published in it. Copyright can only be violated by quoting or reproducing chunks of the original verbatim.
The extraction of information does not normally require the verbatim qutotation of reproduction of diagrams and so does not per se violate copyright. And it is permitted for human readers of all sorts of material – books, films, as well as scientific articles. If, however, machines are used then the process is “forbidden”.
There are many reasons why machine information extraction is valuable to science:
- Humans cannot keep up with the volume of literature
- humans cannot always keep up with correct terminology and usage
- the information is complex and specialised and there are not enough human experts
- the information requires many different resources (e.g. geo-locations, gazeteers, online algorithms, etc). As an example interpreting the geo-location of chemicals requires an expert in both fields
- algorithms and processing are required (a simple example is conversion of units – Celsius and Fahrenheit for weather)
- machines can monitor trends with time and place (and many other variables)
Aspects of forbidding of TM:
The provision of journal articles is controlled not only by copyright but also (for most scientists) the contracts signed by the institution. These contracts are usually not public. We believe (from anecdotal evidence) that there are clauses forbidding the use of systematic machine crawling of articles, even for legitimate scientific purposes.
There are many serious consequences of forbidding text-mining:
- new scientific relationships are not discovered. This is particularly common in biomediacl where searching the literature is as important as doing new research
- “wrong” science (for whatever reason) is not detected (machine analysis of data is very powerful here)
- the text-mining tools cannot be adequately published (acceptable practice is that the training corpus must be freely available – not just DOI references)
- products of text-mining (e.g. classifications and lexicons) are themselves valuable for the next cycle of research and these derived works cannot be published
- innovation in textmining itself and in TM for science is held back
The primary resource on which textmining relies is published science.
We assert that there is no legal, ethical or moral reason to refuse to allow scientists to use machines to analyse the published output of their community.