Stevan Harnad, Peter Suber and I have been discussing whether Green Open Access (author self-archiving in an Institutional Repository) is sufficient to allow indexing and mining. Stevan comments:
Individual re-use capabilities: If a document’s full-text is freely accessible online (OA), that means any individual can (1) access it, (2) read it, (3) download it, (4) store it (for personal use), (5) print it off (for personal use), (6) “data-mine” it and (7) re-use the results of the data-mining in further research publications (but they may not re-publish or re-sell the full-text itself: “derivative works” must instead link to its URL).
Stevan Harnad Says:
October 15th, 2007 at 11:51 pm e
The example you gave of robot blockage was the publisher (Gold? or something else?) giving “free access” with strings and constraints attached. That is not what I am talking about. I am talking about Green OA: That is when an author self-archives his own final, peer-reviewed, accepted draft (”postprint”) in his own Institutional Repository and sets access as “Open Access.” No strings attached, and the spiders can spider away.
And the essence of both my logical and methodological point is that paid Gold OA is always also Green OA. So don’t rely on your publisher providing proper access: self-archive the postprint! Then all the capabilities you seek will come with the territory. Further rights retention or licensing is superfluous (and a retardant, if insisted upon, gratuitously, as a precondition for providing OA!).
And, for the record, I am always talking about published, peer-reviewed journal articles. I am not for a moment contesting that authors can and should license rights to their data as part of making them OA.
PMR: It will help if we understand what responsible and publishable text-mining involves. If any of the SciBorg project (e.g. Peter Corbett) publishes a paper on natural language processing in chemistry, it has to be reproducible. This is fundamental to science – and NLP is a science. If you make a claim but do not allow someone to falsify your claim you are not publishing science. (Unfortunately this lack of repeatability is almost universal in “chemoinformatics” publications where raw data is never required by the journals, but that’s another article).
So the first thing to do is to gather a corpus of documents. This corpus is part of the experimental toolkit – any other scientist should be able to have access to it. It therefore has to be freely distributable. Since we are interested in machines understanding science, we are concentrating on chemistry articles. This isn’t easy since almost all articles are copyrighted and non-distributable. Publisher Copyright is a major barrier to progress in Chemical Natural Language Processing – you can’t just go out and compile a wordlist or whatever as you may infringe copyright or invisible publisher contracts (we found that out the hard way).
When SciBorg started there were no Open Access chemistry journals. Even now the Open Beilstein Journal of Organic Chemistry only has ca. 50 articles. Our corpus comes from Royal Society of Chemistry, Nature, and International Union of Crystallography and we are working on what parts of this we can legally redistribute.
The corpus doesn’t stay as PDFs – PDFs are so awful they are not just useless, but actually destroy information. (Diana Stewart, who works on SPECTRa-T, is trying to find out why theses from Caltech emit non-printing ASCII control characters in their PDF.) So we have to repurpose them by converting to HTML, XML and so on. It’s not a convenience, it’s a necessity. This conversion almost certainly loses information and almost certainly loses any copyright statement (which may even be in an image).
Now the corpus is annotated. Expert humans go through line by line, word by word and character by character, identifying the role of each. Often several do this independently to see how well they agree (it’s never 100%). Then everyone can test their software on the same corpus and make meaningful comparisons. It is this annotated corpus which is of most use to the scientific community.
So suppose I find 50 articles in 50 different repositories, all of which claim to be Green Open Access. I now download them, aggregate them and repurpose them. What is the likelihood that some publisher will complain? I would guess very high. The context of the papers is lost – they simply see “their papers” being packaged and redistributed. They may claim that we have violated database rights, etc. The example I gave showed not that Green Open Access per se was being violated (it wasn’t) but that publishers act in restrictive ways that make no logical sense, and hence logic is of little value.
Only a rights statement actually on each document would allow us to create a corpus for NLP without fear of being asked to take it down.
Data is similar but left as an exercise for the reader.