Why Green Open Access does not support text- and data-mining

Stevan Harnad, Peter Suber and I have been discussing whether Green Open Access (author self-archiving in an Institutional Repository) is sufficient to allow indexing and mining. Stevan comments:

Individual re-use capabilities: If a document’s full-text is freely accessible online (OA), that means any individual can (1) access it, (2) read it, (3) download it, (4) store it (for personal use), (5) print it off (for personal use), (6) “data-mine” it and (7) re-use the results of the data-mining in further research publications (but they may not re-publish or re-sell the full-text itself: “derivative works” must instead link to its URL).

and later:

Stevan Harnad Says:
October 15th, 2007 at 11:51 pm e
Dear Peter,
The example you gave of robot blockage was the publisher (Gold? or something else?) giving “free access” with strings and constraints attached. That is not what I am talking about. I am talking about Green OA: That is when an author self-archives his own final, peer-reviewed, accepted draft (”postprint”) in his own Institutional Repository and sets access as “Open Access.” No strings attached, and the spiders can spider away.
And the essence of both my logical and methodological point is that paid Gold OA is always also Green OA. So don’t rely on your publisher providing proper access: self-archive the postprint! Then all the capabilities you seek will come with the territory. Further rights retention or licensing is superfluous (and a retardant, if insisted upon, gratuitously, as a precondition for providing OA!).
And, for the record, I am always talking about published, peer-reviewed journal articles. I am not for a moment contesting that authors can and should license rights to their data as part of making them OA.

PMR: It will help if we understand what responsible and publishable text-mining involves. If any of the SciBorg project (e.g. Peter Corbett) publishes a paper on natural language processing in chemistry, it has to be reproducible. This is fundamental to science – and NLP is a science. If you make a claim but do not allow someone to falsify your claim you are not publishing science. (Unfortunately this lack of repeatability is almost universal in “chemoinformatics” publications where raw data is never required by the journals, but that’s another article).
So the first thing to do is to gather a corpus of documents. This corpus is part of the experimental toolkit – any other scientist should be able to have access to it. It therefore  has to be freely distributable. Since we are interested in machines understanding science, we are concentrating on chemistry articles. This isn’t easy since almost all articles are copyrighted and non-distributable. Publisher Copyright is a major barrier to progress in Chemical Natural Language Processing – you can’t just go out and compile a wordlist or whatever as you may infringe copyright or invisible publisher contracts (we found that out the hard way).
When SciBorg started there were no Open Access chemistry journals. Even now the Open  Beilstein Journal of Organic Chemistry only has ca. 50 articles. Our corpus comes from Royal Society of Chemistry, Nature, and International Union of Crystallography and we are working on what parts of this we can legally redistribute.
The corpus doesn’t stay as PDFs – PDFs are so awful they are not just useless, but actually destroy information. (Diana Stewart, who works on SPECTRa-T, is trying to find out why theses from Caltech emit non-printing ASCII control characters in their PDF.) So we have to repurpose them by converting to HTML, XML and so on. It’s not a convenience, it’s a necessity. This conversion almost certainly loses information and almost certainly loses any copyright statement (which may even be in an image).
Now the corpus is annotated. Expert humans go through line by line, word by word and character by character, identifying the role of each. Often several do this independently to see how well they agree (it’s never 100%). Then everyone can test their software on the same corpus and make meaningful comparisons. It is this annotated corpus which is of most use to the scientific community.
So suppose I find 50 articles in 50 different repositories, all of which claim to be Green Open Access. I now download them, aggregate them and repurpose them. What is the likelihood that some publisher will complain? I would guess very high. The context of the papers is lost – they simply see “their papers” being packaged and redistributed. They may claim that we have violated database rights, etc. The example I gave showed not that Green Open Access per se was being violated (it wasn’t) but that publishers act in restrictive ways that make no logical sense, and hence logic is of little value.
Only a rights statement actually on each document would allow us to create a corpus for NLP without fear of being asked to take it down.
Data is similar but left as an exercise for the reader.

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Why Green Open Access does not support text- and data-mining

  1. In “Why Green Open Access does not support text- and data-mining”, Peter Murray-Rust wrote:

    PM-R: …the first thing to do is to gather a corpus of documents… any other scientist should be able to have access to it. It therefore has to be freely distributable…

    Agreed. So far this is just bog-standard OA. If the original documents are self-archived as Green OA postprints in their authors’ Institutional Repositories (IRs), your SciBorg robot can harvest them and data-mine them, and make the results freely accessible (but linking back to the postprint in the author’s IR whenever the full-text needs to be downloaded).

    PM-R: [At SciBorg] we are interested in machines understanding science…

    Fine. Let your SciBorg machines harvest the Green OA full-texts and “repurpose” them as they see fit.

    PM-R: almost all articles are copyrighted and non-distributable. Publisher Copyright is a major barrier… you can’t just go out and compile a wordlist or whatever as you may infringe copyright or invisible publisher contracts (we found that out the hard way)…

    You can’t do that if you are harvesting the publisher’s proprietary text, but you can certainly do that if you are harvesting the author’s Green OA postprints.

    PM-R: PDFs are so awful… we have to repurpose them by converting to HTML, XML and so on…

    Fine.

    PM-R: Now the corpus is annotated. Expert humans go through line by line…It is this annotated corpus which is of most use to the scientific community…

    Fine.

    PM-R: So suppose I find 50 articles in 50 different repositories, all of which claim to be Green Open Access. I now download them, aggregate them and [SciBorg] repurpose[s] them. What is the likelihood that some publisher will complain? I would guess very high…

    Complain about what, and to whom? A Green publisher has endorsed the author’s posting of his Green OA postprint in his IR, free for all. The postprint is the author’s own refereed, revised final draft. Now follow me: Having endorsed the posting of that draft, does anyone imagine that the publisher would have any grounds for objection if the author revised it further, making additional corrections and enhancements? Of course not. It’s exactly the same thing: the author’s Green OA postprint.
    So what if the author decides to mark it up as XML and add comments? Any grounds for objections? Again, no. Corrections, updates and enhancements of the author’s postprint are in complete conformity with posting his postprint.
    Suppose the author did not do those corrections with his own hands, but had a graduate student, a secretary, or a hired hand do them for him, and then posted the corrected postprint? Still perfectly fine.
    Now suppose the author had your SciBorg “repurpose” his postprint: Any difference? None — except a trivial condition, easily filled, which is that the locus of the enhanced postprint, the URL from which users can download it, should again be the author’s IR, not a 3rd-party website (that the publisher could then legitimately regard as a rival publisher — especially if they were selling access to the “repurposed” text).
    So the solution is quite obvious and quite trivial: It is fine for the SciBorg harvester to be the locus of the data-mining and enhancement of each Green OA postprint. It can also be the means by which users search and navigate the corpus. But SciBorg must not be the locus from which the user accesses the full-text: The “repurposed” full-text must be parked in the author’s own IR, and retrieved from there whenever a user wants to read and download it, rather than just to search and surf the entire corpus via SciBorg.
    Not only does this all sound silly: it really is silly. In the online age, it makes no functional difference at all where a document is actually physically located, especially if the document is OA!. But we are still at the interface between the paper age and the OA era. So we have to be prepared to go through a few silly rituals, to forestall any needless fits of apoplexy, which always mean delay (for OA).
    So the ritual is this: It would be highly inimical to the progress of Green OA mandates to insist that the publisher’s endorsement to self-archive the postprint in the author’s IR is not enough — that the author must also successfully negotiate with the publisher the retention of the right to assign to 3rd-party harvesters like SciBorg the right to publish a “derivative work” derived from the author’s postprint. That would definitely be the tail wagging the dog, insofar as OA is concerned, and it would put authors and off providing Green OA (and hence their institutions from mandating it) for a long time to come.
    Instead, when SciBorg harvests a document from a Green OA IR, SciBorg must make an arrangement with the author that the resultant “repurposed” draft will be deposited by the author in the author’s IR as an update of the postprint. Then when a user of SciBorg wishes to retrieve the “repurposed” draft, the downloading site must always be the author’s IR, not a draft hosted by and retrieved directly from SciBorg.
    This ritual is ridiculous, and of course it is functionally unnecessary, but it is pseudo-juridically necessary, during this imbecilic interregnum, to keep all parties (publishers, lawyers, IP specialists, institutions, authors) calm and happy — or at least mutely resigned — about the transition to the optimal and invitable that is currently taking place. Once it’s over, and we have 100% Green OA, all this papyrophrenic nonsense can be dropped.
    Please, Peter, be prepared to adapt SciBorg to the exigencies of this all-important (and all too slow-footed) transitional phase, rather than trying to adapt the status quo to SciBorg, at the cost of still more delays to OA.

    PM-R: Only a rights statement actually on each document would allow us to create a corpus for NLP without fear of being asked to take it down…

    No. Green OA authors with standard copyright agreements are not in a position to license republication rights to SciBorg or any other 3rd party. Let us be happy that they have provided Green OA at all, and let SciBorg be the one to adapt to it for now, rather than vice versa.
    Stevan Harnad
    American Scientist Open Access Forum

Leave a Reply

Your email address will not be published. Required fields are marked *