I have just discovered Elsevier’s content mining document.
For those who don’t know I have been trying to get permission to text-mine Elsevier content for two years and have been treated as a second-class citizen and ultimately come away with nothing. See /pmr/2011/11/27/textmining-my-years-negotiating-with-elsevier/ . The analysis in this post will centre round Elsevier but also applies to Nature. And I suspect it applies to a large proportion the rest of the publishing community. I’ll reproduce most of the document. (I don’t have the sacred copyright permission to reproduce it of course, but…). BTW the Elsevier staff in Oxford a year ago promised that they would update me when this document came out but of course they didn’t.
Read http://www.elsevier.com/wps/find/intro.cws_home/contentmining before you read my critique Consider the implications. Then I’ll indicate why we have been so badly let down by academic libraries or their purchasing agents who have given away more of our crown jewels without a fight.
If you want to know why I am so angry with University Libraries read the bottom of the post as well.
OK, have you read it? – it’s not very long. I’ll go through and annotate it – Like a peer-reviewer. Because after all that’s why we pay Elsevier isn’t it? – because without them we’d be incapable of organising peer-review: (Elsevier is in italics).
ELSEVIER CONTENT MINING POLICY
Overview of content mining
• Content Mining concerns the automatic processing of large collections of various forms of data and information to identify, organise and perform analysis in order to determine possible links within the content that may not be obvious on initial inspection.
PMR: This is a extraordinarily simplistic view. It probably arises from Elsevier’s limited vision. FromWikipedia
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. ‘High quality’ in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).
Information retrieval (IR) is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web. There is overlap in the usage of the terms data retrieval, document retrieval, information retrieval, and text retrieval, but each also has its own body of literature, theory, praxis, and technologies. IR is interdisciplinary, based on computer science, mathematics, library science, information science, information architecture, cognitive psychology, linguistics, and statistics.
Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and concept extraction out of images/audio/video could be seen as information extraction.
• There are various methods to perform this processing, but there are elements common to all methods, including an automated way to process all sizes and types of content in which to identify relevant information, facilitate its extraction and its analysis.
PMR: This is a woolly sentence – the only relevant concept is automation. This is the key to our struggle for Free/Open information to mine.
• Content mining has links to semantic technology as it focuses on the interlinks and contextual commonalities to enhance the understanding of the content.
PMR: I have no idea what a “contextual commonality” is. The only meaningful concept here is semantic technology.
• The development of these mining approaches are of particular importance within the scientific community to drive the interdisciplinary nature of research and support new areas of discovery.
PMR: A safe generalization adding little new insight
Elsevier’s principles on content mining
- Elsevier wants to support our customers to advance science and health.
PMR: This is so vapid that it can only be classified as marketing froth. What Elsevier “wants” and what Elsevier provides have no correspondence in reality
- We want to help them realise the maximum benefit from our content and enhance insight and understanding through content mining.
PMR: And in practice they do everything possible to retard the independent development of textmining
- Our journals and books have added value – we invest in quality content and enrich content to maximise discoverability and usability.
PMR: “maximise usability”??? Double-column (or even single column) PDF is a major destruction of information. Scientists have spent hundreds of person-years (probably thousands) trying to get information out of PDF. Whereas simply providing us with the original author manuscript in Word or LaTeX is all we need. We can add the document semantics. But no, we need Elsevier to provide the content.
- We believe a transparent content mining policy framework is essential, which needs efficient implementation and flexibility to cover multiple scenarios.
PMR: Devoid of meaning. “transparent”?? Efficient and flexible? Weasel words (a Wikipedia term) that imply only Elsevier is clever enough to do this,
- The framework of open innovation enables and facilitates application development within our content.
PMR: “open” means controlled by Elsevier. The rest of the sentence is unproven, unimplemented marketing speak.
- Elsevier will continue to manage its content in modern digital formats that facilitate the easy access, use, and re-use of content.
|
||
PMR: Do I have to explain the implications of this? |
||
|
I am the author of the statement from the OSTP website. (sidenote: I misrepresented payments of the University of California to NPG, which are on the order of several $100k, not millions. I should have checked the facts first, definitely learnt something from this.).
The statement mentioned that we’re talking only about a quote.
I want to make sure that readers understand that the amount mentioned was for a first quote. This quote was for XML fulltext files delivered on discs. Later on, they told us that as long as we crawl the PDFs ourselves, we can do that without any charges.
There is so much scientific literature that it is in everyone’s interest that content is indexed as well as possible. If NPG can be crawled, then this is indeed a big step forward for all indexing projects of NPG content!
Thanks you very much Maximilian,
I support your letter – and this episode has been valuable. It is much better that this discussion is in the public domain.
We need to know the details of what NPG allow – are there any conditions about republishing the results?
It would help very considerably if publishers took a constructive pro-active approach and anticipated this type of request.
BTW I have been asked to give evidence to the UK Intellectual Property Office and it is very useful to have good evidence.
Pingback: Around the Web: Some resources on the Panton Principles & open data : Confessions of a Science Librarian