Boycott Elsevier: Does your institution invest in them?

Posted on February 16, 2012 by pm286

I am supporting the Boycott against Elsevier (http://thecostofknowledge.com/ ) not only for the reasons given there (exorbitant prices, bundles of unwanted journals, support for SOPA/PIPA/RWA) but also because they exercise monopoly control. This monopoly is supported through restrictive contracts and cripples innovation in scholarship such as text-mining and data-mining, re-use of factual scientific information and many other necessary actions.

Elsevier’s market is based on reputation and is fragile. The CostOfKnowledge boycott was sufficiently prominent that it caused the share price to dip. Investors are clearly watching the current concern about our protests and we should be able to transmit our concerns to them. In the UK we are able to ask public bodies questions through Freedom Of Information and I am now suggesting we do this on a wide basis.

It would be very disquieting to find that any University or any public body responsible for libraries actually invested in Elsevier. That would imply a conflict between trying to reduce journal prices and benefitting from having higher ones. I am therefore asking my current University to confirm that they do not invest in Elsevier.

This is easy to do. Visit http://whatdotheyknow.com and type a brief letter (such as the one below). The University is required to respond within 20 working days. Anyone can do this (you don’t have to be a UK citizen AFAIK and you don’t have to have any connection with the institution

To: University of Cambridge
Subject: Freedom of Information request – Investments in Elsevier

Dear University of Cambridge,

I would like to know if the University or any of its subsidiary companies have any investment in Elsevier (Reed Elsevier PLC/N.V.), and if so how much.

Yours faithfully,

Peter Murray-Rust

I would urge readers of this blog to copy my action and ask other Universities to confirm that they do not invest in this way.

This type of action will also help to keep the momentum of the boycott.

Posted in Uncategorized | Leave a comment

URGENT: US Citizens MUST Sign RWA petition

Posted on February 16, 2012 by pm286

I’m amazed and saddened that the community has not massively signed the petition against the RWA. (I haven’t because I’m not a US citizen). The petition wanted 25,000 signatures and has only got a much smaller amount. It gives the impression that academia doesn’t care. If the petition isn’t signed, then publishers will say “what a wonderful job we’re doing. Academia loves us – they approve of the RWA – The NIH is against the vibrant market economy, etc.

It’s simple – sign the petition.

Subject: HR3699, Research Works Act

Rep. Caroline Maloney has not backed off in her attempt to put forward the interests of Elsevier and other academic publishers.

If you oppose this measure, please sign this petition on the official ‘we the people’ White House web site. It needs 23,000 signatures before February 22nd and only 1100 so far. Please forward far and wide.

Oppose HR3699, the Research Works Act

HR 3699, the Research Works Act will be detrimental to the free flow of scientific information that was created using Federal funds. It is an attempt to put federally funded scientific information behind pay-walls, and confer the ownership of the information to a private entity. This is an affront to open government and open access to information created using public funds.

This link gets you to the petition:

https://wwws.whitehouse.gov/petitions#!/petition/oppose-hr3699-research-works-act/vKMhCX9k

—
Raji Edayathumangalam
Instructor in Neurology, Harvard Medical School
Research Associate, Brigham and Women’s Hospital
Visiting Research Scholar, Brandeis University

Do NOT assume that RWA will fail. A failure to fill the petition will set us back. A late rally will have huge impact.

PeterMR and PeterMR’s avatar oppose RWA (and so do many publishers)

Posted in Uncategorized | 3 Comments

101 reasons we need @ccess to BOAI-compliant material: Translation

Posted on February 14, 2012 by pm286

We’ve started the @ccess resource and community to make more and hopefully all scholarly material fully BOAI- and OKD-compliant. Anyone can use it for any legal purpose and do anything with it without permission or fear of being sued by publishers. There are probably 101 reason why @ccess is valuable – and most of them I haven’t even dreamed of. So one thing @ccess will do is collect examples of why @ccess-compliance is essential. (Note that I shall never use the words Open and Free in a meaningful sense because they aren’t precise). So here’s an example from the list http://lists.okfn.org/pipermail/open-access/

On Mon, Feb 13, 2012 at 6:40 PM, Douglas Carnall <dougie.carnall@gmail.com> wrote:

>> Especially for scientists access to complete articles and data
>> is compulsory, but I guess that for “laymen” illustrative pictures and
>> abstracts would be sufficient.

>I always get nervous when I see this sort of scientist/layman
>distinction, and I think we should work to eradicate such a boundary
>as much as possible. (I was a layman myself until a few years ago,
>and would have hated to be fed a watered-down version of research
>while an elite priesthood of scientists got the Real Stuff.

I’d like to reinforce this point. As a translator and editor I very
often deal with unfamiliar topics and need to get up to speed quickly
with the language and jargon typical in a field. It is a major
frustration in my work that the most authoritative work is locked up
behind paywalls. Typically I need to briefly access one key term in a
handful of articles to understand how it is used in the field. As the
prevailing rate for technical translation is around $0.12-0.20/word,
accessing 3 or 4 articles at $30 each to check a single term is
completely unfeasible. But that would be the best way to ensure high
quality. I find paywalls vexing precisely because dumbed down
popularizations are useless to me.

PMR: This is a brilliant example of how people don’t realise the different uses to which articles can be put. What percentage of a domain do translators need? For example if we got 10% of all papers is that likely to be enough.

Another similar requirement is my own field of computational linguistics. To train machines to interpret text you need a marked up corpus. For that you absolutely have to have BOAI material – reading free through a paywall is useless. It needs to be redistributable

DC: The point more generally is that neither the author nor the publisher
can possibly conceive of all the potential ways that a scholarly work
might be useful when it is freely available. If the scholarly
literature could be treated as one vast linguistic corpus, I am sure
that interesting developments in scientific communication,
terminology, and translation would follow, for example.

PMR: So let’s collect more examples on the list. What have people wanted to do with scholarly publications and not been able to?

Posted in Uncategorized | Leave a comment

Avian Malaria. Can Bibsoup and @ccess help? Do penguins get malaria?

Posted on February 12, 2012 by pm286

We’re taking MALARIA as our lead project in @ccess. If you haven’t read about @ccess, read the previous post. Many peple are incredibly frustrated by lack of access to the scholarly literature. I call them the “Scholarly Poor”. If you want to read the literature it often costs 35 USD per paper. PER PAPER for ONE DAY. If you work in a University you usually get this “for free”. Of course it’s not free – it comes out of research grants, student fees (yes, student fees go to support the library), government grants (if applicable), charitable donations. It feels free to the researcher but it costs a lot.

And if you’re not in a University it’s anything but free. So we thought we’d have a look at what you can get. Although this is literally deadly serious, I’m illustrating this with our #animalgarden Bibsoup team. What’s Bibsoup? It’s an idea that lets ordinary beings manage their bibliography and grow new functionality (http://bibserver.org/about/bibsoup/ ). So we have built a Bibsoup for MALARIA. Pubmed showed us how to download their bibliography. This bibliography is OKD-Open, regardless of whether the content referenced by it is or is not Open. (http://openbiblio.net/principles/ ).

Jim Pitman developed Bibserver software and over the last year Mark MacGillivray has developed it into a major resource. Mark’s ingested all the records. Tom Olijhoek and Bart Knols of MalariaWorld have given us keywords to search with (“malaria”, “plasmodium”, etc.) – that gives 73560 records see http://malaria.bibsoup.net

The animals are now very worried about malaria – It seems to be common in Owls. Do penguins get it? #animalgarden is going to use BibSoup to explore the literature. They don’t have any money so what will they be able to read? They can read the titles, and they can usually read the abstracts.

But abstracts are acknowledged to lack critical information. They don’t have things like:

Maps
Methodology
Caveats
Tables
Pictures of animals and parasites
Graphs

For that you have to have the full text. And even if you have the full text you can’t reproduce it unless it’s OPEN. OKD-OPEN (free to read is not good enough). So how many articles have free full text and how many have Open content?

I sat down to watch the football while Owl and Penguin examined the bibliography. They limited the search to birds by typing “AVIAN”. Maybe they have missed a few (“false negatives”), but it won’t affect the conclusions. And only one paper was a “false positive” (nothing to do with malaria – feeding garlic oil to starlings) [made owl feel sick just to read it]. They’ve got 70 papers in the period 2000-2010. Here’s their OPENness classification:

2 OKD-OPEN, BOAI (CC-BY)
8 “Free to read” gratis (the mechanism for being free is not given)
5 “author manuscripts” gratis (maybe the version that the author submitted and not the final paper)

That’s 15 out of 70. Just over 20% In 2009-2010 only 1/13 papers was readable without paying.

What did they find? The first thing is that Bibsoup made it incredibly easy to browse this literature (of course Pubmed has provided the base functionality). It’s not easy to find whether you can read a paper. Often it says “Full text online” but it means “Full text IF YOU PAY”. It usually depends on the journal, and that’s where Bibsoup makes the contribution. Bibsoup will allow us in @ccess to identify the publisher and therefore – to a first approximation – whether the paper is OKD-OPEN.

“My journals are BOAI-Open” shouts Gulliver, the Open-Access Turtle (Gulliver is the green one). That makes it easy – we can immediately label all BMC journals as OKD-OPEN. Unfortunately there are only 2 BMC papers in these 70 papers. The other 13 papers are free-to-read. That’s a lot better than nothing, but you can’t use them in books, lectures, magazines, etc. You can’t use them for text-mining. (Actually you can’t use anything on Pubmed or UKPMC for text-mining even if it’s OKD-OPEN. That because the closed-access publishers have required Pubmed to forbid it, even though it’s Open. Using Pubmed for anything automated is almost impossible – the publishers have made sure of that (http://www.ncbi.nlm.nih.gov/pmc/about/copyright/ ):

Restrictions on Systematic Downloading of Articles

Crawlers and other automated processes may NOT be used to systematically retrieve batches of articles from the PMC web site. Bulk downloading of articles from the main PMC web site, in any way, is prohibited because of copyright restrictions.

And

Articles that are available through the PMC OAI and FTP services are still protected by copyright but are distributed under a Creative Commons or similar license that generally allows more liberal use than a traditional copyrighted work. Please refer to the license statement in each article for specific terms of use. The license terms are not identical for all the articles.

“What does that mean?” said Gulliver. “It means that in the Open Access subset you STILL cannot use automated methods because the licences might forbid you” said Owl.

“But” said Penguin, “that’s what @ccess will do. We only have to read each article once, annotate it, and then EVERYONE will know what the licence is. If we each do a bit, then the work becomes easy. We’ll see if Mark can create a button we can click on each record.”

And I think Mark can J .

The sad news is that it looks like Penguins get malaria:

H J W Sturrock and D M Tompkins (2007)
Avian malaria (Plasmodium spp) in yellow-eyed penguins: investigating the cause of high seroprevalence but low observed infection.
in New Zealand veterinary journal
view at pubmed

And the even sadder news is that Penguin cannot read the article.

But at least we can reproduce a picture from Gulliver’s journal (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3152766/figure/F2/)

Click on image to enlarge

Figure 2

Mosquito trapping methods used in this study. A – CDC Light trap hung from dead tree in grassland along Nyong River, Ndibi; B – Net trap placed in grassland along Nyong River; C – Collecting mosquitoes resting in grass and on tree branches by sweep net. Mosquitoes were aspirated out from the sweep net and then placed into holding cages for identification and preservation. D – Ehrenberg bird trap hung in branches of dead tree in along edge of Nyong River grassland.

Owl is not sure that SHE would like to be placed in a cage to be bitten by mosquitoes…

Posted in Uncategorized | 1 Comment

What is the use of @ccess? Do owls get malaria? Is Wikipedia believable? Who’s Alice Hibbert-Ware?

Posted on February 12, 2012 by pm286

Yesterday I blogged about our new project in Opening scholarship: @ccess. Several people retweeted it, and one asked “What’s @ccess for?” – a good prompt for some more information. @ccess is to discover OPEN scholarly information, to label it, and promote it. After that we believe that anything is possible. So I’ll use an example.

We’re lucky to get interesting birds in our garden and I idly wondered whether birds get malaria. They get influenza, of course, and they are a major host and therefore hazard to human health (the human viewpoint). But malaria? Do birds get bitten by mosquitos? I had no idea. So I went to Wikipedia and in 10 seconds discovered http://en.wikipedia.org/wiki/Avian_malaria . Yes birds get malaria. From a bird point of view it’s very serious:

Hawaii has more extinct birds than anywhere else in the world; just since the 1980s, 10 unique birds have disappeared. Virtually every individual of endemic species below 4000 feet in elevation has been eliminated by the disease [malaria].

And I read on:

since 1995, the percent of malaria-infected Great Tits has risen from 3 percent to 15 percent. In 1999, some 4 percent of Blackcaps — a species once unaffected by avian malaria —were infected. For Tawny Owls in the UK, the incidence had risen from two or three percent to 60%.^[1]

And I was gobsmacked. Blackcaps used to be summer visitors only – but now they winter in UK (in our garden). And Owls. I have a special relation with owls in Cambridgeshire as my great-aunt, Alice Hibbert-Ware (who lived in Girton – 5 km from Cambridge), was seminal in persuading the country that little owls should be protected. Here’s Girton Bird News (http://www.girton-cambs.org.uk/nature/birdwatch0607.html ):

Once introduced, it spread rapidly and as it spread it fell foul of ever greater numbers of gamekeepers. They accused the Little Owl of every crime in their calendar, […] It was against this near hysterical background that Alice Hibbert-Ware, after an extensive publicity campaign in the press and on BBC radio, was appointed in 1935 by the BTO as principal investigator into the Little Owl’s diet. Over the next two years, assisted by 75 helpers in 34 counties, she assembled a mass of data, primarily derived from pains-taking dissection of 2460 Little Owl pellets (the indigestible fur and bones ‘sicked up’ by birds of prey), from just one of which she extracted the remains of 343 earwigs, and from another 2000 crane-fly (‘daddy-long-legs’) eggs. This forensic detail both demolished the myths of larders and beetle-luring charnel houses, and swept the ground from under the feet of those who stigmatised the Little Owl as a wholesale destroyer of game-bird chicks. Over the years the bird’s black reputation has withered away, due in no small measure to the initial efforts of Alice Hibbert-Ware, and it is now a welcome addition to the fauna of these islands. So, remember Alice when next you rest in the shade under ‘her’ trees!

I remember her through a photograph given to my father by Eric Hosking, the great bird photographer. It’s a gorgeous photograph, with the owl at the entrance to the burrow. Here’s a detail showing clearly that the owl is eating a cockroach, not a partridge chick.

So maybe Little owls also get malaria? And that’s where the problem starts. Wikipedia gives references

^ GaramszegI, László Z (2011). “Climate change increases the risk of malaria in birds”. Global Change Biology
17 (5): 1751–1759. doi:10.1111/j.1365-2486.2010.02346.x.

It’s from Wiley. So I have to pay. I don’t know how much but probably 30-40 USD. And I have to read it by midnight because I only have ONE day. So of course I don’t read it.

So I don’t know that owls get malaria. And I don’t know whether it’s restricted to Tawny Owls. I imagine not. So the Girton little owls probably have malaria.

And @ccess? When I read Wikipedia I’d like to know whether the references are worth following. It’s a waste of my time to click on links behind Wiley’s paywall. I have a legitimate need to follow up this information – it’s nothing to do with my day-job in the Univeristy of Cambridge, it’s because I am a concerned member of the human race.

Birdwatchers are part of the scholarly poor. @ccess aims to collect OPEN information in subdomains – doesn’t have to be science, but that’s my speciality. It has to be OPEN. The info can then be used for anything. Here’s some ideas:

Collections of images
Guides for health workers and patients
Mapping information onto Open maps
Tutorials

And hundreds more ideas

Here’s a typical example of a paper on avian malaria http://www.ncbi.nlm.nih.gov/pubmed/18442920

Struct Biol. 2008 Jun;162(3):460-7. Epub 2008 Mar 21.

The avian malaria parasite Plasmodium gallinaceum causes marked structural changes on the surface of its host erythrocyte.

Nagao E, Arie T, Dorward DW, Fairhurst RM, Dvorak JA.

Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA.

It’s got some lovely images in of how malaria infects cells, using atomic force, scanning and transmission electron microscopy (an area I used to be involved with). I’d like to put them on this blog. But I can’t. The paper is published by Elsevier and costs 31 USD to read. If I take images from that paper Elsevier might sue me. (Not fanciful, Wiley threatened a graduate student for daring to put a scientific image on her blog /pmr/2007/05/24/sued-for-10-data-points/ ). So science is impoverished.

But hey! At the bottom of the paper it says:

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

This is almost gobbledegook to normal humans, but for those of us accustomed to doing battle (sorry, but that’s how I feel) with publishers I interpret this to mean:

This is what the authors sent to the journal. The copyright in this does NOT belong to the publisher and they have no rights over it. It’s technically the author’s pre-publication pre-review manuscript. So-called “Green” Open Access (not a self-evident term to non-specialists).

But that means the authors still hold the copyright? And I would have to ask them for permission ?

Normally yes. But the authors here are from the US NIH. And works of the US government are in the public domain. So the images are in the public domain! And here they are – how malaria gets into a cell:

Fig. 1

Typical SEM images contrasting the surface topography of noninfected (a) and P. gallinaceum-infected erythrocytes (b, c). Noninfected erythrocytes have a smooth surface. In contrast, the furrow-like surface structures are seen on infected erythrocytes. Bars in (a) and (b) represent 1 μm, in (c) 200 nm.

If I’m wrong my quarrel is with the NIH, not Elsevier. If the NIH have handed the total copyright of these images to Elsevier then I’ll scrub this blog post.

If I’m not wrong, then these images can be aggregated into @ccess. And avaliable for anyone who wants them, for example:

Writing a lecture
Writing a textbook
Educating people infected with malaria to show the science going into the problem
Re-used as compoents in artistic works,

And so on.

Now it’s possible that I have run foul of Pubmed rules. That I can’t even re-use public domain works in Pubmed. If so, Pubmed will tell me. And they’ll tell me that THEY don’t make the rules – the publishers do.

Let’s see.

But in any case there is masses of stuff we can all put into @ccess, that will enhance the information available to the human race. And we all want that, don’t we?

NOTE: I took the photograph of the photograph of the little owl. I might have broken copyright as I http://en.wikipedia.org/wiki/Eric_Hosking died 20 years ago. But somehow I think he and his heirs will approve of what I have done.

NOTE: I can’t reproduce Alice H-W’s report on the Little Owl as, she died in 1944 and Wiley wants 30-40 USD for me to read it for ONE day. (Except I have it in my bedroom)

Posted in Uncategorized | Leave a comment

@ccess for everyone. A new initiative in open Scholarship

Posted on February 11, 2012 by pm286

We have started a really new exciting venture in making scholarship available to everyone. We’re starting from scratch. We’re still working out details. And “we” means “you”.

About 3 weeks ago things came to a head. Many people are frustrated with the lack of real, 21stCentury, access to scholarship. To the outputs of funded scholarship (somewhere between 300 BILLION USD and 1000 BILLION USD). And to the feeling of exclusion that everyone who isn’t a powerful academic feels. The inability to contribute. The feeling that scientific research is a spectator sport for the lucky few who are in rich universities. Some of us swapped emails, initially from a sense of frustration.

But what emerged after about a week was the sense that we had exciting new opportunities to change the world in a bottom-up manner. We are increasingly empowered by the public technology of the web, and we are building on top of it. So we are creating a project, a philosophy, a toolkit, and collections of content. It’s happening under the aegis of the Open Knowledge Foundation (OKFN) (mailing lists at: http://lists.okfn.org/mailman/listinfo ) as a new project (open-access ) and overlaps with open-bibliography and open-science.

When we use “open” we are committed to the Open Knowledge Definition: (http://opendefinition.org/)

“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”

This applied to several aspects of Open Scholarship – Open Access, Open Bibliography (BOAI-compliant), Open Citations and Open Data (OABCD) as a start. If we have OKD-conformant information then, for the first time, we can start to see the power of machines and humans working together. We can use automation without having to seek permission.

We’re currently calling it @ccess .

@ccess for all.

Why this strange formation? Because it’s simple, memorable, and searchable on the web. It avoids the overloading of “Open”. “Open Access” and more generally “Open Foo” is not an effective label for OKD-compliant information. By using a clearly defined string we can label information as OKD-Open, as BOAI-open. (Unfortunately even after 10 years of BOAI there is no simple automatic way of telling that a piece of information is re-usable without asking permission).

So what information can we AUTOMATICALLY label as @ccess – OKD-open? Not very much yet, but we expect that this will grow rapidly. Here’s what we can do:

Anything specifically labelled with CC-BY, CC0, PDDL licences.
Datasets in CKAN labelled as Open
Articles from BOAI-compliant publishers (the main ones being BMC, PLoS, EGU, and a few others)
Data from bioscience databases (e.g. genomes, protein structures) (Bioscientists don’t normally use licences but adhere to the Bermuda principles)

Here are some things that, by default, are not AUTOMATICALLY recognisable as OKD-compliant

Depositions in Institutional repositories. Almost no content is labelled usefully for machines
Self-archived manuscripts including arXiv
Bibliographic collections
Contents of Pubmed (except as above)
Hybrid publications (95% is NOT OKD-compliant)

So a major problem is that we don’t know what is actually OKD-Open and what we can use for modern automated scholarship. @ccess aims to change that.

We’re going to build collections of OKD-Open material and label it as such. To show that’s it’s useful and a new approach to scholarship. Open to everyone, not just academics. Because @ccess is bidirectional – it’s about building our principles and community so that we have a say in modern scholarship.

We’re using our new bibliographic tools – Bibserver and Bibsoup – as an efficient means of collecting the information and labelling it. We’re starting with disease as there are already active communities who want to start using the tools. Our first project is based on MALARIA. This idea been brewing for a year or more – Open Research Reports – but we’ve had to wait while we developed the technology. We’ve now got this, and we can collect and, very soon, label and annotate the information.

We are very grateful to Tom Olijhoek, Bart Knols and MalariaWorld for acting as the centre of this. Our first task is to find what Open information there is. This will require considerable human effort. Even Pubmed isn’t able to label the documents which are OKD-Open.

We’ll be posting more about this – so far we have 70,000 references for Malaria keywords (“Malaria”, “Plasmodium”, etc.) Mark MacGillvray has ingested them into a Bibsoup and we’ll be working on them. The first activity will be to find out how many are OKD-Open. I’m guessing about 3%. That’s the sad face of access to scholarly information. That’s the amount that we can legitimately text-mine, re-index, use graphics from, etc.

But as people see the value of this they’ll want more. And that’s an important driver for making more information OKD-Open and labelling it.

Posted in Uncategorized | 3 Comments

Elsevier, Nature and Content-mining – yet another Digital Land Grab – wake up academia and fight. Or surrender for ever

Posted on February 10, 2012 by pm286

I have just discovered Elsevier’s content mining document.

For those who don’t know I have been trying to get permission to text-mine Elsevier content for two years and have been treated as a second-class citizen and ultimately come away with nothing. See /pmr/2011/11/27/textmining-my-years-negotiating-with-elsevier/ . The analysis in this post will centre round Elsevier but also applies to Nature. And I suspect it applies to a large proportion the rest of the publishing community. I’ll reproduce most of the document. (I don’t have the sacred copyright permission to reproduce it of course, but…). BTW the Elsevier staff in Oxford a year ago promised that they would update me when this document came out but of course they didn’t.

Read http://www.elsevier.com/wps/find/intro.cws_home/contentmining before you read my critique Consider the implications. Then I’ll indicate why we have been so badly let down by academic libraries or their purchasing agents who have given away more of our crown jewels without a fight.

If you want to know why I am so angry with University Libraries read the bottom of the post as well.

OK, have you read it? – it’s not very long. I’ll go through and annotate it – Like a peer-reviewer. Because after all that’s why we pay Elsevier isn’t it? – because without them we’d be incapable of organising peer-review: (Elsevier is in italics).

ELSEVIER CONTENT MINING POLICY

Overview of content mining

• Content Mining concerns the automatic processing of large collections of various forms of data and information to identify, organise and perform analysis in order to determine possible links within the content that may not be obvious on initial inspection.

PMR: This is a extraordinarily simplistic view. It probably arises from Elsevier’s limited vision. FromWikipedia

Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. ‘High quality’ in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

Information retrieval (IR) is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web. There is overlap in the usage of the terms data retrieval, document retrieval, information retrieval, and text retrieval, but each also has its own body of literature, theory, praxis, and technologies. IR is interdisciplinary, based on computer science, mathematics, library science, information science, information architecture, cognitive psychology, linguistics, and statistics.

Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and concept extraction out of images/audio/video could be seen as information extraction.

• There are various methods to perform this processing, but there are elements common to all methods, including an automated way to process all sizes and types of content in which to identify relevant information, facilitate its extraction and its analysis.

PMR: This is a woolly sentence – the only relevant concept is automation. This is the key to our struggle for Free/Open information to mine.

• Content mining has links to semantic technology as it focuses on the interlinks and contextual commonalities to enhance the understanding of the content.

PMR: I have no idea what a “contextual commonality” is. The only meaningful concept here is semantic technology.

• The development of these mining approaches are of particular importance within the scientific community to drive the interdisciplinary nature of research and support new areas of discovery.

PMR: A safe generalization adding little new insight

Elsevier’s principles on content mining

Elsevier wants to support our customers to advance science and health.

PMR: This is so vapid that it can only be classified as marketing froth. What Elsevier “wants” and what Elsevier provides have no correspondence in reality

We want to help them realise the maximum benefit from our content and enhance insight and understanding through content mining.

PMR: And in practice they do everything possible to retard the independent development of textmining

Our journals and books have added value – we invest in quality content and enrich content to maximise discoverability and usability.

PMR: “maximise usability”??? Double-column (or even single column) PDF is a major destruction of information. Scientists have spent hundreds of person-years (probably thousands) trying to get information out of PDF. Whereas simply providing us with the original author manuscript in Word or LaTeX is all we need. We can add the document semantics. But no, we need Elsevier to provide the content.

We believe a transparent content mining policy framework is essential, which needs efficient implementation and flexibility to cover multiple scenarios.

PMR: Devoid of meaning. “transparent”?? Efficient and flexible? Weasel words (a Wikipedia term) that imply only Elsevier is clever enough to do this,

The framework of open innovation enables and facilitates application development within our content.

PMR: “open” means controlled by Elsevier. The rest of the sentence is unproven, unimplemented marketing speak.

Elsevier will continue to manage its content in modern digital formats that facilitate the easy access, use, and re-use of content.

Our approach to providing content mining

Elsevier is receiving an increasing number of content mining requests and we are developing solutions to meet customer needs. We are doing this because we realise that researchers and organisations which to derive even more value from our content, but in a way that they choose. Consequently we have adapted our policies to this primary goal.

PMR: “Researchers choose”??? No, Elsevier chooses. Or have I missed a public consultation process??

We wish to understand our customers’ text mining requirements and as practically every content mining request has a different goal and there is not a common solution to provide this. Consequently we request that customers looking to mine our content should speak to their Elsevier Account Manager or should contact us directly at universal.access@elsevier.com

PMR: Maybe they do “wish” but they aren’t trying – as this document shows. Yes, every content mining project has a different goal. So, before doing research on OUR own output we have to speak to “our Elsevier account manager”.

PMR: a separate comment for “universal access”. Newspeak. This is so Orwellian it’s unbelievable.

We will then discuss the mining request, access to the content (see below), licensing and (where applicable) pricing for the project.

PMR: HERE WE HAVE THE CRUX!!! This is the first meaningful sentence in the whole document. I have to get permission from Elsevier to do research on “their” content. If they don’t like what I want to do they can just block it – or better fail to respond. “Licensing”. I won’t be able to publish the results Openly. I’ve already seen their contract (see my blog post). We carry out mining for Elsevier to possess the results. To create enhanced content that they can sell to the community for higher process and higher justification for their added value.

Mining requests are often content specific. Customers can choose to mine our full-text content, abstracts, data and other materials. A charge may be applicable dependent on the request.

PMR: “A charge”. I’ll discuss that below. This is the second meaningful sentence.

Common requests for Content Mining include:
- Running extensive searches and using locally loaded content for text mining purposes for research.
- Extraction of semantic entities from Elsevier content for the purpose of recognition and classification of the relations between them.
- Performing extensive mining operations on subscribed content, including structuring input text, deriving patterns within this text and evaluation and interpretation of the output.
- Customers can integrate results on a server used for the subscriber’s own mining system for access and use by its researchers through the subscriber’s internal secure network.

PMR: “the subscriber’s internal secure network”. Again we see the control. Nobody can extract value and publish it to the world.

All commercial usage of content mining results arising from Elsevier content will be subject to licensing and will be chargeable. We will discuss the utilisation of results in accordance to each request.

PMR: Do I have to explain the implications of this?

Facilitating access & technology to empower content mining

Elsevier have developed several different methods to allow customers to mine our content. This provides maximum flexibility and multiple options to access the required content. Examples of this include methods to deliver high amounts of content on demand, API access and other solutions associated with specific content types. For example:

ScienceDirect and Scopus licence agreements – subscribers to these products may have options to search, download, email and extract content to allow them to perform their requisite analyses
Application Marketplace – Enabling developers who wish to design and implement applications to analyse our content, or who may wish to test applications as part of their research within Elsevier content. For further information on SciVerse Applications, please visit http://www.info.sciverse.com/sciverse-applications

PMR: In other words I can only have access through Elsevier’s walled gardens.

Now I think we are at a really dangerous place in the history of modern digital scholarship.

The simple position is that we have given the publishers our content. Up till now they have simply replayed it back to us (at vast cost to us and profit to them). But the cost is irrelevant.

Now they want to control it. And get us to pay even more. Lots more.

And the first library to agree to pay for text-mining access has sold the whole academic community down the river.

It is our RIGHT to text-mine scientific content. We created it and we can use modern tools to mine it. Without any help from publishers.

By when University libraries “purchase subscriptions” they only consider the pricing. They come back and tell us “we got a great deal – we beat the publishers down!” (I think there has been a recent Russell group “victory”).

But they flabbily sign the ultra restrictive clauses in the contracts. This is not about copyright, it’s actually signing a much much more restrictive contract. That forbids scientists like me any possibility of doing any meaningful chemical linguistic research. So here are two questions for libraries:

Has your organization ever challenged the restrictive contracts on text-mining? And won the freedom to text-mine?
Have you ever negotiated with a publisher about additional charges for textmining?

Only if you can answer YES and then NO can you hold your head up.

And Nature?

85,000 USD for ONE research group to do text-mining.

http://www.whitehouse.gov/sites/default/files/microsites/ostp/scholarly-pubs-%28%23226%29%20hauessler.pdf

So if you have any information about publishers wanted fees to allow textmining please add them as comments.

Posted in Uncategorized | 3 Comments

Open Access and Eric Raymond

Posted on February 5, 2012 by pm286

This blog has been tackling the problem of Open Access, what it’s vision is and how to get a coherent movement. I’ve been excited to get a comment from Eric Raymond (http://en.wikipedia.org/wiki/Eric_S._Raymond ). Eric (whom I shared a platform with in ca 1996) is one of the pioneers of the F/OSS movement :

Raymond became a prominent voice in the open source movement and co-founded the Open Source Initiative in 1998, taking on the self-appointed role of ambassador of open source to the press, business and public.

Here’s his comment in full – I then comment. /pmr/2011/12/20/the-open-access-movement-is-disorganized-this-must-not-continue/#comment-102743

Eric S. Raymond says:

February 3, 2012 at 12:06 am (Edit)

I was one of the Open Source Initiative’s co-founders and its first president. The confusion you guys are experiencing and the issues you’re debating are startlingly reminiscent of where we were in late 1997, early 1998. As the original poster noted, we got our act together. You can too.

I endorse P-MR’s analysis and his conclusions. You need a parallel to our Open Source Definition – an Open Access Definition. And, yes, it cannot allow no-commercial-use restrictions.

The reason for the this stance in the OSD wasn’t actually philosophical, it was brutally practical. The problem is that there is no bright-line definition of “commercial use”. Licenses with a no-commercial-use provision make it too difficult to reason about your rights are. Such uncertainty exerts a chilling effect on reuses which must be permitted if “openness” is to have any meaning.

Some of you in this discussion seem ready to constitute yourselves as an Open Access Initiative and write an Open Access Definition. To which I say; do it! Audacity is required in these situations.

I’m willing to assist; I can help with drafting the definition, and I can explain lessons of experience from our community that I think will apply directly to the problems you face.

I’m delighted to get this additional confirmation we are on the right track. We are audacious, we have our own definition (http://opendefinition.org/ ). We are also brutally practical – if people are paying 5000 USD for “Open Access” then they should get a much better deal than what most publishers offer. So we have much of this.

What we need is a revitalised Open Access Initiative. One that insists on BOAI-compliant, OKD-compliant. If Eric can bring new insights from the F/OSS experience, great.

Discussion and creativity continues on http://lists.okfn.org/mailman/listinfo/open-access . If you share the view that we need clear adherence to the BOAI principles and practice join the list.

Posted in Uncategorized | 1 Comment

Update: Open Access, SemanticPhysicalScience, Open Bibliography and #animalgarden in the snow

Posted on February 5, 2012 by pm286

I’ve been off air for a bit as

my hard disk crashed – it emitted messages (“Your disk is about to crash”, then “Ctrl-Alt-Del to access dying disk”, then – the “rest is silence”). Since I live my life in the open I can access most of what I need – about the only thing lost was the originals of the videos .
the UCC server (wwmm) has been down for 2 days
#animalgarden were going to make a video about bibliography but they got distracted (see below)

Charlotte Bolton and I have been pulling together the outputs of the #semphyssci meeting. We have planned that about 13-17 articles will come out of it and be published in J.Cheminform. Open Access costs money and we’re grateful to EPSRC/PathwaysToImpact for funding. Unlike last year (http://www.jcheminf.com/series/semantic_mol_future) more of the articles come from outside our group so it won’t be as hairy. We’ve posted all the videos and I will start blogging the content soon.

Open bibliography has been zooming ahead. We have an increasing number of pots of BibSoup. Every week Adrian Pohl tells us of libraries who have released Open bibliographic data. The Bibserver software is becoming very easy to deploy and very useful. The animals were hoping to make a video but they got distracted telling Gulliver about snow:

Open Access. There’s a critical mass of people who care about BOAI-compliant access and making that formal – and exciting. Doing rather than talking. See http://lists.okfn.org/mailman/listinfo/open-access . We’ve come up with a clear and exciting position. If you want to return to the roots of BOAI this is the pace. Everyone is, of course, welcome

What have the Publishers ever done for us? And do we need them?

Posted on January 29, 2012 by pm286

Tim Gowers has used Spike Milligan as an inspiration for challenging Elsevier: http://gowers.wordpress.com/2012/01/21/elsevier-my-part-in-its-downfall/ . British satire is one of the things that keeps us going. I’ll use the equally irreverent Pythons in “Life of Brian” (http://en.wikipedia.org/wiki/Monty_Python%27s_Life_of_Brian ). From WP

There is also a famous scene in which Reg gives a revolutionary speech asking, “What have the Romans ever done for us?” at which point the listeners outline all forms of positive aspects of the Roman occupation such as sanitation, medicine, education, wine, public order, irrigation, roads, a fresh water system, public health and peace, followed by “what have the Romans ever done for us except sanitation, medicine, education…”.

Many industries have generated criticism in the wider community, ranging from anger to outright hate and revolution. Microsoft was (justifiably in my opinion) a major robber baron of the late 20^th C . It was brought to heel by public/governmental anger and regulation and also by forces of innovation. Microsoft was effectively a monopoly but could not continue as such. Yet if you ask

“What has Microsoft ever done for us?”

even the most anti-M people would admit that they have brought new products and culture to the marketplace, and that huge numbers of people use these. If Microsoft products were suddenly taken off the market businesses would fold and kids would be crying. That’s true of most robber barons – steel, railways, cotton, etc. They brought new products and opportunities (albeit at great social and moral cost to many). Word is used by hundreds of millions (?billions) as is ExCEL.

I reiterate – I am not condoning Microsoft’s history – quite the reverse. I am simply saying they innovated. And some of that innovation is valued by many people.

The same can be said of most other entrepreneurs in ICT – Google, Facebook, etc. Whatever their sins they have innovated.

But when it comes to scholarly publishers it’s a different story. [I have acknowledged a few publishers such as IUCr, and some Open Access publishers – BMC, PLoS, EGU – who you should mentally exclude. But for the rest – including many society publishers – they have to stand up and be counted.]

Mike Taylor is a dinosaur expert who has got so angry with the publishing industry that he not only blogs about it but wrote an article in the Guardian. http://www.guardian.co.uk/science/2012/jan/16/academic-publishers-enemies-science where he asserted that “Academic publishers have become the enemies of science“. I agree with this phrase. I have blogged for some years about the restriction, the intransigence, the arrogance of the scholarly publishing industry and I shall continue to do so. (I should be writing semantic code, but I am so upset that I have to write this blogpost first). Read his post, I won’t quote from it.

There has been a reply from Graham Taylor – director of academic, educational and professional publishing at the UK Publishers Association
http://www.guardian.co.uk/science/2012/jan/27/academic-publishers-enemies-science-wrong . It makes the case for why the publishing industry creates value. Some of it is reaction to MikeT’s article, but in a few places it attempts to show why the publishing industry is essential and justifies the 10 billion USD it takes in every year. I have extracted the paragraphs that bear on this:

when the reality is that their investments have made more research available to more readers at a lower unit cost than ever before. [and] Worldwide, around 3m research papers are submitted every year to scholarly journals – rising by around 3% per year in line with research budgets – of which around 1.5m are eventually published, including over 120,000 from UK researchers. Such journals are on the whole by their very nature tailored and adapted to the needs and interests of specific research communities. This is a complex and nuanced system that needs time to adapt to new methodologies.

The scholarly world is not yet fully open access, nor even approaching it, but that is not the fault of the publishers. [and] Publishers are certainly not opposed to open access. [and] Publishers pursue the goal of universal access through whatever means are practically available.

This is all I can find on the value that publishers contribute. My analysis.

“publishers are trying as hard as possible to create Open Access”. This is simply false. Remember PRISM? A publisher consortium that paid 500,000 USD to create the phrase “Open Access means junk science”. “Open Access is ethically flawed” [RSC. Yes, they then got rid of the person who said it. If you look at the RSC licence for “Open Science” which is NOT BOAI compliant it is not the sign of a publisher trying as hard as possible to create OA.] And that’s typical of the industry.

“we’re publishing more each year so we’re putting our charges up”. This argument may work in some industries where there is an innate limitation on the supply of goods. But in digital industries we see costs plummeting every year. We expect disks, bandwidth, cpu, to get massively cheaper each year. And the software that creates digital objects improves. So any INNOVATIVE industry would be reducing its costs.

So back to my question: “What have the publishers ever done for us?” Here’s my list – and they are all negative.

Double-column PDF. About the most senseless way of providing information in the current age. [Oh, they’ll tell us that they are creating stuff for new formats. But it “takes time”].
Restrictive and impenetrable licences. The industry has been excellent at this. It’s almost impossible to find out what you are forbidden to do – the easy answer is “everything except read the PDF”.
Branding. Readers do not want a different interface for each journal. It’s usually impossible to find the current issue – hidden among the glossy Flash adverts for how wonderful the publisher is
The rent-for-one-day-for-40-dollar article.
DRM

I can’t think of any positive innovation in the industry. I mean innovation. Any 10 billion industry will slowly track what everyone else did years ago. Wow! We have hyperlinks!!!! Crossref? DOI? These weren’t developed by the industry. There is NO industry research and innovation. [I’ll note the efforts of Nature to develop new ideas – Connotea, etc. – but these were often shortlived because they were experiments, not commitments]. And what have they stubbornly missed and even fought against?

Taking authors seriously. The industry sees authors as cattle. The interfaces used for submitting papers are AWFUL.
Taking readers seriously. Readers don’t exist. The industry’s end-users are purchasing officers
Semantics.
Interactive publication.
The social revolution

So the industry can be seen to be stagnant, self-serving, introverted, arrogant and either relying on its lawyers or branding.

And that’s a VERY dangerous place to be. “Be afraid, be very afraid”.

Because the publishing industry relies on a dam built on sand. Reed Elsevier used to be active in the arms trade: http://www.idiolect.org.uk/elsevier/

Reed Elsevier have been forced to drop their links with the arms trade – and the reasons are clear: individual and collective action by members of the academic and medical community, combined with disquiet from the public, investors and employees of Reed Elsevier. Thanks to everyone who signed the petition and who lent support in every way.

Yes, petitions. Petitions can grow very quickly in the Internet age. And that’s what Tim Gowers and Tyler Neylon have started http://thecostofknowledge.com/. “If you would like to declare publicly that you will not support any Elsevier journal unless they radically change how they operate, then you can do so by filling in your details in the box below.”

I’ve signed it, and I’m proud to have done so. So have Mike Taylor, Mike Nielsen. Perhaps no surprises there.

BUT:

We all have blogs and they reach different communities.

And those communities will reach others. Out beyond the rotten walls of academia. To the scholarly poor, whose tax dollars go to prop up the industry. An industry dedicated in practice to denying them the results of research.

Yes. Because any innovative industry would have picked up the discontent beyond academia and thought:

Wake up – we’re in the 21^st C – the cost of distribution is zero. Academia is 0.1% of the world’s population [a guess, but it’s less than 1%]. We have a potential market 100 times bigger than our current market. WOW! People like Tim O’Reilly (one of the most innovative publishers) think this way. He’s dismissed the puny protestations of the industry on SOPA and PIPA “In short, SOPA and PIPA not only harm the internet, they support existing content companies in their attempt to hold back innovative business models that will actually grow the market and deliver new value to consumers.” https://plus.google.com/107033731246200681024/posts/LZs8TekXK2T

So we need a revitalised scholarly publishing industry.

But it will not come from the current one. They have shown themselves incapable of change, and arrogant towards their feeders – the academics. We have it in our power – to kill any or all of them and start again. It is a question of getting our act together.

Because almost all monopolist empires have the seeds of their destruction.

Posted in Uncategorized | 10 Comments

petermr's blog

Boycott Elsevier: Does your institution invest in them?

URGENT: US Citizens MUST Sign RWA petition

101 reasons we need @ccess to BOAI-compliant material: Translation

Avian Malaria. Can Bibsoup and @ccess help? Do penguins get malaria?

What is the use of @ccess? Do owls get malaria? Is Wikipedia believable? Who’s Alice Hibbert-Ware?

The avian malaria parasite Plasmodium gallinaceum causes marked structural changes on the surface of its host erythrocyte.

@ccess for everyone. A new initiative in open Scholarship

Elsevier, Nature and Content-mining – yet another Digital Land Grab – wake up academia and fight. Or surrender for ever

Open Access and Eric Raymond

Update: Open Access, SemanticPhysicalScience, Open Bibliography and #animalgarden in the snow

What have the Publishers ever done for us? And do we need them?

Recent Posts

Recent Comments

Archives

Categories

Meta