Why ReadCube is DRMed and unacceptable for science.

My great collaborator Henry Rzepa has read the last post and delved into Macmillan's ReadCube and found it to be totally flawed and unacceptable (http://blogs.ch.cam.ac.uk/pmr/2014/12/03/natures-fauxpen-access-leaves-me-very-sad-and-very-angry/#comment-471649). Here's Henry:

I had a good look at ReadCube, the walled-garden container for articles that is part of the "free access" announcement. It is indeed DRM-armour. If you download the software, it immediately becomes apparent you cannot launch it without either creating an account or using eg Google+ to enter the garden. So, no true anonymity there then? Once in, you can get it to scoop up all your drive-resident PDFs, and it proceeds to harvest the metadata for each of them. Title, authors, keywords, and probably more. This appears to be then sent to the cloud, since you can access it from any device (for which ReadCube is supported). Much like Mendeley. And we presume that Nature/Macmillan has access to this data (the privacy statement, which is available in very small print if you hunt for it, asserts that the metadata is anonymised before being aggregated). What happens to this (meta)data after the individual has (possibly unknowingly) contributed to this aggregation? We might presume that acquiring this data is core to the business plan for launching this project.

"Nature makes the rules for the scientific community. " Well certainly, the user appears to have little control over what a tool such as ReadCube does, silently, when running.

I am certainly with Peter here in wondering quite where the ethos and practice of doing science is going.

Henry says it all - I've nothing to add: "walled garden", "no anonymity", "harvesting your metadata".

I'm very grateful to Henry for doing this. I don't investigate these systems myself (e.g. also Elsevier's TDM API) because I don't want to have signed my rights away or be accused of compromising the systems.

I've previously raised the possibility that Elsevier can use Mendeley to snoop on scientists and control them (by limiting the options they can follow). Macmillan could be even more all-embracing because they have a large share in Figshare, which everyone is now using to manage their data and Universities are buying into.  What actual guarantees are given?

Without public, independent, transparent auditing of these systems how can I trust them. Even Elsevier has "bumpy roads" and how do we know Macmillan doesn't as well.

I look forward to enlightenment.

Nature's fauxpen access leaves me very sad and very angry.

Two days ago Nature/Macmillan (heareafter "Nature") announced a new form of "access" (or better "barrier") to scientific scholarship - "SciShare". It's utterly unacceptable in several ways and Michael Eisen, Ross Mounce ("beggars-access") have castigated it; Glyn Moody gathers these and adds his own, always spot-on, analysis (http://www.computerworlduk.com/blogs/open-enterprise/open-access-3589444/).

Please read Glyn (and Michael and Ross). I won't repeat their analyses.  TL;DR Macmillan have unilaterally "made all articles free to view". The scholarly poor have to find a rich academic and beg them for a DRM'ed copy of a paper. This copy:

  • cannot be printed
  • cannot be downloaded
  • cannot be cut and pasted
  • CANNOT BE READ BY MACHINES (i.e all semantics are destroyed)
  • Cannot be read on mobile devices (which are common in the Global Soutrh)

There are so many reasons why this is odious  - here are some more beyond Michael, Ross and Glyn...

  • It announces, arrogantly, that Nature makes the rules for the scientific community. Publishers have a role (possibly) in the digital age in promoting communication, but Nature has now no role for me. It's now an analogue of Apple in telecoms - unaswerable to anyone (Macmillan is a private company).
  • Like Apple, Nature now intends to sell its "products" as a branded empire. (Recall that the authors write the papers, the reviewers review them, all for free. The taxpayer and the students pay Macmillan for the privilege of having this published. Earlier Nature said it costs about 20,000 USD to publish a paper - it costs arxiv 7 USD and Cameron Neylon's estimate is that it shoudl be about 400 +- 50 %. This huge figure is simply for branding - in the same way that people pay huge anounts for branded H2O+CO2.  Private empires are very very bad for a just society.
  • It is deeply wrenchingly divisive. Some of us are struggling to reach science out to citizens - to create systems where there is joint ownership of the scientific knowledge of the planet - a planet which badly needs it. Nature has created an underclass who are expected to grovel to the academics, some of whom are arrogant enough already. It perpetuates the dea that science is only done in rich Western universities of the global North. This is not a mistake; Timo Hannay of Digital Science (part of Macmillan) wrote recently that there was too much science to publish and the way forward was to have an elite set of journals for the "best" science and an underclass (my word) for the rest. In fact Nature does not publish better science than other journals and actually has a higher retraction rate (Bjorn Brembs' work).
  • It is highly likely to create incompatible platform-empires of the sort we see in mobile phones. ReadCube is a Digital Science company. Will Elsevier or the ACS use it and handing over control of publication to the monopoly of a competitor. Increasingly those who control the means of communication influence the way we work, think and act. ReadCube destroys our freedom. So maybe we'll shortly return to the browser-wars "this paper only viewable on Read-Cube". If readers are brainwashed into compliance by technology restrictions our future is grim.
  • It destroys semantics. Simply, it says that if a sighted human can read it, that's good enough.  This takes us back 30 years. It's been difficult enough to convince scientists that semantics are important - that we should publish our data and text so that machines can't read and understand. At a time when scientific output is increasing too fast for single humans to understand, we desperately need machines to help us.  And Nature says - "get lost".
  • Of course there will be machines to "help" scientists (but not citizens) read science. They'll be controlled by Nature (through Digital Science) or Elsevier (through Mendeley and whoever else it buys up). Their machines will tell us how to think. Or cut us out completely...

So I'm very angry. To see corporates who don't care destroyingthe basis of modern scientific information. I'm used to being angry.

But in this case I'm also sad. Sad, because I used to work with Nature; because I respected Timo Hannay's vision. We had joint projects together, they financed summer students and were industrial partners in an EPSRC project. I used to praise them. And I was honoured to be invited 3 times to SciFoo, run by Google, Timo and Tim O'Reilly.

And I was proud that two of my group went to work with Digital Science.

But now it's clear that Digital Science doesn't care about people - only about technology to control and generate income. Nature's New Technology Group used to produce experiments - Coonotea, Urchin, ... that were useful to the community - and they were good experiments.

But no longer. I'm considering whether Nature are in the same position as Elsevier - where we boycott them - refuse to review, to author. It's close, and the decision may depend on whether they take notice of the current criticism.

But what makes me even sadder is that Nigel Shadbolt - who I also know - has praised Macmillan's venture "Sir @Nigel_Shadbolt endorses our #SciShare initiative... @npgnews @digitalsci" https://twitter.com/MacmillanSandE/status/539774902526824448/photo/1 . I can't cut and paste what Nigel has said because, like ReadCube, it's an image. It's non semantic. It's useless for blind people. And Nigel has masses of gushing praise for how this advances scientific communication.

Which makes me very very sad.




Informa Health Care charge 54 USD for 2-page Open Access Article and 3USD per page for photocopy

Informa have just published a 2-page article.


The author tells us it's Open Access but Informa charge 54 USD for 1 day's read.

That's right, 27 USD per page (and probably taxes).

How can ANYONE justify this?

Does this not make you very angry?

No doubt Informa will tell us it's "a bump on the road" (Elsevier), "a glitch" (Springer). It's a bump that means they take money they are not entitled to. In my view that's unacceptable trading.

Publishers have a duty to serve authors and readers. Simply saying "oh we made a mistake, please be sorry for us" is unacceptable.

The author is a member of Patients Like Me, a charity devoted to patients. He cares.

Informa doesn't seem to care about patients.

And that's shown by the HUGE charges for reproducing papers.

If you wanted to photocopy it, as a non-profit, and distribute 50 copies it would cost you 3 USD per photocopy page. Try it on Rightslink. Charging huge amounts to non-profits and similar to reproduce articles make me very very angry.


Apparently this was an invited editorial and it was promised as some form of free access (almost certainly not CC-BY). Therefore the implied contract is informal and the publishers have retained the right to charge whatever they like. It's clear that they have no problems calculating a charge of 27USD per page.

The point at issue is not the details but the total lack of concern by the publisher. The extortionate rates announce "We don't care" - to authors - to readers - more effectively than I can.


OpenCon2014 was the Best and Most Important Meeting of My Life; the Revolution is launched

I'm serious.

From start to finish this was a superb three-day meeting of young people who know that current scholarship/publishing/university_practice has so many injustices and so much waste that we cannot go on this way.

Must be brief - airport. Search for OpenCon2014 for reports and tweets and pictures. In brief:

  • 175 young passionate people from round the globe (only 5% of applicants could be selected for 80 scholarships)
  • sensational vision and planning from SPARC , RightToResearch, OA Button, etc.
  • Brillaint and inspiring speakers. A+ to everyone.
  • Many people already suffering under the present system.
  • Culminating in a superbly organized day of Advocacy actually on Capitol Hill, Washington. Huge insights into how to get political support and change.

I use the word "Revolution". It needn't be bloody. I'll write more...


Elsevier's French TDM licence conditions

There 's a very useful blog post http://scoms.hypotheses.org/276 on Elsevier's content-mining conditions in France. I assume this is layered on the recent French five-year mega-deal with Elsevier.  My school French isn't good enough for technico-legal terms, so I cannot comment authoritatively (and I'm not sure whether G**gl* translate is better).


This is only one clause and there are others that matter. It looks fairly similar to what Gemma Hersh and Chris Shillum pushed at us last month. Reading the rest of the blog post (which doesn't contain the whole contract, but only snippets - maybe that's all you are allowed to read)  it appears:

  • That mining has to take place through the Elsevier API.
  • That non-API crawling/scraping of the website is forbidden.
  • That there are significant restrictions on the re-use of mined material.

Since I and others have highlighted the unacceptability of Elsevier contracts (they change every month, they have unacceptable restrictions (must not disadvantage Elsevier business), are internally inconsistent and unclear) I hope very much that the French authorities signing this were aware of all the problems.

I'd be grateful for an expert view on what is contained and it would be very useful to have a reasonably precise legal translation....

If anyone or any country is about to sign an agreement with any publisher that contains any mention of mining crawling spidering extraction APIs then:




Because otherwise you are likely to betray the trust of 5 years of researchers.



from Fric_Adèle.

Pardon my French, but yours is indeed not subtle enough ;-)
My pleasure to help.

You missed the most important word of the document : “notamment”, standing for “not least”.
It means that the API is one of the way to perform TDM, not the only way.
Fairly interesting, isn’t it ?

Moreover, in the next section about forbidden uses you can read :
“A l’exception de ce qui est expressément prévu dans le présent Contrat ou autorisé par écrit par Elsevier, l’Abonné et ses Utilisateurs Autorisés ne peuvent pas :
- utiliser des robots ou programmes de téléchargement automatisé destinés à, de façon continue et automatique, extraire l’intégralité ou une partie substantielle des Produits Souscrits (sauf exception autorisée pour le TDM) ou destinés à perturber le fonctionnement des Produits Souscrits”
So, another tweak in Elsevier practices : crawling/scraping is forbidden in general EXCEPT for TDM purposes.
TDM is considered as an exception to what is generally forbidden.

To be more precise : this document is the license between Elsevier and each institution. It is an appendix of the general contract, in which the TDM is allowed as follows – you can again read the word “notamment” :
“6.6 Data et text mining
Tous les contenus accessibles et souscrits sur la plateforme du Titulaire dans le cadre de cet accord seront utilisables à des fins de data et text mining notamment via une interrogation des données par une API connectée à la plateforme ScienceDirect®, conformément aux stipulations du Contrat de Licence.”

which you can translate by :
“All content accessible and subscribed through the agreement can be used for TDM purposes, not least using an API connected to ScienceDirect Platform, in compliance with the Licence Agreement.”

All the best,



Thank you so much.

This reinforces the idea that Elsevier contracts change with the phases of the moon... Now any authorised user can carry out TDM either with Elsevier's API or without it. And that robots can only be used for TDM.

TDM is the common phrase in the UK for "data analytics" (Hargreaves legislation). I can't think of many reasons for using robots that wouldn't be classified as data analytics. Indexing, classifcation, usage - these are all data analytics in my understanding. Most non-mining activities would relate to storage and transformation and here the restrictions come from copyright and agreements, not whether robots are used to collect the material.

But maybe I'll be enlightened?



Institute of Physics (IOP) Charges for Open

ContentMine has a new collaborator - an astrophysicist. We're going to work together on mining data from journals. Their leading two include "The Astrophysics Journal" (TAJ) published by the Institute of Physics (IOP). Now s/he's one of the #scholarlypoor - left academia, but stiil fiercely interested in their discipline. Now TAJ is not Open, but I am allowed to read it. I always practice being a scholarlypoor, so went to the website and saw


WOW! an "Open Issue" - just what s/he needs to use to develop a scraper. So I have a look...


I understand the words, so I am competent to do ContentMining on this - I'm an astrophysical miner... Let's look at a paper...


WHAT!??? Open = Pay us SIX QUID (and that's just rental)

I give publishers the "cock-up" benefit before the "conspiracy". I can hypothesize that "Open" was a typo for "Current" or that Openness is a new concept for IOP, who haven't switched the linksand so describe this as a "Bumpy Road" (a phrase coined by Elsevier). Or... let's hear from IOP.



Update for last month: Shuttleworth, BL, MySociety and more

I have been silent for the last month because I have been very busy. I hope to blog more in the next few days.

The main activities have been

  • The twice-yearly Gathering of Shuttleworth fellows (this time in Malta). My first time (March) was great, but this was fantastic. I have so much in common with so many of the past and current Fellows. I've a huge todo list to see how we can work together.
  • An application to the Shuttleworth Foundation for re-funding for a further year. I'll take you through this in detail in the next posts.
  • Running ContentMining workshops (at EBI on 2014-10-06) and (Jenny Molloy and Puneet Kishor) in Delhi last weekend (2014-11-02). We have now ironed out most of the problems and feel confident about delivering a range of workshops.
  • Preparing for the launch of our mining activities (RRSN, promise!). Will blog in next day or two
  • Preparing for trip to US and doing workshops in Chicago, Washington (OpenCon) and visiting Penn State).
  • Attending Open Access Button launch 2014-10-21. OA_Button is massive because at least there is some true energy and anger. I gave a 30 sec tribute where I said it was massive and that we hadn't seen nothing yet. I think OA_Button will change the world. Young people are sick of the broken values of academia. (I shall expand on this at OpenCon).
  • Cobi Smith (who is currently with Francois Grey in CERN) gave an excellent talk at Open Research Cambridge in the Panton Arms on crowdsourcing/crafting for disasters.
  • Met with all the Cambridge University Librarians to say farewell to Peter Morgan, retiring, who has helped me get off the ground in Library and Informatics research. Peter's been a huge contributor to new ideas and practices and we'll miss him. Good opportunity to talk with the library and know they are supportive of my ContentMining research.
  • Interesting meeting on Big Data in science in Cambridge. Great talk by Florian Markowetz (Cancer Genomics) debunking the hype (Big Data is sliding down the curve into trough of disillusion). He demolished the "lets throw all our data into Machine Learning and wonderful things will automatically emerge" syndrome. I completely agree - Machine Learning has its place - e.g. OCR - but leads to no understanding and no transferability. I am largely renouncing it.
  • British Library Labs on "Big Data" in Arts and Humantities. Massively Wonderful. Wonderful speakers and projects. The BL runs its labs on a shoestring (Mainly Mahendra and Ben) and IMO is a world leader in innovative use of library resources. I'm certainly planning to see if they'll host a workshop on ContentMining.
  • MySociety - ran a great meeting yesterday evening in Cambridge and asked me to present TheContentMine. I had some help - see http://instagram.com/p/u_NrqxIFTz/

More later...


ContentMine at WOSP2014: Text and Data Mining: III What Elsevier's Chris Shillum thinks we can do; Responsible Mining

My last post (http://blogs.ch.cam.ac.uk/pmr/2014/09/15/wosp2014-text-and-data-mining-ii-elseviers-presentation-gemma-hersh/ ) described the unacceptable and intransigent attitude of Elsevier's Gemma Hersh at WOSP2014 Text and Datamining of Scientific Documents.

But there was a nother face to Elsevier at the meeting, Chris Shillum. (http://www.slideshare.net/cshillum ). Charles Oppenheim and I talked with him throughout lunch and also later Richard Smith-Unna also talked. None of the following is on public record but it's a reasonable approximation to what he told us.

Firstly he disagrees with Gemma Hersh that we HAVE to use Elsevier's API and sign their Terms and Conditions (which gives away several rights and severely limit what we can do and publish.) We CAN mine Elsevier's content through their web pages that we have the right to read and we cannot be stopped by law. We have a reasonable duty of care to respect the technical integrity of their system, but that's all we have to worry about.

I *think* we can move on from that. If so, thanks, Chris. And if so, it's a model for other publishers.

We told him what we were planning to do - read every paper as it is published and extract the facts.

CS:  Elsevier published ca 1000 papers a day.

PMR : that's one per minute;  that won't break your servers

CS: But it's not how humans behave....

I think this means that is we appear to their servers as just another human then they don't have a load problem. (For the record I sometimes download manually in sequence as many Elsevier papers as I can to (a) check the licence or (b) to see whether the figures contain chemistry, or sequences. Both of these are legitimate human activities either for subscribers or unsubscribers. For (a) I can manage 3 per minute, 200/hour - I have to scroll to the end of the paper because that's often where the "all rights reserved, C Elsevier" is. For (b) it's about 40/hour if there are an average of 5 diagrams (I can tell within 500 milliseconds whether a diagram has chemistry, sequences, phylogenetic trees, dose-response curves...).  [Note: I don't do this for fun - it isn't - but because I am fighting for our digital rights.]

But it does get boring and error-prone which is why we use machines.

The second problem is what we can publish. The problem is copyright. If I download a complete Closed Access paper from Elsevier and post it on a public website I am breaking copyright. I accept the law. (I am only going to talk about the formal law, not morals or ethics at this stage).

If I read a scientific paper I can publish facts. I MAY be able to publish some text as comment, either as metadata, or because I am critiquing the text. Here's an example (http://www.lablit.com/article/11) :

In 1953, the following sentence appeared near the end of a neat little paper by James Watson and Francis Crick proposing the double helical structure of DNA (Nature171: 737-738 (1953)):

"It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material."

Of course, this bon mot is now wildly famous amongst scientists, probably as much for its coyness and understatement ...

Lablit can reasonably assume that this is fair comment and quote C+W (Copyright who? they didn't have transfer in those days). I can quote Lablit in similar fashion.  In the US this is often justified under "fair use" , but there is not such protection in UK. Fair use is, anyway, extremely fuzzy.

So Elsevier gave a guide as to what they allow IF you sign their TDM restrictions. Originally it was 200 characters, but I pointed out that many entities (facts) , such as chemical names or biological sequences, were larger. So Chris clarified that Elsevier would allow 200 characters of surrounding context.  This could means something like (Abstract, http://www.mdpi.com/2218-1989/2/1/39 ) showing ContentMine markup:

"...Generally the secondary metabolite capability of <a href="http://en.wikipedia.org/wiki/Aspergillus_oryzae"> A.oryzae</a> presents several novel end products likely to result from the domestication process..."

That's 136 characters without the Named Entity "<a.../a>"

Which means that we need guidelines for Responsible Content Mining.

That's what JISC have asked Jenny Molloy and me to do (and we've now invited Charles Oppenheim to be a third author). It's not easy as there are few agreed current practices. It's possible for us to summarise what people currently do, what has been challenged by rights holders and then for us to suggest what is reasonable. Note that this is not a negotiation. We are not at that stage.

So I'd start with:

  1. The right to read is the right to mine.
  2. Researchers should take reasonable care not to violate the integrity of publishers' servers.
  3. Copyright law applies to all parts of the process  although it is frequently unclear
  4. Researchers should take reasonable care not to violate publishers' copyright.
  5. Facts are uncopyrightable.
  6. Science requires the publication of source material as far as possible to verify the integrity of the process. This may conflict with (3/4).


CC BY publishers such as PLoS and BMC will only be concerned with (2). Therefore they act as a yardstick, and that's why we are working with them. Cameron Neylon of PLoS has publicly stated that single text miners cause no problems of PLoS servers and there are checks to counter irresponsible crawling.

Unfortunately (4) cannot be solved by discussions with publishers as they have shown themselves to be in conflict with Libraries, Funders, JISC, etc. Therefore I shall proceed by announcing what I intend to do. This is not a negotiation and it's not asking for permission. It's allowing a reasonable publisher to state any reasonable concerns.

And I hope we can see that as a reasonable way forward.









WOSP2014: Text and Data Mining: II Elsevier's Presentation (Gemma Hersh)

WOSP2014 - http://core-project.kmi.open.ac.uk/dl2014/ - is a scholarly, peer-reviwed workshop. It consists of submiited, peer-reviewed talks and demos and invited talks from well-known people in the field (Lee Giles,  Birger Larsen). At ContentMine we submiited three papers/demos which were peer-reviewed and accepted (and which I'll blog later) .

But there was also one Presentation which was, as I understand, neither invited nor peer-reviewed.

"Elsevier's Text and Data Mining Policy" by Gemma Hersh

It is usually inappropriate for a manufacturer to present at a scholarly conference where the audience are effectively customers. It ends up as a products pitch which is offensive to attendees who have paid to attend, and offensive to those who have submmited papers which were rejected, while product pitches are allowed.

This was one of the most unacceptable presentations I have ever seen at a scholarly workshop and I said so.

Before I add my own comments I simply record the facts. Professor Charles Oppenheim agrees that this is a factual record.

GH = Gemma Hersh (Elsevier)
CO = Prof Charles Oppenheim
GH arrived 10 mins before her presentation and left immediately afterwards. She did not stop to talk. She later tweeted that she had a meeting.
GH Presentation

(1) Elsevier's presentation was not an invitation or peer-reviewed submission but appeared to have been a result of pressuring the organizers.

(2) it was a manufacturer-specific product pitch not a scholarly presentation

(3) It made no attempt to be balanced but presented only Elsevier's product. In particular:

  * no mention was made of Hargreaves

  * no mention that it had been rejected by LIBER and associates

  * no mention that the library community had walked out of Licences for Europe

(4) Elsevier said that their studies showed researchers preferred APIs. No mention was made that researchers had to sign an additional agreement

Public Discussion (no record, but ca 20 witnesses)

PMR challenged the presentation on the basis of bias and inaccuracy.

GH cricitized PMR for being aggressive. She stated that it was the libraries fault that L4E had broken down

PMR asked GH to confirm that if he had the right to read Elsevier material he could mine it without using Elsevier's API

GH replied that he couldn't.

CO told GH that PMR had a legal right to do so

GH said he didn't and that CO was wrong

Discussion continued with no resolution. GH showed no intention of listening to PMR and CO

Later tweets from GH
@petermurrayrust check the explanatory notes that accompany the legislation.
@petermurrayrust would prefer a constructive chat rather than an attack though....
@petermurrayrust very happy to keep talking but I've just come straight from one appointment and now have another.


PMR I believe that any neutral observer would agree that this was roughly factually correct.


PMR: now my comments...

It is completely unacceptable for a product manager to push their way into a scholarly workshop, arrive 10 minutes before their presentation, give a product pitch and leave immediately without deigning to talk to anyone.

The pitch itself was utterly one-sided, presenting Elsevier as the text-miner's friend and failing to give a balanced view of the last several years. Those of us in the UK Hargreaves process and Licences4Europe know that STM publishers in general and Elsevier in particular have thrown money and people in trying to control the mining effort through licences. To give a blatantly biased presentation at a scholarly meeting rules them out as trustable partners.

Worse, the product pitch was false. I called her on this - I was forthright - and asked whether I could mine without Elsevier's permission. She categorically denied this. When challenged by Professor Oppenheim she told him curtly he was wrong and Elsevier could do what they liked.

The law explicitly states that publishers cannot use terms and conditions or other contractual processes to override the right to mine for non-commercial research processes.

So it's a question of who do you believe:

Gemma Hersh, Elsevier or Professor Charles Oppenheim, Loughborough, Northampton, City?

(and PMR, Nottingham and Cambridge Universities)

If GH is right, then the law is pointless.

But she isn't and it isn't.

It gets worse. In later discussions with Chris Shillum, who take a more constructive view, he made it clear that we had the right to mine without Elsevier's permission as long as we didn't sign their terms and conditions. The discussion - which I'll cover in the next post - was useful.

He also said that Elsevier had changed their TaC several times since January, much of this as a result of my challenging them. This means:

  1. Elsevier themselves do not agree on the interpretation of the law
  2. Elsevier's terms and conditions are so mutable and frequently changed that they cannot be regarded as having any force.




ContentMine at WOSP2014: Text and Data Mining; I. General Impressions

On Friday 2014-09-12 4 of us from The ContentMine presented 3 papers at WOSP2014 (http://core-project.kmi.open.ac.uk/dl2014/) . The meeting was well run by Petr Knoth and colleagues from the Open University and CORE (the JISC- and funder-supported project for aggregation of repositories). The meeting gave a useful  overview of TextAndDataMining (TDM).  From the program

  1. The whole ecosystem of infrastructures including repositories, aggregators, text-and data-mining facilities, impact monitoring tools, datasets, services and APIs that enable analysis of large volumes of scientific publications.
  2. Semantic enrichment of scientific publications by means of text-mining, crowdsourcing or other methods.
  3. Analysis of large databases of scientific publications to identify research trends, high impact, cross-fertilisation between disciplines, research excellence etc.

This was an important meeting in several ways and I want to comment on:

  1. the general state of Content-mining
  2. Our own presentations
  3. Elsevier's uninvited presentation on "Elsevier's Text and Data Mining Policy". I'll split this into two parts in two separate posts.

PMR's impressions

Content mining and extraction has a long history so it's now an incremental rather than a revolutionary technology. Much of the basis is "machine learning" where statistical measures are used for classification and identification (essentially how spam detectors work). So several papers dealt with classification, based on the words in the main part ("full text") of the document. [NOTE: full-text is far better than abstracts for classifying documents. Always demand full-text.] Several papers dealt with this.

Then there's the problem of analysing semistructured information. A PDF is weakly structured - a machine doesn't know what order the characters are, whether there are words, etc. Tables are particularly difficult and there were two papers on this.

And then aggregation, repositories, crawling etc. The most important takeaway for me is CORE (http://core-project.kmi.open.ac.uk/ ) , which aggregates several hundred (UK and other) repositories. This is necessary because UK repositories are an uncontrolled mess. Each university does its own thing, has a different philosophy, uses different indexing and access standards. The universities can't decide whether the repo is for authors' benefit, readers' benefit, the university's benefit, HEFCE's benefit or whether they "just have to have one because everyone else does". (By contract the French have HAL http://en.wikipedia.org/wiki/Hyper_Articles_en_Ligne ). So UK repositories remain uncrawlable and unindexed until CORE (even then their is much internal inconsistency and uncertain philosophy).

It's important to have metrics because otherwise we don't know whether something works. But there was too much emphasis on metrics (often to 4 (insignificant) figures). One paper reported 0.38% recall with strict method and 94% with a more sloppy one. Is this really useful?

But virtually no one (I'll omit the keynotes) gave any indication of whether they were doing something useful to others outside their group. I talked to 2-3 groups - why were they working on (sentiment analysis | table extraction | classification). Did they have users? Was their code available? was anyone else using their code. Take up seems very small. Coupled with the fact that many projects have a 2-3 year lifespan, that the basis is competition rather than collaboration, and we see endless reinvention. (I've done table extraction but I'd much rather someone else did it so we are working with tabulaPDF). The output in academia is a publication, not a running reusable chunk of code.

So it's not surprising that there isn't much public acknowledgement of TDM. The tools are tied up in a myriad of university labs, often without code or continuity.

One shining exception is Lee Giles, whose group has built CiteSeer. Lee and I have known each other for many years and we worked together on a MicrosoftResearch project, OREChem. So when we got talking we found we had two bits of the jigsaw.

Readers of this blog will know that Ross Mounce and I are analysing diagrams. To do that we have to indetify the diagrams, and this is best done from the captions (captions are the most important part of a scientific document for understanding what it's about). And we are hacking the contents of the images. So these two fit together perfectly. His colleague Sagnik Ray Choudhury is working on extracting and classifying the images; that saves us huge time and effort in knowing what to process. I'm therefore planning to visit later this year.

For me that was probably the most important positive outcome of the meeting.

The next post will deal with Elsevier's Gemma Hersh who gave an uninvited "Presentation", and the one after with Elsevier's Chris Shillum's comment on Gemma's views and also on Elsevier's TaC.