Text and Data Mining - fighting for our Digital Future ("Peter Murray-Rust is the problem")

Last week there was an important meeting run by LIBER, the association of Research Libraries in Europe. http://www.libereurope.eu/news/the-perfect-swell-a-workshop-on-text-and-data-mining-for-data-driven-innovation. To be quite clear, the meeting was held because (legacy, scholarly) publishers are spending large amounts of time, effort and money to stop people Text and Data Mining unless controlled by the Publishers. We have tried to work towards a resolvable position, but been forced to walk out of talks in Brussels because the industry wants to licence our activities.

We have asserted that "the Right to Read is the Right to Mine". I'm very happy that I came up with this phrase as it expresses precisely what we believe -that there is no difference between human access to a document and a machine's access. Another useful phrase (John McNaught's, not mine) was that "Text and Data Mining saves Lives".

Licensing destroys Text and Data Mining. It's designed to do precsiely that. Imagine as few as 1000 researchers negotiating licences with 1000 publishers. That's 1 million licence negotiations. The legacy publishers have shown themselves to be regressive in their use of licences, and in many case incompetent. Do we really believe that they have the interests of researchers at heart?

So licences lead to the conclusion "NOT(Text and Data Mining saves Lives)" (I daren't write the obvious phrase as I'll be attacked for being emotional).

There'll be a full summary of the meeting shortly, I gather. I've put my tweets below, with some comments. I'm going to comment on a few, but one in particular. The publishers create a huge amount of FUD around TDM.  It's a terrible threat to them. (It isn't actually and any realistic publisher would welcome it - but the last year has shown that the legacy publishers are now the determined anatagonists of change - they have done nothing other than fight against anything new - and they are spending millions doing it).

@CameronNeylon  addressed the FUD [my tweets]:

Myth1:  "researchers don't want TDM". This is rubbish. the public activity is smallish because:

  • All but the most determined get stopped by their publisher or fear of being prosecuted or getting take-downs. I have spent 3 years trying to get any agreement out of Elsevier - the best was " You must discuss your proposed work with Elsevier, you must use our gateway and your results may belong to Elsevier". Not surprising that this deters people
  • There are few communal resources (because people are deterred). There are few corpora (because publishers forbid their creation, and few tools because there isn't much to work on [that will change]

Myth2 "TDM will crush our servers" . this is so hilarious it's worth getting Cameron to give you a replay of his talk. It amounts to about 1 millionth of their traffic. If they can't accommodate they shouldn't be publishing. Textmining on PLoS is TRIVIAL problem. Crawl-delay 30 seconds. TDMers are well behaved and obey this!
Myth3 "this ain't core business" (i.e. why should publishers spend tie talking with me).  "Publishing is not PLoS' USP. Nor filtering. it's Annotation .PloS is about dissemination.
Myth4 : "If we allow TDM Peter Murray-Rust will distribute all our work" . More than one publisher havs said this to Cameron.  Wow! I'm famous! Cameron "There's something wrong if publishers are frightened of a retired academic with a laptop".

To be quite clear. I want to carry out TDM in a way that will create minimal technical problems. I am happy with a 30 second delay - this gives me time to process everything before the next request. I'm happy to acknowledge where the material was published. This is responsible science. If I was a pirate wanting to sell content to China I'd have managed it already without the publishers even knowing. I don't want to break the law.

My right - and it may have to be confirmed in court is to use machines to mine the content I have legitimate access to. Anything less than this is irresponsible. I could actually add value to the publishers if they were reasonable. Here are some examples:

  • detecting errors in content (do publishers want high quality?)
  • indexing their material for scientific search engines (do they want people to find their material?)
  • findings of making their material easier to read and use (do they want that?)

Anyway I'm going to be starting TDM in earnest now that our BBRSC project has started at Bath. It will be massive and I'll blog this very frequently. If you are a publisher your should be very afraid very excited.


Here are some unedited tweets ... Chris Yiu and John Boswell gave a chilling example of why the US has 82% of the new digital industries. The US welcome it. Europe wrings it hands and destroys it with uncertainty. Lucie Guibault gave a good overview of the law effectively confirming that in Europe there is no certainty and no support for the brave entrepreneur. Nilu Satharasinghe setting up startup in Cambridge UK, but if hits legal difficulties will immediately relocate to US. (same story). Caroline Dynes (Royal Society) stated that the RS will allow TDM without licence. Wow! Perhaps other publishers will follow.

Chris Yiu Digital Policy Unit, Policy Exchange UK
Sharing data saves time paper money
data opportunites: realtime opportunities; supermarkets know instantly every tin on shelves
personalization opportunities better user experience;
solving problems we couldn't solve before; data quantities can be handled, ebay uses semantics to stop gaming (w computervision)
can often make good predictions about future -> policymakers.
huge opportunity for new industries in Europe.
healthcare data analytics analysing use of statins in UK. prescribing guidelines + GP prescribing. would save 200 M GBP/year
future of census UK (a) use internet or (b) fuse together smaller surveys.
with right investments Willets UK is well placed for big data revolution
Willets big data is one of 8 areas for Britain to invest in
UK has more digital businesses than they thought.(most active in Cambridge to Bournemooth corridor - none in North Eng or Sco
82% of world's digital businesses are US and 43% ar Cal. 25 NY, [4% in UK ]
we need: skills, ambition (changing the world), finance, mentors, agility, creativity. [great list]
lack of certainty about copyright is major drawback.
what we need is certainty. [PMR absolutely - one of the key questions].
is fair use an exception to copyright or a fundamental user right? [PMR absolutely key]

now up John Boswell SAS institute  (global industry perspective) . Most data is unstructured
will give examples and outline why changes to copyright are misguided
UN look at social network data from IE and US sentiment analysis. words like "mad" predict unemployment
if u can mine data and predict unemployment can you do something. Tax rev drops and expenditure rises
mining soc security US 15% benefit claims are granted after denial. Gov could predict which so save time money add benefit
US FDA analyse records from doctors for marketed drugs post-release of drugs
Text-mining is NOT copying expression of author, you are analysing
if analyse millions of documents you are not copying single expressions (JB simple example of word frequency
JB advises us that TDM should not be regarded as violation of copyright. We are debating at wrong point in continuum
in purest form TDM does not violate copyright - i.e. we should not have asked for extension. Are we making it worse? [PMR agree]

Staffan Truve. Recorded Future [company]
RF analysing time on the web e.g. "this week" is grounded
250,000 realtime sources 8 languages 10 Billion facts 25 entity types
RF predicted unrest in egypt initially didn't know how long but later predicted continued unrest
without textmining the web is useless publishers know this. Analysis evoution, new value, aggregate
threats: vertical siols; deep web+darkweb; ip protectionism
no borderline between reading and analysing, cannot differentiate humans from machines, robots must have same rights as humans

@CameronNeylon ...
Jean-Fred Fontaine MDC Berlin biomedical literature PubMed has 18 M , 50% have visible literature 0.2 M full texts are TDMable
J-FF showing how we navigate biomedical literature. Gene disease associations
showing enity recog - genes chemicals cooccurrence shows interactions
[PMR my software   (#ami2, svg2xml ) be released next week - watch blogs.ch.cam.ac.uk/pmr/ 10 laptops can mine whole literature]
blogs.ch.cam.ac.uk/pmr/ JFF showing creation of networks to add to data thereby enriching data
JFF shows that TDM is as accurate as a human (but needs FULL-text)
JFF example of disease correlations from Elec Patient Records
uses side effects on drug labels (packet) to deduce targets
JFF whole literature will fit in 250GB. Need full text, figure

next up Dieter Van Uytvanck CLARIN
DVU Research Infrastructure perspective
CLARIN provides access to digital language data (mainly humanities and socsci) and tools
DVU many channels, text, sign language , gestures, neuroimaging,fMRI
CLARIN is global (e.g. Amazon, Bali), Time , rockcarvings, smartphones, experimats contrasted with "in the wild"
examples of data mining in CLARIN. Must have access to whole work. Snippets are not enough
CLARIN replicating experiments is utterly important. Licences do not scale to 500,000 texts collected from websites
CLARIN shows that times of wars generate lots of new words
language analysis with phylogenetic trees for  evolution @rmounce
research infrasttr: longterm preserv, citable, fedrated login, web frontends , know building and support
categories:Public (e.g. gov) Academic (soken dutch) Restricted (doctor patient conversations)
recomm: CC licences, older material as free as can be negotiated; probs w. personal data and ethical (ask ppl if OK with TDM)

Lucie Guibault Legal aspects of text and database: will cover Copyright, Sui generis database right, IPR+TDM
LG Copyright Compilations protected and individual works. Facts, ideas are not protected
compilations protected if selection and arrangements must be author's own compilation
LG (throwaway rmk) "life of copyright is far too long"
LG unlikely that TDM will fall under allowed Educational and Research use
LG now telling us of sui generis data base rights (only applies to Europe). [PMR I am now getting really depressed.]
PMR v. depressed : Every talk I've heard on cpyright and other legal things effectively says "you arent allowed to do anything".
"TDM not allowed without authorisation from rights holder". Question, should it?
PMR Why has noone even mentioned Hargeaves at this meeting??? I came for some answers - not getting any

panel starting 5 * 5minutes Natalia Manola Athens OPenAIRE, using TDM to cluster documents in repo.how funding relates 2 outcomes
OpenAire makes agreeemnts with publishers to do TDM
need to work with policyholders at all levels
NM can Liber bring everyone together to make better policy and education about rights?
NM funders have power
NM have to show that many services come from output of mining e.g. Google (and OpenAIRE - I wasn't sure)
NM too many barriers in Europe for TDM

Nilu Satharasinghe startup in Cambridge
NS rerorutes links to publications on basis of content
NM worried that additional restrictions make it difficult for him to conduct business. Gives examples of mashups (e.g. films)

John McNaught Manchester NaCTeM which supports researchers. must be able to share with many types of researchers. Only Full-text
JMcN text-mining saves lives. [PMR great to hear this]
JMcN highly limited at moment. Funders (e.g. EC) want access to research but what is gov (EC) doing to free up info?

Caroline Dynes (royal Soc) . Opening up data sets. Funder, but also publisher. Pays authors APCs (no details).
CD mentions Hargreaves (first mention in this meeting).

Ellen Broad (IFLA). Most TDM is presumptively illegal without legal exception.
Ellen Broad raises good questions and shows that current position is untenable.
EB must be Certainty. Cannot be openended. Currently no defence against infringement. Investors will forsake EU vs US

JMcN showing how crazy the licencing to universities are
CD announces that Royal Soc agrees "the right to read is the right to mine". PMR I'll start tommorrow!
JMcN if Europe wants evidence-based material they should look to TDM and so this must be made legal!
"if there is no european precedent then it isn't illegal" Hey! let's get going.
@OKFN very pleased to see that "The Right to Read is the Right to mine"

Kurt Deketelaere at "we must challenge the law" . PMR absolutely agree. That's how bad laws are obsoleted
Google (?Simon Morrison) you need to create a more flexible approach in Europe ("fair use")
PaulAyris this is a major issue for universities. Database directives must be revised.
PA TDM should be parts of pilots in Horizon2020 has to test limits of current arrangements

