How ContentMine can help you! Our example looks for "tummy bug" for Natalie

Yesterday Tom, Natalie and I had coffee together. Natalie’s a Vet student - at Royal Veterinary College - and we got talking about her project - 8 weeks doing practical research on Isospora. I’ve never heard of it. No idea what it is.

But ContentMine will know, so we’ll ask it…

We’ll be showing you in later posts how it all works, but just accept that we type:

getpapers -q isospora -x

Wait a minute for ca 207 open access papers to be downloaded , and then

cmine isospora

And wait another minute for ami to crunch through the data. Ami has already created summary files and we’ll look at full.dataTables.html which gives an overall view of all the “plugins” we have used (species, genes, words, etc.). Here’s the first few papers:

Screen Shot 2016-04-29 at 14.04.19

No need to squint - We’ll describe them in larger detail. (Note: some of the links are broken and there are a few false positives, both are being cleaned up).

 

The first column results gives links to the papers (PMC2758902 is a PubMedCentral id and clicking it will link to the EuropePubMedCentral repository of full text papers). Yes, YOU can read them. 200 free papers. If your are interested in Isospora, they are all yours! So here’s the first paper of the 200..

PMC2758902 local

 

We still don’t know what Isospora is, so let’s click on Isospora belli . It’s linked to Wikipedia which says:

Cystoisospora belli, previously known as Isospora belli, is a parasite that causes an intestinal disease known as cystoisosporiasis.[1] This protozoan parasite is opportunistic in immune suppressed human hosts.[2] It primarily exists in the epithelial cells of the small intestine, and develops in the cell cytoplasm.[2] The distribution of this coccidian parasite is cosmopolitan, but is mainly found in tropical and subtropical areas of the world such as the Caribbean, Central and S. America, India, Africa, & S.E. Asia. In the U.S., it is usually associated with HIV infection and institutional living.[3]

So, to paraphrase,

“Isospora is the old name of a nasty tummy bug, found mainly, but not exclusively, in the sub/tropical world that can infect HIV-sufferers”

Biological science is often hard to read for newcomers, but with practice you learn how to translate. Here’s a sentence from one paper:

Coprological examination of fresh stool specimens revealed coccidian oocysts of the genus Isospora in 36% of the birds

Translated:

We examined birdshit and found parasite eggs in 36%.

The long words are useful - they aren’t there just to put you off or be pompous. They help translate between human languages, and they increase precision. If we search for “parasite eggs in birds” we might end up with bird eggs, whereas “oocytes” is more precise. ContentMine loves precise words because it reduces false positives (results that aren’t relevant to what you want).

Column “words” is a list of the commonest word tokens. In this case it’s just “patients”. That confirms that the paper is probably about human infection (though Natalie and other Vets call animals “patients”). So were we right? Click on PMC2758902 and we’ll see:

Screen Shot 2016-04-29 at 14.49.41

So it’s about HIV, and drug treatment. Where’s the Isospora? Search down the full text and we find:

The reasons for hospitalization were: disseminated tuberculosis (month 5), reactivation of oropharyngeal Kaposi's sarcoma (month 3), and Isospora belli diarrhea with severe dehydration

So if you are interested in finding all papers where Isospora has infected HIV papers, ContentMine can immediately help you.

Nataliie’s main interest is veterinary, so we’ll look at the next few papers. But that shows how much there is in just ONE paper. And why we need machines to help us. Natalie probably mainly wants papers about animals and we can address that as well…

 

… in the next blog post!

 

Posted in Uncategorized | 1 Comment

TDM at European Parliament - tweet-like report

Great meeting at Brussels EP yesterday. Would have liked to tweet but didn't have password. - There *were* tweets by the MEPs. So I wrote my notes like tweets I would have made.  Maye be useful to some, mystifying to others...

https://twitter.com/ComodiniCachia/status/725302288886169600/photo/1

ChDKk3ZW0AAON57

Also Julia Reda MEP was there at the start!

Here's the panel (7-8) run by Catherine Stihler MEP (who chaired well and let everyone else speak)

Marco Giorello Head copyright Unit DG Connect

Problem: data analytics techniques involve making copies
These copies are relevant to copyright
Legal situation unclear;  some exceptions temporal copying, and copying for research purposes
(a) contractual conditions and policies
(b) legislation - UK exception - because there was already research exception (but leads to Euro fragmentation).
Other states have "research exception". Other states e.g. France, and ?Germany we don't want 15 different legislations
Dec 2015 - EC trying to find balance - PIRO [Public Interest Research Organization, yes I don't what that is either, so asked later...] - to address Univs and research insts.
But aware that Univs have private partners
UK "non-commercial" has caused problems.
Not only about copyright - but also technology , standards ...

John Boswell SAS (software company) - analysis of data.
TDM is just one form of data analysis. Copyright wider, bcos movies, images, voice all covered by copyright
analysis of 1 million docs to extract sentiment and time series, does not implicate (C).
(C) is protection of expression of an idea. Analysing this does not copy the expression or create a derivative work.(C) must not prevent TDM. Issue much bigger than Universities. World has so much (C) - ca 300, 000 every minute FB, Tweets, Instagram, etc. . Much covered by copyright
Analysis of social media is major good. Govs can use social media to predict economics
Debate must realise that TDM does  not implicate (C)

Theresa Comodini Cachia (MEP and meeting convener)

Don't wish to have debate on copyright vs TDM
Startups need protection from copyright and also need to use TDM
Startup innovation are EU priority - social and economic development
TDM will lead to new economic development
Reda report focussed on academic reearch.
innovation not just economic but also health and social
would give good push to innovation

Jakub Czakon (Stermedia) - (data analyst Physics + finance + chess)
loves data
TDM = data -> information -> knowledge
example s/w that matches CVs onto job offers
extract important info from data
try to match qualifications- find connections and distances between documents
health care - diagnosis of tumour - used machine learning and public data - found public competition training set.
looks for cells and local structure. Created diagnostic indicators.
facial recognition
these skills and startups are critical for Europe

Adriana Homolova - data journalist and visualisation
dataScience >> data analysis (insight into data) >> data analytics (analysing large amounts of data) >> data mining
uses AI.
NeuralNets, RandomForests, NearestNeighbours
Data mining is starting in journalism
journalism qualitative vs quantitative - "Interview data"
makes journalism stronger
data analysis used to fliter professors for side jobs for "interesting people"
e.g. 3 side jobs per prof
BBC analysed tennis for match-fixing for repeated underperforming
published on github
revolutionary in journalism
Panama papers had 400 (competing) journalists to abandon secrecy "newsroom collaboration"
data are the raw material of our age.
copyright can do much harm.
data anslytics are extension of our thought proceses
we must look how to open up - e.g. copyleft

Jean-Francois Dechamp DG Research and Innovation
both policy creation and funding agency
FutureTDM and OpenMinTed
objective - best conditions to do their job
resarchers and both producers and consumers
researchers often don't own copyright of their resaerch
competition fierce - merger of Springer and Nature
data journals
publishers => service providers

Sergey Filippov Lisbon Council (Brussels Innovation Think tank)
Report 2 years ago on TDM in Academic and Research Communities in Europe
Academic pubs 1.5 / year , 60 million in total
"Publish or perish" leads to distraction from teaching and poor research
Traditional k/w search, TDM can recognise concept s, facts realtions, preparatory
idea -> lit rev  (TDM)-> hypothesis (TDM) -> data methodology -> analys conclusions
what's problem? copyright ...
researched this...
scientific publications 1200 pubs 47% from US EU 26% EU cited less than US
applicable to all subjects, not just hard sciences
10-fold increase in Data mining, TDM papers in last 5 years
US 21%, EU 28, CN 10, IN 13%
Patents in data mining huge growth in China
Then he interviewed 20 researchers
most people don't know about TDM or tech -savvy
many worried about copyright
leads to results of lower quality
academic want exceptions
R2RR2M
growth in CN and IN and US
Europeans concerned but worried about clarity
if we don't manage to get TDM used, then far-reaching negative implications for EU

Questions:

Christoph Bruch: Open Science Coordination Office of the Helmholtz association,

lot of researchers want assurance
Must not be universities only
(to  Marco EC) must not limit how society can use information
limit will do very much damage

Marco - commercial vs nc. Current draft is not final.
Why not business activities. Exception would also be (C) but certan classes of beneficiaries.
must look at (C) with care
cause friction
Pharma already use licences
Existing lucrative Market for re-use so EC can't easily sweep it away
attempt to give full legal certainty
will be positive for academia and neutral for others

Boswell SAS - there is broad exception for TDM as "fair use" if not used for other purpose
interim step - new work is not copy of expression
in EC temporary copy should be covered by 5.1 of InfoSoc directive
PPIs with universities - lines are blurred
Should not make lines between univs and others

PM-R gave TDMer point of view and asked about PIRO - more later

 

Posted in Uncategorized | 1 Comment

@TheContentMine preparing for largescale high-throughput Mining (TDM)

The ContentMine (contentmine.org) has almost finished the infrastructure and software for automatic daily mining of the scientific literature. We hope to start testing in the next few days. I'll try to post frequent information.

The software has been developed by the ContentMine Team, wonderfully funded by the Shuttleworth Foundation. The people involved include:

  • Mark MacGillivray
  • Anusha Ranganathan
  • Richard Smith-Unna
  • Tom Arrow
  • Peter Murray-Rust
  • Chris Kittel
  • and voluntary contributions

The daily oprtation (as opposed to user-driven getpapers) consists of:

  • DOIs and URLs provided by CrossRef
  • downloading software
  • indexing of fulltext documents (closed as well as open, legal under the UK "Hargreaves" exception)
  • fact extraction
  • display

We'll detail this later.

The sources include:

  • open repositories such as EuropePubMedCentral
  • arxiv and other repositories
  • closed documents to which Cambridge University subscribes. We are working intimately with Cambridge University Library staff and offer public applause and thanks.

All closed work will be carried out on closed machines run by the University's computer officers, primarily in Chemistry, and again public thanks to this wonderful group. We take great care to limit access so that no unauthorised access is possible and that there is also an audit trail of what we do and have done.

It is difficult to predict the daily volume. MarkMacG has found it to vary between 300 and 80,000 documents a day. My guess is about 2000-7000 on average.

This is NOT a resource problem. The whole scientific literature for a year can be held on a terabyte disk. The processing time is small - perhaps 1000 documents a minute on our system. The whole literature can be done within a long coffee break.

The impact on publisher servers is minimal. at, say, 5000 articles/day even the largest publisher would only get 1 request per minute. The others would be trivial (1 request every 5-10 minutes). There is no case that our responsible TDM would cause any problems at all.

And, just to reassure everyone, I and colleagues are working hard to stay completely within the law as we see it. We are not stealing content.

 

Posted in Uncategorized | 1 Comment

Off to Brussels for ContentMining (TDM) meeting.

I'm spending a (long) day going to Brussels to a meeting run by MEPs and the European Parliament on Text and Data Mining. Here's the metadata:

“Demystifying Text and Data Mining in a copyright context”

When: Wednesday 27 April 2016, 13.00 – 15.00

Where: European Parliament, ASP, Room A5E2

Event co-hosted by Miapetra Kumpula-Natri & Therese Comodini Cachia & Catherine Stihler

First - I am a great supporter of the MEPs who propose reform - we can add Julia Reda (@senficon) to this.

The blurb is only present as a woolly GIF:  Why??? I can't even cut-and-paste? we are in the digital century? euroinvite

The UK has one of the few Exceptions to Copyright allowing TDM (for very limited purposes - personal non-commercial research for those who have legal access to the material). I am one of the very few people - perhaps one of two - who is actually using this legal permission.

Europe has been fighting for similar rights - and so have individual jurisdictions such as France:

Declaration pro-exception in #copyright for #TDM in France (and in French) by group of entrepreneurs and leaders: http://www.lesechos.fr/idees-debats/editos-analyses/021875211332-fouille-de-donnees-la-loi-ne-doit-pas-enterrer-la-recherche-francaise-1217258.php#
(PMR summary - the great-and-good of France are fighting for rights to carry out TDM).

However I am deeply worried about the European initiative. Every time there is to be a draft, the time slips. The current wording is so vague as to be almost useless. We are all fighting massive opposition from publishers and lobbyists and reform gets watered down month by month...

Simply - I (PMR) am allowed to mine in UK because ANYONE has "The right to read is the right to mine". By contrast in Europe only "Public (Interest) Research Organisations" can mine.

  • Is a journalist a PIRI? No.
  • Is a teacher a PIRI? No.
  • Is PMR a PIRI? No.

Who is?

My guess is that this will turn out to require either/or

  • a regulator
  • a court case

If we rely on the EC then maybe I would have to register as an approved TDM'er and only carry out TDM at approved institutions.

Please tell me that I am overreacting.

Please...

I shall certainly ask this tomorrow if I am allowed to speak.

oh - and here is the awful GIF that accompanied the event. I hope against hope that it was a mistake. It sends out every wrong message...
Screen Shot 2016-04-26 at 19.24.39

TDM Copyright reform is about LICENSING?? NO, NO, NO

Posted in Uncategorized | 1 Comment

ContentMine at Force2016; notes for my session

I have a 30 mins session at Force2016 on Semantic Publishing. I'll concentrate on ContentMine. I shall not powerpoint people, but do some experiments.

Here are some useful links:

 

NOTE: You can do all this yourself. You don't need to be in a University or get publisher permission. I shall explore this with my taxi-driver.

 

Posted in Uncategorized | 1 Comment

Nature also charges to read about Zika

Nature also charges to read about Zika. The British Dental Journal is owned by them.

Screen Shot 2016-04-16 at 10.07.28

Posted in Uncategorized | Leave a comment

Elsevier's paywalling of Zika papers is systematic

Screen Shot 2016-04-16 at 09.41.54

Posted in Uncategorized | Leave a comment

Wolters-Kluwer charge to read Zika papers

Wolters-Kluwer charge to read Zika papers...

Screen Shot 2016-04-16 at 09.27.11 Screen Shot 2016-04-16 at 09.27.28

Posted in Uncategorized | Leave a comment

Mary Ann Liebert Publishers charge for 51 USD to read a paper on Zika virus

Screen Shot 2016-04-16 at 09.10.50

I thought all responsible publishers had agree to make Zika articles Open Access.

I was clearly wrong.

Open Access saves lives

Closed Access ? ? ?

Posted in Uncategorized | Leave a comment

Elsevier still charge for papers on Zika

II thouhj

Elsevier still hold Zika papers behind paywall

I thought that all major publishers had agreed to make papers on Zika available for free as a public service.
But I was wrong. Elsevier are charging people. Admittedly I haven't read this paper as I'd have to pay. But I think it's about Zika.

Prediction: we'll have a mail from Elsevier saying "this is part of our bumpy road. We'll still not fully competent to manage Open Access, so please forgive us and it will take some more years to get it right. Meanwhile we'll continue to charge you."

NOTE. This paper wasn't difficult to discover - I found it on Europe PubmedCentral http://europepmc.org/. I just typed "Zika"

Posted in Uncategorized | Leave a comment