petermr's blog

Tummy bug 2: The scientific literature teaches us about Isospora

Posted on April 29, 2016 by pm286

In the previous post we showed how ContentMine could give immediate knowledge about a scientific topic – we analysed “Isospora”, which is a nasty tummy bug. Let’s just read Wikipedia to get some idea of the language we’ll need

Life Cycle

PHIL 3398 lores

An oocyst with one sporoblast is released in stool of infected person
After the oocyst has been released, the sporoblast matures further and divides into two
After the sporoblasts divide they create a cyst wall and become sporocysts
The sporocysts each divide twice, resulting in four sporozoites
Transmission occurs when these mature oocysts are ingested
The sporocysts excyst in the small intestine where sporozoites are released
The sporozoites then invade epithelial cells and schizogony is initiated
When the schizonts rupture, mereozoites are released and continue to invade more epithelial cells
Trophozoites develop into schizonts, containing many mereozoites
After about one week, development of male and female gametocytes begin in the mereozoites
Fertilization results in the development of oocysts, which are released in the stool [1][6]

The sporulation time of this parasite’s egg is usually 1–4 days, and the entire life cycle takes about 9–10 days.[7]

Wow! That’s complicated! But that’s because Life is complicated! These parasites have complex life cycles. You have to learn the terms – but it’s no harder than learning the terms in a new game, or a law case, or soccer strategy. You just need to want to do it! And Wikipedia will help. Wikipedia is always there. These parasites are all Apicomplexans and here’s their language https://en.wikipedia.org/wiki/Apicomplexan_life_cycle#oocyst

So if you are interested in more than just Isospora, use ContentMine to search for “Apicomplexan”.

Most of the papers have well defined messages. The first was about opportunistic infections in HIV patients. Read the word cloudlet for each paper here and see if you can guess the subject of papers 2,3,4,5,6. If you know the species behind the latin names that helps. If you don’t use your friend Wikipedia.

Here’s my thinking:

Already done
“Caninum, Parasitology, Vets – probably about Dogs. Toxoplasma I’ve heard of – it’s a parasite and https://en.wikipedia.org/wiki/Toxoplasma_gondii confirms it. Never heard of Neospora or Hammondia but I wouldn’t eat them. Check – https://en.wikipedia.org/wiki/Neospora , https://en.wikipedia.org/wiki/Hammondia_hammondi yes they are both Apicomplexa, the latter of cats. Did we get it right?

Canine faecal contamination and parasitic risk in the city of Naples (southern Italy).

Seems to be about ferrets , and mink (Mustela) getting influenza. Ferrets develop fatal influenza after inhaling small particle aerosols of highly pathogenic avian influenza virus A/Vietnam/1203/2004 (H5N1).

It is. But why are people worried about ferrets getting sick?? Because influenza uses non-human hosts such as birds and ferrets so we might get it from them. And when I was in the pharma industry they used ferrets as a model of human disease.
Where’s the Isospora?
The animals lacked signs of epizootic catarrhal enteritis, and were negative by microscopy for enteric protozoans such as Eimeria and Isospora species using fecasol, a sodium nitrate fecal flotation solution (EVSCO Pharmaceuticals, Buena, NJ).

Translation: we made sure the test animals didn’t have other infections that could distort our research (and we told you how we did it).

I know Gallus is a hen. And we’re going to add an icon and a mouseover on the table so you don’t need to look it up. Eimeria is an apicomplexan, and because it occurs 6 times in the paper it’s pretty important. I’m guessing it’s about parasites of hens. But what’s the rest? There are lots of genes and my guess is that they being used for c omparative genetics or possibly modes of action.
I don’t know what “QTL”. I probably should, but why bother when we have Wikipedia?

A quantitative trait locus (QTL) is a section of DNA (the locus) that correlates with variation in a phenotype (the quantitative trait).[1] The QTL typically is linked to, or contains, the genes that control that phenotype.

Rough Translation: The phenotype is what we feel, touch, smell, observe in an organism. and the QTL is that part of the genes that affects it.
So the paper is probably about genomic studies on parasites and chickens. Let’s look: QTL detection for coccidiosis (Eimeria tenella) resistance in a Fayoumi × Leghorn F₂ cross, using a medium-density SNP panel.
Rough translation: analysing the genome of chickens for regions that confer resistance the the most serious parasite. Eimeria is an apicompelxan, so I expect the paper mentions a range of them, including Isospora. (Yes: “Coccidia are sub-classified into several genera, including Eimeria, Isospora, Cryptosporidium, Toxoplasma and Sarcocystis. ) So we’re becoming experts on Apicomplexan names!

Turdus, Coccothraustes … Thrushes and Hawfinch. Also cloudlet show “birds” and “iron”. “Deadly Outbreak of Iron Storage Disease (ISD) in Italian Birds of the Family Turdidae” . This is the paper where they examines the birdshit for parasites…

So that seems a lot of work – and we are only 5 papers through. But some of those are relevant to Natalie and some aren’t – her false positives. So can we get ContentMine to select just the ones she needs?
We hope so. If the paper has a lot about apicomplexans it’s probably relevant. If it’s about other diseases such as HIV or flu it’s probably not. So we could remove those automatically.
And that would save a lot of time. And hopefully help us learn bioscience in an efficient manner.

Posted in Uncategorized | 1 Comment

How ContentMine can help you! Our example looks for "tummy bug" for Natalie

Posted on April 29, 2016 by pm286

Yesterday Tom, Natalie and I had coffee together. Natalie’s a Vet student – at Royal Veterinary College – and we got talking about her project – 8 weeks doing practical research on Isospora. I’ve never heard of it. No idea what it is.
But ContentMine will know, so we’ll ask it…
We’ll be showing you in later posts how it all works, but just accept that we type:
getpapers -q isospora -x
Wait a minute for ca 207 open access papers to be downloaded , and then
cmine isospora
And wait another minute for ami to crunch through the data. Ami has already created summary files and we’ll look at full.dataTables.html which gives an overall view of all the “plugins” we have used (species, genes, words, etc.). Here’s the first few papers:

No need to squint – We’ll describe them in larger detail. (Note: some of the links are broken and there are a few false positives, both are being cleaned up).

The first column results gives links to the papers (PMC2758902 is a PubMedCentral id and clicking it will link to the EuropePubMedCentral repository of full text papers). Yes, YOU can read them. 200 free papers. If your are interested in Isospora, they are all yours! So here’s the first paper of the 200..

PMC2758902 local	SPSS	Isospora belli		patients x 69 (%) x 28 < x 21

We still don’t know what Isospora is, so let’s click on Isospora belli . It’s linked to Wikipedia which says:

Cystoisospora belli, previously known as Isospora belli, is a parasite that causes an intestinal disease known as cystoisosporiasis.[1] This protozoan parasite is opportunistic in immune suppressed human hosts.[2] It primarily exists in the epithelial cells of the small intestine, and develops in the cell cytoplasm.[2] The distribution of this coccidian parasite is cosmopolitan, but is mainly found in tropical and subtropical areas of the world such as the Caribbean, Central and S. America, India, Africa, & S.E. Asia. In the U.S., it is usually associated with HIV infection and institutional living.[3]

So, to paraphrase,

“Isospora is the old name of a nasty tummy bug, found mainly, but not exclusively, in the sub/tropical world that can infect HIV-sufferers”

Biological science is often hard to read for newcomers, but with practice you learn how to translate. Here’s a sentence from one paper:

Coprological examination of fresh stool specimens revealed coccidian oocysts of the genus Isospora in 36% of the birds

Translated:

We examined birdshit and found parasite eggs in 36%.

The long words are useful – they aren’t there just to put you off or be pompous. They help translate between human languages, and they increase precision. If we search for “parasite eggs in birds” we might end up with bird eggs, whereas “oocytes” is more precise. ContentMine loves precise words because it reduces false positives (results that aren’t relevant to what you want).
Column “words” is a list of the commonest word tokens. In this case it’s just “patients”. That confirms that the paper is probably about human infection (though Natalie and other Vets call animals “patients”). So were we right? Click on PMC2758902 and we’ll see:

So it’s about HIV, and drug treatment. Where’s the Isospora? Search down the full text and we find:

The reasons for hospitalization were: disseminated tuberculosis (month 5), reactivation of oropharyngeal Kaposi’s sarcoma (month 3), and Isospora belli diarrhea with severe dehydration

So if you are interested in finding all papers where Isospora has infected HIV papers, ContentMine can immediately help you.
Nataliie’s main interest is veterinary, so we’ll look at the next few papers. But that shows how much there is in just ONE paper. And why we need machines to help us. Natalie probably mainly wants papers about animals and we can address that as well…

… in the next blog post!

Posted in Uncategorized | 1 Comment

TDM at European Parliament – tweet-like report

Posted on April 28, 2016 by pm286

Great meeting at Brussels EP yesterday. Would have liked to tweet but didn’t have password. – There *were* tweets by the MEPs. So I wrote my notes like tweets I would have made. Maye be useful to some, mystifying to others…
https://twitter.com/ComodiniCachia/status/725302288886169600/photo/1

Also Julia Reda MEP was there at the start!
Here’s the panel (7-8) run by Catherine Stihler MEP (who chaired well and let everyone else speak)
Marco Giorello Head copyright Unit DG Connect
Problem: data analytics techniques involve making copies
These copies are relevant to copyright
Legal situation unclear; some exceptions temporal copying, and copying for research purposes
(a) contractual conditions and policies
(b) legislation – UK exception – because there was already research exception (but leads to Euro fragmentation).
Other states have “research exception”. Other states e.g. France, and ?Germany we don’t want 15 different legislations
Dec 2015 – EC trying to find balance – PIRO [Public Interest Research Organization, yes I don’t what that is either, so asked later…] – to address Univs and research insts.
But aware that Univs have private partners
UK “non-commercial” has caused problems.
Not only about copyright – but also technology , standards …
John Boswell SAS (software company) – analysis of data.
TDM is just one form of data analysis. Copyright wider, bcos movies, images, voice all covered by copyright
analysis of 1 million docs to extract sentiment and time series, does not implicate (C).
(C) is protection of expression of an idea. Analysing this does not copy the expression or create a derivative work.(C) must not prevent TDM. Issue much bigger than Universities. World has so much (C) – ca 300, 000 every minute FB, Tweets, Instagram, etc. . Much covered by copyright
Analysis of social media is major good. Govs can use social media to predict economics
Debate must realise that TDM does not implicate (C)
Theresa Comodini Cachia (MEP and meeting convener)
Don’t wish to have debate on copyright vs TDM
Startups need protection from copyright and also need to use TDM
Startup innovation are EU priority – social and economic development
TDM will lead to new economic development
Reda report focussed on academic reearch.
innovation not just economic but also health and social
would give good push to innovation
Jakub Czakon (Stermedia) – (data analyst Physics + finance + chess)
loves data
TDM = data -> information -> knowledge
example s/w that matches CVs onto job offers
extract important info from data
try to match qualifications- find connections and distances between documents
health care – diagnosis of tumour – used machine learning and public data – found public competition training set.
looks for cells and local structure. Created diagnostic indicators.
facial recognition
these skills and startups are critical for Europe
Adriana Homolova – data journalist and visualisation
dataScience >> data analysis (insight into data) >> data analytics (analysing large amounts of data) >> data mining
uses AI.
NeuralNets, RandomForests, NearestNeighbours
Data mining is starting in journalism
journalism qualitative vs quantitative – “Interview data”
makes journalism stronger
data analysis used to fliter professors for side jobs for “interesting people”
e.g. 3 side jobs per prof
BBC analysed tennis for match-fixing for repeated underperforming
published on github
revolutionary in journalism
Panama papers had 400 (competing) journalists to abandon secrecy “newsroom collaboration”
data are the raw material of our age.
copyright can do much harm.
data anslytics are extension of our thought proceses
we must look how to open up – e.g. copyleft
Jean-Francois Dechamp DG Research and Innovation
both policy creation and funding agency
FutureTDM and OpenMinTed
objective – best conditions to do their job
resarchers and both producers and consumers
researchers often don’t own copyright of their resaerch
competition fierce – merger of Springer and Nature
data journals
publishers => service providers
Sergey Filippov Lisbon Council (Brussels Innovation Think tank)
Report 2 years ago on TDM in Academic and Research Communities in Europe
Academic pubs 1.5 / year , 60 million in total
“Publish or perish” leads to distraction from teaching and poor research
Traditional k/w search, TDM can recognise concept s, facts realtions, preparatory
idea -> lit rev (TDM)-> hypothesis (TDM) -> data methodology -> analys conclusions
what’s problem? copyright …
researched this…
scientific publications 1200 pubs 47% from US EU 26% EU cited less than US
applicable to all subjects, not just hard sciences
10-fold increase in Data mining, TDM papers in last 5 years
US 21%, EU 28, CN 10, IN 13%
Patents in data mining huge growth in China
Then he interviewed 20 researchers
most people don’t know about TDM or tech -savvy
many worried about copyright
leads to results of lower quality
academic want exceptions
R2RR2M
growth in CN and IN and US
Europeans concerned but worried about clarity
if we don’t manage to get TDM used, then far-reaching negative implications for EU
Questions:
Christoph Bruch: Open Science Coordination Office of the Helmholtz association,
lot of researchers want assurance
Must not be universities only
(to Marco EC) must not limit how society can use information
limit will do very much damage
Marco – commercial vs nc. Current draft is not final.
Why not business activities. Exception would also be (C) but certan classes of beneficiaries.
must look at (C) with care
cause friction
Pharma already use licences
Existing lucrative Market for re-use so EC can’t easily sweep it away
attempt to give full legal certainty
will be positive for academia and neutral for others
Boswell SAS – there is broad exception for TDM as “fair use” if not used for other purpose
interim step – new work is not copy of expression
in EC temporary copy should be covered by 5.1 of InfoSoc directive
PPIs with universities – lines are blurred
Should not make lines between univs and others
PM-R gave TDMer point of view and asked about PIRO – more later

Posted in Uncategorized | 1 Comment

@TheContentMine preparing for largescale high-throughput Mining (TDM)

Posted on April 27, 2016 by pm286

The ContentMine (contentmine.org) has almost finished the infrastructure and software for automatic daily mining of the scientific literature. We hope to start testing in the next few days. I’ll try to post frequent information.
The software has been developed by the ContentMine Team, wonderfully funded by the Shuttleworth Foundation. The people involved include:

Mark MacGillivray
Anusha Ranganathan
Richard Smith-Unna
Tom Arrow
Peter Murray-Rust
Chris Kittel
and voluntary contributions

The daily oprtation (as opposed to user-driven getpapers) consists of:

DOIs and URLs provided by CrossRef
downloading software
indexing of fulltext documents (closed as well as open, legal under the UK “Hargreaves” exception)
fact extraction
display

We’ll detail this later.
The sources include:

open repositories such as EuropePubMedCentral
arxiv and other repositories
closed documents to which Cambridge University subscribes. We are working intimately with Cambridge University Library staff and offer public applause and thanks.

All closed work will be carried out on closed machines run by the University’s computer officers, primarily in Chemistry, and again public thanks to this wonderful group. We take great care to limit access so that no unauthorised access is possible and that there is also an audit trail of what we do and have done.
It is difficult to predict the daily volume. MarkMacG has found it to vary between 300 and 80,000 documents a day. My guess is about 2000-7000 on average.
This is NOT a resource problem. The whole scientific literature for a year can be held on a terabyte disk. The processing time is small – perhaps 1000 documents a minute on our system. The whole literature can be done within a long coffee break.
The impact on publisher servers is minimal. at, say, 5000 articles/day even the largest publisher would only get 1 request per minute. The others would be trivial (1 request every 5-10 minutes). There is no case that our responsible TDM would cause any problems at all.
And, just to reassure everyone, I and colleagues are working hard to stay completely within the law as we see it. We are not stealing content.

Posted in Uncategorized | 1 Comment

Off to Brussels for ContentMining (TDM) meeting.

Posted on April 26, 2016 by pm286

I’m spending a (long) day going to Brussels to a meeting run by MEPs and the European Parliament on Text and Data Mining. Here’s the metadata:
“Demystifying Text and Data Mining in a copyright context”
When: Wednesday 27 April 2016, 13.00 – 15.00
Where: European Parliament, ASP, Room A5E2
Event co-hosted by Miapetra Kumpula-Natri & Therese Comodini Cachia & Catherine Stihler
First – I am a great supporter of the MEPs who propose reform – we can add Julia Reda (@senficon) to this.
The blurb is only present as a woolly GIF: Why??? I can’t even cut-and-paste? we are in the digital century?
The UK has one of the few Exceptions to Copyright allowing TDM (for very limited purposes – personal non-commercial research for those who have legal access to the material). I am one of the very few people – perhaps one of two – who is actually using this legal permission.
Europe has been fighting for similar rights – and so have individual jurisdictions such as France:

Open Access EC @OpenAccessEC

Declaration pro-exception in #copyright for #TDM in France (and in French) by group of entrepreneurs and leaders: http://www.lesechos.fr/idees-debats/editos-analyses/021875211332-fouille-de-donnees-la-loi-ne-doit-pas-enterrer-la-recherche-francaise-1217258.php#

(PMR summary – the great-and-good of France are fighting for rights to carry out TDM).

However I am deeply worried about the European initiative. Every time there is to be a draft, the time slips. The current wording is so vague as to be almost useless. We are all fighting massive opposition from publishers and lobbyists and reform gets watered down month by month…
Simply – I (PMR) am allowed to mine in UK because ANYONE has “The right to read is the right to mine”. By contrast in Europe only “Public (Interest) Research Organisations” can mine.

Is a journalist a PIRI? No.
Is a teacher a PIRI? No.
Is PMR a PIRI? No.

Who is?
My guess is that this will turn out to require either/or

a regulator
a court case

If we rely on the EC then maybe I would have to register as an approved TDM’er and only carry out TDM at approved institutions.
Please tell me that I am overreacting.
Please…
I shall certainly ask this tomorrow if I am allowed to speak.
oh – and here is the awful GIF that accompanied the event. I hope against hope that it was a mistake. It sends out every wrong message…

TDM Copyright reform is about LICENSING?? NO, NO, NO

Posted in Uncategorized | 1 Comment

ContentMine at Force2016; notes for my session

Posted on April 17, 2016 by pm286

I have a 30 mins session at Force2016 on Semantic Publishing. I’ll concentrate on ContentMine. I shall not powerpoint people, but do some experiments.
Here are some useful links:

Content mine website contentmine.org.
Install contentmine software yourself https://contentmine.github.io/. (Julia Reda MEP did so, so can you!)
Extracting information from biomedical papers (example clinical trials). More slides at http://www.slideshare.net/petermurrayrust/
Architecture of the system http://www.slideshare.net/petermurrayrust/architecture-of.
Wikdata. Example
Full semantic markup in chemistry. http://chemicaltagger.ch.cam.ac.uk/. Try it yourself.
Text mining – discussions with Royal Society.
typical dictionary Mouse genes extracted from Jackson laboratory.
http://blog.riojournal.com/2016/03/11/openly-published-open-science-prize-grant-proposal-builds-on-contentmine-and-hypothes-is-to-bridge-scientists-and-facts/ Proposal with Hypothes.is