We’ve now built a version of ContentMine that we feel happy to promote for you to use. (There’s been no secret – all the code (http://https://github.com/ContentMine/ ) and most of the discussion has been public). But now we are actively asking people to try it and give feedback.
Here’s a quote we got today:
“Hi, first congratulations. I have to say that I was skeptical at the beginning but I have tested the software and it is fantastic. “
That is the sort of thanks that keeps you going at midnight when the build doesn’t build and the tests start failing and …
… but ContentMine is a large system and distributing and maintaining large systems is hard, tedious, endlessly frustrating. We’re very conscious of when it fails for you. The good thing is that we’ve had a number of volunteers try it out and they deserve y/our thanks. It’s very hard being the first and getting fails.
A great difference has been made by Tom Arrow. Tom graduated last year from Imperial, London, and won the first Bradley-Mason prize. We have been delighted that he’s chosen to join ContentMine as developer and move to Cambridge two months ago . We’re excited that this role is turning out to be critical and exciting. Tom has been helping people set up the system.
CM consists of 6 modules of which 2 (cat, canary) are primarily server-side and 4 (getpapers, quickscrape, norma, ami) are downloadable by anyone. The use has evolved over the last year – in April 2015 we ran a workshop where the first day was showing how to use the system. Now we suggest you can get up to speed on your own – and in maybe 15 minutes. That depends critically on your experience of installation on your system – if you know about the commandline and things like apt-get, Node.js, npm, JRE, and PATHs, it should be reasonably straightforward. So we’ll call it “alpha”, which means it works but you benefit from knowing what you are doing and how to avoid mistakes.
There’s 4 modules:
- getpapers (Node.js) which queries a repository (by default Europe PubmedCentral). It returns anywhere from 10 to >1000 papers (mainly OpenAccess). This can take 15-200 s depending on line speeds
- quickscrape (Node.js) which uses known URLs to retrieve the components of a paper (HTML, PDF, supplemental data, etc.). Normally you’ll use one of these two.
- norma (Java) which turns XML, PDF, HTML into scholarlyHtml. In most cases norma will do what she needs to without you having to worry
- ami (Java) a set of modules for searching and filtering on scientific criteria (species, genes, chemistry, disease, countries, etc.).
We use the commandline to launch processes, and the file system to communicate and stores results. This general approach is tried and tested over 50 years! In essence we have built a toolbox for knowledge.
We’re showing this in Brussels on Thursday with a hackday on Friday. Do come! We are delighted that Julia Reda, MEP is coming in the afternoon. Julia has been a huge supporter of mining the literature and the copyright reform required to support it. Julia has volunteered to install the system on her Ubuntu machine – so we are on public show!
It’s much easier installing programs on multiple Operating systems than it used to be. Here we have two languages (Node, Java) and 3 systems (UNIX, MacOS, Windows) so that’s 6 combinations. Tom has made a great job of making all these work but there are still alpha bugs we don’t know of!
And we know that different countries have variants – diacritics, keyboards, even file names. We can’t guarantee anything other that UK/US-EN, but we’ll certainly try!
In Brussels we’ll hope to have communication to anyone via Etherpad, possibly Skype or Uber. More details on Twitter – follow @TheContentMine.
I’ll talk about the science we are going to do in the next blog…