Postal voting UK style


Whenever there is a national, international or local vote I feel it's my duty to vote. My grandmother was a suffragette and their actions won the right for today's British women to vote. (She didn't go to jail because she had small children and the suffragettes therefore wouldn't allow her to be put in the front line.). Universal suffrage represents one of the main struggles of the C20 - and it's still going on today.

So even if you think your vote makes no difference, the very fact of voting in the UK supports the disadvantaged elsewhere.

I take the act of voting seriously and dress up for the occasion - see my voting suit above. The suit came about because it was unclear (and still is) whether a voter has to show their face. In fact every time I have voted as a bear I have been asked to remove my head and have done so.

This year I can't vote in person so I applied for a postal vote. I have found the process archaic and arcane. I am sure that many voters don't use a postal vote because of the hassle.

  • Step 1. find out how to do it.
  • Step 2 send a postal request for a voting form. This cost me a first class stamp. It seems absurd. Why can we not ask for a form online? Anyway I sent the form off and heard nothing. I tweeted my MP (Julian Huppert) and he agreed that I should have heard. Anyway I rang up this Monday and they said forms had been sent out last Friday. Heard nothing.  Finally the forms arrived on Wednesday. So it's taken 3 (or 4) working days for the forms to travel 2 miles (3 km). This is absurd. Relying on the Royal Mail to provide "next day" delivery for first-class local post seems broken.
  • step 3. I have filled the forms in, for the National and Local elections. How did I vote? [see below]. I then dressed up to vote tday 2015-05-01, went to my local post-box, and put the envelope in it. It's got 4 working days to get to the Guildhall (3 km distant). I will never know if it reached it in time. It's possible I may be disenfranchised by slow postage.

So the system must be changed. It should be possible to get a voting form instantaneously. I am not arguing for electronic voting, but it should be possible to know that your vote has arrived and will go into the counting process.

So who did I vote for?

The party system is so broken in UK that it's impossible to vote for a party. I used to know what Labour, Conservative and Liberal stood for. Now I don't. Blair betrayed the system for ever - with the implied slogan "Trust me, I know what's best". The current leaders are pathetic. They are currently trying to find small gaps in the opponent's policies and add sweeties for the electorate. "No new income tax for 5 years". If that was so important it should have been in the manifesto. They are simply gibbering and most people who think about the issues have long ago given up. Clegg has destroyed the Lib-Dems - left them with no solid bedrocks.

So I don't vote for parties. I vote for people. That's on the basis that a responsible representative will recognize when policy is so far adrift that it has to be challenged. (Yes, you have spotted that I am an idealist). It makes it very difficult in Europe because you can only vote for parties.

Politics is carried out 365 days a year. Many issues are not party-political but require an independent, committed analysis and also hard work. There is good politics in the UK, In Europe, and in US (I don't have first hand knowledge of most countries). There is also awful politics in all of them, and it's that that I am fighting.

So you will have to guess who I voted for.


Posted in Uncategorized | Leave a comment

Is Figshare Open? "it is not just about open or closed, it is about control"

[Quote in title is from Mark Hahnel, see below]

I have been meaning to write on this theme for some time, and more generally on the increasing influence of DigitalScience's growing influence in parts of the academic infrastructure. This post is sparked by a twitter exchange (follow backwards from ) in the last few hours, which addresses the question of whether "Figshare is Open".

This is not an easy question and I will try to be objective. First let me say - as I have said in public - that I have huge respect and admiration for how Mark Hahnel created Figshare while a PhD student. It's a great idea and I am delighted - in the abstract - that it gained some much traction so rapidly.

Mark and I have discussed issues of Figshare on more than one occasion and he's done me the honour of creating a "Peter Murray-Rust" slide ( ) where he addresses some (but not all) of my concerns about Figshare after its "acquisition" by Macmillan Digital Science (I use this term, although there are rumours of a demerger or merger). I use "acquisition" because I have no knowledge of the formal position of Figshare as a legal entity (I assume it *is* one? Figshare FAQs ) and that's one of the questions to be addressed here.

From the FAQs:

figshare is an independent body that receives support from Digital Science. "Digital Science's relationship with figshare represents the first of its kind in the company's history: a community based, open science project that will retain its autonomy whilst receiving support from the division."

However lists Figshare among "our products" and brands it as if it is a DigitalScience division or company. Figshare appears to have no corporate address other than Macmillan and I assume trades through them.

So this post has been catalysed by a tweet of a report from a DS employee(?) Dan Valen

John Hammersley @DrHammersley tweeted:
Such a key message: "APIs are essential (for #opendata and #openscience)" - Dan Valen of @figshare at #shakingitup15

This generated a twitter exchange about why APIs were/not essential. I shan't explore that in detail, but my primary point is that:

If the only access to data is through a controlled API, then the data as a a whole cannot be open , regardless of the openness of individual components.

There is no doubt that some traditional publishers see APIs as a way of enforcing control over the user community. Readers will remember that I had a robust discussion with Gemma Hirsh of Elsevier, who stated that I could not legally mine Elsevier's data without going through their API. She was wrong, categorically wrong, but it was clear that she and Elsevier saw, and probably still see, APIs as a control mechanism. Note that Elsevier's Mendeley never exposed their whole data - only an API.

An API is the software contract with a webserver offering a defined service. It is often accompanied with a legal contract for the user (with some reciprocity). The definition of that service is completely in the hands of the provider. The control of that service is entirely in the hands of the provider. This leads to the following technical possibilities:

  • control: The provider can decide what to offer , when, to whom, on what basis. They can vary this by date, geography or IP of user, and I have no doubt that many publishers do exactly this. In particular, there is no guarantee that the user is able to see the whole data and no guarantee that it is not modified in some way from the "original". This is not, per se, reprehensible but it is a strong technical likelihood.
  • monitoring: ("snooping") The provider can monitor all traffic coming in from IP addresses, dwell times, number of revisits, quite apart from any cached information. I believe that a smart webserver, when coupled to other data about individuals, can deduce who the user is, where they are calling from and, with the sale of information between companies, what they have been doing elsewhere.

By default companies will do both of these. They could lead to increased revenue (e.g. Figshare could sell user data to other organizations) and increased lockin of users. Because Figshare is one of several Digital Science products (DS words, not mine) they could know about a user's publication record, their altmetric activity, what manuscripts they are writing, what they have submitted to the REF, what they are reading in their browser, etc. I am not asserting this is happening but I have no evidence it is not.

Mark says, in his slides,

"it is not just about open or closed, it is about control"

and I agree. But for me the question is who controls Figshare? and is Figshare controlling us?

Figshare appears to be one of the less transparent organizations I have encountered. I cannot find a corporate structure, and the companies' address is:

C/o Macmillan Publishers Limited, Brunel Road, Basingstoke, Hampshire, RG21 6XS

I can't find a board of directors or any advisory or governing board. So in practice Figshare is legally responsible to no-one other than UK corporate law.

You may think I am being unfair to an excellent (and I agree it's excellent) service. But history inexorably shows that these beginnings become closed, mutating into commercial control and confidentiality. Let's say Mark moves on? Who runs Figshare then? Or Springer buys Digital Science? What contract has Mark signed with DS? Maybe it binds Figshare to being completely run by the purchaser?

I have additional concerns about the growing influence of DigitalScience products, especially such as ReadCube, which amplify the potential for "snoop and control" - I'll leave those to another blogpost.

Mark has been good enough to answer some of my original concerns, so here are some othe'r to which I think an "open" ("community-based") organization should be able to provide answers.

  • who owns Figshare?
  • who runs Figshare?
  • Is there any governance process from outside Macmillan/DS? An advisory board?
  • How tightly bound is Figshare into Macmillan/DS? Could Figshare walk away tomorrow?
  • What could and what would happen to Figshare if Mark Hahnel left?
  • What could and what would happen to Figshare if either/both of Macmillan / DS were acquired?
  • Where are the company accounts for the last trading year?
  • how, in practice, is Figshare a "a community based, open science project that will retain its autonomy whilst receiving support from the (DS) division."?

I very much hope that the answers will allay any concerns I may have had.



Posted in Uncategorized | 11 Comments

The power of Digital Theses to change the world

I am speaking tomorrow at Lille to a group of Digita Humanists:

Séminaire DRTD-SHS

« Les données de la recherche dans les humanités numériques » 

Journée du 21 avril 2015 : « Maîtriser les technologies pour valoriser les données »

Lieu : MESHS (salle 2), 2 rue des Canonniers, 59000 Lille


I always wait to meet the audience before deciding what to say precisely but here are some themes:

  • Most research and scholarship is badly published, not reaching the people who need it and not creating a dialogue or community of practice. This is a moral crime and leads to impoverishment of the human spirit and the health of the citizens of the world.
  • The paradox of this century is that we have the potential for a new Digital Enlightenment, but in the Universities we are collaborating with those who, for their own personal gain, wish to restrict the distribution of knowledge. The large publishing corporations, taking support from media corporations are building an infrastructure which they monitor and control.
  • We have the technical means to break out of this. In our we can scrape the whole of the literature published every day; create a semantic index for searching and extract facts in far greater number than humans can ever do.
  • We are held back by the lack of vision, and our solution lies not in science, but in humanities. We lack a communal goal, communal values.

How can we harness the vision of Diderot and the Enlightenment and the radicalism of Mai 1968? How can we create the true culture of the digital century?

I shall show some of the tools we have developed in which can scrape and "understand"  the whole of scholarly publication. In the UK , after an intense battle against the mainstream publishing community, we have won the right for machines to read and analyze electronic documents without fear of copyright. I express this as:
We need this in the rest of Europe - Julia Reda MEP has recently proposed this  (and much more). There is again intense backlash - so we need philosophers, political scientists, historians, literary studies, economists to show why this freedom has to triumph.
All our tools are Open (Apache2, CC BY, CC0) and we have shown that "anyone" can learn to use them within a morning. They are part of the technical weaponry of digital liberation.
Theses are the major resource over which publishers have no control. Much of our scholarship is published in theses as well as in journals; and much is only published in theses. My single laptop can process 5000 theses per day - or 1 million per year - which should suffice.
The solution will come through human-machine symbionts - communities of practice who understand what machines can and cannot do.
Posted in Uncategorized | Leave a comment

TheContentMine is Ready for Business and will make scientific and medical facts available to everyone on a massive scale.

It's a year since I started TheContentMine ( - a project funded by the Shuttleworth Foundation. In ContentMine we intend to extract all the world's scientific and medical facts from the scholarly literature and make them available to everyone under permissive Open licences. We have been so busy - writing code, lobbying politically, building the team, designing the system, giving workshops, creating content, writing tutorials, etc. that I haven't had time to blog.

This week we launched, without fanfare, at a workshop sponsored by Robert Kiley of the Wellcome Trust:


[RK presented with an AMI, the mascot of TheContentMine]

Robert (and WT) have been magnificent in supporting ContentMining. He has advocated, organised, corralled, pushed, challenged over many years. The success of the workshop owes a great deal to him.

On Monday and Tuesday (2015-04-13/14) we ran a 2day workshop - training , hacking and advocacy/policy. We advertised the workshop, primarily for Early Career Researchers and were overwhelmed - FOUR TIMES oversubscribed [1]. Jenny Molloy organised the days, roughly as follows:

  • Day1
  • tutorials and simple hands on about technology
  • aspects of policy and protocols
  • planning projects
  • Day2
  • hacking projects for 6 hours
  • 2-hour policy/advocacy session with key UK and EU attendees.

It worked very well and showed that ContentMine is now viable in many areas:

  • We have unique software that has a completely new approach to searching scientific and medical literature.
  • We have an infrastructure that allows automatic processing of the literature through CRAWLing, SCRAPE-ing, NORMAlising and MINING (AMI).
  • architecture
  • We have a back-end/server CATalogue (contracted through CottageLabs) which has ingested and analysed a million articles.
  • We have novel search interfaces and display of results.
  • We have established that THE RIGHT TO READ IS THE RIGHT TO MINE. in the UK
  • We have built a team, and shown how to build communities.
  • We have tested training sessions that can be used to train trainers and spread the word.
  • And we are credible at the policy level.


[Part of the policy session]

We are delighted that a dozen funders, policy makers, etc came. They included JISC, IPO, LIBER, RLUK, RCUK, HEFCE, CUL, WT, BIS, UbiquityPress, NatureNews. The discussion took for granted that ContentMining is critically important and addressed how it could be suported and encouraged.

My slides for the policy session are at

I will blog more details later and show more pictures and so will Graham McDawg Steel. But the highlight for me was the speed and effciency of the Early Career Researchers in adopting, using, modifying and promoting the system. They came mainly from bioscience, /medical and ranged from UNIX geeks to those who hadn't seen a commandline. In their projects they were able to make the CM software work for them and extract facts from the literature. One group wrote additional processing software, another created a novel display with D3.

Best of all they said they'd be happy to learn how to run a workshop and take the ideas and software (which is completely Open Apache2/CC BY/CC0) to their communities.

NOTE: Hargreaves allows UK researchers to mine ANYTHING (that they have legal right to read) for non-commercial use. The publishers cannot stop them, either by technical means or contracts with libraries.

This should make the UK the content-mining capital of the world. Please join us!


Posted in Uncategorized | 3 Comments

32-year old Elsevier paper could have averted Ebola but Liberians would have had to pay to read it

I am very angry with the publishing industry.

Last week the NY Times reported that the Ministry of Health in Liberia had discovered a 30-year old paper that, if they had known about it, might have alerted Liberians to the possibility of Ebola. See a report in TechDirt ( ) and also the article in the NY Times itself ( ). The paper itself ( ) is in Science Direct and paywalled (31 USD for ca 1000 words (3.5 pages). I'll write more on what the Liberians had to say and how they feel about the publishing industry and Western academia (they are incredibly restrained). But I'm not, and this makes me very angry .

This paper contains the words;

“The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be prepared to avoid nosocomial epidemics,”

The Liberians argue that if they had known about this risk some of the effects of Ebola could have been prevented.

Suppose I'm a medical educational organization In Liberia and I wanted to distribute this paper to 50 centers in Liberia. I am forbidden to do this by Elsevier unless I pay 12 USD per 3-page reprint (from


I adamantly maintain "Closed access means people die".

This is self-evidently true to me, though I am still cricitized for not doing a scientific study (which would be necessarily unethical). But the Liberian Ministry is not impressed with academia and:

There is an adage in public health: “The road to inaction is paved with research papers.

We've paid 100 BILLION USD over the last 10 years to "publish" science and medicine. Ebola is a massive systems failure which I'll analyze shortly.



Posted in Uncategorized | 2 Comments

Content Mining Hackday in Cambridge this Friday 20150123 all welcome

We are having a ContentMine hackday - open to all - this Friday in Cambridge .

We are VERY grateful to Laura James, from our Advisory Board who also set up the Cambridge Makespace where the event will be held. This event will cover everything - technical, science, sociolegal, etc. We are delighted that Professor Charles Oppenheim , another of our Advisory Board, will be present. Charles is a world expert on scholarship including the policy and legality of mining. For example he flagged up today that the EU and its citizens are pushing for reform...

We're also expecting colleagues from Cambridge University Library so we can have a lively political stream. And we've got scientific publishers in Cambridge - love to see you.

There'll be a technical stream - integrating the components of quickscrape, Norma, AMI and our API created by Mark MacGillivray and colleagues at CottageLabs. All the technology is brand new and everything is offered Openly (including commercial use).

And there'll be a group of subprojects based on scientific disciplines. They include:

  • clinical trials
  • farming and agronomy
  • crystallography

If you have an area you'd like to mine, come along. You'll need to have a good idea of your sources (journals, theses, etc.) , and some idea of what you'd like to extract. And, ideally, you'll need energy and stamina and friends...

Oh, and in the unlikely event you get bored we are 15 metres from the Cambridge Winter Beer Festival.

Posted in Uncategorized | Leave a comment

This month's typographical horror: Researchers PAY typesetters to corrupt information

One of the "benefits" we get from paying publishers to publish our work is that they "typeset" it. Actually they don't. They pay typesetters to mutilate it. I don't know how much they pay but it's probably > 10 USD per page. This means that when you pay APCs (Article Processing Charges) YOU are paying typesetters - maybe 200 USD.

Maybe you or your funder is happy with this?

I'm not. Typesetters destroy information. Badly. Often enough to blur or change the science. ALL journals do this. I happen to be hacking PLoSONE today (, but this is unlikely to be specific to them:


So what's the typographical symbol/s in the last line? Hint. It's NOT what it SHOULD be

Unicode Character 'PLUS-MINUS SIGN' (U+00B1)

image of Unicode Character 'PLUS-MINUS SIGN' (U+00B1)

So what's happened? Try cutting and pasting the last line into a text editor. Mine gives:

(TY/SVL = 0.05+0.01 in males, 0.06+0.01 in females versus 0.08+0.01 in both sexes in L.

This is a DESTRUCTION of information.

So authors should be able to refuse charges for typesetting and save over 100 USD. and thereby improve science.

BTW the same horror appears in the XML. So when the publishers tell you how wonderful XML is, make your own judgment.

There are other horrors of the same sort (besides plus-minus) in the document. Can you spot them?

The only good news is that ContentMine sets out to normalize and remove such junk. It will be a long slog, but if you are committed to proper communication of science, lend a hand.



Posted in Uncategorized | 10 Comments

FORCE2015 ContentMine Workshop/hack - we are going to index the scientific literature and clinical trials...

TL;DR We had a great session at FORCE2015 yesterday in Oxford - people liked it, understood it, and are wanting to join us.

We ran a pre-conference workshop for 3 hours followed by extra hack. This was open to all and all sorts of people came including:

  • library
  • publisher
  • academic
  • hacker
  • scholarly poor
  • legal
  • policy
  • campaigner

So we deliberately didn't have a set program but we promised that anyone could learn about many of the things that ContentMine does and get their hands dirty. Our team presented the current state of play and then we broke into subgroups looking at legal/policy, science, and techie.

ContentMining is at a very early stage and the community, including ContentMine, is still developing tools and protocols. There's a lot to know and a certain amount of misunderstanding and disinformation. So very simply:

  • facts are uncopyrightable
  • large chunks of scientific publications are facts
  • in the UK we have the legal right to mine these documents for facts for non commercial activity / research
  • the ContentMine welcomes collaborators who want to carry out this activity - it's inclusive - YOU are part of US. ContentMine is not built centrally but by volunteers.
  • Our technology is part alpha, part beta. "alpha" means that it works for us, and so yesterday was about the community finding out whether it worked for them.

And it did. The two aspects yesterday were (a) scraping and (b) regexes in AMI. The point is that YOU can learn how to do these in about 30 mins . That means that YOU can build your bit of the Macroscope ("information telescope") that is ContentMine. Rory's interested in farms, so he, not us, is building a regexes for agriculture. (A week ago he didn't know what a regex was). Yesterday the community built a scraper for peerj - so if you want anything from that, it's now added to the repertoire (and available to anyone). We've identified clinical trials as one of the areas that we can mine - and we'd love volunteers here.

What can we mine? Anything factual from anywhere. What are facts (asked by one publisher yesterday)? There's the legal answer ("what the UK judge decides when the publisher takes a miner to court") and I hope we can move beyond that - that publishers will recognize the value of mining and want to promote a community approach. Operationally it's anything which can be reliably parsed by machine into a formal language and regenerated without loss. So here are some facts: "DOI 123456 contains..."

  • this molecule
  • this species
  • this star, galaxy
  • this elementary particle.

and relationships ("triples" in RDF-speak)

  • [salicylic acid] [was dissolved in] [methanol]
  • [23] [fairy penguins] [breed] [in St Kilda, VA]

Everything in [...] is precisely definable in ontologies and can be precisely annotated by current ContentMine technologies.

We can do chemistry (in depth), phylogenetics, agriculture, etc. but what about clinical trials? So we need to build:

  • a series of scrapers for appropriate journals
  • a series of regexes for terms in clinical trials. "23 adult females between the ages of ...".

For the really committed and excited we will also be able to analyze tables, figures and phrases in text using Natural Language Processing. So if this is you, and you are committed, then it will be very exciting.





Posted in Uncategorized | 2 Comments

FORCE2015 Workshop: How ContentMine works for you and what you can bring

TL;DR. WE outline the tools and pipeline which ContentMine will show on Sunday at Force2015. They are very general and accessible to everyone....

ContentMine technology and community is maturing quickly. We've just had a wonderful three days in Berlin with Johnny West a co-Shuttleowrth Fellow. Johnny runs - a project to find public information about the extractive industries (oil/gas, mining). Technically his tasks and ours are very similar - the information is there but hard to find and locked in legacy formats. So at the last Shuttleworth gathering we suggested we should have a hack/workshop to see how we could help each other.

I thought this would initially be about OCR, but it's actually turned out that our architecture for text analysis and searching is exactly what Openoil needs. By using regexes on HTML (or PDF-converted-to-HTML) we can find company names and relations, aspects of contracts etc. The immediate point is that ContentMine can be used out-of-the-box for a wider range of information tasks.


  1. We start with a collection of documents. Our mainstream activity will be all papers published in a day - somewhere between 2000 - and 3000 (no one quite knows). We need a list of those and there are several sources such as CrossRef or JournalToCs. We may also use publishers' feeds. The list is usually a list of references - DOIs or URLs which we use in the scraping. But we can also use other sources such as Repositories. (We'd love to  find people at Force2015 who would like their repositories searched and indexed - including for thsese (which are currently very badly indexed indeed)). But ContentMine can also be used on personally collections such as hard drives.
  2. The links are then fed to Richard-Smith-Unna's quickscrape which can determine all the documents associated with a publication (PDF text, HTML text, XML, supplementary files, images, DOCX, etc.). This needs volunteers to write scrapers but quite a lot of this has already been done. A scraper for a journal can often be written in 30 minutes and there's no special programming required. This introduces the idea of community. Because ContentMine is Open and kept so the contributions will remain part of the community. We're looking for community and this is "us" , not "we"-and-"you". And the community has already started with Rory Aaronson (also Shuttleworth Fellow) starting a sub-project on agriculture ( We're going to find all papers that contain farming terms and extracts the FACTs.
    The result of scraping is a collection of files. They're messy and irregular - some articles have only a PDF, others have tens of figures and tables. Many are difficult to read. We are going to scrape these and make them usable.
  3. The next stage is normalization (Norma). The result of Norma's processing is tagged, structured, HTML - or "scholarly HTML" (  which a group of us designed 3 years ago. At that time we were thinking of authoring, but because proper scholarship closes the information loop, it's also an output.
    Yes, Scholarly HTML is a universal approach to publishing. Because HTML can carry any general structure, and because it can host foreign namespaces (MathML, CML), and because it has semantic graphics (SVG) and because it has tables and list and because it manages figures  and links, it has everything. So Norma will turn everything into sHTML/SVG/PNG.
    That's a massive step forward. It means we have a single simple tested supported format for everything scholarly.
    Norma has to do some tricky stuff. PDF has no structure and much raw HTML is badly structured. So we have to use sections for different parts and roles in the document (abstract, introduction, ... references, licence...) That means we can restrict the analysis to just one or a few  parts of the article ("tagging"). that's a huge win for precision , speed and usability.  A paper about E. coli infection ("Introduction" or "Discussion") is very different from one that uses E. coli as a tool for cloning ("Materials").
  4. So we now have normalized sHTML. AMI provides a wide and communty-fuelled set of services to analyze and process this. There are at least three distiint tasks: {a} indexing, (Information retrieval, or classification) where we want to know what sort of a paper it is and find it later (b) information extraction, where we pull out chunks of Facts from the paper (e.g. all the chemical reactions) and (c) transformation, where we create something new out of one or papers - for example calculating the physical properties of materilas from the chemical composition.
    AMI does this though a plugin architecture. These can be very sophisticated , such as OSCAR and Chemicaltagger which recognise and interpret chemical names and phrases and are large Java programs in their own right. Or Phylotree which interprets pixel diagrams and turns them into semantic NexML trees. These took years. But at the other end we can search text for concepts using reguar expressions and our Berlin and OpenFarm experience shows that people can learn these in 30 minutes!

In summary, then, we are changing the way that people will search for scientific information, and changing the power balance. Currently people wait passively for large organizations to create push-button technology. If it does what you want fine (perhaps, if you don't mind snoop-and-control) ; if it doesn't, you;re hosed. With ContentMine YOU == WE decide what we want to do, and then just do it.

We/you need y/our involvement in autonomous communities.

Join us on Sunday. It's all free  and Open.

It will range from very straightforward (using our website) to running your own applications in a downloadable virtual machine which you control. No programming experience required but bring a lively mind. It will help if we know how many are coming and if you download the virtual machine beforehand, jsut to check it works. easy, but takes a bit of time to download 1,8 GB.





Posted in Uncategorized | Leave a comment

ContentMine Update and FORCE2015; we read and index the daily scholarly literature

We've been very busy and I haven't blogged as much as I'd liked. Here's an update and news about immediate events.

Firstly to welcome Graham Steel (McDawg) who is joining us as community manager. Graham is a massive figure in the UK and the world in fighting for Open. We've known each other for several years. Graham is a tireless, fearless fighter for access to scholarly information. He's one to the #scolarlypoor (i.e. not employed by a rich university) so he doesn't have access to the literature. Nonethelesss he fights for justice and access.

Here's a past blog post 4 years ago where I introduce him and McDawg. He'll be with us this weekend at FORCE2015, more later.

We have made large advances in the ContentMine technology. I'm really happy with the architecture which Cottagelabs, Richard Smith-Unna and I have been hacking. Essentially we automate the process of reading the daily scientific literature - this is between 1000 and 4000 articles depending on what you count. Each is perhaps 5-20 pages, many with figures. Our tools (quickscrape, Norma, and AMI) carry out the process of

  • scraping (downloading all the components of a paper (XML, HTML, PDF, CSV, DOC, TXT, etc.)
  • Normalising and tagging the papers. We convert PDF and XML to HTML5 , which is essentially Scholarly HTML. We extract the figures and interepret them where possible. We also identify the sections and tag them, so - for example - we can look at just the Materials and Methods section, or just the LIcence.
  • indexing and transformation (AMI). AMI now has several well tested plugins: chemistry, species, sequences, phylogenetic trees, and more generally Regular expressions designed for community creation.

Mark MacGillivray and colleagues have created a lovely faceted search index so it's possible to ask scientific questions with a facility and precision that we think is completely novel.

We're doing a workshop on this at FORCE2015 next Sunday (Jan 11) for 3 hours and hacking thereafter. The software is now easily used on or distributable in virtual machines. Everything is Open, so there is no control by third parties. The workshop will start by searching for species, and then move on to building custom searches and browsing. For those who can't be there, Graham/McDawg is hoping to create a livestream - but no promises.

I've spent a wonderful 3 days in Berlin with fellow Shuttleworth fellow Johnny West. Johnny's OpenOil project - - is about creating Open information about the extractive industries. It turns out that the technology we are using in ContentMine are extremely useful for understanding corporate reports. So I've been hacking corporate structure diagrams which are extremely similar to metabolic networks or software flowcharts.

More later, as we have to wrap up the hack....


Posted in Uncategorized | Leave a comment