Category Archives: Uncategorized

Content Mining Hackday in Cambridge this Friday 20150123 all welcome

We are having a ContentMine hackday - open to all - this Friday in Cambridge .

We are VERY grateful to Laura James, from our Advisory Board who also set up the Cambridge Makespace where the event will be held. This event will cover everything - technical, science, sociolegal, etc. We are delighted that Professor Charles Oppenheim , another of our Advisory Board, will be present. Charles is a world expert on scholarship including the policy and legality of mining. For example he flagged up today that the EU and its citizens are pushing for reform...

We're also expecting colleagues from Cambridge University Library so we can have a lively political stream. And we've got scientific publishers in Cambridge - love to see you.

There'll be a technical stream - integrating the components of quickscrape, Norma, AMI and our API created by Mark MacGillivray and colleagues at CottageLabs. All the technology is brand new and everything is offered Openly (including commercial use).

And there'll be a group of subprojects based on scientific disciplines. They include:

  • clinical trials
  • farming and agronomy
  • crystallography

If you have an area you'd like to mine, come along. You'll need to have a good idea of your sources (journals, theses, etc.) , and some idea of what you'd like to extract. And, ideally, you'll need energy and stamina and friends...

Oh, and in the unlikely event you get bored we are 15 metres from the Cambridge Winter Beer Festival.

This month's typographical horror: Researchers PAY typesetters to corrupt information

One of the "benefits" we get from paying publishers to publish our work is that they "typeset" it. Actually they don't. They pay typesetters to mutilate it. I don't know how much they pay but it's probably > 10 USD per page. This means that when you pay APCs (Article Processing Charges) YOU are paying typesetters - maybe 200 USD.

Maybe you or your funder is happy with this?

I'm not. Typesetters destroy information. Badly. Often enough to blur or change the science. ALL journals do this. I happen to be hacking PLoSONE today (, but this is unlikely to be specific to them:


So what's the typographical symbol/s in the last line? Hint. It's NOT what it SHOULD be

Unicode Character 'PLUS-MINUS SIGN' (U+00B1)

image of Unicode Character 'PLUS-MINUS SIGN' (U+00B1)

So what's happened? Try cutting and pasting the last line into a text editor. Mine gives:

(TY/SVL = 0.05+0.01 in males, 0.06+0.01 in females versus 0.08+0.01 in both sexes in L.

This is a DESTRUCTION of information.

So authors should be able to refuse charges for typesetting and save over 100 USD. and thereby improve science.

BTW the same horror appears in the XML. So when the publishers tell you how wonderful XML is, make your own judgment.

There are other horrors of the same sort (besides plus-minus) in the document. Can you spot them?

The only good news is that ContentMine sets out to normalize and remove such junk. It will be a long slog, but if you are committed to proper communication of science, lend a hand.



FORCE2015 ContentMine Workshop/hack - we are going to index the scientific literature and clinical trials...

TL;DR We had a great session at FORCE2015 yesterday in Oxford - people liked it, understood it, and are wanting to join us.

We ran a pre-conference workshop for 3 hours followed by extra hack. This was open to all and all sorts of people came including:

  • library
  • publisher
  • academic
  • hacker
  • scholarly poor
  • legal
  • policy
  • campaigner

So we deliberately didn't have a set program but we promised that anyone could learn about many of the things that ContentMine does and get their hands dirty. Our team presented the current state of play and then we broke into subgroups looking at legal/policy, science, and techie.

ContentMining is at a very early stage and the community, including ContentMine, is still developing tools and protocols. There's a lot to know and a certain amount of misunderstanding and disinformation. So very simply:

  • facts are uncopyrightable
  • large chunks of scientific publications are facts
  • in the UK we have the legal right to mine these documents for facts for non commercial activity / research
  • the ContentMine welcomes collaborators who want to carry out this activity - it's inclusive - YOU are part of US. ContentMine is not built centrally but by volunteers.
  • Our technology is part alpha, part beta. "alpha" means that it works for us, and so yesterday was about the community finding out whether it worked for them.

And it did. The two aspects yesterday were (a) scraping and (b) regexes in AMI. The point is that YOU can learn how to do these in about 30 mins . That means that YOU can build your bit of the Macroscope ("information telescope") that is ContentMine. Rory's interested in farms, so he, not us, is building a regexes for agriculture. (A week ago he didn't know what a regex was). Yesterday the community built a scraper for peerj - so if you want anything from that, it's now added to the repertoire (and available to anyone). We've identified clinical trials as one of the areas that we can mine - and we'd love volunteers here.

What can we mine? Anything factual from anywhere. What are facts (asked by one publisher yesterday)? There's the legal answer ("what the UK judge decides when the publisher takes a miner to court") and I hope we can move beyond that - that publishers will recognize the value of mining and want to promote a community approach. Operationally it's anything which can be reliably parsed by machine into a formal language and regenerated without loss. So here are some facts: "DOI 123456 contains..."

  • this molecule
  • this species
  • this star, galaxy
  • this elementary particle.

and relationships ("triples" in RDF-speak)

  • [salicylic acid] [was dissolved in] [methanol]
  • [23] [fairy penguins] [breed] [in St Kilda, VA]

Everything in [...] is precisely definable in ontologies and can be precisely annotated by current ContentMine technologies.

We can do chemistry (in depth), phylogenetics, agriculture, etc. but what about clinical trials? So we need to build:

  • a series of scrapers for appropriate journals
  • a series of regexes for terms in clinical trials. "23 adult females between the ages of ...".

For the really committed and excited we will also be able to analyze tables, figures and phrases in text using Natural Language Processing. So if this is you, and you are committed, then it will be very exciting.





FORCE2015 Workshop: How ContentMine works for you and what you can bring

TL;DR. WE outline the tools and pipeline which ContentMine will show on Sunday at Force2015. They are very general and accessible to everyone....

ContentMine technology and community is maturing quickly. We've just had a wonderful three days in Berlin with Johnny West a co-Shuttleowrth Fellow. Johnny runs - a project to find public information about the extractive industries (oil/gas, mining). Technically his tasks and ours are very similar - the information is there but hard to find and locked in legacy formats. So at the last Shuttleworth gathering we suggested we should have a hack/workshop to see how we could help each other.

I thought this would initially be about OCR, but it's actually turned out that our architecture for text analysis and searching is exactly what Openoil needs. By using regexes on HTML (or PDF-converted-to-HTML) we can find company names and relations, aspects of contracts etc. The immediate point is that ContentMine can be used out-of-the-box for a wider range of information tasks.


  1. We start with a collection of documents. Our mainstream activity will be all papers published in a day - somewhere between 2000 - and 3000 (no one quite knows). We need a list of those and there are several sources such as CrossRef or JournalToCs. We may also use publishers' feeds. The list is usually a list of references - DOIs or URLs which we use in the scraping. But we can also use other sources such as Repositories. (We'd love to  find people at Force2015 who would like their repositories searched and indexed - including for thsese (which are currently very badly indexed indeed)). But ContentMine can also be used on personally collections such as hard drives.
  2. The links are then fed to Richard-Smith-Unna's quickscrape which can determine all the documents associated with a publication (PDF text, HTML text, XML, supplementary files, images, DOCX, etc.). This needs volunteers to write scrapers but quite a lot of this has already been done. A scraper for a journal can often be written in 30 minutes and there's no special programming required. This introduces the idea of community. Because ContentMine is Open and kept so the contributions will remain part of the community. We're looking for community and this is "us" , not "we"-and-"you". And the community has already started with Rory Aaronson (also Shuttleworth Fellow) starting a sub-project on agriculture ( We're going to find all papers that contain farming terms and extracts the FACTs.
    The result of scraping is a collection of files. They're messy and irregular - some articles have only a PDF, others have tens of figures and tables. Many are difficult to read. We are going to scrape these and make them usable.
  3. The next stage is normalization (Norma). The result of Norma's processing is tagged, structured, HTML - or "scholarly HTML" (  which a group of us designed 3 years ago. At that time we were thinking of authoring, but because proper scholarship closes the information loop, it's also an output.
    Yes, Scholarly HTML is a universal approach to publishing. Because HTML can carry any general structure, and because it can host foreign namespaces (MathML, CML), and because it has semantic graphics (SVG) and because it has tables and list and because it manages figures  and links, it has everything. So Norma will turn everything into sHTML/SVG/PNG.
    That's a massive step forward. It means we have a single simple tested supported format for everything scholarly.
    Norma has to do some tricky stuff. PDF has no structure and much raw HTML is badly structured. So we have to use sections for different parts and roles in the document (abstract, introduction, ... references, licence...) That means we can restrict the analysis to just one or a few  parts of the article ("tagging"). that's a huge win for precision , speed and usability.  A paper about E. coli infection ("Introduction" or "Discussion") is very different from one that uses E. coli as a tool for cloning ("Materials").
  4. So we now have normalized sHTML. AMI provides a wide and communty-fuelled set of services to analyze and process this. There are at least three distiint tasks: {a} indexing, (Information retrieval, or classification) where we want to know what sort of a paper it is and find it later (b) information extraction, where we pull out chunks of Facts from the paper (e.g. all the chemical reactions) and (c) transformation, where we create something new out of one or papers - for example calculating the physical properties of materilas from the chemical composition.
    AMI does this though a plugin architecture. These can be very sophisticated , such as OSCAR and Chemicaltagger which recognise and interpret chemical names and phrases and are large Java programs in their own right. Or Phylotree which interprets pixel diagrams and turns them into semantic NexML trees. These took years. But at the other end we can search text for concepts using reguar expressions and our Berlin and OpenFarm experience shows that people can learn these in 30 minutes!

In summary, then, we are changing the way that people will search for scientific information, and changing the power balance. Currently people wait passively for large organizations to create push-button technology. If it does what you want fine (perhaps, if you don't mind snoop-and-control) ; if it doesn't, you;re hosed. With ContentMine YOU == WE decide what we want to do, and then just do it.

We/you need y/our involvement in autonomous communities.

Join us on Sunday. It's all free  and Open.

It will range from very straightforward (using our website) to running your own applications in a downloadable virtual machine which you control. No programming experience required but bring a lively mind. It will help if we know how many are coming and if you download the virtual machine beforehand, jsut to check it works. easy, but takes a bit of time to download 1,8 GB.





ContentMine Update and FORCE2015; we read and index the daily scholarly literature

We've been very busy and I haven't blogged as much as I'd liked. Here's an update and news about immediate events.

Firstly to welcome Graham Steel (McDawg) who is joining us as community manager. Graham is a massive figure in the UK and the world in fighting for Open. We've known each other for several years. Graham is a tireless, fearless fighter for access to scholarly information. He's one to the #scolarlypoor (i.e. not employed by a rich university) so he doesn't have access to the literature. Nonethelesss he fights for justice and access.

Here's a past blog post 4 years ago where I introduce him and McDawg. He'll be with us this weekend at FORCE2015, more later.

We have made large advances in the ContentMine technology. I'm really happy with the architecture which Cottagelabs, Richard Smith-Unna and I have been hacking. Essentially we automate the process of reading the daily scientific literature - this is between 1000 and 4000 articles depending on what you count. Each is perhaps 5-20 pages, many with figures. Our tools (quickscrape, Norma, and AMI) carry out the process of

  • scraping (downloading all the components of a paper (XML, HTML, PDF, CSV, DOC, TXT, etc.)
  • Normalising and tagging the papers. We convert PDF and XML to HTML5 , which is essentially Scholarly HTML. We extract the figures and interepret them where possible. We also identify the sections and tag them, so - for example - we can look at just the Materials and Methods section, or just the LIcence.
  • indexing and transformation (AMI). AMI now has several well tested plugins: chemistry, species, sequences, phylogenetic trees, and more generally Regular expressions designed for community creation.

Mark MacGillivray and colleagues have created a lovely faceted search index so it's possible to ask scientific questions with a facility and precision that we think is completely novel.

We're doing a workshop on this at FORCE2015 next Sunday (Jan 11) for 3 hours and hacking thereafter. The software is now easily used on or distributable in virtual machines. Everything is Open, so there is no control by third parties. The workshop will start by searching for species, and then move on to building custom searches and browsing. For those who can't be there, Graham/McDawg is hoping to create a livestream - but no promises.

I've spent a wonderful 3 days in Berlin with fellow Shuttleworth fellow Johnny West. Johnny's OpenOil project - - is about creating Open information about the extractive industries. It turns out that the technology we are using in ContentMine are extremely useful for understanding corporate reports. So I've been hacking corporate structure diagrams which are extremely similar to metabolic networks or software flowcharts.

More later, as we have to wrap up the hack....


Wiley's "Free to read" actually means "pay 35 USD"


I got the above unwanted Twitter from Wiley (I have checked as far as possible that it's genuine). It seems to be Wiley advertising a free to read article. I have pasted the message so you can try this at home:

Progress in #nanotechnology within the last several decades review from @unifr is #freetoread!

I check the poster and it seems to be a genuine site. So off I go to get my free copy (sorry, my free set of photons for sighted readers)...



I click the "View Full Article (HTML)" and get...


So Wiley equate "35 USD" with "free to read".

I don't.

I'm sure it's a BUMP-ON-THE-ROAD (Elsevier excuse).

But this is the independent fourth publisher foul-up I have got in the last four days. We pay them 20 Billion USD and they can't get it right.


How publishers destroy science: Elsevier's XML, API and the disappearing chemical bond. DO NOT BUY XML

TL;DR Elsevier typsetting turns double bonds into garbage.

Those of you who follow this blog will know that I contend that publishers corrupt manuscripts and thereby destroy science.

Those of you who follow this blog will know that Elsevier publicly stated that I could not use the new "Hargreaves" law to mine articles on their web page and I must do this through their API. Originally there were zillions of conditions, which - under our constant criticism - have gradually (but nowhere completely) disappeared. They now allow me to mine from the web page, but insist that their XML-API gives better content.

I have consistently refused to use Elsevier's API for legal, political and social reasons (I don't want to sign my rights away, be monitored, have to ask permission, etc.). But I also know from at least 5 years of trying to interpret publishers' PDFs and HTML that information is corrupted. By this I mean that what the author submits is turned into something different lexically, typographically and often semantics. (Yes, that means that by changing the way something looks , you can change its meaning).

Anyway yesterday Chris Shillum, who was part of the team I challenged, tweeted that he would let me have a paper - in XML format - from the Elsevier API. For those who don't know, XML is designed to hold information in a style-free form. It can be rendered by a stylesheet or program (e.g. FOP) into whatever font you like. I'm very familiar with XML having run the developers' list with Henry Rzepa in 1997 and been co-author of the universal SAX protocol. Henry and I have developed Chemical Markup Language (CML) precisely for the purpose of chemical publishing (among many other things).


But Elsevier don't use CML, they use typographers who know nothing about chemistry. At school you may have heard of a "double bond" ( It's normally represented by two lines between the atoms. We used to draw these with rapidographs, but now we type them. So every chemist in the world will type Carbon Dioxide as


capital-O equals capital-C equals capital-O

You can do it - nothing terrible happens. You can even search chemical databases using this. They all understand "equals".

But that's not good enough for Elsevier (and most of the others). It has to look "pretty". It's more important that a publication looks pretty than that it's correct. And that's one of the major ways they corrupt information. So here's the paper that Chris Shillum sent me.

First as a PDF.


Can you see the C=O double bond in the middle? "(C=O stretching)". It's no longer an equals, but a special publisher-only symbol they think looks prettier. Among other things if I search for "C=O" I won't find the double bond in the text. That's bad enough. But what's far worse is that this symbol has been included in their XML. And this gets transmitted to the HTML - which looks like (you can try this yourself ).



What's happened??? Do you also see a square? The double bond has disappeared.

The square is Firefox saying "I have been given a character I don't understand and the best I can do is draw a square" - sorry. Safari does the same. Do ANY of you get anything useful? I doubt it.

Because Elsevier has created a special Elsevier-only method of displaying chemistry. It probably only works inside Elsevier back-room - it won't work in any normal browser. Here's what has happened.

Elsevier wanted a symbol to display a double bond. "Equals" which all the rest of the world uses - isn't good enough. So they created their own special Elsevier-double-bond. It's not a standard Unicode codepoint - it's in a Private Use Area: ( This is reserved for a single organisation to use. It is not intended for unrestricted public use. In certain cases groups, with mutual agreement, have developed communities of practice. But I know of no community outside Elsevier that uses this. (BTW the XML uses 6 Elsevier-only DTDs and can only be understood by reading a 500-page manual - the chemistry is hidden somewhere at the end. This is the monstrosity that Elsevier wishes to force us to use.

It's highly dangerous. If you change a double bond to a triple bond (ethylene => acetylene) it can explode and blow you up. But double and triple bonds are both represented by a hollow square if you try to view Elsevier-HTML. And goodness knows what else:

So Elsevier destroys information.

Chris Shillum tells me on Twitter that it's not a problem. But it is. Using the Private Use Area without the agreement of the community is utterly irresponsible. No one even knew that Elsevier was doing it.

Why's it irresponsible? Because many languages use it for other purposes. See Wikipedia above. Estonian, Tibetan, Chinese ... If an Elsevier-double-bond is used in these documents (e.g. an Estonian chemistry department) there will be certain corruption of both the chemistry and the Estonian. There are probably 10 million chemical compounds with double bonds and all will be corrupted.

But it's also arrogant. "We're Elsevier. We're not going to work with existing DTDs (XML specifications) - we're going to invent our own." Who uses it outside Elsevier? "And we are going to force text-miners to use this monstrosity."

And it's the combined arrogance and incompetence of publishers that destroys science during the manuscript processing. I've been through it. I know.



Publishers' typesetting destroys science: They are all as bad as each other. Can you spot the error?

I've just been trying to mine publicly visible scientific publications from scholarly publishers. (That's right - "publicly visible" - Hargreaves comes later).


They destroy the text. They destroy the images and diagrams. And we pay them money - usually more than a thousand dollars for this. Sometimes many thousands. And when I talk to them - which is regular - they all say something like:

"Oh, we can't change our workflow - it would take years" (or something similar). As if this was a law of the universe.

Unfortunately it's a law of publishing arrogance. They don't give a stuff about the reader. There's no market forces - the only thing that the PublisherAcademic complex worries about is the shh-don't-mention-the-Impact-Factor.

And it's not just the TollAccess ones but also the OpenAccess ones. So today's destruction of quality comes from BMC. (I shall be even handed in my criticism).

I'm trying to get my machines to read HTML from BMC's site. Why HTML? Well publisher's PDF is awful - I'll come to that tomorrow or sometime). Whereas HTML is a standard of many years and so it's straightforward to parse. Yes,

unless it comes from a Scholarly publisher...

PUZZLE TODAY. What's (seriously) wrong with the following. [Kaveh, you will spot it, but give the others a chance to puzzle!]. It's verbatim from (I have added some CR's to make it readable

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
<html id="nojs" xmlns="" 
    xmlns:og="" xml:lang="en-GB" 
    lang="en-GB" xmlns:wb=“”>

<head> ... [rest of document snipped]

When you see it you'll be as horrified as I was. There is no excuse for this rubbish. Why do we put up with this?

Elsevier's Bumpy Road; Unacceptable licence metadata on "Open Access"

I am looking for Open Access articles to mine and since I have recently become an astrophysicist I started with

Can I mine it?

elsevierrights1 elsevierrights2

"Open Access" means virtually nothing. Let's try RightsLink, the tax-collector for the toll-access scholarly publishers. Normally a depressing experience, but here's a surprise.


Rightslink can't work out the licence.

(PMR can't work out the licence either). Sometimes it hides at the bottom of the document...


PMR still can't work it out? Is it Open.

The only certain thing is that even after years of mislabelling documents Elsevier is still incapable of reliably attaching licence information to document.






Content Mining; thoughts from den Haag - can we aspire to universal knowledge?

I'm in den Haag (The Hague) for a meeting run by LIBER - the association of European Research Libraries - about Content Mining. Content Mining is often called TDM - Text and Data Mining - but it also applies to images and other media which contain uncopyrightable facts. This meeting is not primarily technical - it's about the socio-politico-legal issues in doing mining.

Mining can create Universal knowledge. I've just read a wonderful post from Pierre Estienne

we share the same vision and the same deep-seated concern that publishers are destroying this vision. Just a few quotes - you should read it.

A Universal Library is a representation of science. Gathering all human knowledge in one place creates a monolithic artefact I call the Universal Library. It contains all of what Popper called the third world or world three: all of humankind’s literature.
As Popper said, “instead of growing better memories and brains, we grow paper, pens, pencils, typewriters, dictaphones, the printing press, and libraries.”, yet today brain-enhancing tools like libraries are scattered around the globe, and are (academic libraries especially) inaccessible for most of us. The Universal Library is the ultimate tool we can create in order to store and retrieve all of our knowledge easily.


“The internet is Gutenberg on steroids, a printing press without ink, overhead or delivery costs”. [Michael Scherer]... [PE continues]  Yet the internet isn’t seen this way by publishers. They still behave like books are a “scarce” commodity, while the internet allows unlimited distribution of books for free. If the publishers really embraced the internet, they would publish their books/journals for free, instead of charging exorbitant amounts of money for pdfs.


Google is a great tool, but it doesn’t have access to everything – scholarly publications especially are locked inside publishers’ databases and are behind paywalls – if you want to really get a good look at most of the literature, you have to switch between multiple tools: Google, Elsevier, Wiley, Springer’s databases, etc… It’s a very time consuming process the Universal Library should make fast and simple.


So, publishers. The “big three” (Wiley, Springer, Elsevier) and a few others retain a monopoly on scientific publications, and behave like a cartel, making deals to not compete with one another (just look at their prices, which are kept very high and are the same for all the different publishers). As they refuse to compete, they are very unlikely to change their business model. I’m surprised they haven’t been under investigation for antitrust… As they have the copyrights of most of the scientific publications in circulation, they can charge sky-high prices for simple pdfs, and they are quick to call “pirate” anyone who tries to make these papers more available.

[and lots more wonderfully clear, historically grounded stuff...]

Picking up immediate threads...

I am challenging Nature / Macmillan over their new "experiment" in releasing dumbed down (read-only) versions of the scholarly literature that "they own". They think it's a step forward. I think it's an assertion that they believe they control the scholarly literature. I'll blog more , but here's something to think about:

It costs Nature 30-25 THOUSAND dollars to process a single accepted article (usually 2-6 pages).

That's Nature's figures, this week. I'll rephrase that:

Nature take 30-45,000 USD out of the community (taxpayers and students-paying-fees) to create a single published article which, in most cases they control ("own")

There is no way this is moving towards a Universal Library. There are many dystopias which we can imagine - 1984, Fahrenheit451, The Lives Of Others, and the life and death of Aaron Swartz.

So in contentmine,org we are developing a part of the universal library. We are starting with Open Access articles and then moving to legally-minable-facts-in-UK. This is not Universal. It's severely restricted by the publishing industry. If I step outside the lines they have drawn we shall be challenged. Diderot was challenged by the establishment - they, including the printers, destroyed his work.

But his soul shines brightly today and Pierre and I and many others honour him.

And ContentMine is starting to catch on. We've had a great contribution from Magnus Manske + Wikidata and it deserves a complete post on its own.

Off to LIBER, via Bezuidenhout...