This is the first post in (hopefully) a regular series on the development of Open Content Mining in scholarly articles (mainly STM = Science Technical Medical). It's also a call for anyone interested to join up as a community. This post describes the background – later ones will cover the technology, the philosophy and the contractual and legal issues. I shall use #opencontentmining as a running hashtag.
I'm using the term "content mining" as it's broader than "text-mining". It starts from a number of premises:
- The STM literature is expanding so quickly that no-one human can keep up, even in their own field. There are perhaps 2 million articles / year == 60, 000 per day. You could just about read the titles at 1 per second if you had no sleep. Many of them might be formally "outside" your speciality but actually contain valuable information.
- A large part of scientific publication and communication is data. Everyone is becoming more aware of how important data is. It is essential to validate the science, it can be combined with other data to create new discoveries. Yet most data is never published and of the rest much ends up in the "fulltext" of the articles. (Note that "fulltext" is a poor term as there are lots of pictures and other non-text content. "Full content" would be logical (although misleading in that papers only report a small percentage of the work done).
- The technology is now able to do some exciting and powerful things. Content-mining is made up or a large number of discrete processes and as each one is solves (even partially) we get more value. This is combined with the increasing technical quality of articles (e.g. native PDF rather than camera-ready photographed text).
I used to regard PDF as an abomination. See my post 6 years ago: http://blogs.ch.cam.ac.uk/pmr/2006/09/10/hamburgers-and-cows-the-cognitive-style-of-pdf/. I quoted the maxim "turning a PDF into XML is like turning a hamburger into a cow." (not mine, but I am sometimes credited with it). XML is structured semantic text. PDF is a (random) collections of "inkmarks on paper". The conversion destroys huge amounts of information.
I still regard PDF as an abomination. I used to think that force of argument would persuade authors and publishers to change to semantic authoring. I still think that has to happen before we have modern scientific communication through articles.
But in the interim I and others have developed hamburger2cow technology. It's based on the idea that if a human can understand a "printed page" then a machine might be able to. It's really a question of encoding a large number of rules. The good thing is that machines don't forget rules and they have no limit to the size of their memory for them. So I have come to regard PDF as a fact of life and a technical problem to be tackled. I've spent the last 5 months hacking at it (hence few blog posts) and I think it's reached an alpha stage.
And also it is parallelisable at the human level. I and others have developed technology for understanding chemical diagrams in PDF. You can use that technology. If you create a tool that recognizes sequence alignments, then I can use it. (Of course I am talking Open Source – we share our code rather than restricting its use). I have created a tool that interprets phylogenetic trees – you don't have to. Maybe you are interested in hacking dose-response curves?
So little by little we build a system that is smarter than any individual scientist. We can ask the machine "what sort of science is in this paper?" and the machine will apply all the zillions of rules to every bit of information in an article. And the machine will be able to answer: "it's got 3 phylogenetic trees, 1 sequence alignment, 2 maps of the North Atlantic, and the species are all sea-birds".
The machine is called AMI2. Some time ago we had a JISC project to create a virtual research environment as we called the software "AMI". That was short for "the scientists' amanuensis". An amanuensis is a scholarly companion; Eric Fenby assisted the blind composer Frederick Delius in writing down the notes that Delius dictated. So AMI2 is the next step – a scientifically artificially intelligent program. (That's not as scary as it sounds – we are surrounded by weak AI everywhere, and it's mainly a question of glueing together a lot of mature technologies).
AMI2 starts with two main technologies – text-mining and diagram-mining. Textmining is very mature and could be deployed on the scientific literature tomorrow.
Except that the subscription-based publishers will send lawyers after us if we do. And that is 99% of the problem. They aren't doing text-mining themselves but they won't let subscribers do it either.
But there is 5% of the literature that can be text-mined – that with a CC-BY licence. The best examples are BioMedCentral and PLoS. Will 5% be useful? No-one knows but I believe it will. And in any case it will get the technology developed. And there is a lot of interest in funders – they want their outputs to be mined.
So this post launches a
community approach to content-mining. Anyone can take part as long as they make content and code Open (CC-BY-equiv and OSI F/OSS). Of course the technology can be deployed on closed content and we are delighted for that, but the examples we use in this project must be Open.
Open communities are springing up everywhere. I have helped to launch one – the Blue Obelisk – in chemistry. It's got 20 different groups creating interoperable code. It's run for 6 years and its code is gradually replacing closed code. A project in content-mining will be even more dynamic as it addresses unmet needs.
So here are some starting points. Like all bottom-up projects expect them to change:
- We shall identify some key problems and people keen to solve them
- We'll use existing Open technology where possible
- We'll educate ourselves as we go
- We'll develop new technologies where they are needed
- Everything will be made Open on the web as soon as it is produced. Blogs, wikis, repositories – whatever works
If you are interested in contributing to #opencontentmining in STM please let us know. We are at alpha stage (i.e. you need to be prepared to work with and test development systems – there are no gold-plated packages- and there probably never will be). There's lots of scope for biologists, chemists, material scientists, hackers, machine-learning experts, document-hackers (esp. PDF), legal, publishers, etc.
Content-mining is set to take off. You will need to know about it. So if you are interested in (soluble) technical challenges and contributing to an Open community let's start.
[More details in next post – maybe about phylogenetic trees.]