On Friday 2014-09-12 4 of us from The ContentMine presented 3 papers at WOSP2014 (http://core-project.kmi.open.ac.uk/dl2014/) . The meeting was well run by Petr Knoth and colleagues from the Open University and CORE (the JISC- and funder-supported project for aggregation of repositories). The meeting gave a useful overview of TextAndDataMining (TDM). From the program
- The whole ecosystem of infrastructures including repositories, aggregators, text-and data-mining facilities, impact monitoring tools, datasets, services and APIs that enable analysis of large volumes of scientific publications.
- Semantic enrichment of scientific publications by means of text-mining, crowdsourcing or other methods.
- Analysis of large databases of scientific publications to identify research trends, high impact, cross-fertilisation between disciplines, research excellence etc.
This was an important meeting in several ways and I want to comment on:
- the general state of Content-mining
- Our own presentations
- Elsevier's uninvited presentation on "Elsevier's Text and Data Mining Policy". I'll split this into two parts in two separate posts.
Content mining and extraction has a long history so it's now an incremental rather than a revolutionary technology. Much of the basis is "machine learning" where statistical measures are used for classification and identification (essentially how spam detectors work). So several papers dealt with classification, based on the words in the main part ("full text") of the document. [NOTE: full-text is far better than abstracts for classifying documents. Always demand full-text.] Several papers dealt with this.
Then there's the problem of analysing semistructured information. A PDF is weakly structured - a machine doesn't know what order the characters are, whether there are words, etc. Tables are particularly difficult and there were two papers on this.
And then aggregation, repositories, crawling etc. The most important takeaway for me is CORE (http://core-project.kmi.open.ac.uk/ ) , which aggregates several hundred (UK and other) repositories. This is necessary because UK repositories are an uncontrolled mess. Each university does its own thing, has a different philosophy, uses different indexing and access standards. The universities can't decide whether the repo is for authors' benefit, readers' benefit, the university's benefit, HEFCE's benefit or whether they "just have to have one because everyone else does". (By contract the French have HAL http://en.wikipedia.org/wiki/Hyper_Articles_en_Ligne ). So UK repositories remain uncrawlable and unindexed until CORE (even then their is much internal inconsistency and uncertain philosophy).
It's important to have metrics because otherwise we don't know whether something works. But there was too much emphasis on metrics (often to 4 (insignificant) figures). One paper reported 0.38% recall with strict method and 94% with a more sloppy one. Is this really useful?
But virtually no one (I'll omit the keynotes) gave any indication of whether they were doing something useful to others outside their group. I talked to 2-3 groups - why were they working on (sentiment analysis | table extraction | classification). Did they have users? Was their code available? was anyone else using their code. Take up seems very small. Coupled with the fact that many projects have a 2-3 year lifespan, that the basis is competition rather than collaboration, and we see endless reinvention. (I've done table extraction but I'd much rather someone else did it so we are working with tabulaPDF). The output in academia is a publication, not a running reusable chunk of code.
So it's not surprising that there isn't much public acknowledgement of TDM. The tools are tied up in a myriad of university labs, often without code or continuity.
One shining exception is Lee Giles, whose group has built CiteSeer. Lee and I have known each other for many years and we worked together on a MicrosoftResearch project, OREChem. So when we got talking we found we had two bits of the jigsaw.
Readers of this blog will know that Ross Mounce and I are analysing diagrams. To do that we have to indetify the diagrams, and this is best done from the captions (captions are the most important part of a scientific document for understanding what it's about). And we are hacking the contents of the images. So these two fit together perfectly. His colleague Sagnik Ray Choudhury is working on extracting and classifying the images; that saves us huge time and effort in knowing what to process. I'm therefore planning to visit later this year.
For me that was probably the most important positive outcome of the meeting.
The next post will deal with Elsevier's Gemma Hersh who gave an uninvited "Presentation", and the one after with Elsevier's Chris Shillum's comment on Gemma's views and also on Elsevier's TaC.