It's often said by detractors and obfuscates that "there is no demand for content mining". It's difficult to show demand for something that isn't widely available and which people have been scared to use publicly. So this is an occasional post to show the very varied things that content mining can do.
It wouldn't be difficult to make a list of 101 things that a book can be used for. Or television. Or a computer (remember when IBM told the world that it only needed 10 computers?) Content mining of the public Internet is no different.
I'm listing them in the order they come into my head, and varying them. The primary target will be scientific publications (open or closed - FACTs cannot be copyrighted) but the technology can be applied to government documents, catalogues, newspapers, etc. Since most people probably limit "content" to words in the text (e.g. in a search engine) I'll try to enlarge the vision. I'll put in brackets the scale of the problem
- Which universities in SE Asia do scientists from Cambridge work with? (We get asked this sort of thing regularly by ViceChancellors). By examining the list of authors of papers from Cambridge and the affiliations of their co-authors we can get a very good approximation. (Feasible now).
- Which papers contain grayscale images which could be interpreted as Gels? A http://en.wikipedia.org/wiki/Polyacrylamide_gel is a universal method of identifying proteins and other biomolecules. A typical gel (Wikipedia CC-BY-SA) looks like Literally millions of such gels are published each year and they are highly diagnostic for molecular biology. They are always grayscale and have vertical tracks, so very characteristic. (Feasibility - good summer student project in simple computer vision using histograms).
- Find me papers in subjects which are (not) editorials, news, corrections, retractions, reviews, etc. Slightly journal/publisher-dependent but otherwise very simple.
- Find papers about chemistry in the German language. Highly tractable. Typical approach would be to find the 50 commonest words (e.g. "ein", "das",...) in a paper and show the frequency is very different from English ("one", "the" ...)
- Find references to papers by a given author. This is metadata and therefore FACTual. It is usually trivial to extract references and authors. More difficult, of course to disambiguate.
- Find uses of the term "Open Data" before 2006. Remarkably the term was almost unknown before 2006 when I started a Wikipedia article on it.
- Find papers where authors come from chemistry department(s) and a linguistics department. Easyish (assuming the departments have reasonable names and you have some aliases ("Molecular Sciences", "Biochemistry")...)
- Find papers acknowledging support from the Wellcome Trust. (So we can check for OA compliance...).
- Find papers with supplemental data files. Journal-specific but easily scalable.
- Find papers with embedded mathematics. Lots of possible approaches. Equations are often whitespaced, text contains non-ASCII characters (e.g. greeks, scripts, aleph, etc.) Heavy use of sub- and superscripts. A fun project for an enthusiast
So that's just a start. I can probably get to 50 fairly easily but I'd love to have ideas from...
[The title many or may not allude to http://en.wikipedia.org/wiki/101_Uses_for_a_Dead_Cat ]