As regular readers will know we are applying text-mining to chemistry in Open theses. The problem is finding fully Open theses - so far we have got Alicia's. Alicia has captured all here molecules in semantic form so text-mining isn't required - and I'm hoping to do some fun stuff with XML on it.
I've searched for large collections of theses. MIT has a promising collection which would be ideal but they are only TollFree, not OpenAccess. I'm still appealing for readers to help. But in one of those quirks of Googling I ended up at the digital repository of the University of Stirling.
I am delighted about this since I spent 15 very happy years on the staff at Stirling in the Chemistry department. It doesn't have one now - that's why I left - although we're having our 40th anniversary later. But perhaps there are theses with chemical concepts Now the repository announces:
The copyright in theses in this collection remains with the author, unless it is stated to have been assigned to the University of Stirling. The University of Stirling reserves the right to keep electronic copies for consultation in both cases.
so I wasn't very hopeful. But I thought I'd have a look and found one in aquaculture - one of the successful disciplines in Stirling:
which carried the licence:
This item is protected by original copyright
Items in the Repository are protected by copyright, with all rights reserved, unless otherwise indicated.
still not hopeful, until I read the license:
License granted by Kriengkrai Satapornvanit (firstname.lastname@example.org) on 2007-03-26T06:34:59Z (GMT):[...]END USER LICENCE This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 1.0 Licence. YOU ARE FREE: - to copy, distribute, display, and perform the work - to make derivative works Under the following conditions: ATTRIBUTION You must give the original author credit. NON COMMERCIAL You may not use this work for commercial purposes. SHARE ALIKE If you alter, transform, or build upon this work, you may distribute the resulting work only under a licence identical to this one. For any reuse or distribution, you must make clear to others the licence terms of this work. Any of these conditions can be waived if you receive permission from the author. Your fair dealings and other rights are in no way affected by the above.
Many public thanks, Kriengkrai Satapornvanit and I hope your future work prospers. Now I am completely free to see if chemicals can be mined from the thesis:
- I download the PDF.
- I convert it to ASCII using pdftotxt. This destroys the formatting, diacritics, tables, subscripts, superscripts - in fact almost everything except the words. And - for most theses - these are still in the right order. (Unlike Eric Morecambe, this is not a joke - PDF often has the words in an arbitrary order).
- I start OSCAR (http://oscar3.sf.net) as a Server and paste in the text. It takes about 30 seconds for OSCAR to read the whole thesis and interpret the chemistry. (This portable version of OSCAR does not have a complete lexicon - full versions need to be run on a server).
- OSCAR (originally dubbed the "journal eating-robot") eats the text. Here's a typical section:
You can see that OSCAR has recognised many words as likely chemical terms (in yellow) and knows the structure of the underlined ones (the full version would know all of them). It's not 100% accurate - you can see it thinks "P" is the element phosphorus - but Peter Corbett has addressed this in later versions.
So this allows us to collect metadata from theses automatically. OSCAR can tell us in a few seconds that this thesis is concerned with specific pesticides. That's part of the basis of the SPECTRa-T project. Since we've benefited from Open Source theses, maybe we should do the whole project on an Open Wiki...