As I have blogged (Electronic Theses (ETD2007) – June 8th, 2007) I shall be demonstrating the power of the eThesis next week at Uppsala. We now have technology that will identify the chemistry in a thesis and automatically re-use it in many ways. These include:
- machine extraction of metadata/terminology
- identification of named entities (especially chemical names)
- validation of the contents of the thesis
- validation of the structure of the thesis
- conversion of the thesis into different formats (e.g. to use SI units)
- comparison of similarity between theses
- linking theses to existing ontologies and other resources
For example we could see a thesis repository as a SPARQL endpoint.
However most existing theses, even if publicly visible, are not automatically re-usable without explicit permission. I’d very much like to have a few exemplars which I can show at the Uppsala meeting next week.
I’d be very grateful if any reader(s) have a thesis (possibly their own) or a collection of theses (ideally already posted on the web) which:
- we can use for text-mining without further permission
- has an attribution for author and institution
- is likely to contain a significant amount of chemistry [1]
- is electronic and machine-readable (Word, PDF, NOT TIFF)
- can be made immediately available
[1] many scientific areas which are not themselves chemistry (bioscience, materials, geoscience, environment) may contain chemical terminology – e.g. in methods and materials.
The purpose of this request is to develop ways of enhancing the value of theses as mentioned above and in general we would not expect our software to “discover new science” or to explicitly criticize the thesis – that would be unkind. (Although when eTheses become more common you can expect this to become more common!)
(Please see WWMM for email address)