I am giving a talk at ETD2009 (Electronic Theses and Dissertations) in Pittsburgh, PA tomorrow on “the Semantic Scientific Thesis”. I now like to try to blog the key points of my presentation since I shall show a number of demos which aren’t easy to capture.
My arguments will be that almost all theses are created with the “book” or “journal” metaphor and based on flat text, flat images and little if any linking or semantics. This is an increasingly outdated way of communicating to humans in today’s pervasive web where we learn to interact with information from the cradle upwards.
It’s even worse for machines. Most scientific theses contain a wealth of data which can be used for data-driven science. When we combine observations and conclusions from different sources, or different times we can often get radically new insights.
What can we do with an electronic thesis in PDF? Print it. Because PDF has been designed to talk to printers, not humans. It has no semantics (in its usual form). If we are luck we can search it for concept in text, but we cannot search the diagrams or the tables.
We also cannot use it for input into new science. It’s now quite common for a chemist to compute the properties of a molecule using quantum mechanics. But this normally means she has type up all the data (either from a journal or a thesis) to do the calculation. And yet, if the theses were machine friendly, we could do this for thousands of chemistry theses a year.
So how do we create semantic theses? It’ll take us 10 years or more to work through the process, and it needs the creation of ontologies as we go. But there are simple first steps:
-
preserve the actual document that the student authored. This is normally either Word/OpenOffice or LaTeX
-
preserve the key data files in the work.
These are easy to state and easy to do technically. Let’s not worry at this stage about exactly what the semantics of the data files are – we shouldn’t worry about migration or semantic preservation for civilisations who dig up our digital artefacts. Let’s just get everyone into the habit of saving their data. In many subjects there are de facto standards and even if there are many it’s a lot better than nothing.
So, Graduate Offices and Repository Rats, just allow data to be deposited alongside the theses. The tools are coming (I shall show TheOREm, where we have used the ORE RDF technology to manage components of a thesis).
Let each student answer the question:
“If someone in my discipline or in my lab wanted to build on my work next year, have I made it easy for her to use my data?”
And let every examiner and every board make that a prerequisite of graduating.