I am honoured to be asked to speak at the meeting next week in Uppsala on electronic theses (The Power of the Electronic Scientific Thesis). (This resonates with the JISC meeting on repositories (Digital repositories: Dealing with the digital deluge) which I haven't yet been able to blog as our server is only just back up.) Some snippets:
Yet our own work in the SPECTRa project has shown that 80% (or more) of scientific data is never published....Electronic theses have the power to change all this. The thesis has several major advantages over current methods of publication
- the author and/or its institution retain complete control over the copyright of the work and are not forced to hand it over to the publisher
- there is a strict quality control system of internal and external examiners. The candidate has to convince them that the data are fit for purpose.
- the student cannot be "lazy" about the means of authoring. If a university insists on XML then the student will have to do it.
- an electronic thesis can (and I argue must) be openly available in an institutional repository.
- an unlimited amount of supporting data can be copublished.
There are technical and socio-political barriers.
- the thesis is often produced in some form or e-paper (TIFFs or PDF) which completely destroy all semantics
- XML tools are not yet universal
- there is no metadata for the scientific data
- the authors and their supervisors are afraid that someone might read the thesis and (a) show there are errors (b) re-use it in clever ways thus "scooping" the authors. (This is sometimes contaminated with the problems of patents and confidential human information - but there are well accepted mechanisms for this). There are no moral reasons why the average thesis should not be fully visible to the world and re-usable under the BOAI declaration.
- the university has medieval rules of ownership and copyright but enlightened ones now routinely post their theses.
My utopian vision is that students prepare their thesis in XML. This solves all the technical problems. It also will help the students to prepare better theses faster. For example students are often criticised for not having scientific units, omitting scales and labels on diagrams, missing out critical information, etc.
I suggest the following simple rules:
- invest in XML authoring technology for theses (it is then automatic to create PDFs)
- invest in communal XML languages (MathML, CML, SVG...) for the major scientific domains and to check the quality of material
- develop departmental awareness and practices for capturing data at source. Our SPECTRa project has done this for crystallography, computational chemistry and spectroscopy.
- until then ALWAYS co deposit a Word or LaTeX document, never just the PDF
- add a copyright notice such as Science/Creative Commons to protect the data being appropriated by publishers
I also prepared a "manifesto" for the JISC meeting - it overlaps with the rules but adds
- Theses must be born-digital (i.e. NOT PDF)
- Domain ontologies must be used
- All data must be included in theses
- Data must be validated before submission
- Theses must be openly exposed to data and metadata crawlers
One critical point from the JISC meeting was that in most institutions the copyright of the thesis is vested in the author (student) (although sometimes it is the institution). For born-nondigital his makes it VERY difficult to re-use without explicit permission from the author. A human can read, but not re-use.
This is compounded by the use of the term "Open Access" to describe theses. My interpretation of Open Access is strict BOAI (Budapest Open Access Initiative):
By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.
Unfortunately it is common practice for many at the JISC meeting to talk of "Open Access" when they mean "Toll Free". I asked several organizers of thesis repositories specifically whether my robots could download these "Open Access" theses, text-mine them, and publish the results. In all cases I was told that for existing theses this was not allowed. However most agreed that born-digital theses had the opportunity for authors to make their theses fully Open.
The single most important rule, therefore, is that authors should be very strongly encouraged to make their theses fully Open under the BOAI and given the technical and legal tools to do so. Although in many disciplines this is complex (the thesis could contain third-party material, creative works of the author that they hold valuable (e.g. music, poetry, art...)) in most sciences it is negligible. I would be surprised in many current chemistry PhD students wished anything other than full re-use of their material. (Yes - it's frightening - there will be errors - inevitably. I am anything but proud of my own thesis presentation and know there are errors, but I might go back and scan it in all the same when I have time).
Now I'm going to appeal to the chemistry community to see if there are any Open theses I can use.
BTW I am tagging this and future relevant posts as etd2007