petermr's blog

A Scientist and the Web


Archive for the ‘etd2007’ Category

“open access” is not good enough

Sunday, June 10th, 2007

I have ranted at regular intervals about the use of “Open Access” or often “open access” as a term implying more than it delivers. My current concern is that although there are are tens of thousands of theses described as “open access” I have only discovered 3 (and possibly another 15 today) which actually comply with the BOAI definition of Open Access.

The key point is is that unless a thesis (or any publication) explicitly carries a license (or possibly a site meta-license) actually stating that it is BOAI compliant, then I cannot re-use it. I shall use “OpenAccess” to denote BOAI-compliant in this post and “open access” to mean some undefined access which may only allow humans to read but not re-use the information
I do not wish to disparage the important efforts to making scholarly information more widely available, and I applaud the general direction and achievement of the groups below. I appreciate that the copyright of historical content normally is held by the student author and it’s certainly very valuable to have “access” to it. But it is not OpenAccess. And unless specific policies are put in place to add specific BOAI-compliant licenses then future theses will also be non-compliant.
Here are typical statements:

  • “EThOS will make UK theses available on open access for global use”. Having spoken to EThOS colleagues last week it is clear that “open access” does not automatically mean OpenAccess.  (Electronic theses in the UK: the open access future : JISC).
  • MIT theses: “Regardless of whether copyright is held by the student or the Institute, the MIT Libraries publish the thesis electronically allowing open access viewing and limited downloading/printing. See” The term “open access viewing” might suggest the theses are BOAI-compliant and therefore is potentially misleading. I found that the “public” thesis had been mounted with “printing disabled” which means that it cannot be technically re-used (as well as being legally non-reusable)
  • ECS Soton: A well-known thesis is in e-form: (ECS EPrints Service – Evaluating Research Impact through Open Access to Scholarly Communication) (Brody, T). It offers two sorts of download: “(i) PDF – and (ii) Other (Latex)Access restricted to members of ECS [i.e. Soton only]“. This is a differential distribution of scholarship. I could also not find any license or copyright statement

By contrast let’s look at “Open Source” which applies to software and has been highly successful in liberating the field. It’s very widely used in academia and elsewhere. The Open Source Definition states

Open source doesn’t just mean access to the source code. [PMR's emphasis] The
distribution terms of open-source software must comply with
the following criteria [PMR's elisions]:

1. Free Redistribution

The license shall not restrict any party from selling or
giving away the software as a component of an aggregate
software distribution containing programs from several
different sources. The license shall not require a
royalty or other fee for such sale.
2. Source Code

The program must include source code [...]

3. Derived Works

The license must allow modifications and derived works, and must
allow them to be distributed under the same terms as the license
of the original software.

4. Integrity of The Author’s Source Code

The license may restrict source-code from being distributed in
modified form only if the license allows [...]

7. Distribution of License

The rights attached to the program must apply to all to whom
the program is redistributed without the need for execution of
an additional license by those parties.


*10. License Must Be Technology-NeutralNo provision of the license may be predicated on any individual
technology or style of interface.

In general the term “Open Source” is completely self-explanatory within a large community. I can describe my software as OS and everyone understands what I mean. There are some licenses (e.g. GPL) which require additional freedoms but they don’t invalidate the above.

By contrast if someone describes something as “open access” it simply means that I may – as a human – and at some arbitrary time in human history – read the document. It does not guarantee that

  • I can save my own copy (the MIT suggests you cannot print it and may not be allowed to save it)
  • that it will be available next week
  • that it will be unaltered in the future or that versions will be tracked
  • that I can create derivative works
  • that I can use machines to text- or data-mine it

So I believe that “open access” should be recast as “toll-free” – i.e. you do not have to pay for it but there are no other guarantees. We should restrict the use of “Open Access” to documents which explicitly carry licenses compliant with BOAI. [A weaker (and much more fragile approach) is that a site license applies to all content. The problem here is that documents then get decoupled from the site and their OpenAccess position is unknown.]

If the community wishes to continue to use “open access” to describe documents which do not comply with BOAI then I suggest the use of suffixes/qualifiers to clarify. For example:

  • “open access (CC-BY)” – explicitly carries CC-BY license
  • “open access (BOAI)” – author/site wishes to assert BOAI-nature of document(s) without specific license
  • “open access (FUZZY)” – fuzzy licence (or more commonly absence of licence) for document or site without any guarantee of anything other than human visibility at current time. Note that “Green” open access falls into this category. It might even be that we replace the word FUZZY by GREEN, though the first is more descriptive.

However there is no value in “Green open access” for theses. Let’s make sure they are all BOAI compliant.

OSCAR eats an Open thesis

Saturday, June 9th, 2007

As regular readers will know we are applying text-mining to chemistry in Open theses. The problem is finding fully Open theses – so far we have got Alicia’s. Alicia has captured all here molecules in semantic form so text-mining isn’t required – and I’m hoping to do some fun stuff with XML on it.

I’ve searched for large collections of theses. MIT has a promising collection which would be ideal but they are only TollFree, not OpenAccess. I’m still appealing for readers to help. But in one of those quirks of Googling I ended up at the digital repository of the University of Stirling.

I am delighted about this since I spent 15 very happy years on the staff at Stirling in the Chemistry department. It doesn’t have one now – that’s why I left :-( – although we’re having our 40th anniversary later. But perhaps there are theses with chemical concepts Now the repository announces:

The copyright in theses in this collection remains with the author, unless it is stated to have been assigned to the University of Stirling. The University of Stirling reserves the right to keep electronic copies for consultation in both cases.

so I wasn’t very hopeful. But I thought I’d have a look and found one in aquaculture – one of the successful disciplines in Stirling:

Feeding behaviour of the prawn Macrobrachium rosenbergii as an indicator of pesticide contamination in tropical freshwater.

which carried the licence:

This item is protected by original copyright

Items in the Repository are protected by copyright, with all rights reserved, unless otherwise indicated.

still not hopeful, until I read the license:

License granted by Kriengkrai Satapornvanit ( on 2007-03-26T06:34:59Z (GMT):

This work is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 1.0 Licence.

- to copy, distribute, display, and perform the work
- to make derivative works

Under the following conditions:

You must give the original author credit.

You may not use this work for commercial purposes.

If you alter, transform, or build upon this work, you may distribute
the resulting work only under a licence identical to this one.

For any reuse or distribution, you must make clear to others the
licence terms of this work. Any of these conditions can be waived
if you receive permission from the author. Your fair dealings and
other rights are in no way affected by the above.

Many public thanks, Kriengkrai Satapornvanit and I hope your future work prospers. Now I am completely free to see if chemicals can be mined from the thesis:

  • I download the PDF.
  • I convert it to ASCII using pdftotxt. This destroys the formatting, diacritics, tables, subscripts, superscripts – in fact almost everything except the words. And – for most theses – these are still in the right order. (Unlike Eric Morecambe, this is not a joke – PDF often has the words in an arbitrary order).
  • I start OSCAR ( as a Server and paste in the text. It takes about 30 seconds for OSCAR to read the whole thesis and interpret the chemistry. (This portable version of OSCAR does not have a complete lexicon – full versions need to be run on a server).
  • OSCAR (originally dubbed the “journal eating-robot”) eats the text. Here’s a typical section:


You can see that OSCAR has recognised many words as likely chemical terms (in yellow) and knows the structure of the underlined ones (the full version would know all of them). It’s not 100% accurate – you can see it thinks “P” is the element phosphorus – but Peter Corbett has addressed this in later versions.

So this allows us to collect metadata from theses automatically. OSCAR can tell us in a few seconds that this thesis is concerned with specific pesticides. That’s part of the basis of the SPECTRa-T project. Since we’ve benefited from Open Source theses, maybe we should do the whole project on an Open Wiki…

Chemistry/science theses urgently wanted!

Friday, June 8th, 2007

As I have blogged (Electronic Theses (ETD2007) – June 8th, 2007) I shall be demonstrating the power of the eThesis next week at Uppsala. We now have technology that will identify the chemistry in a thesis and automatically re-use it in many ways. These include:

  • machine extraction of metadata/terminology
  • identification of named entities (especially chemical names)
  • validation of the contents of the thesis
  • validation of the structure of the thesis
  • conversion of the thesis into different formats (e.g. to use SI units)
  • comparison of similarity between theses
  • linking theses to existing ontologies and other resources

For example we could see a thesis repository as a SPARQL endpoint.

However most existing theses, even if publicly visible, are not automatically re-usable without explicit permission. I’d very much like to have a few exemplars which I can show at the Uppsala meeting next week.

I’d be very grateful if any reader(s) have a thesis (possibly their own) or a collection of theses (ideally already posted on the web) which:

  • we can use for text-mining without further permission
  • has an attribution for author and institution
  • is likely to contain a significant amount of chemistry [1]
  • is electronic and machine-readable (Word, PDF, NOT TIFF)
  • can be made immediately available

[1] many scientific areas which are not themselves chemistry (bioscience, materials, geoscience, environment) may contain chemical terminology – e.g. in methods and materials.
The purpose of this request is to develop ways of enhancing the value of theses as mentioned above and in general we would not expect our software to “discover new science” or to explicitly criticize the thesis – that would be unkind. (Although when eTheses become more common you can expect this to become more common!)

(Please see WWMM for email address)

Electronic Theses (ETD2007)

Friday, June 8th, 2007

I am honoured to be asked to speak at the meeting next week in Uppsala on electronic theses (The Power of the Electronic Scientific Thesis). (This resonates with the JISC meeting on repositories (Digital repositories: Dealing with the digital deluge) which I haven’t yet been able to blog as our server is only just back up.) Some snippets:

Yet our own work in the SPECTRa project has shown that 80% (or more) of scientific data is never published….Electronic theses have the power to change all this. The thesis has several major advantages over current methods of publication

  • the author and/or its institution retain complete control over the copyright of the work and are not forced to hand it over to the publisher
  • there is a strict quality control system of internal and external examiners. The candidate has to convince them that the data are fit for purpose.
  • the student cannot be “lazy” about the means of authoring. If a university insists on XML then the student will have to do it.
  • an electronic thesis can (and I argue must) be openly available in an institutional repository.
  • an unlimited amount of supporting data can be copublished.

There are technical and socio-political barriers.

  • the thesis is often produced in some form or e-paper (TIFFs or PDF) which completely destroy all semantics
  • XML tools are not yet universal
  • there is no metadata for the scientific data
  • the authors and their supervisors are afraid that someone might read the thesis and (a) show there are errors (b) re-use it in clever ways thus “scooping” the authors. (This is sometimes contaminated with the problems of patents and confidential human information – but there are well accepted mechanisms for this). There are no moral reasons why the average thesis should not be fully visible to the world and re-usable under the BOAI declaration.
  • the university has medieval rules of ownership and copyright but enlightened ones now routinely post their theses.

My utopian vision is that students prepare their thesis in XML. This solves all the technical problems. It also will help the students to prepare better theses faster. For example students are often criticised for not having scientific units, omitting scales and labels on diagrams, missing out critical information, etc.
I suggest the following simple rules:

  • invest in XML authoring technology for theses (it is then automatic to create PDFs)
  • invest in communal XML languages (MathML, CML, SVG…) for the major scientific domains and to check the quality of material
  • develop departmental awareness and practices for capturing data at source. Our SPECTRa project has done this for crystallography, computational chemistry and spectroscopy.
  • until then ALWAYS co deposit a Word or LaTeX document, never just the PDF
  • add a copyright notice such as Science/Creative Commons to protect the data being appropriated by publishers

I also prepared a “manifesto” for the JISC meeting – it overlaps with the rules but adds

  • Theses must be born-digital (i.e. NOT PDF)
  • Domain ontologies must be used
  • All data must be included in theses
  • Data must be validated before submission
  • Theses must be openly exposed to data and metadata crawlers

One critical point from the JISC meeting was that in most institutions the copyright of the thesis is vested in the author (student) (although sometimes it is the institution). For born-nondigital his makes it VERY difficult to re-use without explicit permission from the author. A human can read, but not re-use.

This is compounded by the use of the term “Open Access” to describe theses. My interpretation of Open Access is strict BOAI (Budapest Open Access Initiative):

By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.

Unfortunately it is common practice for many at the JISC meeting to talk of “Open Access” when they mean “Toll Free”. I asked several organizers of thesis repositories specifically whether my robots could download these “Open Access” theses, text-mine them, and publish the results. In all cases I was told that for existing theses this was not allowed. However most agreed that born-digital theses had the opportunity for authors to make their theses fully Open.
The single most important rule, therefore, is that authors should be very strongly encouraged to make their theses fully Open under the BOAI and given the technical and legal tools to do so. Although in many disciplines this is complex (the thesis could contain third-party material, creative works of the author that they hold valuable (e.g. music, poetry, art…)) in most sciences it is negligible. I would be surprised in many current chemistry PhD students wished anything other than full re-use of their material. (Yes – it’s frightening – there will be errors – inevitably. I am anything but proud of my own thesis presentation and know there are errors, but I might go back and scan it in all the same when I have time).
Now I’m going to appeal to the chemistry community to see if there are any Open theses I can use.

BTW I am tagging this and future relevant posts as etd2007