Category Archives: etd2007

Webcast: the power of the eThesis

I am very grateful to Caltech, specially Eric van der Velde, for organising and recording my presentation on eTheses at Caltech last month. See The power of the Scientific eThesis, a combined audio, video and screenshow. Caltech have done a very good job of stitching it together. Many of the "slides" were in scrolling HTML so the slide-count is artificially high - each scroll generates a new "slide". Total time about 67 minutes.

The themes include:

  • homage to Caltech: Jack Dunitz, Linus Pauling, Verner Schomaker and Ken Trueblood.
  • data-driven science in crystallography - examples from 1973 to present day.
  • semantic web and chemistry, including DBPedia
  • Open Access
  • eTheses
  • crystalEye

and questions at the end.

Since my presentations are taken from many thousand slides it gives an accurate impression of a typical talk, where I do not know in advance exactly what components I shall touch on. In a few places my machine ran slowly so there are minor hiatuses.

Data and Institutional Repositories

One of the themes of ETD2007 was a strong emphasis on IR's. Not surprising since they are topical and a natural place to put theses and dissertations. Almost everyone there - many from the Library and Information Services (LIS) community - had built, or was building, IRs. I asked a lot of people why. Many were doing it because everyone else was, or there was funding, or similar pragmatic motives. But beyond that the motives varied considerably. They included:

  • To promote the institution and its work
  • To make the work more visible
  • To manage business processes (e.g. thesis submission, or research assessment exercises).
  • To satisfy sponsors and funding bodies
  • To preserve and archive the work
  • To curate the work

and more. The point is that there is no single purpose and therefore IR software and systems have to be able to cope with a lot of different demands.

The first generation IRs (ePrints, DSpace, Fedora) addressed the reposition of single eManuscripts ("PDFs") with associated metadata. This now seems to work quite well technically, although there are few real metrics about whether they enhance exposure and there is poor compliance in many institutions. There is also a major problem in some Closed Access (e.g. in chemistry) formally forbidding Open reposition. So the major problems are social.

Recently the LIS community has started to highlight the possibility of repositing data. This is welcome, but needs careful thought - here are a few comments.

Many scholars produce large amounts of valuable data. In some cases the data are far more important than the full text. For example Nick Day's crystallographic repository CrystalEye contains 100,000 structures from the published literature and although it links back to the text can be used without needing to. This is also true of crystallographic data collected from departmental services as in our SPECTRa system. With the right metadata data can often standalone.

All scientists - and especially those with sad experiences of data loss (i.e. almost all) - are keen for their data to be stored safely and indefinitely. And most would like other scientists to re-use their data. This needs a bit of courage: the main drawbacks are:

  • re-analysis could show the data or the conclusions were flawed
  • re-analysis could discover exciting science that the author had missed
  • journals could refuse to publish the work (isn't it tedious to have to mention this in every post :-( )

But many communities have faced up to it. The biosciences require deposition of many sorts of data when articles are published. A good example is the RCSB Protein Data Bank which has a very carefully thought-out and tested policy and process. If you are thinking of setting up a data repository this document (and many more like it) should be required reading. It works, but it's not trivial - 205 pages. It requires specialist staff, constant feedback to and from the community. Here's a nice, honest, chunk:

If it is so simple, why is this manual so long?

In the best of worlds all information
would be exchanged using both syntactically and semantically precise protocols. The reality of the current state of information exchange is somewhat different. Most applications still rely on the poorly defined semantics of the PDB format. Although there has been significant effort to define and standardize the information exchange in the crystallographic community using mmCIF , the application of this method of exchange is just beginning to be employed in crystallographic software.
Currently, there is wide variation in specification of deposited coordinate and structure factor data. Owing to this uncertainty, it is not practical to attempt fully automated and unsupervised processing of coordinate data. This unfortunately limits the functionality that can be presented to the depositor community through Web tools like ADIT , as it would be undesirable to have depositors deal with the unanticipated software failures arising from data format ambiguities. Rather than provide for an automated pipeline for processing and validating coordinate data, coordinate data processing is performed under the supervision of an annotator. Typical anomalies in coordinate data can be identified and rectified by the annotator in a few steps permitting subsequent processing to be performed in an automated fashion.
The remainder of this manual describes in detail each of the data processing steps. The data assembly step in which incoming information is organized and encoded in a standard form is described in the next section.

So data reposition is both highly desirable and complex. I'm not offering easy solutions. But what is not useful is the facile idea that data can simply be reposited in current IRs. (I heard this suggested by a commercial repository supplier in the Digital Scolarship meeting in Glasgow last year. He showed superficial slides of genomes, star maps, etc. and implied that all this could be reposited in IRs. It can't and I felt compelled to say so.)

At ETD2007 we had a panel session on repositing data. Lars Jensen gave a very useful overview of bioscientific data - here are my own points from it:

  • There is a huge amount of bioscience data in repositories.
  • These are specialist sites, normally national or international
  • There is a commitment to the long-term
  • much bioscience is done from the data in these repositories
  • the data in them is complex
  • the community puts much effort into defining the semantics and ontologies
  • specialist staff are required to manage the reposition and maintenance
  • there are hundreds of different types of data - each requires a large amount of effort
  • the relationships between the data are both complex and exceedingly valuable

All bioscientists are aware of these repositories (they don't normally use this term - often "data bank", "gene bank", etc. are used.) They would always look to them to deposit their data. Moreover the community has convinced the journals to enforce the reposition of data by authors.

Some other disciplines have similar approaches - e.g. astronomers have the International Virtual Observatory Alliance. But most don't. So can IRs help?

I'd like to think they can, but I'm not sure. My current view is that data (and especially metadata) - at this stage in human scholarship - have to be managed by the domains, not the institution. So if we want chemical repositories the chemical community should take a lead. Data should firstly be captured in departments (e.g. by SPECTRa) because that is where the data are collected, analysed, and - in the first instance - re-used. For some other domains it's different - perhaps it might be at a particular large facility (synchrotron, telescope, outstation, etc.).

Some will argue that chemistry already operates this domain-specific model. Large abstracters aggregate our data (which is given for free) and then sell it back to us. In the 20th C this was the only model, but in the distributed web it breaks. It's too expensive, does not allow for community ontologies to be developed (the only Open ones in chemistry are developed by biologists). And it's selective and does not help the indivdual researcher and department.

Three years ago I thought it would be a great idea to archive our data in our DSpace repository. It wasn't trivial to put in 250, 000 objects. It's proving even harder to get them out (OAI-PMH is not designed for complex and compound objects).

Joe Townsend who works with me will submit his thesis very shortly. He want to preserve his data - 20 GBytes. So do I. I think it could be very useful for other chemists and eScientists. But where to put it? If we put it in DSpace it may be preserved but it won't be re-usable. If he puts it on CD it requires zillions of actual CDs. And they will decay. We have to do something - and we are open to suggestions.
So we have to have new model - and funding. Here are some constraints in chemistry - your mileage may vary:

  • there must be support for the development of Open ontologies and protocols. One model is to encourage groups which are already active and then transfer the maintenance to International Unions or Learned Societies (though this can be difficult when they are also income-generating Closed Access Publishers)
  • funders must make sure that the work they support is preserved. Hundreds of millions or more are spent in chemistry departments to create high-quality data and most of these are lost. It's short-sighted to argue that the data only live until the paper publication
  • departments must own the initial preservation of this data. This costs money. I think the simplest solution is for funders to mandate the Open preservation of data (cf. Wellcome Trust).
  • the institutions must support the generic preservation process. This requires the departments actually talking to the LIS staff. It also requires LIS staff to be able to converse with scientists on equal terms. This is hard, but essential.

Where the data finally end up is irrelevant as long as they are well managed. There may, indeed, be more than one copy. Some could be tuned for discoverability.

So the simple message is:

  • save your data
  • don't simply put it in your repository

I wish I could suggest better how to do this well.

CML on ICE - towards Open chemical/scientific authoring

Because WWMM had outages my blogging is behind and I'd written a post on Peter Sefton's ICE. Peter and I met at ETD2007 and immediately clicked. But WWMM went to sleep and I haven't reposted. Peter has beaten me to it.

ICE is a content authoring tool based on Open Office. It works natively with XML and Subversion. So it adds a dramatic aspect to document authoring - versioning with full community access and collaboration (if required). For example, if Peter and I write a paper about this we'd use the ICE server at University of Southern Queensland to store the versions. And of course as it's Open Source anyone can set one up - it would be ideal for the Blue Obelisk community to author papers with.

But what catalysed this was the possibility of authoring theses. Students and looking for imaginative approaches and many will be happy to be early adopters in this new technology. If the domain-specific components are in XML (or other standards) it becomes easy to integrate them into ICE. And it is fantastic to be able to revert to previous versions at - I find Subversion easier than Word change management for example.

So some points from PeterS's post:

View this page as PDF


I mentioned before that at the ETD 2007 conference I met Prof Peter Murray-Rust. [1] We're going to collaborate on adding support for CML the Chemical Markup Language to ICE, so that people can write research publications that include 'live' data.

[1] I'm just Petermr or PMR :-)

Here's a quick demo of the possibilities.

I went to the amazing Crystaleye service.

PMR: This is Nick Day's site. We'd hoped to announce it formally a week or so ago but machine problems kep us back. But we'll get some posts out this coming week. [We thank Acta Cryst/IUCr for a summer studentship which helped greatly to get it off the ground.]

The aim of the CrystalEye project is to aggregate crystallography from web resources, and to provide methods to easily browse, search, and to keep up to date with the latest published information.

Crystaleye automatically finds descriptions of crystals in web-accessible literature, turns them into CML and builds pages like Acta Crystallographica Section B, 2007, issue 03-00.

From that page I grabbed this two dimensional image of (C6H15N4O2)2(C4H4O6-2),


PMR: Minor point: This is just the anion - there is a separate image for the cation.(the 3D structure below displays the cations as well).

There's a Java applet on the page that lets you play with the crystal in 3d. Here's a screenshot. of the 3d rendering.


There's lots more work to be done, but I thought I'd show how easy it is to make an ICE document that shows the 2d view for print, with the 3d view for the web, via the applet. Be warned, this may not work for you. The applet refuses to load in Firefox 2 for me, but it does work in Safari on Max OS X. If you follow the 'view this page in PDF' link above you'll see just the picture.

PMR: image and applet deleted here ...

What's happening here?

My initial hack is really simple. I grab the image and paste it into ICE like any other image, but then I link it to the CML source. I wrote a tiny fragment of Python in my ICE site to go through every page, and if it finds a link to to a CML file containing an image, it adds code to load the CML into the Jmol applet. This is a kind of integration-by-convention, AKA microformat.

The main bit of programming only took a few minutes, but sorting out where to put the CML files and the Jmol applet, and integrating the changes into this blog took ages. I ended up putting the files here on my web site which meant putting a big chunk of stuff into subversion, something that should have been done ages ago, but the version of svn that runs on my other server refuses to do large commits over HTTPS 'cos of some SSL bug and I can't figure out how to update it which meant switching the repository to use plain HTTP, and so on. It wasn't made easier by me mucking around with the Airport Extreme router and our ADSL modem at the same time, halting internet access at home for a couple of hours.

To make this integration a bit more usable and robust we want to:

  • Work out a workflow that lets you keep CML files in ICE and easily drop images in to your documents, letting ICE render using the applet when it makes HTML.
  • Integrate forthcoming work from Peter & team that will provide high quality vector graphics instead of the PNG files I'm using now.

PMR: I have now hacked JUMBO so it generates SVG images of 2D (and soon 3D molecules). Note that this then allows automatic generation of molecular images in PDF files (through FOP/SVG)

  • Investigate embedding CML in an image format such as EPS that word processors understand.
  • Generalize this approach for other e-scholarship applications. We're working with the Alive team at USQ on this.
  • Talk to the DART & ARCHER teams.

I am also extremely keen to talk to these teams - as they are doing very similar and complementary work to our SPECTRa and SPECTRA-T projects in capturing scientific data at source.

I am impressed by the Australian commitment to Open Access, Open Data and collaborative working. ICE is an excellent example of how we can split the load. ICE likes working with the technical aspects documents (I don't really though I have to). The Blue Obelisk likes working with XML in chemistry. The two components naturally come together.

This is something I have been waiting for for about 12 years. We haven't got there yet, but we are well on the way.

The power of the scientific eThesis

This is the summary of a presentation I am giving tomorrow at ETD2007 (run by Networked Digital Library of Theses and Dissertations. I'm blogging this as the simplest way of (a) reminding me what I am going to say and (b) acting as very rough record of what I might have presented. (My talks are chosen from a menu of 500+ possible slides and demos and I don't know which at the start of the presentation so it's very difficult to have a historical record. The blog carries the main arguments).
Main themes (many of which have been blogged recently):

  • the thesis need not be a dull record of a final result but a creative work with lives and evolves until and beyond the "final submission"
  • theses should be semantic and interactive, supported by ontologies and go beyond "hamburger PDF". Theses are computable.
  • We must develop communal semantic authoring/creation environments and processes.
  • the process should move rapidly towards embracing open philosophies and methodology. Metadata and ontologies should be open.
  • young people should be actively involved in all parts of managing the thesis process.
    (Harvard Free Culture)
  • "Web 2.0" will transform society and therefore the academic process. We must be prepared for this.
  • It is not clear that current approaches to "repositories" will help rather than hinder innovation and dissemination of eTheses. They will only be useful for preservation if they are semantic.

In detail scientific theses need support for authoring and validating:

  • thesis structure (templating) - e.g. USQ's Integrated Content Environment ICE system which supports XML/"Word"
  • MathML
  • SVG (graphics)
  • CML (Chemistry)
  • GML (maps)
  • Numeric data (various, including CML)
  • graphs (various, including CML)
  • tables (various, including CML)
  • scientific units (various, including CML)
  • ontologies and dictionaries (various, including CML)

Some exciting thesis projects:

Why PDF is so awful: Organic Theses: Hamburger or Cow?

Subversion (CML project)

Wikipedia - caffeine - (info boxes)

GoogleInChI - semantic chemical search without Google knowing

The power of the semantic Web - Using Wikipedia as a Web Database.

Chemical blogspace - overview of exciting developments in chemistry

Local demos including analysis of theses:

What should institutions and NDLTD do to promote this vision?

  • involve young people in all parts of the process - understand Web 2.0 culture and democracy. Be brave
  • help promote their vision against the conservatism of institutions, learned societies and commercial interests
  • promote thesis creation as a complete part of the research process. Start on day 0 with tools, encouragment. Get students from year+1 to explain the vision
  • Harness the power of social computing (Google, Flickr, Wikipedia, etc.). You will have to anyway. Give credit for innovation in this area
  • Co-develop semantic authoring tools, including scientific languages. Use rich clients for display.
  • Promote the use of ontologies and similar resources as integral parts of the scholarly process. Insist on marked up information and entities
  • Use software to validate data in theses. Give these tools to examiners.
  • insist that data belongs to the scientific community. Use creative commons licenses from day 0.

and overall... Use the power of the scholarly community to show that they can communicate science far better than the absurd e-paper, unacceptable business models, and repression of innovation that is forced on us by the commercial and pseudo-commercial publishers. Destroy the pernicious pseudo-science of citation metrics. Reclaim our scholarship.

"open access" - some central questions

I am grateful for the recent correspondence from Peter Suber and Stevan Harnad as it helps me get my thoughts in order for ETD2007. In response to Stevan:

Open Access: What Comes With the Territory,

Peter has analysed the central question very clearly (as always)

I expect that all of us will agree with the analysis below. The position each of us takes may vary:

Summary [of Stevan's post]:Downloading, printing, saving and data-crunching come with the territory if you make your paper freely accessible online (Open Access). You may not, however, create derivative works out of the words of that text. It is the author's own writing, not an audio for remix. And that is as it should be. Its contents (meaning) are yours to data-mine and reuse, with attribution. The words themselves, however, are the author's (apart from attributed fair-use quotes). The frequent misunderstanding that what comes with the OA territory is somehow not enough seems to be based on conflating (1) the text of research articles with (2a) the raw research data on which the text is based, or with (2b) software, or with (2c) multimedia -- all the wrong stuff and irrelevant to OA.


  • Stevan is responding to Peter Murray-Rust's blog post from June 10. But since I agreed with most of what Peter MR wrote, I'll jump in.
  • Stevan isn't saying that OA doesn't or shouldn't remove permission barriers. He's saying that removing price barriers (making work accessible online free of charge) already does most or all of the work of removing permission barriers and therefore that no extra steps are needed.
  • The chief problem with this view is the law. If a work is online without a special license or permission statement, then either it stands or appears to stand under an all-rights-reserved copyright. The only assured rights for users are those collected under fair use or fair dealing. These rights are far fewer and less adequate than OA contemplates, and in any case the boundaries of fair use and fair dealing are vague and contestable.
  • This legal problem leads to a practical problem: conscientious users will feel obliged to err on the side of asking permission and sometimes even paying permission fees (hurdles that OA is designed to remove) or to err on the side of non-use (further damaging research and scholarship). Either that, or conscientious users will feel pressure to become less conscientious. This may be happening, but it cannot be a strategy for a movement which claims that its central practices are lawful.
  • This doesn't mean that articles in OA repositories without special licenses or permission statements may not be read or used. It means that users have access free of charge (a significant breakthrough) but are limited to fair use.

PMR: "The chief problem with this view is the law". That puts it precisely, and that's where Stevan and I differ. At the moment I think we have to work within the law, and I think the law debars me from crunching. There may come a time where we feel that civil disobedience is unavoidable but it hasn't arrived yet - if it does I shall be there.

And some comments on other parts of Stevan's post:

Get the Institutional Repository Managers Out of the Decision Loop

The trouble with many Institutional Repositories (IRs) (besides the fact that they don’t have a deposit mandate) is that they are not run by researchers but by “permissions professionals,” accustomed to being mired in institutional author IP protection issues and institutional library 3rd-party usage rights rather than institutional author research give-aways.

PMR: I have had similar thoughts. I got the distinct impression that some IR's are run like victorian museums - look but don't touch. Ithe very word "repository" suggests a funereal process - it's no surprise that having put much of my stuff into DSpace I find it's an enormous effort to get it out. Why don't we build "disseminatories" instead?
[Stevan's analysis of how we should deposit papers omitted. I don't disagree - I'm just more interested in data t present.]

Now, Peter, I counsel patience! You will immediately reply: “But my robots cannot crunch Closed Access texts: I need to intervene manually!” True, but that problem will only be temporary, and you must not forget the far larger problem that precedes it, which is that 85% of papers are not yet being deposited at all, either as Open Access or Closed Access. That is the inertial practice that needs to be changed, globally, once and for all.

PMR: Here we differ. In many fields there has been little movement and no Green journals. We could wait another five years for no effect. But my main concern is the balance between Green access and copyrighted data. The longer we fail to address the copyrighting of data the worse the situation will become. Publishers are not stupid - they have revenue-oriented business people working out how to make money out of our data - Wiley told me so. Imagine, for example, that a publisher says "I will make all our journals green as long as we retain copyright. And we'll extend the paper to cover the whole of the scientific record". That would be wonderful for Stevan and a complete disaster for paper-crunchers. We can't afford to wait for that to happen.

TJust as I have urged that Gold OA (publishing) advocates should not over-reach (”Gold Fever“) — by pushing directly for the conversion of all publishers and authors to Gold OA, and criticizing and even opposing Green OA and Green OA mandates as “not enough” — I urge the advocates of automatized robotic data-mining to be patient and help rather than hinder Green OA and Green OA (and ID/OA) mandates.

PMR: I am not - I hope - hindering Green access. I am not personally agitating for Green or Gold - my energies go into arguing that the experimental process must not be copyrighted by the publisher or anyone else. And that institutional repositories should start to be much much more proactive and actively support the digital research process.

Stevan Harnad on "open access"

Stevan Harnad - a tireless evangelist of OA - has replied to my points. He has been consistent in arguing the logic below and I agree with the logic. The problem is that few people believe that this allows us to act as he suggests.

Stevan argues that current Green Open Access allows us to do all we wish with the exposed material without permission. However when I spoke to several repositories managers at the JISC meeting all were clear that I could not have permission to do this with their current content. I asked "can my robots download and mine the content in your current open access repository of theses?" - No. "Can you let me have come chemistry theses from your open access collection so I can data-mine them/" - No - you will have to ask the permission of each author individually. So Stevan's views on what I can do iseem not to be - unfortunately - widely held.

  1. Stevan Harnad Says:
    June 12th, 2007 at 3:37 am eOpen Access: What Comes With the Territory

    Peter Murray-Rust’s worries about OA are groundless. Peter worries he can’t be be sure that:

    “I can save my own copy (the MIT [site] suggests you cannot print it and may not be allowed to save it)”

    Pay no attention. Download, print, save and crunch (just as you could have done if you had keyed in the text from reading the pages of a paper book)! [Free Access vs. Open Access (Dec 2003)]

    “that it will be available next week”

    It will. The University OA IRs all see to that. That’s why they’re making it OA. [Proposed update of BOAI definition of OA: Immediate and Permanent (Mar 2005)]

    “that it will be unaltered in the future or that versions will be tracked”

    Versions are tracked by the IR software, and updated versions are tagged as such. Versions can even be DIFFed.

    “that I can create derivative works”

    You may not create derivative works. We are talking about someone’s own writing, not an audio for remix, And that is as it should be. The contents (meaning) are yours to data-mine and reuse, with attribution. The words, however, are the author’s (apart from attributed fair-use quotes). Link to them if you need to re-use them verbatim (or ask for permission).

    “that I can use machines to text- or data-mine it”

    Yes, you can. Download and crunch away.

    This is all common sense, and all comes with the OA territory when the author makes his full-text freely accessible for all, online. The rest seems to be based on some conflation between (1) the text of research articles and (2a) the raw research data on which the text is based, and with (2b) software, and with (2c) multimedia — all the wrong stuff and irrelevant to OA).

    Stevan Harnad
    American Scientist Open Access Forum

Specific issues:

My concern was not with just with material in repositories but elsewhere. Some publishers allow posting on green open access on web sites but debar it from repositories. So the concerns remain.

The MIT repository deliberately adds technical restrictions from printing there theses and this also technically prevents data and text mining. There are some hacks possible to get round this but it comes close to dishonesty and illegailty.

"derivative works" is a phrase that doesn't work well in the data-rich subjects and we need something better. But it's what the licenses use at present.
In data-rich subjects Linking to repositories is often little use. I need thousands of texts on specialist machines accessed with high frequency and bandwidth.

My problem is not with Stevan's views but that few others give positive support to them, particularly not the repository managers. Maybe I'm too cautious...

More on "open access"

I recently posted my concern about the use of "open access" as phrase which is sufficently broad to be confusing and Peter Suber has created a thoughtful and useful reply. I agree in detail with all his analysis and any differences are probably in emphasis and strategy.

Peter Murray-Rust, “open access” is not good enough, A Scientist and the Web, June 10, 2007.  Excerpt:
Comments [PeterS]

  • I agree with much but not all of what Peter MR says.  I'm responding at length because I've often had many of the same thoughts.
  • I'm the principal author of the BOAI definition of OA, and I still support it in full.  Whenever the occasion arises, I emphasize that OA removes both price and permission barriers, not just price barriers.  I also emphasize that the other major public definitions of OA (from Bethesda and Berlin) have similar requirements.

PMR: Agreed. PeterS continually and consistently asserts this - I am arguing that the level of emphasis throughout the community should be higher.

  • I don't agree that the term "open access" on its own, or apart from its public definitions, highlights the removal of price barriers and neglects the removal of permission barriers.  There are many ways to make content more widely accessible, or many digital freedoms, and the term "open access" on its own doesn't favor or disfavor any of them.  Even at the BOAI meeting we realized that the term was not self-explanatory and would need to be accompanied by a clear definition and education campaign.
  • The same, BTW, is true for terms like "open content", "open source", and "free software".  If "open source" is better understood than "open access", it's because its precise definition has spread further, not because the term by itself is self-explanatory or because "open access" lacks a precise definition.

PMR: I accept this. In which case I think we have too look for additional tools of discourse. If "open access" serves an important current purpose in a broad sense it should continued to be used in that way but we should not expect it to deliver precision.

  • I do agree that many projects which remove price barriers alone, and not permission barriers, now call themselves OA.  I often call them OA myself.  This is only to say that the common use of the term has moved beyond than the strict definitions.  But this is not always regrettable.  For most users, removing price barriers alone solves the largest part of the problem with non-OA content, and projects that do so are significant successes worth celebrating.  By going beyond the BBB definition, the common use of the term has marked out a spectrum of free online content, ranging from that which removes no permission barriers (beyond those already removed by fair use) to that which removes all the permission barriers that might interfere with scholarship.   This is useful, for we often want to refer to that whole category, not just to the upper end.  When the context requires precision we can, and should, distinguish OA content from content which is merely free of charge.  But we don't always need this extra precision.

PMR: agreed. But "we often need the extra precision" is also valid.

  • In other words:  Yes, most of us are now using the term "OA" in at least two ways, one strict and one loose, and yes, this can be confusing.  But first, this is the case with most technical terms (compare "evolution" and "momentum").  Second, when it's confusing, there are ways to speak more precisely.  Third, it would be at least as confusing to speak with this extra level of precision --distinguishing different ways of removing permission barriers from content that was already free of charge-- in every context.  (I'm not saying that Peter MR thought we should do the latter.)
  • One good way to be precise without introducing terms that might baffle our audience is to use a license.  Each of the CC licenses, for example, is clear in it own right and each removes a different set of permission barriers.  The same is true for the other OA-friendly licenses.  Like Peter MR, I encourage providers to remove permission barriers and to formalize this freedom with a license.  Even if we multiplied our technical terms, it will usually be more effective to point to a license than to a technical term when someone wonders exactly what we mean by OA for a given piece of work.

This is the central and simple point on which we are agreed - for some of our problems we can solve this problem without extra tools if we put our minds and energy into it. We aren't yet doing that sufficiently.

Part of the problem arises because in the Green approach to "open access" there is often an implicit trade-off between price freedom and permission freedom. There is tool-free access at the expense of having no permissions other than human readability - all the permissions (other than "fair use") remain with the publisher. Many people may feel that this is a reasonable compromise in journal publishing at the present stage. Some may feel that 100% Green open access is an acceptable endpoint.
But I think it comes with a cost to those of us who wish to develop digital scholarship - the use of the information in scholarship by machines as well as humans. As an example the JISC meeting on institutional repositories  I have just been at was called "Digital Repositories - Dealing with the Digital Deluge".  This is an emotive phrase - but it's currently misleading. In many subjects there is a complete Digital Drought. And unless the permissions issue is dealt with there will continue to be. Permission freedom is essential for digital scholarship.

My concern is that unless we address the permission issue much more actively we shall slide into the acceptance that permission freedom is the exception or less important than price. The one area where we have to power to act unilaterally is those parts of our own scholarship over which we have effective control - theses, data in repositories, lteaching/learning materials, technical reports, etc. Let us work to make these 100% permission free.

My immediate urgency is fueled by the ETD2007 meeting tomorrow. I hope that we can find consensus on this issue.

More Open Thesis heroes

I have continued to try to find full OpenAccess theses and encountered considerable difficulty. The main problem is that universities and their repositories do not help readers to find theses with OpenAccess licenses and in many cases they do not give any license information at all.

Anyway the story... I searched Google for "open access creative commons thesis" and found Mathias Klang's thesis on Disruptive Technology. Mathias claims this is the first thesis in Sweden to be issued under CC, so I mailed and asked whether he had information from other countries about earlier theses. He mailed back:

Oleg Evnin at Caltech (successfully defended May 26, 2006) [PMR: blogged by Peter Suber]
...a number of CC-licensed ETDs at the U of Edinburgh and that the earliest seems to be by Magnus Hagdorn, submitted on March 4, 2004.

Many thanks Mathias, and I shall enjoy reading your thesis - this whole area needs some disruptive technology - I am finding that approaches to repositories still look conservative and based on outdated models of thought.

I can't comment in detail on the science but the format of Magnus' thesis is an excellent example of what a modern thesis should contain - it's 400Mbyte zipped but contains spendid animations and data of glaciation - worth a look.

But the problem with the repositories is that there is no indication that the actual thesis is OpenAccess. The Edinburgh repository announces:

All items in ERA are protected by copyright, with all rights reserved.

Copyright for this page [1] belongs to The University of Edinburgh

[1] i.e. the metadata splash page

which discourages the visitor for looking for an Open License within the thesis.

I'm sure this isn'r deliberate, but, repository managers, here is a very simple idea:

Add dc:rights to the splash page and metadata and proudly proclaim in large letters:



Free Culture and Open Theses

As you know I am looking for real Open Access theses (not fuzzy open). Where have I found the most so far? Not in any of the highly supported repositories but in Harvard College Thesis Repository part of Harvard College Free Culture - here's their splash page...

Welcome to the Harvard College Thesis Repository

Welcome to the Harvard College Thesis Repository, a project of Harvard College Free Culture! Here Harvard students make their senior theses accessible to the world, for the advancement of scholarship and the widening of open access to academic research.

Too many academics still permit publishers to restrict access to their work, needlessly limiting—cutting in half, or worsereadership, research impact, and research productivity. For more background, check out our op-ed article in The Harvard Crimson.

If you've written a thesis in Harvard College, you're invited to take a step toward open access right here, by uploading your thesis for the world to read. (If you're heading for an academic career, this can even be a purely selfish move—a first taste of the greater readership and greater impact that comes with open access.)

If you're interested in what the students at (ahem) the finest university in the world have to say at the culmination of their undergraduate careers, look around.

There are 28 theses here and - unlike the green fuzzy repositories - all have been deposited under CC-BY (i.e. completely compliant with BOAI). The web page didn't make the license position clear but I got the following clarification today:

Yes--all users of

our repository agreed to a CC-by license when they uploaded their
theses.  As part of the submission process, all users agreed to the
following terms:
"I am submitting this thesis, my original work, under the terms of
the Creative Commons Attribution License, version 2.5: roughly, I
grant everyone the freedom to share and adapt this work, so long as
they credit me accurately. I have read and understood this license."
We will work to make this more clear in the metadata for each thesis.

Well done Harvard College Free Culture - you have made an important step forward. Convince students in other institutions to follow your lead and the battle is won.

(Not surprisingly there are no chemistry theses but I am sure that can be fixed).

Useful chemistry thesis in RDF

I shall be using Alicia’s Open Science Thesis in Useful Chemistry as a technical demonstrator at ETD2007. I really want to show how a born digital thesis is a qualitative step forward. Completely new techniques can be used to structure, navigate and mine the information. Here's a taster:

A chemical reaction diagram ("scheme") is a graphic object which looks like this:


As you can see this is semantically useless. A lot of work has gone into this, but none of it is useful to a machine (look closely and you'll see it's a JPEG). Even in the native software which was used to draw it it is unlikely that the semantics can be easily determined. However XML and RDF allow a complete representation. It took me about 1 hour to handcraft the topology - if we had decent tools it would be seconds. The complete set of reaction schemes (I counted 11 in the thesis can be easily converted to a single RDF file which looks something like this:

uc:scheme1_1 pmr:isA pmr:reactionScheme .
uc:scheme1_1 pmr:hasA uc:rxn1_1a .
uc:scheme1_1 pmr:hasA uc:rxn1_1b .

uc:rxn1_1a pmr:hasReactant uc:comp1 .
uc:rxn1_1a pmr:hasReactant uc:comp2 .
uc:rxn1_1a pmr:hasReactant uc:comp3 .
uc:rxn1_1a pmr:hasReactant uc:comp4 .
uc:rxn1_1a pmr:hasProduct uc:comp5 .
uc:rxn1_1b pmr:hasReactant uc:comp5 .
uc:rxn1_1b pmr:hasProduct uc:comp6 .

(uc: refers to the usefulChemistry namespace, pmr: to mine).

There are many Open Source tools for graphing this and here is part of the output of one from the W3C


Here you can see that reaction1.1a has four reactants (compound 1,2,3,4) and 1 product (comp 5). Comp5 is the reactant for another reaction (clipped to save blog problems). The complete picture for the whole thesis looks like this:

and (assuming you have a large screen) you can see immediately what reactions every compound is involved in.

That's only the start as it is possible to ask sophisticated questions from a SPARQL endpoint - and that's where we are going next...

... IFF you make the theses true Open Access