Stevan Harnad on "open access"

Stevan Harnad – a tireless evangelist of OA – has replied to my points. He has been consistent in arguing the logic below and I agree with the logic. The problem is that few people believe that this allows us to act as he suggests.
Stevan argues that current Green Open Access allows us to do all we wish with the exposed material without permission. However when I spoke to several repositories managers at the JISC meeting all were clear that I could not have permission to do this with their current content. I asked “can my robots download and mine the content in your current open access repository of theses?” – No. “Can you let me have come chemistry theses from your open access collection so I can data-mine them/” – No – you will have to ask the permission of each author individually. So Stevan’s views on what I can do iseem not to be – unfortunately – widely held.

  1. Stevan Harnad Says:
    June 12th, 2007 at 3:37 am eOpen Access: What Comes With the Territory
    Peter Murray-Rust’s worries about OA are groundless. Peter worries he can’t be be sure that:

    “I can save my own copy (the MIT [site] suggests you cannot print it and may not be allowed to save it)”

    Pay no attention. Download, print, save and crunch (just as you could have done if you had keyed in the text from reading the pages of a paper book)! [Free Access vs. Open Access (Dec 2003)]

    “that it will be available next week”

    It will. The University OA IRs all see to that. That’s why they’re making it OA. [Proposed update of BOAI definition of OA: Immediate and Permanent (Mar 2005)]

    “that it will be unaltered in the future or that versions will be tracked”

    Versions are tracked by the IR software, and updated versions are tagged as such. Versions can even be DIFFed.

    “that I can create derivative works”

    You may not create derivative works. We are talking about someone’s own writing, not an audio for remix, And that is as it should be. The contents (meaning) are yours to data-mine and reuse, with attribution. The words, however, are the author’s (apart from attributed fair-use quotes). Link to them if you need to re-use them verbatim (or ask for permission).

    “that I can use machines to text- or data-mine it”

    Yes, you can. Download and crunch away.
    This is all common sense, and all comes with the OA territory when the author makes his full-text freely accessible for all, online. The rest seems to be based on some conflation between (1) the text of research articles and (2a) the raw research data on which the text is based, and with (2b) software, and with (2c) multimedia — all the wrong stuff and irrelevant to OA).
    Stevan Harnad
    American Scientist Open Access Forum

Specific issues:
My concern was not with just with material in repositories but elsewhere. Some publishers allow posting on green open access on web sites but debar it from repositories. So the concerns remain.
The MIT repository deliberately adds technical restrictions from printing there theses and this also technically prevents data and text mining. There are some hacks possible to get round this but it comes close to dishonesty and illegailty.
“derivative works” is a phrase that doesn’t work well in the data-rich subjects and we need something better. But it’s what the licenses use at present.
In data-rich subjects Linking to repositories is often little use. I need thousands of texts on specialist machines accessed with high frequency and bandwidth.
My problem is not with Stevan’s views but that few others give positive support to them, particularly not the repository managers. Maybe I’m too cautious…

Posted in etd2007, open issues | 1 Comment

More on "open access"

I recently posted my concern about the use of “open access” as phrase which is sufficently broad to be confusing and Peter Suber has created a thoughtful and useful reply. I agree in detail with all his analysis and any differences are probably in emphasis and strategy.

Peter Murray-Rust, “open access” is not good enough, A Scientist and the Web, June 10, 2007.  Excerpt:
Comments [PeterS]

  • I agree with much but not all of what Peter MR says.  I’m responding at length because I’ve often had many of the same thoughts.
  • I’m the principal author of the BOAI definition of OA, and I still support it in full.  Whenever the occasion arises, I emphasize that OA removes both price and permission barriers, not just price barriers.  I also emphasize that the other major public definitions of OA (from Bethesda and Berlin) have similar requirements.

PMR: Agreed. PeterS continually and consistently asserts this – I am arguing that the level of emphasis throughout the community should be higher.

  • I don’t agree that the term “open access” on its own, or apart from its public definitions, highlights the removal of price barriers and neglects the removal of permission barriers.  There are many ways to make content more widely accessible, or many digital freedoms, and the term “open access” on its own doesn’t favor or disfavor any of them.  Even at the BOAI meeting we realized that the term was not self-explanatory and would need to be accompanied by a clear definition and education campaign.
  • The same, BTW, is true for terms like “open content”, “open source”, and “free software”.  If “open source” is better understood than “open access”, it’s because its precise definition has spread further, not because the term by itself is self-explanatory or because “open access” lacks a precise definition.

PMR: I accept this. In which case I think we have too look for additional tools of discourse. If “open access” serves an important current purpose in a broad sense it should continued to be used in that way but we should not expect it to deliver precision.

  • I do agree that many projects which remove price barriers alone, and not permission barriers, now call themselves OA.  I often call them OA myself.  This is only to say that the common use of the term has moved beyond than the strict definitions.  But this is not always regrettable.  For most users, removing price barriers alone solves the largest part of the problem with non-OA content, and projects that do so are significant successes worth celebrating.  By going beyond the BBB definition, the common use of the term has marked out a spectrum of free online content, ranging from that which removes no permission barriers (beyond those already removed by fair use) to that which removes all the permission barriers that might interfere with scholarship.   This is useful, for we often want to refer to that whole category, not just to the upper end.  When the context requires precision we can, and should, distinguish OA content from content which is merely free of charge.  But we don’t always need this extra precision.

PMR: agreed. But “we often need the extra precision” is also valid.

  • In other words:  Yes, most of us are now using the term “OA” in at least two ways, one strict and one loose, and yes, this can be confusing.  But first, this is the case with most technical terms (compare “evolution” and “momentum”).  Second, when it’s confusing, there are ways to speak more precisely.  Third, it would be at least as confusing to speak with this extra level of precision –distinguishing different ways of removing permission barriers from content that was already free of charge– in every context.  (I’m not saying that Peter MR thought we should do the latter.)
  • One good way to be precise without introducing terms that might baffle our audience is to use a license.  Each of the CC licenses, for example, is clear in it own right and each removes a different set of permission barriers.  The same is true for the other OA-friendly licenses.  Like Peter MR, I encourage providers to remove permission barriers and to formalize this freedom with a license.  Even if we multiplied our technical terms, it will usually be more effective to point to a license than to a technical term when someone wonders exactly what we mean by OA for a given piece of work.

This is the central and simple point on which we are agreed – for some of our problems we can solve this problem without extra tools if we put our minds and energy into it. We aren’t yet doing that sufficiently.
Part of the problem arises because in the Green approach to “open access” there is often an implicit trade-off between price freedom and permission freedom. There is tool-free access at the expense of having no permissions other than human readability – all the permissions (other than “fair use”) remain with the publisher. Many people may feel that this is a reasonable compromise in journal publishing at the present stage. Some may feel that 100% Green open access is an acceptable endpoint.
But I think it comes with a cost to those of us who wish to develop digital scholarship – the use of the information in scholarship by machines as well as humans. As an example the JISC meeting on institutional repositories  I have just been at was called “Digital Repositories – Dealing with the Digital Deluge”.  This is an emotive phrase – but it’s currently misleading. In many subjects there is a complete Digital Drought. And unless the permissions issue is dealt with there will continue to be. Permission freedom is essential for digital scholarship.
My concern is that unless we address the permission issue much more actively we shall slide into the acceptance that permission freedom is the exception or less important than price. The one area where we have to power to act unilaterally is those parts of our own scholarship over which we have effective control – theses, data in repositories, lteaching/learning materials, technical reports, etc. Let us work to make these 100% permission free.
My immediate urgency is fueled by the ETD2007 meeting tomorrow. I hope that we can find consensus on this issue.

Posted in etd2007, open issues | Leave a comment

More Open Thesis heroes

I have continued to try to find full OpenAccess theses and encountered considerable difficulty. The main problem is that universities and their repositories do not help readers to find theses with OpenAccess licenses and in many cases they do not give any license information at all.
Anyway the story… I searched Google for “open access creative commons thesis” and found Mathias Klang’s thesis on Disruptive Technology. Mathias claims this is the first thesis in Sweden to be issued under CC, so I mailed and asked whether he had information from other countries about earlier theses. He mailed back:

Oleg Evnin at Caltech (successfully defended May 26, 2006) [PMR: blogged by Peter Suber]
…a number of CC-licensed ETDs at the U of Edinburgh and that the earliest seems to be by Magnus Hagdorn, submitted on March 4, 2004.

Many thanks Mathias, and I shall enjoy reading your thesis – this whole area needs some disruptive technology – I am finding that approaches to repositories still look conservative and based on outdated models of thought.
I can’t comment in detail on the science but the format of Magnus’ thesis is an excellent example of what a modern thesis should contain – it’s 400Mbyte zipped but contains spendid animations and data of glaciation – worth a look.
But the problem with the repositories is that there is no indication that the actual thesis is OpenAccess. The Edinburgh repository announces:

All items in ERA are protected by copyright, with all rights reserved.
Copyright for this page [1] belongs to The University of Edinburgh
[1] i.e. the metadata splash page

which discourages the visitor for looking for an Open License within the thesis.
I’m sure this isn’r deliberate, but, repository managers, here is a very simple idea:
Add dc:rights to the splash page and metadata and proudly proclaim in large letters:
THIS THESIS CARRIES A CREATIVE COMMONS LICENCE – ENJOY!

 
Posted in etd2007, open issues | Leave a comment

Free Culture and Open Theses

As you know I am looking for real Open Access theses (not fuzzy open). Where have I found the most so far? Not in any of the highly supported repositories but in Harvard College Thesis Repository part of Harvard College Free Culture – here’s their splash page…

Welcome to the Harvard College Thesis Repository

Welcome to the Harvard College Thesis Repository, a project of Harvard College Free Culture! Here Harvard students make their senior theses accessible to the world, for the advancement of scholarship and the widening of open access to academic research.
Too many academics still permit publishers to restrict access to their work, needlessly limiting—cutting in half, or worsereadership, research impact, and research productivity. For more background, check out our op-ed article in The Harvard Crimson.
If you’ve written a thesis in Harvard College, you’re invited to take a step toward open access right here, by uploading your thesis for the world to read. (If you’re heading for an academic career, this can even be a purely selfish move—a first taste of the greater readership and greater impact that comes with open access.)
If you’re interested in what the students at (ahem) the finest university in the world have to say at the culmination of their undergraduate careers, look around.

There are 28 theses here and – unlike the green fuzzy repositories – all have been deposited under CC-BY (i.e. completely compliant with BOAI). The web page didn’t make the license position clear but I got the following clarification today:

Yes–all users of

our repository agreed to a CC-by license when they uploaded their
theses.  As part of the submission process, all users agreed to the
following terms:
“I am submitting this thesis, my original work, under the terms of
the Creative Commons Attribution License, version 2.5: roughly, I
grant everyone the freedom to share and adapt this work, so long as
they credit me accurately. I have read and understood this license.”
We will work to make this more clear in the metadata for each thesis.

Well done Harvard College Free Culture – you have made an important step forward. Convince students in other institutions to follow your lead and the battle is won.
(Not surprisingly there are no chemistry theses but I am sure that can be fixed).

Posted in etd2007, open issues | Leave a comment

Useful chemistry thesis in RDF

I shall be using Alicia’s Open Science Thesis in Useful Chemistry as a technical demonstrator at ETD2007. I really want to show how a born digital thesis is a qualitative step forward. Completely new techniques can be used to structure, navigate and mine the information. Here’s a taster:
A chemical reaction diagram (“scheme”) is a graphic object which looks like this:
udc_scheme2.JPG
As you can see this is semantically useless. A lot of work has gone into this, but none of it is useful to a machine (look closely and you’ll see it’s a JPEG). Even in the native software which was used to draw it it is unlikely that the semantics can be easily determined. However XML and RDF allow a complete representation. It took me about 1 hour to handcraft the topology – if we had decent tools it would be seconds. The complete set of reaction schemes (I counted 11 in the thesis can be easily converted to a single RDF file which looks something like this:
uc:scheme1_1 pmr:isA pmr:reactionScheme .
uc:scheme1_1 pmr:hasA uc:rxn1_1a .
uc:scheme1_1 pmr:hasA uc:rxn1_1b .
uc:rxn1_1a pmr:hasReactant uc:comp1 .
uc:rxn1_1a pmr:hasReactant uc:comp2 .
uc:rxn1_1a pmr:hasReactant uc:comp3 .
uc:rxn1_1a pmr:hasReactant uc:comp4 .
uc:rxn1_1a pmr:hasProduct uc:comp5 .
uc:rxn1_1b pmr:hasReactant uc:comp5 .
uc:rxn1_1b pmr:hasProduct uc:comp6 .
(uc: refers to the usefulChemistry namespace, pmr: to mine).
There are many Open Source tools for graphing this and here is part of the output of one from the W3C
scheme1.png
Here you can see that reaction1.1a has four reactants (compound 1,2,3,4) and 1 product (comp 5). Comp5 is the reactant for another reaction (clipped to save blog problems). The complete picture for the whole thesis looks like this:
reactions1.png
and (assuming you have a large screen) you can see immediately what reactions every compound is involved in.
That’s only the start as it is possible to ask sophisticated questions from a SPARQL endpoint – and that’s where we are going next…
… IFF you make the theses true Open Access

Posted in chemistry, etd2007, open issues, XML | 3 Comments

"open access" is not good enough

I have ranted at regular intervals about the use of “Open Access” or often “open access” as a term implying more than it delivers. My current concern is that although there are are tens of thousands of theses described as “open access” I have only discovered 3 (and possibly another 15 today) which actually comply with the BOAI definition of Open Access.
The key point is is that unless a thesis (or any publication) explicitly carries a license (or possibly a site meta-license) actually stating that it is BOAI compliant, then I cannot re-use it. I shall use “OpenAccess” to denote BOAI-compliant in this post and “open access” to mean some undefined access which may only allow humans to read but not re-use the information
I do not wish to disparage the important efforts to making scholarly information more widely available, and I applaud the general direction and achievement of the groups below. I appreciate that the copyright of historical content normally is held by the student author and it’s certainly very valuable to have “access” to it. But it is not OpenAccess. And unless specific policies are put in place to add specific BOAI-compliant licenses then future theses will also be non-compliant.
Here are typical statements:

  • “EThOS will make UK theses available on open access for global use”. Having spoken to EThOS colleagues last week it is clear that “open access” does not automatically mean OpenAccess.  (Electronic theses in the UK: the open access future : JISC).
  • MIT theses: “Regardless of whether copyright is held by the student or the Institute, the MIT Libraries publish the thesis electronically allowing open access viewing and limited downloading/printing. See http://dspace.mit.edu.” The term “open access viewing” might suggest the theses are BOAI-compliant and therefore is potentially misleading. I found that the “public” thesis had been mounted with “printing disabled” which means that it cannot be technically re-used (as well as being legally non-reusable)
  • ECS Soton: A well-known thesis is in e-form: (ECS EPrints Service – Evaluating Research Impact through Open Access to Scholarly Communication) (Brody, T). It offers two sorts of download: “(i) PDF – and (ii) Other (Latex)Access restricted to members of ECS [i.e. Soton only]”. This is a differential distribution of scholarship. I could also not find any license or copyright statement

By contrast let’s look at “Open Source” which applies to software and has been highly successful in liberating the field. It’s very widely used in academia and elsewhere. The Open Source Definition states

Open source doesn’t just mean access to the source code. [PMR’s emphasis] The
distribution terms of open-source software must comply with
the following criteria [PMR’s elisions]:
1. Free Redistribution
The license shall not restrict any party from selling or
giving away the software as a component of an aggregate
software distribution containing programs from several
different sources. The license shall not require a
royalty or other fee for such sale.
2. Source Code
The program must include source code […]
3. Derived Works
The license must allow modifications and derived works, and must
allow them to be distributed under the same terms as the license
of the original software.
4. Integrity of The Author’s Source Code
The license may restrict source-code from being distributed in
modified form only if the license allows […]
[…]
7. Distribution of License
The rights attached to the program must apply to all to whom
the program is redistributed without the need for execution of
an additional license by those parties.
[…]
*10. License Must Be Technology-NeutralNo provision of the license may be predicated on any individual
technology or style of interface.

In general the term “Open Source” is completely self-explanatory within a large community. I can describe my software as OS and everyone understands what I mean. There are some licenses (e.g. GPL) which require additional freedoms but they don’t invalidate the above.
By contrast if someone describes something as “open access” it simply means that I may – as a human – and at some arbitrary time in human history – read the document. It does not guarantee that

  • I can save my own copy (the MIT suggests you cannot print it and may not be allowed to save it)
  • that it will be available next week
  • that it will be unaltered in the future or that versions will be tracked
  • that I can create derivative works
  • that I can use machines to text- or data-mine it

So I believe that “open access” should be recast as “toll-free” – i.e. you do not have to pay for it but there are no other guarantees. We should restrict the use of “Open Access” to documents which explicitly carry licenses compliant with BOAI. [A weaker (and much more fragile approach) is that a site license applies to all content. The problem here is that documents then get decoupled from the site and their OpenAccess position is unknown.]
If the community wishes to continue to use “open access” to describe documents which do not comply with BOAI then I suggest the use of suffixes/qualifiers to clarify. For example:

  • “open access (CC-BY)” – explicitly carries CC-BY license
  • “open access (BOAI)” – author/site wishes to assert BOAI-nature of document(s) without specific license
  • “open access (FUZZY)” – fuzzy licence (or more commonly absence of licence) for document or site without any guarantee of anything other than human visibility at current time. Note that “Green” open access falls into this category. It might even be that we replace the word FUZZY by GREEN, though the first is more descriptive.

However there is no value in “Green open access” for theses. Let’s make sure they are all BOAI compliant.

Posted in etd2007, open issues | 7 Comments

OSCAR eats an Open thesis

As regular readers will know we are applying text-mining to chemistry in Open theses. The problem is finding fully Open theses – so far we have got Alicia’s. Alicia has captured all here molecules in semantic form so text-mining isn’t required – and I’m hoping to do some fun stuff with XML on it.
I’ve searched for large collections of theses. MIT has a promising collection which would be ideal but they are only TollFree, not OpenAccess. I’m still appealing for readers to help. But in one of those quirks of Googling I ended up at the digital repository of the University of Stirling.

I am delighted about this since I spent 15 very happy years on the staff at Stirling in the Chemistry department. It doesn’t have one now – that’s why I left 🙁 – although we’re having our 40th anniversary later. But perhaps there are theses with chemical concepts Now the repository announces:

The copyright in theses in this collection remains with the author, unless it is stated to have been assigned to the University of Stirling. The University of Stirling reserves the right to keep electronic copies for consultation in both cases.

so I wasn’t very hopeful. But I thought I’d have a look and found one in aquaculture – one of the successful disciplines in Stirling:

Feeding behaviour of the prawn Macrobrachium rosenbergii as an indicator of pesticide contamination in tropical freshwater.

which carried the licence:

This item is protected by original copyright

Items in the Repository are protected by copyright, with all rights reserved, unless otherwise indicated.

still not hopeful, until I read the license:

License granted by Kriengkrai Satapornvanit (ffiskks@ku.ac.th) on 2007-03-26T06:34:59Z (GMT):
[...]
END USER LICENCE
This work is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 1.0 Licence.
YOU ARE FREE:
- to copy, distribute, display, and perform the work
- to make derivative works
Under the following conditions:
ATTRIBUTION
You must give the original author credit.
NON COMMERCIAL
You may not use this work for commercial purposes.
SHARE ALIKE
If you alter, transform, or build upon this work, you may distribute
the resulting work only under a licence identical to this one.
For any reuse or distribution, you must make clear to others the
licence terms of this work. Any of these conditions can be waived
if you receive permission from the author. Your fair dealings and
other rights are in no way affected by the above.

Many public thanks, Kriengkrai Satapornvanit and I hope your future work prospers. Now I am completely free to see if chemicals can be mined from the thesis:

  • I download the PDF.
  • I convert it to ASCII using pdftotxt. This destroys the formatting, diacritics, tables, subscripts, superscripts – in fact almost everything except the words. And – for most theses – these are still in the right order. (Unlike Eric Morecambe, this is not a joke – PDF often has the words in an arbitrary order).
  • I start OSCAR (http://oscar3.sf.net) as a Server and paste in the text. It takes about 30 seconds for OSCAR to read the whole thesis and interpret the chemistry. (This portable version of OSCAR does not have a complete lexicon – full versions need to be run on a server).
  • OSCAR (originally dubbed the “journal eating-robot”) eats the text. Here’s a typical section:

stirling0.PNG
You can see that OSCAR has recognised many words as likely chemical terms (in yellow) and knows the structure of the underlined ones (the full version would know all of them). It’s not 100% accurate – you can see it thinks “P” is the element phosphorus – but Peter Corbett has addressed this in later versions.

So this allows us to collect metadata from theses automatically. OSCAR can tell us in a few seconds that this thesis is concerned with specific pesticides. That’s part of the basis of the SPECTRa-T project. Since we’ve benefited from Open Source theses, maybe we should do the whole project on an Open Wiki…

Posted in chemistry, etd2007, open issues | 1 Comment

Alicia's Open Science Thesis

Jean-Claude Bradley and coworkers has pioneered the concept of Open Science in chemistry – and it goes beyond that. On UsefulChem he writes:

The fact that Alicia’s masters thesis “Synthesis of Diketopiperazines, Possible Malaria Enoyl Reducatase Inhibitors Using Open Source Science” is being written on a wiki was noted by Pharyngula, A Blog around the Clock and Pimm – Partial Immortalization.
I am particularly happy that Attila from Pimm has obtained permission from his supervisor to write at least part of his thesis on his blog. Outside of the sciences, I recall Mark Wagner doing something similar for his thesis on educational gaming. Also see Laura Blankenship‘s thesis on blogging in the classroom.

Yes – there has been a lot of interest in this innovative approach and I’m delighted to echo it. Since they wish this to be an open process here are my comments directly for Alicia to use if she wishes:

  • I didn’t see any license on the thesis. I’m hoping it is Creative Commons sharealike (like J-C’s blog) but it’s not explicit. If that is confirmed I can highlight the thesis at Uppsala next week
  • Do all the molecules have machine-readable connection tables? I noticed some that link through to SMILES – but do they all? And it would be useful to have InChIs so we could search the thesis.
  • Much of the data is rendered as bitmaps of rather low quality. Could we have the reaction schemes and particularly the spectra in greater clarity. (Obviously I’d prefer CML or SVG or JCAMP… )

My immediate technical goal would be the creation of a datument (everything in XML) for the thesis – I’m not going to do all that myself. But I would be keen to see the reaction sequences in animated SVG…
The same goes, of course, for anyone else writing Open theses.

Posted in blueobelisk, chemistry, data, open issues | 5 Comments

JISC meeting on institutional repositories

JISC has issued a summary on the conference I have just attended, About Digital repositories: Dealing with the digital deluge (Manchester, June 5-6, 2007). The summary deals with the plenary sessions – I might have commented if this blog had been awake. So these are fuzzier thoughts from two days on.
There was a welcome emphasis on the thesis. This is the one area in which institutions have complete control over the scholarly process. (Not true if the thesis – as in some European countries – consists of bound published papers, so this is a Britocentric viewpoint). So – and we have a very small window of opportunity – the instutitions can really show how publishing – scientific communication – SHOULD be done. I’ve blogged about this before  but it means really cutting adrift from the printed page, the static moment of publication, even the role of the team rather than individual, the value of data … etc. Can I be optimistic? Please let’s try something soon.
There was a major shift from repositories of text (exemplified by ePrints/DSpace and other current technologies where only holy PDF matters) to DATA. Examples were Keith Jeffery’s presentation and a fine session from David Shotton of Oxford on image repositories. Not surprisingly chemistry got a prominent airing (Simon Coles and ourselves).
I am still very worried about our gift of intellectual property to publishers. In discussion about images some people assume that an image is a published paper is the copyright of the creator. Well we know that Wiley doesn’t take this view and I suspect few other non-OpenAccess publishers do. BUT there are technical solutions. If we use rich formats (XML, SVG, even GIF) we can embed rights information in the object. So here is a simple idea:

Adapt our image generation and processing software so that it always contains a license specifying the author and asserting Science Commons or Creative Commons licenses. Allow the author to change this away from default – I suspect few will do so. Then, when the images reach the publisher they can only reassign the image copyright by specifically asking the author. This will become rather unpopular…

And while in discursive mode: “why do we need institutional repositories?” I used to think it was obvious – now I think it’s rather confusing. And I tried to answer the question “why does a scientist need institutional repositories?” and haven’t come up with a very good answer. So here are some facets – see what you think:
Political:

  • Universities wish to advertise their success to applicants, funders, reviewers (RAE, etc.). Fair enough – they pay our salaries, and I have an obligation within reason – but it doesn’t directly help me.
  • Libraries need to reinvent themselves and repositories is an important area where they can do this. Generally not very compelling for the average scientist.

Financial:

  • Repositories are new and so there is funding. That’s valuable for us (thank you JISC) in developing new informatics strategies and tools. But not very general.
  • Repositories will generate a new funding opportunity for data curation/archival etc. Perhaps.

Scholarship and scientific practice

  • Scientists will be able to archive their own data, funded by the institution. That’s compelling. Often the users of public databases are those who originally deposited the data, lose it, and then find it in repositories. But are universities the best place to put scientific data? Can they all support chemistry, archaeology, astronomy, etc.? These are hard. They are much more likely to happen in domain-specific repositories. No reason why these shouldn’t be supported by LIS staff, but you won’t find them in every institution.
  • Liberation of data (a al SPECTRa). I argue that this is much more likely to happen at a departmental level than at institutional. There was a general agreement that mandates don’t work well and that incentives are critical. The normal unit of loyalty is the research group or department
  • Sharing data. Yes, but this is primarily cultural and is much more likely to be encouraged at domain level.

And I had a general worry. We are seeing the enormous force of world-centric technology – Google, Flickr, and the new RDF/SPARQL enthusiasts. This is on a scale magnitudes above what institutions can do. Also the blogosphere can often move much more quickly than the average IT project. So we are always in danger of being overtaken and overwhelmed by the collaborative tools and practices “out there”. We need to do exactly those things – and no more – that we are specialists at – education, scientific research, domain-specific ontologies, etc.
So at the end of the meeting the audience was asked for questions and suggestions. It’s too late for the current JISC funding round (I think) but I suggested that we award some modest bursaries directly to graduate students and undergraduates to build their ideas of information management systems for their own requirements (e.g. thesis creation). After all, wasn’t there a fairly large Internet company founded by students? – give them their chance.

Posted in data, open issues | 2 Comments

Chemistry/science theses urgently wanted!

As I have blogged (Electronic Theses (ETD2007) – June 8th, 2007) I shall be demonstrating the power of the eThesis next week at Uppsala. We now have technology that will identify the chemistry in a thesis and automatically re-use it in many ways. These include:

  • machine extraction of metadata/terminology
  • identification of named entities (especially chemical names)
  • validation of the contents of the thesis
  • validation of the structure of the thesis
  • conversion of the thesis into different formats (e.g. to use SI units)
  • comparison of similarity between theses
  • linking theses to existing ontologies and other resources

For example we could see a thesis repository as a SPARQL endpoint.
However most existing theses, even if publicly visible, are not automatically re-usable without explicit permission. I’d very much like to have a few exemplars which I can show at the Uppsala meeting next week.
I’d be very grateful if any reader(s) have a thesis (possibly their own) or a collection of theses (ideally already posted on the web) which:

  • we can use for text-mining without further permission
  • has an attribution for author and institution
  • is likely to contain a significant amount of chemistry [1]
  • is electronic and machine-readable (Word, PDF, NOT TIFF)
  • can be made immediately available

[1] many scientific areas which are not themselves chemistry (bioscience, materials, geoscience, environment) may contain chemical terminology – e.g. in methods and materials.
The purpose of this request is to develop ways of enhancing the value of theses as mentioned above and in general we would not expect our software to “discover new science” or to explicitly criticize the thesis – that would be unkind. (Although when eTheses become more common you can expect this to become more common!)
(Please see WWMM for email address)

Posted in chemistry, etd2007 | Leave a comment