petermr's blog

#ILI2009: Why can't I search Institutional Repostories?

Posted on June 20, 2009 by pm286

I was just about to blog on the way that institutional repositories hide information rather than advertise it, when I found my thoughts had been anticipated by Dave Haden, Open access search?, Jurn blog, June 12, 2009. He puts the simple question

Pouring out all this open access content is all very well, but where’s the competition and development in open access search?

And where are the simple common standards for flagging open content for search-engine discovery and sorting, for that matter?

He’s absolutely right – and it’s shameful that academia has almost no systematic search for its content. In 1994 the first search engine “the Jump Engine” was developed at the University of Stirling, and there were offerings from national labs, etc. but these were soon taken over by Altavista (yes, there was life before Google). This was when the web was open and spam-free.

I’m making a partial exception for my collaborators at PennState, Lee Giles and colleagues who have developed CiteSeer and ChemXSeer. But those are mainly aimed at formally published scholarly articles and little at institutional content.

Now of course I’m viewing things from the outside, as an independent curator and social entreprenuer, not a librarian or OA evangelist. But it seems to me that burying your Phd thesis deep in a repository cattle-car — seemingly with only a few keywords, an ugly template and an impenetrable URL for company — isn’t serving it or the author very well. Especially in terms of metadata and tagging leading to full-text search discovery. As the authors of “Experiences in Deploying Metadata Analysis Tools for Institutional Repositories” recently wrote in Cataloging & Classification Quarterly (No. 3/4, 2009)…

“Current institutional repository software provides few tools to help metadata librarians understand and analyse their collections.”

DH: Which doesn’t bode well for search-engines aiming to hook into and sort the same metadata. That sort of statement might have been acceptable in 1999, but it’s a damning statement to hear from librarians in 2009. And another paper in the same issue concludes that there is…

“a pressing need for the building of a common data model that is interoperable across digital repositories”.

And he goes on to give a simple idea for making metadata exposed.

Posted in Uncategorized | 2 Comments

Slashdot: Doctor Who and the Daleks

Posted on June 19, 2009 by pm286

Whereas the common metric of acclaim for the academic is some trivial metric such as the impact factor, the real geek aspires to be slashdotted (http://en.wikipedia.org/wiki/Slashdot). This is real fame as 5.5 million nerds read the site. So – directly or indirectly – through Glyn Moody my post on the Doctor Who Model of Open Source has hit /. and there are the predictable suggestions as to who are the Daleks.

Who else could it be – “Embrace, extend and EX-TER-MIN-ATE:

(http://en.wikipedia.org/wiki/Embrace,_extend_and_extinguish)

I can tell you from our first hand experience that they can go up and down stairs now.

The discussion is at: Slashdot

Posted in Uncategorized | Leave a comment

PDFs

Posted on June 19, 2009 by pm286

Egon Willighagen is one of the of people whose support has kept me going through tough times. Here he supports my criticism of PDF (No, PDFs really do suck!). For background: I had posted a criticism of PDF and a lively discussion took place on FriendFeed. I understand it is questionable as to whether FF discussions should be quoted – they are public but can be taken out of context – so I didn’t. I had a few supporters, but a surprising number of naysayers, the general tenor of which was that PDF is easy to read and HTML is difficult to read. So I’ve done an experiment, but first Egon’s post:

No, PDFs really do suck!

EW: A typical blog by Peter MR made (again), The ICE-man: Scholary HTML not PDF, the point of why PDF is to data what a hamburger is to a cow, in reply to a blog by Peter SF, Scholarly HTML.

This lead to a discussion on FriendFeed. A couple of misconceptions:

FF: “But how are we going to cite without paaaaaaaaaaaage nuuuuuuuuuuumbers?”

EW: We don’t. Many online-only journals can do without; there is DOI. And if that is not enough, the legal business has means of identifying paragraphs, etc, which should provide us with all the methods we could possibly need in science.

FF: Typesetting of PDFs, in most journals, is superior than HTML, which is why I prefer to read a PDF version if it is available. It is nicer to the eyes.

EW:Ummm… this is supposed to be Science, not a California Glossy. It seems that pretty looks is causing major body count in the States. Otherwise, HTML+CSS can likely beat any pretty looks of PDF, or at least match it.

FF:As I seem to be the only physicist/mathematician who comments on these sort of things, I feel like a broken record, but math support in browsers currently sucks extremely badly and this is a primary reason why we will continue to use PDF for quite some time.

EW: HTML+MathML is well established, and default FireFox browsers have no problem showing mathematical equations. For years, the Blue Obelisk QSAR descriptor ontology has been using such a set up for years. If you use TeX to author your equations, you can convert it to HTML too.

FF:We can mine the data from the PDF text.

EW:Theoretically, yes. Practically, it is money down the drain. PDF is particularly nasty here, as it breaks words at the end of a line, and even can make words consist of unlinked series of characters positioned at (x,y). PDF, however, can contains a lot of metadata, but that is merely a hack, and unneeded workaround. Worse, hardly used regarding chemistry. PDF can contain PNG images which can contain CML; the tools are there, but not used, and there are more efficient technologies anyway.

EW: I, for one, agree with Peter on PDF: it really suck as scientific communication medium.

So here’s an experiment with a sample size of one. I went to BiomedCentral, took the first journal I came across which had a ta-ble. (A ta-ble is a coll-ect-ion of num-bers in rows and col-umns). Sometimes I read tables, but sometimes I put them into a spread-sheet. (A spread-sheet is soft-ware that lets you cal-cul-ate things). The article is http://www.biomedcentral.com/1471-2105/9/545 which is called the full-text and is in HTML. I went to Table 1 and found:

graphics1

I then went to the PDF (which is not seen by BMC as the full-text) http://www.biomedcentral.com/content/pdf/1471-2105-9-545.pdf and found the same table:

graphics2

[I have reproduced them at the same size as they cam up in my open-source browser. The HTML was rendered naturally by the browser with no help from me. The PDF required me to download a closed-source proprietary plugin from Adobe.

I am not an expert on readability but I would like to see the researched arguments that says the HTML is worse for humans than the PDF (actually I think it’s better).

But here is the clincher. As a scientist I don’t just want to read the paper with my eyes. I want to use the numbers. Maybe I want to see how column 1 varies with column 2. The natural way to do this is to cut-and-paste the table. (This is done in each case by sweeping out the table with the cur-sor and pressing the Ctrl-key and the C key on the key-board and the same time. The data is now on the clip-board). I then open up Ex-cel (because I am in the pay of Mi-cro-soft) and “paste” the clipboard into the spread-sheet. This is what I get from the PDF version.

graphics3

All the data has gone into one column. The tabular nature has been completely destroyed. And the cut-and-paste was done with Adobe’s own tool, so even Adobe doesn’t know what a table is in the PDF. (I have been taken to task for criticizing PDF because some people don’t use Adobe tools).

Here’s the HTML version. I have highlighted a cell to show that all cells are in correct columns:

graphics4

I was called “a bit … dogmatic”. Yes I am. This seems to me so self-evidently a case for using HTML over PDF that I can’t think of any reason why PDF should be used.

And kudos to BMC. They have realised that HTML is a better digital medium than PDF. Are their readers cancelling subscriptions? No…

Oops… BMC is an Open Access publisher. It is forcing its authors to pay for their manuscripts to be converted into horrid HTML. I expect they’ll start sending their papers elsewhere…

Xxx

Posted in Uncategorized | 14 Comments

Wellcome welcomes NPG's textmining policy; I don't

Posted on June 18, 2009 by pm286

From Open Access News, interspersed with my comments:

NPG permits text-mining on green OA manuscripts

Nature Publishing Gr:oup (NPG) will explicitly permit academic reuse of archived author manuscripts. Head of Content Licensing David Hoole announced the development today at the OAI6 meeting in Geneva, Switzerland. Researchers can now data-mine and text-mine author manuscripts from NPG journals archived in PubMed Central and other academic repositories.

PMR: one of the really important aspects of PubMed central is that it contains millions (sic) of abstracts, of which some have openly accessible full-text. However readers cannot assume they can text-mine this without contravening publisher contracts.

“NPG supports reuse for academic purposes of the content we publish. We want the excellent research that we publish to help further discovery, and recognize that data-mining and text-mining are important aspects of that,” said David Hoole.

PMR: Nature is among the more innovative publishers and has realised the value of text-mining for some time. It originally came out with the (almost useless) OTMI which published all the sentiences in some publications but (as Eric Morcambe observed) “not necessarily in the right order”. Here, however they are starting (sic) to do the right thing.

Under NPG’s terms of reuse, users may view, print, copy, download and text and data-mine the content for the purposes of academic research. Re-use should only be for academic purposes, commercial reuse is not permitted. Full conditions are available [here].

PMR: Oh dear, “non-commercial” yet again. Apart from the questionable motivation, it’s almost impossible to define. If we include material in a text-book is that “commercial”? [Yes, I have read the conditions and we can’t]

The re-use permissions apply to author manuscripts, of articles published in NPG’s journals, which have been archived in PubMed Central, UK PubMed Central (UKPMC) and other institutional and subject repositories. The terms were developed in consultation with the Wellcome Trust, the leading biomedical research charity….

“The Wellcome Trust is supportive of NPG’s efforts to make archived content more reusable,” said Sir Mark Walport, Director of the Wellcome Trust. “This is an important development because it shows that reuse can be facilitated, independent of business model, for text-mining and academic research.” …

PMR: If Wellcome have actually OK’ed the terms pointed to, then this is not an important development.

NPG’s re-use terms will be included in the metadata of these archived manuscripts.

PMR: anything which helps make things clearer is useful. Where is the metadata kept?

Peter Suber’s Comment. OA supporters have disagreed on whether text-mining is covered by fair use (or fair dealing etc.) or whether it requires fresh permission. Regardless of where you came down on that, it’s good to have explicit permission. (On the other hand, if permission is unnecessary, then it wouldn’t be good if researchers and publishers began to believe that it was; but that’s a different issue.) I regard this as a small but welcome step beyond gratis green OA to libre green OA.

PMR: I rarely disagree with PeterS, but this is not a step towards “libre”. I do not welcome this step.

Here are Nature’s terms:

Academic research only
1. Archived content may only be used for academic research. Any content downloaded for text based experiments should be destroyed when the experiment is complete.

PMR: This is not libre. It’s incredibly restrictive. It forbids the use of this material in creating corpora (which are essential for building text-ming tools properly). In fact, by forbidding the creation of corpora publishers as a whole are holding academic research back.

Wholesale re-publishing is prohibited
3. Archived content may not be published verbatim in whole or in part, whether or not this is done for Commercial Purposes, either in print or online.

PMR: If I do proper text-mining (as opposed to trivial lexical matching) and build tools such as OSCAR I need to be able to show the sources used to train and develop the tools. This is good science. Forbidding me to show my sources is bad science.

4. This restriction does not apply to reproducing normal quotations with an appropriate citation. In the case of text-mining, individual words, concepts and quotes up to 100 words per matching sentence may be reused, whereas longer paragraphs of text and images cannot (without specific permission from NPG).

PMR: This is no more than fair use. 100 words is far too small for creating text-mining tools. It is not “libre” it is restrictive.

It would be a disaster if other publishers copy Nature and if Wellcome adopt this appalling policy as the standard. Wellcome are a major guardian of scientists’ rights and in this case they are not doing so.

Posted in Uncategorized | 12 Comments

Dear MP – protect us from HADOPI-UK

Posted on June 18, 2009 by pm286

The Open Rights Group has pointed out that “Digital Britain” – a major recent government report – contains recommendations that Ofcom (hitherto the defender of citizens’ rights) now enforces a HADOPI-like policy to counter alleged copyright infringements. I have therefore written to my MP:

Dear David Howarth,

I wrote to you recently on the absolute need to maintain Net Neutrality
– the right of everyone to have the same access to the Internet. And
your colleague indicated that you and your party have taken the issue
on board and are researching this.

I have recently got more details of the very real threat (in “Digital
Britain”).The government are requiring Ofcom – originally a guardian of
citizens’ rights – to act as a “police force” to defend the rights of
big business (see
http://www.openrightsgroup.org/2009/06/digital-britain-closing-down-the-open-internet/)
This article is a clear indication of the threat, e.g.:

“If ‘secondary legislation’ (rubber-stamped papers to parliament) is
passed, new powers would be given to Ofcom to require ISPs to restrict
access of alleged infringers.

“There is no suggestion of a requirement to take users to court before
curtailing their access. This looks like HADOPI-lite: muzzling of users
and potential harm to the internet’s infrastructure and lawful
businesses, to protect failing business models in the entertainment.
Regulations around enforcement will be drafted by industry and approved
by the regulator, Ofcom.”

I commend this article and the OpenRightsGroup. In my own area
(scientific research, on which so much in Cambridge depends) there is a
real threat that large publishers could call on Ofcom to ban users from
access to scientific information that they felt had been illegally
accessed. The “guilty without appeal” process generates fear which is
destructive of innovation

Follow the link to see the whole article by the ORG.

Posted in Uncategorized | Leave a comment

Internet Librarian International 2009 – I'd like your ideas

Posted on June 18, 2009 by pm286

I have been honoured to be asked to speak at the Internet Librarian International 2009 #ILI2009 meeting this October (Oct 14,15,16, London). We’ve had a phone call and some snippets of thought to outline what I might talk about but – as I always make clear – don’t know what I shall actually say till I find out what the audience is like. I shall – by necessity – be urging that we change. I’m not quite sure what but here are some ideas:

We must get young people immediately involved in changing the future.
We must be immersed in how the real web works, not how we would like it to work. Unless we think like Google, Wikipedia, Twine, Twitter we miss the point. It, not us, are reality.
We must reform the practice of copyright. We may be getting close to civil disobedience. Because unless we do we shall not control our future but be controlled by others.
We must move fast.
We must find ways of collaborating using the web. I’ve just been talking with Rufus Pollock about how the Open Knowledge Foundation can use digital media to grow a community of practice.

Last time I was asked to speak (at the JISC Libraries of the Future 2009) I used this blog to try to gather ideas about what I should say. That didn’t start out very successful in that I got no responses. So – whether deliberately or not – I turned up the outrage button and started to get input – on the blog and more valuably on FriendFeed. FF is a much more loquacious feedback than the blog and Twitter is even more so.

I came away from #LOTF09 convinced that (a) libraries were going to have to work harder to change towards the future and that (b) academia was slowly starting to get the point – e.g. Harvard would start challenging the publishers to insist that Harvard’s work was Open to the world. The details of this are still being worked/fought out (and I gather there is another publisher cartel lobbying the US congress/senate next week). But we have to get more universities fighting for their rights.

Of course Internet Librarian is not just academia and certainly not just science. But as Brian Kelly (another speaker at ILI2009) makes clear (http://ukwebfocus.wordpress.com/2009/06/18/respect-copyright-and-subvert-it/) we have to change:

Free your materials: Make use of Creative Commons for the materials that you create.
Take a risk management approach: Change does not occur without taking risks. So we prepared to take risks, but asses the risks and make an informed decision.
Be open about the risks: Share the approaches your have taken with others. Help them to assess the risks they may face in reusing your content.
But change is neededAnd remember that there will be people and organisations within our sector who will have vested interests in maintaining the status quo. If, for example, you are involved in negotiating copyright deals, you may be concerned that your empire would be threatened by the widespread available of open content. Or maybe you simply don’t want to rock the boat. !

So I need your ideas! In the debate about LOTF09 Brian called me a “critical friend” – someone who was aligned in the same direction but prepared to challenge. So I am supportive of libraries and Libraries (even though I am adhering to a personal view rather than any manifestation). I don’t know what the Library of the future will look like. But whatever it is it must belong to the commons of the web and it must be managed in their interest.

Posted in Uncategorized | 3 Comments

The ICE-man: Scholary HTML not PDF

Posted on June 16, 2009 by pm286

I’m picking up on Peter Sefton’s monster post and one of his phrases suddenly hit me:

academia is one of the few places where PDF is considered acceptable as a means of communication

I thought about it and I realised – it’s true. This awful mess we are in is of our own making. Or rather our own supine acceptance of the PDF served up by scholarly publishers. So why does academia use PDF?

Because the publishers like it
Because it looks like a good way to preserve things

And I can’t think of any other reason. It’s awful to index, to add behaviour to, as a means of developing interoperability (which is trumpeted for repositories but hasn’t happened). It is directly against the spirit of the web. HTML has been one of the great successes of the webb (HTTP was another as were URIs).

PDF announces: “I don’t care about modern information. I can’t think for myself.” The medium is the message.

Apart from advertising brochures another area where PDF flourishes is regulatory systems. Pharma companies like PDF because it’s much more difficult to search than XML (or HTML) and so harder to find those bits which they don’t want found. And regulatory likes it because the pages allow for easy certification .

Is that all academia is about? Helping the publishers certify their page count? And making it difficult to search their pages?

So here’s a fuller version of Peter’s section:

Scholarly HTML

Against this background I will confine myself to the dimensions I really care about, which is how to make word processors produce good quality HMTL, and document interoperability. I’ve been over and over why this is important here, but here’s a summary.

On the authoring side, offline word processors like Microsoft Word and OpenOffice.org Writer are probably still the best all round compromised for academic authoring in those disciplines which don’t use some other format like LaTeX. For now. I expect this to change soon, we are starting to see document drafting in Google Docs (which lacks citation services and styles and easy embedding of diagrams so far) , and if Google Wave realises its promise then I think it could be an end-to-end scholarly communications platform.

PMR: Fully agreed. Word processors are complicated because documents are complicated (unless you default to bitmaps such as PDF. I have looked under the cover and PDF is truly awful)

On the delivery side, academia is one of the few places where PDF is considered acceptable as a means of communication whereas on a normal website it is regarded as an impediment to usability. We need to be getting scholarly works into HTML so we can do more with them; meshing them with data and visualisations and delivering them to mobile devises.

While we wait for Google Wave to take over the world, what I’d like to see is a Word toolbar much like the ICE toolbar to support scholarly authoring but with better integration into Word than we have had the resources to make so far here in Toowoomba. It should let people create well structured documents which can be pushed to academic systems; journals, repositories and learning systems and not just in PDF, or Word format, in some kind of formally specified Scholarly HTML. I think that idea had some support at our meeting, but Lee Dirks in particular pointed out that it would need to be done with reference to a stakeholder group who can help define and own this Scholarly HTML thing. I’d be interested in ideas on who these stakeholders might be;

Publishers obviously, where MS Research have great contacts.

Repository owners particularly the discipline repositories like arXive and Pubmed Central.

The eResearch community; I hope that I can get the Australian National Data Service (ANDS) interested in this stuff.

The Electronic Thesis and Dissertation (ETD) movement. (My group is involved in this via our CAIRSS repository support service, the Australasian Digital Thesis program in Australia will come to CAIRSS at some point.)

The eLearning community, maybe.

But actually, where this matters most is on the long tail:

Thousands of small repositories and journals are stuck with paper-on-screen because that’s all their tools support.

The small but growing group of users who want to do more with the versions of their documents they deposit in repositories.

I’d appreciate any thoughts about who might be interested in defining a scholarly profile of HTML – a few people told me they’re following these posts so please speak up in the comments.

I’m interested, obviously. My requirements – which Peter knows of course – are that we can embed CML (Chemical Markup Language) and other Markup languages. And that we can start to use RDF (RDFA?).

Please, academia, wake up and embrace the digitalSemantic, not ePaper future.

Posted in Uncategorized | 14 Comments

Breaking Powerpoint with an ICE-axe

Posted on June 15, 2009 by pm286

Readers of this blog and those who have seen me present know that don’t use Powerpoint – partly on principle (it leads to dumbing down of communications) and partly because I want to do things that Powerpoint can’t do:

hold semantic content
copy existing web pages
jump from slide x to slide y in the middle of a presentation

So I use XHTML for my slides. (If you don’t know what that is, it’s just ordinary HTML conforming to modern standards).

There’s a downside to this – it’s difficult to “give people copies of my slides”. That’s because:

I select from ca 5,000 un-slides [see below] and never know which I am going to give. No-one wants 5,000 un-slides
I can’t remember which slides I used and in which order
Many slides don’t make much sense if my speech is absent.

That is why I am always grateful to people who video my presentations.

There may be an answer. I mentioned this when I last visited the ICE-man, Peter Sefton, in Toowoomba and he’s addressed these issues in his latest post (Desktop Repositories: Smashing up PowerPoint). (I am planning to reply to Peter’s posts but there is so much in them it overwhelms me):

Les Carr has been experimenting with desktop repository services. He started by wondering how he might manage the thousands of PowerPoint slides and presentations he has, moved on to converting them into images, with embedded textual metadata, then put them in ePrints on the desktop and started speculating on how slides might be reassembled into new presentations and exported.

These workflows are exactly what we have been looking at with The Fascinator Desktop, our nascent eResearch repository platform. Our goal is to index and understand everything on an academic’s desktop, including presentations, documents, video, images, audio, data of all kinds, everything; via a plugin architecture which will be easily scriptable. We’re in the middle of a two week development sprint getting some of the pieces in place for this, so I thought that picking up on Les Carr’s PowerPoint work would make for a good target for the end of next week.

PMR. There’s too much to reproduce here, but we are working closely with Peter:

A lot of this is similar to Jim Downing’s Lensfield project – we have talked about harmonizing our projects.

PMR . And most relevantly to my problem:

If we had all that in place we could finally help Peter Murray-Rust with his presentations, which are made up of web pages selected from a huge library of un-slides many of which included embedded data visualizations. By indexing all his individual pages we could let him ’shop’ for the ones he wants, order them and then create a presentation-by-reference which could be de-referenced and blogged or reposited. Peter, can you make your slide library available to us for experimentation?)

Yes, I can and will. The main problem is that a lot of what I have is simply scraped and comes to zillions of Mbytes, so I’ll look out for simple ones and post them to you. (I should (== I don’t but ought to) have these exposed on our web site and I’ll be trying to do this anyway.

I’ve been looking for this for many years – maybe it’s finally starting to happen.

Posted in Uncategorized | Leave a comment

ETD2009; Make Electronic Theses properly visible

Posted on June 13, 2009 by pm286

This is my last post on ETD2009 #etd09. It was great to meet and re-meet many of the people who have developed the ideas an practice of electronic theses and dissertations. We had a great evening out on the Pittsburgh river and many useful discussions.

So I apologize if I am critical of current practices in theses and institutional repositories. Please argue back. And I acknowledge I was not present for the whole meeting.

I got the overwhelming impression that the major purpose of putting theses in Irs was to preserve them. There was an after-lunch talk from Deanna Marcum from Library of Congress which stressed preservation and the benefits of copyright (LOC was permitted to copy things so they could be preserved). She mentioned the “data deluge” – labs with terabytes/day – and acknowledged this was a preservation problem. Later, when the delegates were asked for straw polls for why we should have ETDs in repositories the largest vote was for preservation. Although there was some appreciation of the fact that theses now had a wider readership, there was little discussion of how they could enhance the visibility of theses.

And absolutely no expressed appreciation of the fact that someone might wish to download 10,000 theses at once.

There were far too many presentations about metadata-gateways to theses. The infrastructure still seems to be:

precious thesis submitted to IR, in precious PDF.
IR-metadata expert spends time indexing this “properly” as authors are no good at metadata and full-text doesn’t work. [I challenged the latter absolutely and pointed out that for anything other than text – maths, chemistry, protein sequences, etc. human metadata experts are irrelevant].
A commercial metadata organisation is allowed access to thesis metadata to create complex archaic arcane metadata structure where users (probably not even readers but subject librarians) search for individual items by metadata.
Thesis is embargoed from view if there is any FUD it might offend a publisher.

This is so far out of track with the C21 that I don’t know where to begin.

In the modern web we are developing Linked Open Data. This linking is largely being done by robots. It works like this:

Information provider (e.g. scientist) creates information in web-friendly form. HTML is designed for the web, so use that. As the web evolves we will use RDF, microformats, etc. This information will be added by robots. But for now bog-standard HTML works very well.
Expose the information on web pages.

That’s it.

The search engines are smarter and more numerous than metadata specialists. They know how to get the best out of full text. The search engines in our group at Cambridge understand chemical language. Soon, very soon, they will understand chemical diagrams. That is 100x more than can be added by a metadata specialist, even a chemical one.

The answer is simple. Create Open Theses in HTML and publish them. Use IR’s if you think that’s a useful way of making them permanent – but it’s not required.

That’s all.

So we can index the academic web It would be useful to have a user-accessible list of academic IRs for our robots to scan academia. In RDF, please.

[I will give my own views on preservation later. I do care about it. But not to the exclusion of making material visible.]

Posted in Uncategorized | 4 Comments

Copyright in Scientific Theses is holding us back; Ignore it

Posted on June 12, 2009 by pm286

I feel the dread hand of copyright hanging Mordor-like over the whole area of scholarly publishing. I heard to my horror in PennState that one University had embargoed all its theses in case they violated copyright. So I tested this in my talk and asked “are there repositories that embargo all their content for fear of copyright?” and got a few nodding heads. So I am taking this as fact, and asking:

Why is no-one except me angry about the way that copyright (or exaggerated fear of it) is stifling electronic innovation in academia?

So for example, I asked one speaker who proudly talked about their thesis aggregator “how many of your theses are available under CC-BY or equivalent and can I download all of these and data mine them?” Apparently they weren’t his theses – he just aggregated metadata and I would have to approach the author of every single thesis to find out what the permissions are.

The whole meeting seems to be asleep about the urgency to liberate these theses into the digital Open. I am depressed. I don’t think it’s going to happen any time soon. Providing access to single humans for single views on single theses is all that seems available. Maybe some commercial company does some full-text indexing somewhere, but that’s no use to me. We could process 10,000 theses tomorrow and extract the chemistry. The recall won’t be great but it will still be thousands of results. But there is no way that anyone seems interested in this. Theses are precious jewels which require a priesthood to access – not for my robots.

So I am angry. It is not just the fault of the libraries – faculty, especially senior faculty, bear half the blame. But we are sitting on a goldmine of scientific information in academic theses and we are deterred from using them by copyright FUD. There is an implicit assumption that copyright is one of the god-given commandments – it seems almost revered here.

So let’s abandon copyright in science. What does it gain us? Almost nothing, unless you author a successful textbook. Nowhere else is copyright the slightest use to a scientist and its stands in their way at every step. There are faculty who can’t use their own research work in teaching. There are libraries which can’t let the world see their theses. There are librarians who spend their time negotiating deals with publishers so they can access their own work. Even Laputa could not have designed the bizarre copyright system.

I stress that this is for SCIENCE. I agree that if you are working in creative arts you may wish to protect your work. But scientists don’t. And they are being held back by the assumptions that apply to creative works.

So I would urge that science declares war on copyright. It is fiendishly complex, and wastes vast amounts of time. I know that, theoretically, we can’t as copyright is a legal thing. But since the British Library is trying to get the UK government (what is left of it) to change the law, why don’t we just assume that the spirit of the law and the future letter is that copyright in science should be effectively irrelevant.

SO AS A FIRST STEP LET’S JUST PUBLISH ALL OUR **SCIENCE** THESES OPENLY AND ALLOW UNRESTRICTED DOWNLOADING AND RE-USE?

I can’t see any reason why not. Any publisher who sues a university for publishing work done and that University and for no financial profit has to show that this is not fair use. OK, it’s then down to the lawyers, but I suspect that few publishers will relish suing large prestigious institutions.

And in that way the faculty-library complex can regain some of the sense of urgency demanded by students and researchers.

Posted in Uncategorized | 6 Comments

#ILI2009: Why can't I search Institutional Repostories?

Slashdot: Doctor Who and the Daleks

PDFs

Wellcome welcomes NPG's textmining policy; I don't

Dear MP – protect us from HADOPI-UK

Internet Librarian International 2009 – I'd like your ideas

The ICE-man: Scholary HTML not PDF

Breaking Powerpoint with an ICE-axe

ETD2009; Make Electronic Theses properly visible

Copyright in Scientific Theses is holding us back; Ignore it

Recent Posts

Recent Comments

Archives

Categories

Meta