petermr's blog

Linked Open Repositories: “We can do it in an afternoon”

Posted on August 4, 2011 by pm286

I have suggested that we can and should create Linked Open Repositories (/pmr/2011/08/04/linked-open-repositories/ ) and that it might take a week. I expected this timescale to be challenged and that I would be seriously wrong.

I was. Dan, who was a wonderful summer student with us and who I am proud to say now works with Digital Science, says I am out by a factor of 14:

Dan Hagon says:

August 4, 2011 at 11:15 am

I’d say this could be done in an afternoon. Here I won’t cover the LOD conversion part, just the part about getting the raw data.

First note that this has kind of already done as part of the JISC-funded EThOS project now hosted at the BL http://ethos.bl.uk/ but at present I don’t see anyway to access the underlying dataset other than by the search interface. Instead you can do the OAI-PMH harvesting for yourself.

Yes – that’s the sort of problem. Dan and I can use the OAI-PMH (for theses only?) . How many others can?

The OAI-PMH protocol is well-documented http://www.openarchives.org/OAI/openarchivesprotocol.html but you save yourself the trouble of playing around with resumption tokens and such-like by using the Python library pyoai http://www.infrae.com/download/OAI/pyoai

We have to write a program to do it. Read Dan’s post for the details…

We can now iterate through this list of repos… [snipped]

[…] Unfortunately there doesn’t appear to be consistency in the naming of sets …

With this in hand we can now get the data back we want:

http://eprints.ecs.soton.ac.uk/cgi/oai2?verb=ListRecords&metadataPrefix=uketd_dc&set=747970653D746865736973

… you would now also need to determine the subject of the thesis to find just those that are Chemistry. Note also that the list of repos in ListFriends XML format certainly doesn’t cover all Universities in the UK.

So it can be done. But it’s an afternoon’s work. That doesn’t sound too bad ..

LET’S JUST DO IT?

Posted in Uncategorized | 2 Comments

Linked Open Repositories:

Posted on August 4, 2011 by pm286

At http://www.repositoryfringe.org/ we have a competition – run by JISC/Mahendra_Mahey. It’s primarily for hackers and they were hard at work last night, even forgoing the delights of the Edinburgh Fringe (to which the M-R clan succumbed instead). But one category is for a “good Idea”. So I’ll enter for this one and blog it – so even if I don’t win the prize the blogosphere can go ahead and implement it. (Mahendra/Assessors – if you are short of time, just jump to “Proposal” – you know the problem).

Background and Problem

Institutional repositories are designed for capturing the output of universities and other scholarly organizations. They have things like theses, preprints, priceless digital objects, teaching and learning objects. Quite varied. Not as full as they should be. So here’s a simple question:

Find me all chemistry theses in UK repositories.

And although we ought to be able to, we cannot answer this very simple question.

BTW if you want incentive, the Netherlands can expose all their theses. Portugal exposes its theses. Even France exposes its theses. But not the UK.

Why not? Repositories are searched by Bingle, aren’t they? Yes. But Bingle doesn’t give a complete list – it just gives a few pages. It doesn’t “expose its API”. And anyway Bingle might stop indexing them tomorrow. We can’t rely on Bingle, and more seriously we shouldn’t. We want something better. Owned by the academic community.

Well, can’t we use OAI-PMH to search them? What’s that? It’s a specialist search system for academic sites. And, possibly, we could use it to search for chemistry theses. But the hits rates would be very low. The content probably isn’t metadata-labelled as “chemistry” or as “thesis”.

But *I* know a chemistry thesis when I see the first page of it. It will say something like “A thesis submitted for the Doctor Of Philosophy the in the University of Laputa” . It has words such as “thesis” and “University” somewhere in the page. Machines can read a document and decide whether it’s a chemistry thesis. Better than most humans. All they need is the document.

So If I can get all the content of all the UK repositories I can find all the theses, simply by iterating over the whole content. [Iterate is described by Alice: “Begin at the beginning,”, the King said, very gravely, “and go on till you come to the end: then stop”]

And there is a simple way to express the content – organize it as LINKED OPEN DATA. http://en.wikipedia.org/wiki/Linked_Data .

What’s Linked Open Data? It’s TB-L’s great idea of the semantic web, giving everything an identifier, an address and using RDF. Wikpedia is in LOD (as DBPedia). Genome data is present as LOD. Government data is there as well.

Nearly everything except university content.

Isn’t it terribly difficult? And what’s this RDF anyway? No – it’s very easy – as my proposal will show:

PROPOSAL FOR LINKED OPEN REPOSTORIES

Step 0. Create a node in the LOD diagram/graph called “UK repositories”. Tell TimBL about it.

Step 1. Under this node, create a list of UK repositories in RDF. There’s about 100-200?? This could be done in the evening , in a bar. Use ORE to create an iterable list (“Table of Contents”). The list should point to each UK repository.

Step 2. Get all repository owners to provide an RDF list of their contents. This is technically trivial. All repository software has a button labelled “Dump as RDF”. Put the list (ise ORE?) on the repository web page.

That’s it. We’ve spent 100 million GBP in real and implicit costs to create UK repositories and their content. This proposal could be completed in a week.

As Tim says: JUST DO IT.

Posted in Uncategorized | 1 Comment

Figshare: how to publish your data to write your thesis quicker and better

Posted on August 3, 2011 by pm286

I’m at the JISC repo fringe (#rfringe11) in Edinburgh (If you want to follow it the live blog is great: http://www.repositoryfringe.org/ ). I was really excited to meet Mark Hahnel today – the creator of Figshare (http://figshare.com/ ). Mark is exceptional in that he has not only done research in Cell biology and just finished his thesis on stem cells, but also developed Figshare – a tool to publish his data to the open web.

Here’s the OKF blog and Mark’s account of Figshare : http://blog.okfn.org/2011/03/02/introducing-figshare-a-new-way-to-share-open-scientific-data/ . Some extracts:

The following post is from Mark Hahnel, founder of the Science 3.0 network and member of the Open Knowledge Foundation’s Working Group on Open Data in Science.

Scientific publishing as it stands is an inefficient way to do science on a global scale. A lot of time and money is being wasted by groups around the world duplicating research that has already been carried out. FigShare allows you to share all of your data, negative results and unpublished figures. In doing this, other researchers will not duplicate the work, but instead may publish with your previously wasted figures, or offer collaboration opportunities and feedback on preprint figures:

What? Publishing your data. To everyone else? And spending valuable thesis time doing it? Wasting time on things that didn’t work? And giving your competitors an advantage?

In fact there is a very good self-centered reason for publishing your data as you do the experiment. It means you are always in control of your data. You won’t need to frantically hunt for the missing gel that your thought had only one spot on it but now you aren’t sure. The spectrum that clearly showed the methyl group was on an aromatic ring. The crystal structure that showed the metal was zinc not magnesium. If you publish your data, openly, at the first possible opportunity then you know it’s safe. It will have all the metadata it needs. It will be coupled to the bibliography.

And this means that writing your thesis will take less time to write and be higher quality. Because the discipline of publishing data will become second nature. You won’t forget to label the axes. You won’t wonder whether the distance was in Hartrees or Angstroms, the energy in Kcal or KJ. In fact you will be working towards continuous integration for your thesis.

What’s that? It’s a software concept. Every time you make a small increase in your program’s capability, you make sure it still works perfectly. Every time you collect a new piece of data you make sure that it’s in the right place, describing the right sample. It’s so easy to muddle things if you don’t record them properly at the time. And this saves time and worry.

So the benefit is primarily to YOU. We can expect Figshare and related sites to expand their capabilities. There will be a bibliographic spine – you’ll be able to look for figures that might look a “bit like yours”. Or, indeed if you are repeating a protocol, you’d hope they looked a lot like yours. And rather than simply trawling through the journal of indeterminate biology in the hope you’ll find relevant diagrams you’ll be able to search directly.

Figshare is so simple in concept it will succeed (that doesn’t mean it was easy to write!). And because it’s written by someone intimately concerned with the research, and in tune with writing theses it fits perfectly. Mark’s talking tomorrow and I’m looking forward to it – I hope the live blog catches it.

Posted in Uncategorized | Leave a comment

What’s wrong with scholarly publishing? It’s only for academics.

Posted on August 2, 2011 by pm286

I have just been at a wonderful conference in Canyons, Utah “Accelerating discovery: Human-computer symbiosis 50 years on” (https://sites.google.com/site/licklider50/ ). This drew from Lick’s vision – 50 years ago – of machines and humans interacting in a symbiosis to help each other by contributing the things they are good at. And I’ll write more about this.
But a surprisingly large amount of the meeting was about scientific “publishing”. In large part because it is critical to modern machine intelligence which looks at how knowledge can be used by computers. If we publish data, then computers can use that. (If we don’t, they can’t).
And most of the time we do not publish our work.
Surely “publish or perish”? If academics don’t publish, they don’t advance – and may get terminated.
True – but it’s often a special type of “publication”. Publication only readable by academics, because only they have libraries which subscribe to these publications. And often not even read by them. Used primarily for the purpose of evaluating the worth of academics and their institutions. And, as I have started to argue here and shall continue, it’s not even very good at doing this.
One of the scientists at licklider50 was Cameron Neylon – a great proponent of Open science (and who has talked about this at the PMR hackfest earlier). For “technical reasons” I have used an earlier version of his slides (2009) to illustrate my argument (http://www.slideshare.net/CameronNeylon/nesta-science-in-society. )
Here is his model of the scholarly research cycle

And how it can be linked to other scientists so that ideas and publications spread…

But the diagram also (unconsciously or serendipitously) highlights the primary problem with academic research:
It’s CLOSED.
It’s closed in that the communication process is primarily to other scientists (IN RICH UNIVERSITIES). It does not reach to scientists outside the “top” universities. One research project that I heard about at licklider50 was Open Source Drug Discovery (OSDD). In India. Its aim is to discover new compounds active against tuberculosis. [I’ll write more later].
Because one person dies every three minutes of tuberculosis.
The people working on this project don’t even have roads to their houses. But they do have mobile connections. And they have a passionate desire to develop new medicines.
But they don’t have full access to the outputs of modern research. Because it’s closed to all except rich universities.
And it’s not just India. Here’s an example of a non-university scientist in the West. Sounds like a medical charity. They are seriously disadvantaged by the current system.

So what’s wrong?
The aims of many academic researchers. Somehow their vision has got transformed into a simple cycle:

Get funding
Do research
Publish in prestigious journal to get personal recognition, career advancement and more grants.
Go to 1

The primary purpose of publication for most academics is self-advancement.

Well, we don’t expect people to be complete altruists. But the pressure to succeed (strongly reinforced by the institutions) is so big that other aspects are secondary. There is no pressure to actually communicate the results of the research to the world in a usable form. (Remember it can cost 40 USD for 1 day’s read of a 2 page paper – that’s effectively disadvantaging to OSSD community, isn’t it?)
And more from Cameron:
“UK Scientists spend 12billion GBP ( 20 billion USD) of the taxpayers money”
Yes. And much of this goes into the closed, inward-looking cycle of academia. To publish research for the benefit of the university community. Not more generally.
So let’s re-examine the Cycle. It really looks like this:

If 12 billion is input into research then surely it should reach more people than the academics? Shouldn’t it reach everyone?
Some academics will argue this is a caricature – that their research is widely disseminated. But is it UNIVERSALLY disseminated?
But we are still locked into this dysfunctionality. People will publish in closed journals because it advances their career.
Bangladesh learnt earlier this year that it was too rich to continue to get free access to journals under the HINARI WHO scheme: http://www.scidev.net/en/features/hinari-and-the-dream-of-free-journal-access.html. But we shouldn’t even by discussing whether Bangladesh should have free access to medical research. Shouldn’t we be working for this as a basic human right?
Universities ought to be highlighting this problem and trying to change it. Instead they are compounding it by buying into the false values of scholarly publishing – that where you publish matters more than the benefit to humankind.
Where will change come from? I still think the system is on target for a crash. That’s not the most constructive plan….
I think the most realistic action is by funders being really tough on grantees. Absolutely insisting that their funding goes to making sure that outputs are OPEN. And by disadvantaged non-readers at all levels lobbying for change. After all most of us are taxpayers – shouldn’t we argue that our money should lead to better value?
(Please excuse formatting – I am working on it)

Posted in Uncategorized | 3 Comments

Why YOU need a data management plan

Posted on August 1, 2011 by pm286

The following appeared on noticeboards in the Chemistry Department – the Panton Arms [1] is just 200 metres away

Maybe some reader of this blog can help…

But the real message is that Data Management Plans are most valuable to the actual people involved. So I would start graduate training with a data management course…

But NOT run by the Library or Department or the University.

Run by THIRD-YEAR PhD students. These are the people who know the problems of not doing things properly in year one. When they come to write their theses, they know the pain of not having all the data at hand.

And I would develop a thesis-oriented tool that takes ideas from versioned repositories (such as Mercurial (http://en.wikipedia.org/wiki/Mercurial ) and Git http://en.wikipedia.org/wiki/Git_%28software%29 ) and merge it with a continuous integration tool such as Jenkins (http://en.wikipedia.org/wiki/Jenkins_%28software%29 ). This is what we do for software and the peace of mind it gives is enormous. I *know* that all our software is permanently archived [2] and that it works.

Wouldn’t you like to know that your thesis is permanently archived during the whole of your research (not just after submission). And that it is up to date (i.e. you have all the figures and tables already prepared and certified as fit-for-publish)?

It’s technically straightforward. It simply needs universities to provide a system. IMO this is more important than putting resources into traditional Institutional repositories.

Scientists do not and will never use IRs in their present form. But they will use a good data management tool.
If libraries or information support wishes to interact more with researchers (there is very poor engagement at present) then supporting continuous data integration during these research is the best place to invest effort.

[1] Perhaps this pub is familiar? See http://pantonprinciples.org/about/

[2] Scientists are concerned about archiving for the here and now. Not for 100 years. Bitbucket and Sourceforge are quite stable enough.

Posted in Uncategorized | 1 Comment

What is civil disobedience?

Posted on July 27, 2011 by pm286

I have already blogged about the Aaron Swartz case and civil disobedience /pmr/2011/07/20/the-ethics-of-%E2%80%9Cstealing%E2%80%9D-scientific-articles-and-civil-disobedience/ . I expect to blog more as the case unfolds (assuming it is carried through) but this post id to clarify what I think are the prerequisites for CD if it becomes relevant in the area of scholarly publishing. First, read http://en.wikipedia.org/wiki/Civil_disobedience from which I quoted.

Civil disobedience is a very powerful tool but it must be used with thought, care and bravery to be effective. It is the deliberate breaking of a law, or regulation or practice on moral grounds (“conscience”), and not for personal gain. At such the motivation must be clear, and I believe, before the act. At present I am unclear in the Swartz case what the motivation was before the case and what will be presented in court. I cannot therefore label it as CD or not-CD, though his past actions were certainly not. Peter Suber has discussed CD (http://www.earlham.edu/~peters/writing/civ-dis.htm ) and has made it clear that he does not regard Swartz’s previous actions (“Guerilla OA”) as CD and from what Peter has written I agree completely (1. Suber, P. Guerilla OA. Open Access News. 21 August, 2008. http://www.earlham.edu/~peters/fos/2008/09/guerilla-oa.html ). This does not mean I may not personally support Swartz (I have signed the petition). More on this later.

I am familiar with CD in the context of protesting for peace and the right not to fight and for countries not to possess weapons of mass destruction. I grew up in the shadow of UK conscription and was prepared to face a tribunal (and possible prison) and argue why I was not prepared to join the armed forces. In this I was following the example of people like http://en.wikipedia.org/wiki/Kathleen_Lonsdale pre-eminent scientist (first female FRS) and Quaker peace campaigner:

Kathleen Lonsdale became a Quaker in 1935, simultaneously with her husband. Both of them were committed pacifists and were attracted to Quakerism in no small part for this reason. She served a month in Holloway prison during the Second World War because she refused to register for civil defence duties or pay a fine for refusing to register. At the annual meeting of the British Quakers in 1953 she delivered the keynote Swarthmore Lecture, under the title Removing the Causes of War. [my emphasis]

This action (refusing to take part in the war action) is perhaps more an act of conscience that CD but the effect is the same – prepared to suffer if necessary for clear principles. A similar action was http://en.wikipedia.org/wiki/Draft-card_burning where groups of individuals systematically burned their draft cards (the requirement to sign up for war duties):

Beginning in May 1964,^[1]^[2] some activists burned their draft cards at antiwar rallies and demonstrations. By May 1965 it was happening with greater frequency. To limit this kind of protest,^[3] in August 1965, the United States Congress enacted a law to broaden draft card violations to punish anyone who “knowingly destroys, knowingly mutilates” his draft card.^[4] Subsequently, 46 men were indicted^[5] for burning their draft cards at various rallies, and four major court cases were heard. One of them, United States v. O’Brien, was argued before the Supreme Court. The act of draft card burning was defended as a symbolic form of free speech, a constitutional right guaranteed by the First Amendment. The Supreme Court decided against the draft card burners; it determined that the federal law was justified and that it was unrelated to the freedom of speech. This outcome was criticized by legal experts.

This is clear civil disobedience – the preparedness to suffer for a clearly held view which is in apparent conflict with the law. Note that it is not always clear what the law is, and that cases involving CD may need to be resolved in high-profile cases. The resolution sometimes changes the status quo, sometimes upholds it.

I am more familiar with the acts of CD associated with nuclear weapons in the UK. I have protested at bases, but always within the law. However I have supported those who have deliberately challenged the law (this is not inconsistent). I have visited Greenham Common, and quote at length from http://www.greenhamwpc.org.uk/
Greenham Common Women’s Peace Camp):

On the 5th September 1981, the Welsh group “Women for Life on Earth” arrived on Greenham Common, Berkshire, England. They marched from Cardiff with the intention of challenging, by debate, the decision to site 96 Cruise nuclear missiles there. On arrival they delivered a letter to the Base Commander which among other things stated ‘We fear for the future of all our children and for the future of the living world which is the basis of all life’.

This was the prime motivation – there was a secondary one in that common land had been appropriated for a nuclear missile base:

At a time when the USA and the USSR were competing for nuclear superiority in Europe, the Women’s Peace Camp on Greenham Common was seen as an edifying influence. The commitment to non-violence and non-alignment gave the protest an authority that was difficult to dismiss – journalists from almost every corner of the globe found their way to the camp and reported on the happenings and events taking place there.

…

The protest, committed to disrupting the exercises of the USAF, was highly effective. Nuclear convoys leaving the base to practice nuclear war, were blockaded, tracked to their practice area and disrupted.Taking non-violent direct action meant that women were arrested, taken to court and sent to prison.

And the camp finally achieved its objectives:

A number of initiatives were made by women in Court testing the legality of nuclear weapons. Also, challenges to the conduct and stewardship of the Ministry of Defence as landlords of Greenham Common. In 1992 Lord Taylor, Lord Chief Justice, delivering the Richard Dimbleby
Lecture for the BBC, referring to the Bylaws case ( won by Greenham women in the House of Lords in 1990) said ‘…it would be difficult to suggest a group whose cause and lifestyle were less likely to excite the sympathies and approval of five elderly judges. Yet it was five Law Lords who allowed the Appeal and held that the Minister had exceeded his powers in framing the byelaws so as to prevent access to common land’.

Here we see the power of nonviolent civil disobedience. The key included:

The conduct and integrity of the protest mounted by the Women’s Peace Camp was instrumental in the decision to remove the Cruise Missiles from Greenham Common. [my emphasis]

If it comes to the fact that civil disobedience appears inevitable (for whatever cause) then integrity of purpose and action is key.

Posted in Uncategorized | 2 Comments

The benefits and limitations of Green Open Access

Posted on July 20, 2011 by pm286

In a reply to my exposition of Green, Gold, Gratis and Libre, Steve Hitchcock comments:

Steve Hitchcock says:

July 20, 2011 at 5:03 pm

Peter, In this blog post you say “Modern e-science requires documents over which the reader/user has rights of re-use, which is why Green self-archiving is of little value to high-volume information analysts.” In your next post on the Aaron Swartz/JSTOR case you arrive at a concluding point: “I am concerned that academic institutions will continue to develop their role as “police for publishers” rather than pressuring for democratic and legal change in the system.” What institutional change do you have in mind? In this context one is to provide, and mandate the use of, ‘green’ open access institutional repositories, but you appear to rule this out. Institutions can influence what their researchers and authors do more directly than they can act for ‘democratic and legal change’. I know from reading your posts and mails over many years that you prefer the libre OA approach above others, but you seem unsure who should take the lead on this, publishers or institutions. The one you choose will determine the starting point: green gratis OA (which institutions can provide), or libre OA (which institutions cannot provide for journal published content).

Steve is from Southampton, which is one of the shining examples of how to manage scholarship on a University-wide scale – with deposition mandates, clear IT infrastructure, etc. Researchers probably get more implicit and explicit support for self-archiving than almost anywhere else.

Green OA has the following advantages over Gold OA (I am assuming we compare gratis with gratis and libre with libre). (I am not including hybrid Gold in this – operationally it has almost no benefit over Green)

It costs no cash and the effort (particularly with a system like Soton’s Eprints with Chris Gutteridge to help) is fairly small

I cannot immediate think of any other universal advantages – I will add them as I go along and as they are pointed out

The following advantage(s) are common to both Gold and Green

They get indexed by search engines such as Google and Bing. I am not aware of any independent academic archive of Green OA or Gold OA. In fact I have a suggestion for doing exactly that which I will put in a later blog. I do not regard deposition in an IR as making Open content more discoverable than on a publisher’s web site – I suspect they are roughly equivalent – Bingle will index both.

The following are the advantages of Gold:

The licence is clear, both on the document itself and in the context. (Green OA almost never confers any rights explicitly, and the context may well not include rights
The documents may be systematically discovered by iterating through the publisher’s tables of contents. This is VERY important, perhaps the most striking advantage of Gold (whether gratis or libre). I can for example download all BMC content whenever I wish , subject only to the courtesy of agreeing a robot-friendly protocol when I want. Can I systematically download all Green material from the 100 UK repositories? I doubt it (a) how do I discover it? (b) when I have discovered it how does my machine know the rights?
With Gold It is almost always possible to know whether the content is libre. It is almost impossible to determine the gratis/libre on Green. I am therefore assuming that there are very few Green documents where I can trivially determine that they are libre

The advantages of libre are enormous. I am assuming a high correlation between Gold = libre and Green = gratis. Effectively only Gold gives me a significant amount of libre. The advantages:

I can copy and reproduce some or all of the content
I can rework the text into book chapters
I can include the diagrams as slides
I can compute the tables in R or other statistic programs
I can extract the chemistry (yes we can extract the chemistry automatically).
I can use the material as a corpus for developing textmining
I can use the corpus to extract information
I can use the corpus to compare documents, including detection of plagiarism
I can make my own overlay journal (and we are doing exactly that with Acta Crystallographica E)
I can create resources on the web of Linked Open Data
I can create Open Research Reports for diseases (OKF/JISC hackathon in December)

And much more.

A caution. Some Greenophiles such as Stevan Harnad have told me I can do all this with Green material. I believe that in every case I would be breaking contract and/or copyright law. If anyone can convince me that almost all Green carries implicit rights to do this I would change my view. But I am very sceptical.

Gold Open Access has one major limitation:

It normally costs a considerable amount of money.

SteveH says:

green gratis OA (which institutions can provide),

This is not correct. The providers of the permission for Green gratis are the publishers. Some publishers such as the American ******** Society have been solidly set against Green Open Access of any sort. The instituions cannot provide Green. They can help authors find out WHETHER they have a right to self-archive as Green and they can – perhaps – lobby publishers to persuade them to allow Green SA. They can provide the technology to do it and they can provide implicit and explicit support. But they cannot provide it absolutely.

I need tens of thousands of articles. I need to know I am legally and contractually able to obtain and re-use them. If SteveH or anyone else can show how this can be done with Green articles in Repositories I’d be grateful.

As a touchstone it is impossible even to get all the UK theses published last year. Impossible to determine their rights. Impossible to know how to write a universal downloader. That’s much the same with Green, which need n ot even be in IRs.

Please – anyone – adjust this analysis.

Posted in Uncategorized | 14 Comments

The ethics of “stealing” scientific articles and civil disobedience

Posted on July 20, 2011 by pm286

I have been alerted to the following article in the Boston Globe about a Cambridge[Mass] Man who has been accused of “stealing” 4 million scientific articles. http://www.boston.com/Boston/metrodesk/2011/07/cambridge-man-accused-hacking-mit-computers-steal-scientific-papers/6SVnqu3Yfo7OIrLQOYSz5M/index.html?comments=all#readerComm

A Cambridge man [Swartz] who was a fellow at Harvard University’s Edmond J. Safra Center for Ethics is now facing federal charges that he hacked into a Massachusetts Institute of Technology computer archive system to steal more than 4 million articles from scientific journals and academic work.

…

Swartz has advocated for the elimination of barriers to the distribution of information over the Internet, and for the widest public distribution of information in libraries. He is also a co-founder of reddit.com.

…

However, the organization [JSTOR, the repository and resupplier of these articles] said that “a substantial portion of our publisher partners’ content was downloaded in an unauthorized fashion using the network at the Massachusetts Institute of Technology, one of our participating institutions. The content taken was systematically downloaded using an approach designed to avoid detection by our monitoring systems,” the statement said.

I shan’t reprint the whole article – this might infringe copyright. But I want to comment on one statement in it – and then more generally

The articles and journals listed under the JSTOR system are available through a paid subscription, with some subscriptions costing as much as $50,000. A portion of the fee is in turn paid over to copyright holders.

The subscriptions are paid to the PUBLISHERS. (I do not know whether JSTOR receives the full subscription and then relays some or all to the publisher). The publisher collects subscription revenues from JOURNAL subscriptions which may contain articles where the authors have, and have NOT, transferred copyright. In neither case do the AUTHORS receive any payment.

I do not personally advocate criminal damage, and I am currently reasonably scrupulous to avoid deliberately violating copyright law or the contract that my institution has signed with the publishers. I say “reasonably” because the whole area of law and contracts in this area is so complex that there is no human that understands it in all its details (it varies by country and individual institution). I am also conscious that I am employed here and as such my actions can disadvantage my employer. On two previous occasions my actions, perfectly legal actions, have caused the University to be cut off by publishers. Their server algorithms for “stealing” content had been triggered and reacted automatically. (ASIDE – how many of you are aware that the publisher alone decides what is and what is not legitimate usage of their content? They can just cut the institution off).

I believe that our laws and contracts for access to scientific literature need serious revision. I believe that the current situation is unethical and that decisions are made for reasons that do not help science and frequently hinder it. When one believes that laws must be changed, there are two main ways of doing it.

One is to work within the law and put personal and political pressure on the people and organizations involved. That is now what this blog has evolved to – I still do science, but only half of what I could (I did spend this morning writing code to calculate metabolism – but it only makes much sense if I can text-mine the literature – the literature that has been authored by my world colleagues). I campaign through this blog, through the OKF, and elsewhere and I rely on the viral spread of ideas to those who can be infected by them. Fortunately I live in a country which has established a tradition of free speech over centuries.

The other is deliberate breaking of the law. This is what Swartz has done. The Globe article is unclear but I assume he did not intend to benefit personally from his action. He did it to fight for a principle. (It was unclear whether he advertised his actions before or after). It can reasonably be described as http://en.wikipedia.org/wiki/Civil_disobedience

Ronald Dworkin held that there are three types of civil disobedience:

“Integrity-based” civil disobedience occurs when a citizen disobeys a law he feels is immoral, as in the case of northerners disobeying the fugitive slave laws by refusing to turn over escaped slaves to authorities.
“Justice-based” civil disobedience occurs when a citizen disobeys laws in order to lay claim to some right denied to him, as when blacks illegally protested during the Civil Rights Movement.
“Policy-based” civil disobedience occurs when a person breaks the law in order to change a policy (s)he believes is dangerously wrong.^[19]

Civil disobedience has a long history in the UK (see http://en.wikipedia.org/wiki/William_Penn )

… following his 1670 arrest with William Meade. Penn was accused of preaching before a gathering in the street, which Penn had deliberately provoked in order to test the validity of the new law against assembly. Penn pleaded for his right to see a copy of the charges laid against him and the laws he had supposedly broken, but the judge (the Lord Mayor of London) refused – even though this right was guaranteed by the law. Furthermore, the judge directed the jury to come to a verdict without hearing the defence.^[51]

Despite heavy pressure from the Lord Mayor to convict Penn, the jury returned a verdict of “not guilty”. When invited by the judge to reconsider their verdict and to select a new foreman, they refused and were sent to a cell over several nights to mull over their decision. The Lord Mayor then told the jury, “You shall go together and bring in another verdict, or you shall starve”, and not only had Penn sent to jail in loathsome Newgate Prison (on a charge of contempt of court), but the full jury followed him, and they were additionally fined the equivalent of a year’s wages each.^[52]^[53] The members of the jury, fighting their case from prison in what became known as Bushel’s Case, managed to win the right for all English juries to be free from the control of judges.^[54] This case was one of the more important trials that shaped the future concept of American freedom (see jury nullification)^[55] and was a victory for the use of the writ of habeas corpus as a means of freeing those unlawfully detained.

There are many other examples where civil disobedience has had similar effects in changing the law and policies.

In this blog post I am not advocating civil disobedience. But I am pointing out that the strains in the system are becoming larger. There is a growing feeling of inequality not only in scholarly publishing but in the more general access to human knowledge. The battle for http://en.wikipedia.org/wiki/Net_neutrality is critical to our development as a free knowledge-based world community.

I am making predictions, not issuing calls to action. If the feelings of injustices continue to grow I expect that we shall see more of this kind of action. The Net makes it easy to spread ideas, gather support. I am concerned that academic institutions will continue to develop their role as “police for publishers” rather than pressuring for democratic and legal change in the system. I appreciate the difficulties – we grow up in a society where we respect the law and where we are innately bound to work within it. But circumstances change and laws become outdated and counterproductive. Where this is not addressed major fractures are inevitable.

LATER NOTE: Read the comments below as well, which give greater background.

Posted in Uncategorized | 12 Comments

Green and Gold Open Access? Libre and Gratis. Reasons why readers and re-users matter

Posted on July 19, 2011 by pm286

I have just been reading Peter Suber’s latest SOAN http://www.arl.org/sparc/publications/articles/oanewsletter-oa-and-copyright.shtml (a monthly Open Access news ) and also his interview with Richard Poynder (short version http://poynder.blogspot.com/2011/07/peter-suber-leader-of-leaderless.html contains pointers to full version).

PeterS is, for many of us, the person who has led Open Access to where it is today. His textual discourse is something we should all aspire to. Beautifully and simply wordsmithed, with all the arguments completely and fairly laid out. He has never ranted.

A lunchtime break gives me to opportunity to raise some questions about “Open Access”. Open Access *is* complex and the terminology has sometimes been wayward. It is now converging on two axes Green-Gold and gratis-libre. This classification has taken years to resolve and during that time there has been much confusion. I’m afraid I have to say that several publishers benefit from the confusion and may deliberately promote it by non-standard terminology and poor labelling of products. Indeed if there is one message I would like everyone, especially publishers, to take away from this blogpost its is that precise terminology and clear labelling is essential. If, for example, as an author you pay 3000 USD to create an “Open Access” publication the publisher owes it to you to label it properly and to make it clear what benefits you have received that you may not have got from a non-Open product.

The term “Open Access” by itself is used so variably that all you can determine is that you can see the publication somewhere for free, hopefully for eternity. A responsible publisher should make it clear what the label means. We must also distinguish between visibility and the rights of an arbitrary reader to re-use some or all of the material. I am particularly concerned about rights as I wish to carry out textmining on a massive scale and many types of Open Access forbid this for various reasons. It should therefore be trivially clear on a publication what rights the reader (including a machine) has. This is technically straightforward and only laziness, ignorance or deliberate subterfuge are preventing it.

The right to view and the right to use are , unfortunately, convoluted with when, where and how the document is published, and may depend on versions. This makes the business rules for most publishers different for every other publisher. If you are a full-time information professional (e.g. a librarian or informatician – maybe a funder) then you have time to manage the most important publishers. But for the average author and reader it is unnecessarily confusing. For that reason it can be very hard to get the average person to spend time on the issues. So here goes (if I get this wrong then it shows how complex it is. Wikipedia is, as always, effectively definitive http://en.wikipedia.org/wiki/Open_access_%28publishing%29

Colour axis (independent of Gratis/libre and independent of authorside fees)

This is particularly confusing because several colour axes have been in use for different purposes (and may still be so). Moreover colours have no mnemonic value.

Gold applies only to publication through publishers. By submitting a manuscript here the publisher has the responsibility of making the manuscript Open in whatever form and preserving it indefinitely (maybe with a third party). Gold publication may or may not carry author-side fees (for example the Beilstein Journal of Organic Chemistry is a gratis OA publisher with no fees, while BMC, IUCr and PLoS journals have authorside fees. Gold may be gratis or libre. Generally Gold is provided in a single completely Open Access journal (i.e. where all papers are required to be OA). Examples are BMC, PloS and Acta Crystallographica E. A “Gold publisher” is a deprecated term as some publishers (e.g. BMC) have some closed publications.

Green relates to self-archiving, normally of material published in a conventional journal.
Assuming the author has the right, they may or may not choose to self-archive (i.e. by putting it on their website or in their Institutional Repository). The place and number of such archivals *may* be controlled by agreement with the publisher (e.g. you may/mayNot have multiple archivals, use the IR, etc.). There may also be regulations on *what* you archive. These may cover pre-publication (e.g. before peer-review), authors corrected manuscript. It often does not allow archival of the final “publisher’s PDF”. Some universities will help researchers archive their publications. I can personally vouch that self-archival can be a time-consuming business (not a “one-click” process). It may also depend on having a very clear personal record of the timeline of interaction with the publisher.

Funders spend much effort negotiating with publishers as to exactly what form of colour is allowed and what type of self-archival.

A special form of (usually Gold) OA is the “hybrid journal“. This is where a single Gold publication appears in the same journal/issue as closed publications. IMO it must be carefully labelled so the reader/user can determine their rights. I see little value in hybrid publications – the publisher gains double revenue and the major benefit (in science) of automatic re-use is probably impossible to determine without a human.

The gratis-libre axis

This applies only to the rights of a reader and the terms are descended from Richard Stallman and others. The terms “Free” and “Open” should be avoided when taking about this axis. Gratis is “free as in beer” and libre is “free as in speech”. Gratis grants no rights, other than to read; libre grants significant rights. The fundamental Open Access declarations (Budapest, berlin, Bethesda) defined open Access in “libre”-oriented prose. Unfortunately much of that clarity has become muddied, and is only now refocussing. Libre must, IMO, be accompanied by a precise definition of the rights of the re-user. I would urge these to be compliant with the Open Definition (http://www.opendefinition.org/ )

“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”.

I reiterate – Gold/Green and Gratis/Libre are formally independent.

I was surprised to see from Peter Suber’s interview that a large proportion of Gold OA was not libre. Peter often measures by journals, whereas I measure by articles, especially in STM. Since most Gold OA is now likely to come from funder mandates it would worry me a lot if they were only buying Gratis for their money.

Why is libre so important? What do you get for your money? (assuming you pay and this isn’t donated by the journal).

You get certainty for your reader (assuming the libre rights are well defined). You should certainly get a clear licence or contract for your payment.
Assuming the libre is OpenDefinition compliant your reader can re-use the material for almost anything. This includes teaching, book chapters, slide shows, movies, databases, textmining, data mining.
You SHOULD get a clear indication on/in the document itself what the (a) authorship is and (b) the reader’s rights

If you get an undefined gratis document you cannot assume ANY of these things by default. To add rights to a self-archived document is often problematic. You cannot make assumptions that a given document carries rights unless it actually carries them. Institutional Repositories compound this, often by failing to state rights, failing to add rights to documents or even worse (as Cambridge and I suspect many others do) adding the blanket disclaimer:

The reiterates the default copyright position that:

Readers have no rights by default

“fair use” does not apply in the UK. DSpace does not know the author’s date of death so can never assert that an item is formally out of copyright. Therefore by default:

Unless the author/self-archivist makes a special effort, the reader has no rights of use over the deposited item

Modern e-science requires documents over which the reader/user has rights of re-use, which is why Green self-archiving is of little value to high-volume information analysts. Moreover the author has not only to indicate that the item is libre, they also have to do it in a way where the information is easily discovered.

It is incredibly difficult to discover libre Open Access items unless they are published under the Gold system in a “Gold journal” – i.e. where every paper is guaranteed to be libre. Here are some simple questions, which despite the large amount of resource poured into the system I cannot even start to answer:

Find me all libre papers published by the American Chemical Society, Springer, Elsevier, Wiley, Royal Soc. Chemistry.
Make available a machine-readable licence illustrating the rights I/myCrawler have
Find me all libre theses in the Cambridge DSpace

I believe there are publishers out there who are trying to be constructive players in the Open Access market , i.e. giving the authors/funders value for their libre-fee. (I suspect there are some who are dragging their feet and giving as little value as possible for large fees). So constructive publishers, here is a check list:

Are all your libre publications labelled, both on the splash page and in the text itself with the readers’ rights? (A simple statement of CC-BY accomplishes this)
If you run hybrid journals is it easily possible to search for the libre content? Both by human and machine. This is not simply provided by a label saying “libre” but is a systematic exposition – effectively a separate TOC for libre content.

After all the author/funder can be paying a lot for a hybrid publication – all parties should regard this as an honourable transaction, not some back-street bargain.

And for repository managers:

Do all items carry explicit rights? Do you make it easy for these to be added (DSpace does not)
Does a reader/machine have an index of all your libre content?
Are your licences OKD-compliant and machine-readable

And for funders;

Are you insisting on full libre for your fee payments
Are you highlighting the value to society by creating indexes of the libre documents you have sponsored? (yes I know authors don’t comply always!)
Are you advocating the value of re-use rather than just visibility and showing it is value for money?

Posted in Uncategorized | 3 Comments

What’s wrong with scientific publishing? The challenges to ethical behaviour

Posted on July 19, 2011 by pm286

Here’s a comment from a blog some days ago which is so compelling I reproduce it in full. It needs little comment from me.

Nuwan says:

July 18, 2011 at 5:40 pm

I think scientific publications are a victim of our own “research success measurement yardstick”. I did my EECS graduate work in a far east university. Situation here is something like, your productivity as a researcher equals to the number of publication you write a year. On the first day I showed up in the graduate school, head of research summoned me and said “I want you to publish a journal paper and a conference paper every year! I won’t accept your thesis until you publish 2 journal papers”. In another words, he is putting the status quo — publish or perish — in few sentences. This pressure is even worse for junior academics, who are trying to build an academic career. Unless they author/co-author 20+ journal papers a year, their advancements in an academic institutions is most often ill fated.

I think this is deleterious for the whole of sciences. Such quantitative success measures lead to enormous pressures on researchers, which eventually leads to:

1. Publishing poor quality papers with half baked ideas or less rigorous experimental evidence
2. Helping unheard/unrecognized journals to proliferate
3. Researchers losing their integrity and proliferations of research malpractices
e.g. – fabrication of data, dishonesty, plagiarism, fragmenting single publication into multiple publications (just to get the brownie points), intellectual piracy (trying to get your name into colleague’s publications), publishing same results in multiple journals under different titles.

I was quite frustrated in academia and it eventually lead to my untimely departure, as I couldn’t stand what was happening around. After doing a long and thorough investigation, when I publish a paper, I see others have published half a dozen by the means of malpractices listed above. In university administrators perspective, I am nothing but an unproductive “dead-wood”. Finally I decided to do a 9-5 job in industry to earn the bread, and do research in spare time. This way, I won’t have any of the drawbacks being attached to an academic institution, and allow me to be more independent and honest researcher.

I wish the science community (as well as universities) reward more for “quality” research and publications rather than pure volume. It is my hypothesis, this is the key reason why science is not progressing at present. As most researchers have to “survive” in their respective institutions, hence they work on research that leads to predictable results, which translate into papers; rather engage in high quality/productive research, which always comes with high risk, long term rigorous investigations and most often not, big price tags.

I empathasize with these comments – I have seen similar in the blogosphere over the last few years. I have no idea how common they are. It was very worrying that 2 years ago Acta Crystallographica detected 50+ publications from one institution which were all fake. The credit goes to the crystallographic community for detecting this – I suspect that a significant amount (probably not a large percentage) of scientific publications are partly or wholly fraudulent. In Cheminformatics, for example, (where I am on an editorial board) there is a culture of not publishing data (its IP is protected, and it gives the authors an “advantage”), using closed software (you make money from it) and not revealing all your analysis methods in detail. Although some editors are trying to change it, the culture of not allowing reproducibility (and not being interested in it) is still there. Almost by definition very few chemoinformatics papers can be reproduced from what is published in the paper. I am not saying any of the published work is fraudulent (I think quite a lot of it is meaningless, and that also leads to unnecessary forms of publication) but it would be difficult to detect problems simply by reading the paper.

Posted in Uncategorized | 2 Comments

Linked Open Repositories: “We can do it in an afternoon”

Linked Open Repositories:

Figshare: how to publish your data to write your thesis quicker and better

What’s wrong with scholarly publishing? It’s only for academics.

Why YOU need a data management plan

What is civil disobedience?

The benefits and limitations of Green Open Access

The ethics of “stealing” scientific articles and civil disobedience

Green and Gold Open Access? Libre and Gratis. Reasons why readers and re-users matter

What’s wrong with scientific publishing? The challenges to ethical behaviour

Recent Posts

Recent Comments

Archives

Categories

Meta