petermr's blog

A Scientist and the Web


Archive for July, 2011

What is civil disobedience?

Wednesday, July 27th, 2011

I have already blogged about the Aaron Swartz case and civil disobedience . I expect to blog more as the case unfolds (assuming it is carried through) but this post id to clarify what I think are the prerequisites for CD if it becomes relevant in the area of scholarly publishing. First, read from which I quoted.

Civil disobedience is a very powerful tool but it must be used with thought, care and bravery to be effective. It is the deliberate breaking of a law, or regulation or practice on moral grounds (“conscience”), and not for personal gain. At such the motivation must be clear, and I believe, before the act. At present I am unclear in the Swartz case what the motivation was before the case and what will be presented in court. I cannot therefore label it as CD or not-CD, though his past actions were certainly not. Peter Suber has discussed CD ( ) and has made it clear that he does not regard Swartz’s previous actions (“Guerilla OA”) as CD and from what Peter has written I agree completely (1. Suber, P. Guerilla OA. Open Access News. 21 August, 2008. ). This does not mean I may not personally support Swartz (I have signed the petition). More on this later.

I am familiar with CD in the context of protesting for peace and the right not to fight and for countries not to possess weapons of mass destruction. I grew up in the shadow of UK conscription and was prepared to face a tribunal (and possible prison) and argue why I was not prepared to join the armed forces. In this I was following the example of people like pre-eminent scientist (first female FRS) and Quaker peace campaigner:

Kathleen Lonsdale became a Quaker in 1935, simultaneously with her husband. Both of them were committed pacifists and were attracted to Quakerism in no small part for this reason. She served a month in Holloway prison during the Second World War because she refused to register for civil defence duties or pay a fine for refusing to register. At the annual meeting of the British Quakers in 1953 she delivered the keynote Swarthmore Lecture, under the title Removing the Causes of War. [my emphasis]

This action (refusing to take part in the war action) is perhaps more an act of conscience that CD but the effect is the same – prepared to suffer if necessary for clear principles. A similar action was where groups of individuals systematically burned their draft cards (the requirement to sign up for war duties):

Beginning in May 1964,[1][2] some activists burned their draft cards at antiwar rallies and demonstrations. By May 1965 it was happening with greater frequency. To limit this kind of protest,[3] in August 1965, the United States Congress enacted a law to broaden draft card violations to punish anyone who “knowingly destroys, knowingly mutilates” his draft card.[4] Subsequently, 46 men were indicted[5] for burning their draft cards at various rallies, and four major court cases were heard. One of them, United States v. O’Brien, was argued before the Supreme Court. The act of draft card burning was defended as a symbolic form of free speech, a constitutional right guaranteed by the First Amendment. The Supreme Court decided against the draft card burners; it determined that the federal law was justified and that it was unrelated to the freedom of speech. This outcome was criticized by legal experts.

This is clear civil disobedience – the preparedness to suffer for a clearly held view which is in apparent conflict with the law. Note that it is not always clear what the law is, and that cases involving CD may need to be resolved in high-profile cases. The resolution sometimes changes the status quo, sometimes upholds it.

I am more familiar with the acts of CD associated with nuclear weapons in the UK. I have protested at bases, but always within the law. However I have supported those who have deliberately challenged the law (this is not inconsistent). I have visited Greenham Common, and quote at length from
Greenham Common Women’s Peace Camp):

On the 5th September 1981, the Welsh group “Women for Life on Earth” arrived on Greenham Common, Berkshire, England. They marched from Cardiff with the intention of challenging, by debate, the decision to site 96 Cruise nuclear missiles there. On arrival they delivered a letter to the Base Commander which among other things stated ‘We fear for the future of all our children and for the future of the living world which is the basis of all life’.

This was the prime motivation – there was a secondary one in that common land had been appropriated for a nuclear missile base:

At a time when the USA and the USSR were competing for nuclear superiority in Europe, the Women’s Peace Camp on Greenham Common was seen as an edifying influence. The commitment to non-violence and non-alignment gave the protest an authority that was difficult to dismiss – journalists from almost every corner of the globe found their way to the camp and reported on the happenings and events taking place there.

The protest, committed to disrupting the exercises of the USAF, was highly effective. Nuclear convoys leaving the base to practice nuclear war, were blockaded, tracked to their practice area and disrupted.Taking non-violent direct action meant that women were arrested, taken to court and sent to prison.

And the camp finally achieved its objectives:

A number of initiatives were made by women in Court testing the legality of nuclear weapons. Also, challenges to the conduct and stewardship of the Ministry of Defence as landlords of Greenham Common. In 1992 Lord Taylor, Lord Chief Justice, delivering the Richard Dimbleby
Lecture for the BBC, referring to the Bylaws case ( won by Greenham women in the House of Lords in 1990) said ‘…it would be difficult to suggest a group whose cause and lifestyle were less likely to excite the sympathies and approval of five elderly judges. Yet it was five Law Lords who allowed the Appeal and held that the Minister had exceeded his powers in framing the byelaws so as to prevent access to common land’.

Here we see the power of nonviolent civil disobedience. The key included:

The conduct and integrity of the protest mounted by the Women’s Peace Camp was instrumental in the decision to remove the Cruise Missiles from Greenham Common. [my emphasis]

If it comes to the fact that civil disobedience appears inevitable (for whatever cause) then integrity of purpose and action is key.


The benefits and limitations of Green Open Access

Wednesday, July 20th, 2011

In a reply to my exposition of Green, Gold, Gratis and Libre, Steve Hitchcock comments:


Steve Hitchcock says:

July 20, 2011 at 5:03 pm  

Peter, In this blog post you say “Modern e-science requires documents over which the reader/user has rights of re-use, which is why Green self-archiving is of little value to high-volume information analysts.” In your next post on the Aaron Swartz/JSTOR case you arrive at a concluding point: “I am concerned that academic institutions will continue to develop their role as “police for publishers” rather than pressuring for democratic and legal change in the system.” What institutional change do you have in mind? In this context one is to provide, and mandate the use of, ‘green’ open access institutional repositories, but you appear to rule this out. Institutions can influence what their researchers and authors do more directly than they can act for ‘democratic and legal change’. I know from reading your posts and mails over many years that you prefer the libre OA approach above others, but you seem unsure who should take the lead on this, publishers or institutions. The one you choose will determine the starting point: green gratis OA (which institutions can provide), or libre OA (which institutions cannot provide for journal published content).

Steve is from Southampton, which is one of the shining examples of how to manage scholarship on a University-wide scale – with deposition mandates, clear IT infrastructure, etc. Researchers probably get more implicit and explicit support for self-archiving than almost anywhere else.

Green OA has the following advantages over Gold OA (I am assuming we compare gratis with gratis and libre with libre). (I am not including hybrid Gold in this – operationally it has almost no benefit over Green)

  • It costs no cash and the effort (particularly with a system like Soton’s Eprints with Chris Gutteridge to help) is fairly small

I cannot immediate think of any other universal advantages – I will add them as I go along and as they are pointed out

The following advantage(s) are common to both Gold and Green

  • They get indexed by search engines such as Google and Bing. I am not aware of any independent academic archive of Green OA or Gold OA. In fact I have a suggestion for doing exactly that which I will put in a later blog. I do not regard deposition in an IR as making Open content more discoverable than on a publisher’s web site – I suspect they are roughly equivalent – Bingle will index both.

The following are the advantages of Gold:

  • The licence is clear, both on the document itself and in the context. (Green OA almost never confers any rights explicitly, and the context may well not include rights
  • The documents may be systematically discovered by iterating through the publisher’s tables of contents. This is VERY important, perhaps the most striking advantage of Gold (whether gratis or libre). I can for example download all BMC content whenever I wish , subject only to the courtesy of agreeing a robot-friendly protocol when I want. Can I systematically download all Green material from the 100 UK repositories? I doubt it (a) how do I discover it? (b) when I have discovered it how does my machine know the rights?
  • With Gold It is almost always possible to know whether the content is libre. It is almost impossible to determine the gratis/libre on Green. I am therefore assuming that there are very few Green documents where I can trivially determine that they are libre

The advantages of libre are enormous. I am assuming a high correlation between Gold = libre and Green = gratis. Effectively only Gold gives me a significant amount of libre. The advantages:

  • I can copy and reproduce some or all of the content
  • I can rework the text into book chapters
  • I can include the diagrams as slides
  • I can compute the tables in R or other statistic programs
  • I can extract the chemistry (yes we can extract the chemistry automatically).
  • I can use the material as a corpus for developing textmining
  • I can use the corpus to extract information
  • I can use the corpus to compare documents, including detection of plagiarism
  • I can make my own overlay journal (and we are doing exactly that with Acta Crystallographica E)
  • I can create resources on the web of Linked Open Data
  • I can create Open Research Reports for diseases (OKF/JISC hackathon in December)

And much more.

A caution. Some Greenophiles such as Stevan Harnad have told me I can do all this with Green material. I believe that in every case I would be breaking contract and/or copyright law. If anyone can convince me that almost all Green carries implicit rights to do this I would change my view. But I am very sceptical.

Gold Open Access has one major limitation:

  • It normally costs a considerable amount of money.

SteveH says:

green gratis OA (which institutions can provide),

This is not correct. The providers of the permission for Green gratis are the publishers. Some publishers such as the American ******** Society have been solidly set against Green Open Access of any sort. The instituions cannot provide Green. They can help authors find out WHETHER they have a right to self-archive as Green and they can – perhaps – lobby publishers to persuade them to allow Green SA. They can provide the technology to do it and they can provide implicit and explicit support. But they cannot provide it absolutely.

I need tens of thousands of articles. I need to know I am legally and contractually able to obtain and re-use them. If SteveH or anyone else can show how this can be done with Green articles in Repositories I’d be grateful.

As a touchstone it is impossible even to get all the UK theses published last year. Impossible to determine their rights. Impossible to know how to write a universal downloader. That’s much the same with Green, which need n ot even be in IRs.

Please – anyone – adjust this analysis.




The ethics of “stealing” scientific articles and civil disobedience

Wednesday, July 20th, 2011

I have been alerted to the following article in the Boston Globe about a Cambridge[Mass] Man who has been accused of “stealing” 4 million scientific articles.

A Cambridge man [Swartz] who was a fellow at Harvard University’s Edmond J. Safra Center for Ethics is now facing federal charges that he hacked into a Massachusetts Institute of Technology computer archive system to steal more than 4 million articles from scientific journals and academic work.

Swartz has advocated for the elimination of barriers to the distribution of information over the Internet, and for the widest public distribution of information in libraries. He is also a co-founder of

However, the organization [JSTOR, the repository and resupplier of these articles] said that “a substantial portion of our publisher partners’ content was downloaded in an unauthorized fashion using the network at the Massachusetts Institute of Technology, one of our participating institutions.  The content taken was systematically downloaded using an approach designed to avoid detection by our monitoring systems,” the statement said.

I shan’t reprint the whole article – this might infringe copyright. But I want to comment on one statement in it – and then more generally

The articles and journals listed under the JSTOR system are available through a paid subscription, with some subscriptions costing as much as $50,000. A portion of the fee is in turn paid over to copyright holders.

The subscriptions are paid to the PUBLISHERS. (I do not know whether JSTOR receives the full subscription and then relays some or all to the publisher). The publisher collects subscription revenues from JOURNAL subscriptions which may contain articles where the authors have, and have NOT, transferred copyright. In neither case do the AUTHORS receive any payment.

I do not personally advocate criminal damage, and I am currently reasonably scrupulous to avoid deliberately violating copyright law or the contract that my institution has signed with the publishers. I say “reasonably” because the whole area of law and contracts in this area is so complex that there is no human that understands it in all its details (it varies by country and individual institution). I am also conscious that I am employed here and as such my actions can disadvantage my employer. On two previous occasions my actions, perfectly legal actions, have caused the University to be cut off by publishers. Their server algorithms for “stealing” content had been triggered and reacted automatically. (ASIDE – how many of you are aware that the publisher alone decides what is and what is not legitimate usage of their content? They can just cut the institution off).

I believe that our laws and contracts for access to scientific literature need serious revision. I believe that the current situation is unethical and that decisions are made for reasons that do not help science and frequently hinder it. When one believes that laws must be changed, there are two main ways of doing it.

One is to work within the law and put personal and political pressure on the people and organizations involved. That is now what this blog has evolved to – I still do science, but only half of what I could (I did spend this morning writing code to calculate metabolism – but it only makes much sense if I can text-mine the literature – the literature that has been authored by my world colleagues). I campaign through this blog, through the OKF, and elsewhere and I rely on the viral spread of ideas to those who can be infected by them. Fortunately I live in a country which has established a tradition of free speech over centuries.

The other is deliberate breaking of the law. This is what Swartz has done. The Globe article is unclear but I assume he did not intend to benefit personally from his action. He did it to fight for a principle. (It was unclear whether he advertised his actions before or after). It can reasonably be described as

Ronald Dworkin held that there are three types of civil disobedience:

  • “Integrity-based” civil disobedience occurs when a citizen disobeys a law he feels is immoral, as in the case of northerners disobeying the fugitive slave laws by refusing to turn over escaped slaves to authorities.
  • “Justice-based” civil disobedience occurs when a citizen disobeys laws in order to lay claim to some right denied to him, as when blacks illegally protested during the Civil Rights Movement.
  • “Policy-based” civil disobedience occurs when a person breaks the law in order to change a policy (s)he believes is dangerously wrong.[19]

Civil disobedience has a long history in the UK (see )

… following his 1670 arrest with William Meade. Penn was accused of preaching before a gathering in the street, which Penn had deliberately provoked in order to test the validity of the new law against assembly. Penn pleaded for his right to see a copy of the charges laid against him and the laws he had supposedly broken, but the judge (the Lord Mayor of London) refused – even though this right was guaranteed by the law. Furthermore, the judge directed the jury to come to a verdict without hearing the defence.[51]

Despite heavy pressure from the Lord Mayor to convict Penn, the jury returned a verdict of “not guilty”. When invited by the judge to reconsider their verdict and to select a new foreman, they refused and were sent to a cell over several nights to mull over their decision. The Lord Mayor then told the jury, “You shall go together and bring in another verdict, or you shall starve”, and not only had Penn sent to jail in loathsome Newgate Prison (on a charge of contempt of court), but the full jury followed him, and they were additionally fined the equivalent of a year’s wages each.[52][53] The members of the jury, fighting their case from prison in what became known as Bushel’s Case, managed to win the right for all English juries to be free from the control of judges.[54] This case was one of the more important trials that shaped the future concept of American freedom (see jury nullification)[55] and was a victory for the use of the writ of habeas corpus as a means of freeing those unlawfully detained.

There are many other examples where civil disobedience has had similar effects in changing the law and policies.

In this blog post I am not advocating civil disobedience. But I am pointing out that the strains in the system are becoming larger. There is a growing feeling of inequality not only in scholarly publishing but in the more general access to human knowledge. The battle for is critical to our development as a free knowledge-based world community.

I am making predictions, not issuing calls to action. If the feelings of injustices continue to grow I expect that we shall see more of this kind of action. The Net makes it easy to spread ideas, gather support. I am concerned that academic institutions will continue to develop their role as “police for publishers” rather than pressuring for democratic and legal change in the system. I appreciate the difficulties – we grow up in a society where we respect the law and where we are innately bound to work within it. But circumstances change and laws become outdated and counterproductive. Where this is not addressed major fractures are inevitable.

LATER NOTE: Read the comments below as well, which give greater background.

Green and Gold Open Access? Libre and Gratis. Reasons why readers and re-users matter

Tuesday, July 19th, 2011

I have just been reading Peter Suber’s latest SOAN (a monthly Open Access news ) and also his interview with Richard Poynder (short version contains pointers to full version).

PeterS is, for many of us, the person who has led Open Access to where it is today. His textual discourse is something we should all aspire to. Beautifully and simply wordsmithed, with all the arguments completely and fairly laid out. He has never ranted.

A lunchtime break gives me to opportunity to raise some questions about “Open Access”. Open Access *is* complex and the terminology has sometimes been wayward. It is now converging on two axes Green-Gold and gratis-libre. This classification has taken years to resolve and during that time there has been much confusion. I’m afraid I have to say that several publishers benefit from the confusion and may deliberately promote it by non-standard terminology and poor labelling of products. Indeed if there is one message I would like everyone, especially publishers, to take away from this blogpost its is that precise terminology and clear labelling is essential. If, for example, as an author you pay 3000 USD to create an “Open Access” publication the publisher owes it to you to label it properly and to make it clear what benefits you have received that you may not have got from a non-Open product.

The term “Open Access” by itself is used so variably that all you can determine is that you can see the publication somewhere for free, hopefully for eternity. A responsible publisher should make it clear what the label means. We must also distinguish between visibility and the rights of an arbitrary reader to re-use some or all of the material. I am particularly concerned about rights as I wish to carry out textmining on a massive scale and many types of Open Access forbid this for various reasons. It should therefore be trivially clear on a publication what rights the reader (including a machine) has. This is technically straightforward and only laziness, ignorance or deliberate subterfuge are preventing it.

The right to view and the right to use are , unfortunately, convoluted with when, where and how the document is published, and may depend on versions. This makes the business rules for most publishers different for every other publisher. If you are a full-time information professional (e.g. a librarian or informatician – maybe a funder) then you have time to manage the most important publishers. But for the average author and reader it is unnecessarily confusing. For that reason it can be very hard to get the average person to spend time on the issues. So here goes (if I get this wrong then it shows how complex it is. Wikipedia is, as always, effectively definitive

Colour axis (independent of Gratis/libre and independent of authorside fees)

This is particularly confusing because several colour axes have been in use for different purposes (and may still be so). Moreover colours have no mnemonic value.

Gold applies only to publication through publishers. By submitting a manuscript here the publisher has the responsibility of making the manuscript Open in whatever form and preserving it indefinitely (maybe with a third party). Gold publication may or may not carry author-side fees (for example the Beilstein Journal of Organic Chemistry is a gratis OA publisher with no fees, while BMC, IUCr and PLoS journals have authorside fees. Gold may be gratis or libre. Generally Gold is provided in a single completely Open Access journal (i.e. where all papers are required to be OA). Examples are BMC, PloS and Acta Crystallographica E. A “Gold publisher” is a deprecated term as some publishers (e.g. BMC) have some closed publications.

Green relates to self-archiving, normally of material published in a conventional journal.
Assuming the author has the right, they may or may not choose to self-archive (i.e. by putting it on their website or in their Institutional Repository). The place and number of such archivals *may* be controlled by agreement with the publisher (e.g. you may/mayNot have multiple archivals, use the IR, etc.). There may also be regulations on *what* you archive. These may cover pre-publication (e.g. before peer-review), authors corrected manuscript. It often does not allow archival of the final “publisher’s PDF”. Some universities will help researchers archive their publications. I can personally vouch that self-archival can be a time-consuming business (not a “one-click” process). It may also depend on having a very clear personal record of the timeline of interaction with the publisher.

Funders spend much effort negotiating with publishers as to exactly what form of colour is allowed and what type of self-archival.

A special form of (usually Gold) OA is the “hybrid journal“. This is where a single Gold publication appears in the same journal/issue as closed publications. IMO it must be carefully labelled so the reader/user can determine their rights. I see little value in hybrid publications – the publisher gains double revenue and the major benefit (in science) of automatic re-use is probably impossible to determine without a human.

The gratis-libre axis

This applies only to the rights of a reader and the terms are descended from Richard Stallman and others. The terms “Free” and “Open” should be avoided when taking about this axis. Gratis is “free as in beer” and libre is “free as in speech”. Gratis grants no rights, other than to read; libre grants significant rights. The fundamental Open Access declarations (Budapest, berlin, Bethesda) defined open Access in “libre”-oriented prose. Unfortunately much of that clarity has become muddied, and is only now refocussing. Libre must, IMO, be accompanied by a precise definition of the rights of the re-user. I would urge these to be compliant with the Open Definition ( )

“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”.


I reiterate – Gold/Green and Gratis/Libre are formally independent.

I was surprised to see from Peter Suber’s interview that a large proportion of Gold OA was not libre. Peter often measures by journals, whereas I measure by articles, especially in STM. Since most Gold OA is now likely to come from funder mandates it would worry me a lot if they were only buying Gratis for their money.

Why is libre so important? What do you get for your money? (assuming you pay and this isn’t donated by the journal).

  • You get certainty for your reader (assuming the libre rights are well defined). You should certainly get a clear licence or contract for your payment.
  • Assuming the libre is OpenDefinition compliant your reader can re-use the material for almost anything. This includes teaching, book chapters, slide shows, movies, databases, textmining, data mining.
  • You SHOULD get a clear indication on/in the document itself what the (a) authorship is and (b) the reader’s rights

If you get an undefined gratis document you cannot assume ANY of these things by default. To add rights to a self-archived document is often problematic. You cannot make assumptions that a given document carries rights unless it actually carries them. Institutional Repositories compound this, often by failing to state rights, failing to add rights to documents or even worse (as Cambridge and I suspect many others do) adding the blanket disclaimer:

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

The reiterates the default copyright position that:

Readers have no rights by default

“fair use” does not apply in the UK. DSpace does not know the author’s date of death so can never assert that an item is formally out of copyright. Therefore by default:

    Unless the author/self-archivist makes a special effort, the reader has no rights of use over the deposited item

Modern e-science requires documents over which the reader/user has rights of re-use, which is why Green self-archiving is of little value to high-volume information analysts. Moreover the author has not only to indicate that the item is libre, they also have to do it in a way where the information is easily discovered.

It is incredibly difficult to discover libre Open Access items unless they are published under the Gold system in a “Gold journal” – i.e. where every paper is guaranteed to be libre. Here are some simple questions, which despite the large amount of resource poured into the system I cannot even start to answer:

  • Find me all libre papers published by the American Chemical Society, Springer, Elsevier, Wiley, Royal Soc. Chemistry.
  • Make available a machine-readable licence illustrating the rights I/myCrawler have
  • Find me all libre theses in the Cambridge DSpace


I believe there are publishers out there who are trying to be constructive players in the Open Access market , i.e. giving the authors/funders value for their libre-fee. (I suspect there are some who are dragging their feet and giving as little value as possible for large fees). So constructive publishers, here is a check list:

  • Are all your libre publications labelled, both on the splash page and in the text itself with the readers’ rights? (A simple statement of CC-BY accomplishes this)
  • If you run hybrid journals is it easily possible to search for the libre content? Both by human and machine. This is not simply provided by a label saying “libre” but is a systematic exposition – effectively a separate TOC for libre content.

After all the author/funder can be paying a lot for a hybrid publication – all parties should regard this as an honourable transaction, not some back-street bargain.

And for repository managers:

  • Do all items carry explicit rights? Do you make it easy for these to be added (DSpace does not)
  • Does a reader/machine have an index of all your libre content?
  • Are your licences OKD-compliant and machine-readable

And for funders;

  • Are you insisting on full libre for your fee payments
  • Are you highlighting the value to society by creating indexes of the libre documents you have sponsored? (yes I know authors don’t comply always!)
  • Are you advocating the value of re-use rather than just visibility and showing it is value for money?



What’s wrong with scientific publishing? The challenges to ethical behaviour

Tuesday, July 19th, 2011

Here’s a comment from a blog some days ago which is so compelling I reproduce it in full. It needs little comment from me.

Nuwan says:

July 18, 2011 at 5:40 pm 

I think scientific publications are a victim of our own “research success measurement yardstick”. I did my EECS graduate work in a far east university. Situation here is something like, your productivity as a researcher equals to the number of publication you write a year. On the first day I showed up in the graduate school, head of research summoned me and said “I want you to publish a journal paper and a conference paper every year! I won’t accept your thesis until you publish 2 journal papers”. In another words, he is putting the status quo — publish or perish — in few sentences. This pressure is even worse for junior academics, who are trying to build an academic career. Unless they author/co-author 20+ journal papers a year, their advancements in an academic institutions is most often ill fated.

I think this is deleterious for the whole of sciences. Such quantitative success measures lead to enormous pressures on researchers, which eventually leads to:

1. Publishing poor quality papers with half baked ideas or less rigorous experimental evidence
2. Helping unheard/unrecognized journals to proliferate
3. Researchers losing their integrity and proliferations of research malpractices
e.g. – fabrication of data, dishonesty, plagiarism, fragmenting single publication into multiple publications (just to get the brownie points), intellectual piracy (trying to get your name into colleague’s publications), publishing same results in multiple journals under different titles.

I was quite frustrated in academia and it eventually lead to my untimely departure, as I couldn’t stand what was happening around. After doing a long and thorough investigation, when I publish a paper, I see others have published half a dozen by the means of malpractices listed above. In university administrators perspective, I am nothing but an unproductive “dead-wood”. Finally I decided to do a 9-5 job in industry to earn the bread, and do research in spare time. This way, I won’t have any of the drawbacks being attached to an academic institution, and allow me to be more independent and honest researcher.

I wish the science community (as well as universities) reward more for “quality” research and publications rather than pure volume. It is my hypothesis, this is the key reason why science is not progressing at present. As most researchers have to “survive” in their respective institutions, hence they work on research that leads to predictable results, which translate into papers; rather engage in high quality/productive research, which always comes with high risk, long term rigorous investigations and most often not, big price tags.

I empathasize with these comments – I have seen similar in the blogosphere over the last few years. I have no idea how common they are. It was very worrying that 2 years ago Acta Crystallographica detected 50+ publications from one institution which were all fake. The credit goes to the crystallographic community for detecting this – I suspect that a significant amount (probably not a large percentage) of scientific publications are partly or wholly fraudulent. In Cheminformatics, for example, (where I am on an editorial board) there is a culture of not publishing data (its IP is protected, and it gives the authors an “advantage”), using closed software (you make money from it) and not revealing all your analysis methods in detail. Although some editors are trying to change it, the culture of not allowing reproducibility (and not being interested in it) is still there. Almost by definition very few chemoinformatics papers can be reproduced from what is published in the paper. I am not saying any of the published work is fraudulent (I think quite a lot of it is meaningless, and that also leads to unnecessary forms of publication) but it would be difficult to detect problems simply by reading the paper.

MOTSI: What is a citation?

Monday, July 18th, 2011

We are all now judged by citations? But what *IS* a citation? It’s not easy to answer… and it may not be quite what you think. gives:

Broadly, a citation is a reference to a published or unpublished source (not always the original source). More precisely, a citation is an abbreviated alphanumeric expression (e.g. [Newell84]) embedded in the body of an intellectual work that denotes an entry in the bibliographic references section of the work for the purpose of acknowledging the relevance of the works of others to the topic of discussion at the spot where the citation appears. Generally the combination of both the in-body citation and the bibliographic entry constitutes what is commonly thought of as a citation (whereas bibliographic entries by themselves are not).[PMR's emphasis]

A prime purpose of a citation is intellectual honesty: to attribute prior or unoriginal work and ideas to the correct sources, and to allow the reader to determine independently whether the referenced material supports the author’s argument in the claimed way.


Bibliographies, and other list-like compilations of references, are generally not considered citations because they do not fulfill the true spirit of the term: deliberate acknowledgment by other authors of the priority of one’s ideas.

By the definition of this Wikipedia article it is clear that a citation consists of not only the bibliographic entry reference, but also the in-text context. Using our own paper (Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry, BalaKrishna Kolluru1*, Lezan Hawizy2, Peter Murray-Rust2, Junichi Tsujii1, Sophia Ananiadou1 )we find a typical context in which the citation occurs.


  1. Different aspects [of text-mining] such as named entity recognition (NER), tokenisation and acronym detection require bespoke approaches because the complex nature of such texts [1][5].

The [1]-[5] represents five elements of prior work which are defined by the language. This “sentiment” is very difficult to analyse exactly by machine and requires a human to describe the type of the citation. In this case it is prior work in the field. Here are the resolved bibliographic references:

Kemp N, Lynch M (1998) Extraction of information from the text of chemical patents. 1. identification of specific chemical names. Journal of Chemical Information and Computer Sciences 4: 544–551. Find this article online

Murray-Rust P, Rzepa H (1999) Chemical markup, xml, and the worldwide web. 1. basic principles. Journal of Chemical Information and Computer Sciences 39: 928–942. Find this article online

Murray-Rust P, Mitchell J, Rzepa H (2005) Chemistry in bioinformatics. BMC Bioinformatics 6: 141. Find this article online

Banville D (2006) Mining chemical structural information from the drug literature. Drug Discovery Today 11: 35–42. Find this article online

Kolrik C, Hofmann-Apitius M, Zimmermann M, Fluck J (2007) Identification of new drug classification terms in textual resources. Bioinformatics 13: 264–272. Find this article online

Note that the references themselves are NOT citations, it is the combination of each of them with the context (sentence (I)) that defines the citation. This is important as most “citations” do not fulfil this criterion. Note also that two of the references are to works authored in part by one of the authors (PMR); these , when expressed as citations, are sometimes called “self-citations”. (But note that several authors are involved in each case). These citations – in the first paragraph of the paper – can be assumed to reference fairly important prior work on which the paper probably builds

These citations can be seen to lend some merit to the work described by the references. But it is because of the complete citation that we lend the merit – not just because the references are in the paper. In using citations, therefore, we should always include the sentiment. There are many types of sentiment – and we looked at this in our Sciborg project. ( )

Each citation is labelled with exactly one category. The following top-level four-way distinction applies:

  • Weakness: Authors point out a weakness in cited Work

  • Contrast: Authors make contrast/comparison with cited work (4 categories)

  • Positive: Authors agree with/make use of/show compatibility or similarity with cited work (6 categories),


  • Neutral: Function of citation is either neutral, or weakly signalled, or different from the three functions stated above.


Some of these might count positively to an author’s reputation, others would be negative.


Here’s a similar assessment in botanical systems . Typical extract:

In 2007, Stephen McLaughlin published “Tundra to Tropics: The Floristic Plant Geography of North America” in Sida, Bototanical Miscellany. McLaughlin is one of the few authors who included data sources in his work. He stated that “The 245 local floras selected for this study are listed in Appendix A” (p. 3). Although listed in an appendix, they are not included in the bibliography. Instead, his bibliography consists of 28 other publications, mostly books and articles in books, but also articles in Thomson Reuters-monitored journals.


In other words flora (which are citable) may be excluded from “citation” analysis” because the authors put them elsewhere in the document. Automated methods of “citation analysis” cannot pick this up. In some of my own work many references may be in tables.


So a true citation carries sentiment and describes the purpose of the citation. In #jiscopencite (sister to #jiscopenbib) David Shotton and colleagues are developing a citation typing ontology (CITO).


But, unless you tell me different, the “citation” used in current metrics and in the Science Citation Index and modern descendants is in fact only a bibli0ographic reference. It carries no sentiment. Which is why citation counts are skewed by negative sentiment. And some types of common citation (methods or software) can achive very large counts.


None of this is included in the Journal Impact Factor, which AFAIK simply extracts bibliographic references. A person can get increased “citations” for being criticized. Being controversial may increase your metrics. None of this is surprising, other than to those who think the evaluation of research can be automated.


There are many other reasons why “citations” (bibliographic references) are seriously flawed. I’m not the first to bring these up

  • :Error in the bibliographic references themselves. shows “an author named “I. INTRODUCTION” has published hundreds of papers. Similarly, according to CiteSeer, the first and third authors of this article have a coauthor named “I. Introducción”. This makes us laugh, but sickly, in that our careers are based on this level of inaccuracy. There are many other problems, author identification and disambiguation (ORCID may solve some, but not all of this). #jiscopencite reveals that many bibliographic references are simply inaccurate. José H. Canós Cerdá, Eduardo Mena Nieto, and Manuel Llavador Campos continue: Citation analysis needs an in-depth transformation. Current systems have been long criticized due to shortcomings such as lack of coverage of publications and low accuracy of the citation data. Surprisingly, incomplete or incorrect data are used to make important decisions about researchers’ careers. We argue that a new approach based on the collection of citation data at the time the papers are created can overcome current limitations, and we propose a new framework in which the research community is the owner of a Global Citation Registry characterized by high quality citation data handled automatically. We envision a registry that will be accessible by all the interested parties and will be the source from which the different impact models can be applied
  • Lack of transparency in what sources are used. What is, and is not, citable – the target of a bibliographic reference. Web pages? Broadcast talks? Our media changes if we wish to use it for evaluation, WE, not the unaccountable commercial organizations should be in control.
  • Lack of sentiment (above). Without knowing why something is cited we cannot attribute motivation and value.


For me the biggest problem is the lack of transparency – if this problem is addressed, the other two follow.

In (14 years ago) Cameron suggests:


A universal citation database has significant potential to act as a catalyst for reform in scholarly communication by leveling the playing field between alternative forms of scholarly publication. This would happen in two important ways. First, the citation database would ensure that publications in any form are equally visible (but not necessarily equally accessible) to the literature research process. Regardless of which publication venue an author chooses, all that she/he need to do to make her/his work visible is to cite appropriate previous works. Publication venues would then compete on the important values that they bring to the publication process, such as refereeing standards, editorial control, quality of presentation, timeliness of dissemination and so forth. Publications would no longer enjoy an unfair competitive advantage simply by virtue of being indexed in a particular literature database.

The second way that a universal citation database would promote fairer competition among publication venues is by providing a method for evaluating the significance of individual papers independent of the publication venue chosen. University faculty members are often critically concerned with the recognition that their work receives because of its importance to the evaluation of their academic careers. Because the significance of papers is often judged solely by the perceived quality of the venues in which they are published, this encourages a very conservative approach to choice of publication venues. By providing citation data as an independent means of demonstrating the significance of a particular work, a universal citation database has the potential to encourage authors to choose publication venues for other qualities.

And the implication was clear – academia could and should have initiated this. As I have implied elsewhere academia has sleepwalked past this opportunity and now generated a mass of unregulated bibliographic reference collections. Their quality and coverage is not transparent so I cannot judge them – other than to say that non-transparency has generally little value.

So – as in so much else – IF we created semantic publications, and IF we made them Open, many of the problems would be solved. We should use semantic bibliography (as we have developed in our Open Bibliography project). We should label that with sentiment to create true citations. These should be semantic and published Openly at time of first publication. This would solve most of our problems of bad data, missing data, control by unaccountable third parties, etc.

But it would require the author to change their habits. To adopt new and better ways of authoring papers. And there is an established industry – including academia – who benefit from the low quality processes we currently have.

We are constantly told

“Authors will never do that”

And that’s true IF, but only IF, academia doesn’t care.

Let’s try the following:

“scientists will never asses the safety of their reactions. It’s too much trouble”

“scientists will never bother to report experiments on animals. It’s too much trouble”

So IF academia required scholarly publications to have semantic accurate citations (not just bibliographies) we could solve this in months. The technology is not the problem.

Academics are the problem.

We are in the grip of being controlled by our own creations we cannot control. In this case our Monster of the Scholarly ID is the “citation”. Let’s tame it.







Journal review system: a reviewer’s perspective

Sunday, July 17th, 2011

Quite by chance I have just received an update of a review I did for [a gold open access scientific journal]. I omit all confidential info:

Dear Dr. Murray-Rust,

Thank you for your review of this manuscript.  The Editor has made a decision on this paper and a copy of the decision letter can be found below.
You can also access your review comments and the decision letter by logging onto the Editorial Manager as a Reviewer.

[Dear Author… ]

Before your manuscript can be formally accepted, your files will be checked by the [publisher's] production staff. Once they have completed these checks, they will return your manuscript to you so that you may attend to their requests and make any changes that you feel necessary.

To speed the publication of your paper you should look very closely at the PDF of your manuscript. You should consider this text to have the status of a production proof. Your paper will be tagged and laid out to produce professional PDF and online versions. However, the text you have supplied will be faithfully represented in your published manuscript exactly as you have supplied it.

So as far as the author and reviewer are concerned everything is driven by PDF (confirming Cameron’s experience). PDF is a well-known destroyer of semantic information. This, of course, is common to all publishers. We have allowed them to create this monster and force it on us.

PDF holds back the development of semantically supported science.

The destruction of semantic data: The PLoS community replies

Sunday, July 17th, 2011

I posted yesterday about an article in PLoS ONE where I criticized the author/editor/publisher for destroying semantic data. It has generated 11 replies and you should read them before this post so as to get all points of view.

It turns out it is the PLoS system that carries out this transformation. There are several vigorous defences of PLoS so I will try to be objective.

The first, and IMO fundamental, concern is that this is a system which (however good or bad) is developed by the publisher and thrust on authors, reviewers and readers/re-users. That is true of almost all publishers and it is one of the MOTSIs – that we have handed over to publishers the representation of our knowledge. In the print era this might have been acceptable but in the century of the semantic web I find it inexcusable. PLoS is no worse than others and because it’s Open it exposes its XML (closed publishers do not, of course do this).

What I have objected to is that the information submitted by the authors is transformed (in my case without my knowledge or consent) to a dumbed down version. Here is an example (from our paper on text-mining );

What we submitted: (I don’t like submitting images but sometimes it is the only way

It’s quite readable to young sighted humans. Now here is the “Powerpoint-friendly” version:

Cameron (below) argues that most “readers” will want to display the included material as a slideshow. But this is systematic destruction of the information (by reducing the resolution when it wasn’t necessary).

I’ll take Cameron and others replies and comment – as objectively as I can. BTW I use “PLoS” because we can expect some answers, but the arguments below are generic (they may differ in detail from publisher to publisher).

>>> FWIW the review process is done entirely via PDFs so it is not straightforward to tell what the native format of any part of the paper is for the reviewer. I would agree that this is bad but its consistent with most of the journal systems I’ve worked with. Obviously this makes reviewing for data formats difficult or impossible but what’s probably worse is that it discourages you from asking the question. In some ways the BMC system is better in this respect because the files are there in front of you with big icons saying what the format is.

PMR>> exactly so. The reviewers have to accept what the journal thrusts on them. It’s impossible for them to get close to the data. So this is a publisher-enforced policy that hinders the publication of semantic information. Not what the data *is* but what it looks like. How many reviewers might like to cut and paste (or better) the authors’ data into their own data analysis tool? This is true for most publishers – they send reviewers the PDFs because it’s easier for the publishers. IMO this leads to poorer science because it’s impossible for the reviewers to have access to the data (even if submitted)

>>>However I think I disagree with Peter about the destruction element here. The html version of the paper is explicitly designed in the PLoS system for human reading (admittedly by sighted people). I actually find that floating window and the ability to click through figures very useful and I’d imagine that it makes that process simple if everything, figures and tables, are the same format. Given that the tabular data is available in the XML [PMR I'll address that later], which is where you’d go to dig out data, I don’t think its a question of destruction but of differing priorities.

PMR>> OK, who determines the priorities. Not the authors, although they pay PLoS for the publication. The reviewers?? The editors?? Or PLoS management.

>>>The person who wants to cut and paste the numbers from the table is going to be annoyed but the person who wants to grab the figure and drop it into a presentation is going to be happy. And I suspect the latter may be the more common re-use case.

PMR>> I am surprised. I would have thought that many readers actually want to have access to the data – in data form – on their machine. I don’t spend my time presenting other people’s published material as parts of slide shows, but maybe I am the exception. Where I do I would not drag-n-drop an unreadable Powerpoint friendly table – I would create it is a form where the audience could read the most important bits. Maybe I would have to do some editing and cropping…

>>> The ideal would obviously be to have both, contextually presented depending on what the user (human or machine) wants. PLoS have focussed very hard on making their html rendering attractive to human readers and have as a result pulled into a situation where html downloads are much greater than pdf downloads which I would see as a good thing.

PMR>> I would agree that HTML is far better than PDF and HTML5 is better than HTML and Scholarly HTML should be what we aim for.

>>. The price, with limited resources, is things like this which are suboptimal obviously.

PMR: This I fail to see. If you already have tables marked up in XML it’s trivial, yes really trivial, to convert them to HTML tables. It would take me 30 minutes to write a stylesheet to extract the tables and translate them to HTML (trhey are effectively that already). And putting the links in to the HTML shouldn’t be rocket science

>>>What would a system look like that achieved all of these goals – presenting the easily cut and pasted whole for those who wanted it, plus the cut and pasted data for the humans who want that, plus the marked up data for those who want that? At least there’s a DOI for each element so a content negotiation scheme would in principle be possible. It also re-raises the question of standardising the form in which a paper points to its data on an external service such as Dryad – how should that link be made machine discoverable in a general way?

PMR>> exactly. My concern was that by turning semantic tables into images the publisher(s) give the impression they don’t care about data. BMC (to pick another Open Access publisher) does care about data. So should PLoS

Andy Turner>>> It is easy to find the XML for the table in the article XML

PMR>> yes – and I found it. J It gives no explanation on what it is, how to use it, whether you need special tools, etc.

AT>>>and it has an XLink so there is perhaps really very little to find issue with about this.

PMR>> The Xlink in the XML points to an IMAGE (see mimetype). And that’s what I take issue with.

AT>>> Perhaps the enhancement wanted is to add buttons for XML (small, medium and large) i.e. XML Table Values Only, XML Table with Metadata and context links, XML for the article. Perhaps also there could be a download package for all this as zip, tar.gz etc…

PMR>> Exactly. This would be a big enhancement. And if PLoS and BMC and EGU and IUCr and… (maybe even some closed access publishers) all used the same approach it would solve the problem. Because which reader or re-user wants a different approach to each publisher?

QUESTION. Yes, I can find the XML and – because I understand XML – I can locate the tables and I can write a stylesheet to extract them. But most people can’t. Is there something I’m overlooking? An open set of tools that everyone except me has access to? Or is it actually cutting and pasting each individual field out of the XML?




How to share data and how not to

Saturday, July 16th, 2011

I have been pointed to a paper in PLoSONE on Data sharing

I haven’t read the text but I am afraid I have to comment adversely on the way the data are presented. Because it illustrates a fundamental reason why data cannot be shared. This is data from Table 6 in the paper:

This is described as “a Powerpoint-friendly Image”. It’s unreadable to a human (though if you step away it gets possible).

What does it represent? The paper describes a survey carried out by questionnaire (survey instrument) and the results are presented in 29 Tables. The entries in the tables are small amounts of text, and numbers. Here’s Table 6;

So this is a table of data. And it is transmitted as a TIFF. And if you want a “powerpoint friendly image” it appears to be a PNG.

In simple terms this completely destroys data.

Now I know some of the people involved – Carol, Cameron Neylon (who edits this, and has a journal on reproducible computing) and the folks at PLoS.

Something has gone terribly wrong here. Maybe in the authoring, maybe in the reviewing, maybe in production.

Open Access by itself solves a bit, but not enough. We also need to make our Open products BETTER than previously. Any of CSV, HTML or even XLS would be possible for the tables. Then the data could be shared.

We have to move towards fully interoperable semantic Open data.

What’s wrong with scholarly publishing? The MOTSI

Saturday, July 16th, 2011

NOTE: You may find my allegory of “Monsters of the Id” as irrelevant to scholarly publishing. If so, skip this. But do not doubt that scholarly publishing needs changing – drastically and soon and that I, at least, am committed to finding ways for that to happen. Before they happen outwoith our control.

I have used the term “Monsters of the Scholarly Id” (MOTSI) to describe the dysfunctionalities in scholarly publishing created unconsciously by academia, driven by its innate need for self-glorification. This may seem OTT so I’ll ramble through the background and the idea.

I start with my perception, shared by many, that scholarly publishing is increasingly dysfunctional. Obviously not everyone will agree. A CEO of a publishing company which sees revenues increase over the decade by 9% or so is not going to complain. I’ve blogged before on Richard Poynder interviewing Springer’s CEO ( ) Read it – it chills me that this is purely about revenue – not any sense of providing useful goods in response to a market demand. A senior editor of a “successful” closed access journal isn’t going to complain – s/he probably gets paid expenses at least and lots of brownie points. A researcher with lots of citations and H-index karma isn’t going to complain. The 1-in-a-hundred researcher who has got a paper into NatSciCell may be able to get a job on the strength of it.

But many, many feel severe dysfunction. I’ll come to the causes later – they may not be so different from performing arts, or authors of fiction – the system does not allow everyone to succeed. But science is different. If we simply strive for the “excellent” (whatever that is) we neglect the good on which science is built. We have to separate the good from the unacceptable.

At a SciFoo camp about 3 years ago we had a discussion about scientific publishing (this has been a common theme at SciFoo). Two young attendees felt that the situation was so bad they were going to write an article for NatSci, but this never got written. But it’s a common theme on the blogosphere.

So what’s the Id? I grew up in an era when – I think – Freudian theory was almost regarded as proven fact. I believed in the id, ego and superego and I’ll replay them here using Wikipedia (,_ego_and_super-ego ).

Id, ego and super-ego are the three parts of the psychic apparatus defined in Sigmund Freud’s
structural model of the psyche; they are the three theoretical constructs in terms of whose activity and interaction mental life is described. According to this model of the psyche, the id is the set of uncoordinated instinctual trends; the ego is the organised, realistic part; and the super-ego plays the critical and moralising role.[1]


The id comprises the unorganised part of the personality structure that contains the basic drives. The id acts according to the “pleasure principle“, seeking to avoid pain or unpleasure aroused by increases in instinctual tension.[2]

The id is unconscious by definition:

“It is the dark, inaccessible part of our personality, what little we know of it we have learned from our study of the dream-work and of the construction of neurotic symptoms, and most of that is of a negative character and can be described only as a contrast to the ego. We approach the id with analogies: we call it a chaos, a cauldron full of seething excitations… It is filled with energy reaching it from the instincts, but it has no organisation, produces no collective will, but only a striving to bring about the satisfaction of the instinctual needs subject to the observance of the pleasure principle.”[3]

And this article specifically references my inspiration:

  • In the classic 1956 movie Forbidden Planet, the destructive forces at large on the planet Altair IV are finally revealed to be “monsters from the id” — destructive psychological urges unleashed upon the outside world through the operation of the Krells’ “mind-materialisation machine”. The example is of significance because of the unusual degree of insight it demonstrates: the creature eventually revealed follows classical psychoanalytic theory in being literally a dream-like primary process “condensation” of different animal parts. The plaster cast of its footprint, for example, reveals a feline pad combined with an avian claw. As a crew member observes, “Anywhere in the galaxy this is a nightmare”.


So my allegorical approach is to see the dysfunctions of scholarly publishing as arising from the subconscious of academia. The drive to achieve, the drive to be recognised and glorified. The need for gratification. And where uncontrolled, the id triumphs at the cost of rational behaviour.

Ultimately in Forbidden Planet the only solution is to destroy the planet, at a cost of destroying the good that the Krell have bequeathed. I’m not suggesting that we destroy scholarly publishing. But I think it possible that the monsters it has created will, if untackled, lead to catastrophic changes.

There is no super-ego of academia. Indeed it is not clear whether the uncoordinated behaviour of 10,000 institutions can have a super-ego – a controlling intelligence. For me it is tragic that Universities are not collectively addressing their role – in public – and getting feedback. Maybe they do this in closed national sessions with the great-and-the-good of government. Politicians have blogs and tweet. Stephen Fry tweets. Where is the vice-chancellor who reaches out to todays’ world? Where, indeed, are the senior academics? There are a few – a very few – and we may meet them at ScienceOnline in September in London. But academia does not care about the common wo/man. It looks inward, not outward. Where is the world leadership? And that is one of the causes of the problems.

So, while academia gazes inwards, the planet needs it more than ever. The sleepwalking has consequences outside scholarly publication. Where is the communal action to address climate change, resistance to disease, ageing, hunger and many other predictable problems. Why shouldn’t universities work together? But they are set up to compete, to generate their own feeling that they are better than their neighbours. And so the MOTSI are rife in scholarly publishing

What are the MOTSI? I probably haven’t thought of them all and I’m hoping for your input as well. The MOTSI are things that we have created unconsciously. They are not Frankenstein monsters or Mr Hydes which we have deliberately created and been unable to control. Because in those cases the creator is often aware of the dysfunction even if they cannot control it. The MOTSI have emerged during our sleep. Some are, in principle, controllable if we woke to the need to do so. They include, in no particular order:

  • The revenue-oriented publisher (which includes scholarly societies)
  • The “citation” and citation metrics (which I will set as homework)
  • Journal branding and the journal impact factor
  • New Journal SPAM
  • The PDF and the monoculture of publishing technology

I’d value your input on:

“What is a citation?”

This is not trivial. I do not know the answer. But if we are using this as a measure of a person’s worth (and hence their institution) we owe ourselves the responsibility of defining it.