Monthly Archives: September 2011

Access to scientific publications should be a fundamental right

In my last post I reviewed a paper in Nature and gave a précis for the “scholarly poor” the internet citizens who are not employed by rich universities. (Universities have an arrogant attitude that access to the literature only matters for them, and they spend billions of taxpayers’ money in their libraries just for themselves). One of my commenters and ex-colleague (@walter blackstock) commented that it would cost 32 USD to read the Nature article (2 pages, rented for 1 day). He’s a scientist – he just doesn’t happen to work in universities – so no access to the scientific literature.

The lack of access to scientific literature is now a global shame.

Think of all the people (in all countries, not just the rich west) who cannot read the literature:

  • Patients and patient groups
  • Government and policy makers (e.g. in NGOs)
  • Environmental groups
  • Many practitioners of medicine who do not have access to (say) a national service (e.g. NHSNet)
  • Retired people (many of whom are actively involved in changing the world)
  • Schoolchildren (yes, much of the literature is – or should be – understandable by them)
  • Interested citizens who want to understand the world we live in

And, starkly, people die because of lack of access to scientific information. That sounds sensationalist, but it’s self-evidently true. If a patient group cannot read the medical literature they cannot make informed decisions. The conventional wisdom that patients are incapable of understanding their literature is no longer true in this century. The Open Source Drug Discovery project in India is more concerned with stopping people dying than their precious h-index from closed access journals. The countries which see sea levels increasing and changing weather patterns have a right to know what the scientific world is doing and saying – and indeed to be a full part of it.

I have blogged at length on this http://blogs.ch.cam.ac.uk/pmr/2011/07/09/what-is-wrong-with-scientific-publishing-and-can-we-put-it-right-before-it-is-too-late/ and subsequent. I have been catalysed to continue by

So I am arguing that Access to scientific publications should be a fundamental right

 

That may sound dramatic so I will put it in context. http://en.wikipedia.org/wiki/Human_rights lists substantive human rights to include:

Note that this not mean they are free-of-cost, free-as-in-beer. Water costs money. But it is a fundamental right:

In November 2002, the United Nations Committee on Economic, Social and Cultural Rights issued a non-binding comment affirming that access to water was a human right:

the human right to water is indispensable for leading a life in human dignity. It is a prerequisite for the realization of other human rights.

—United Nations Committee on Economic, Social and Cultural Rights

This principle was reaffirmed at the 3rd and 4th World Water Councils in 2003 and 2006. This marks a departure from the conclusions of the 2nd World Water Forum in The Hague in 2000, which stated that water was a commodity to be bought and sold, not a right.[99] There are calls from many NGOs and politicians to enshrine access to water as a binding human right, and not as a commodity.[100][101] According to the United Nations, nearly 900 million people lack access to clean water and more than 2.6 billion people lack access to basic sanitation. On July 28, 2010, the UN declared water and sanitation as human rights. By declaring safe and clean drinking water and sanitation as a human right, the U.N. General Assembly made a step towards the Millennium Development Goal to ensure environmental sustainability, which in part aims to “halve, by 2015, the proportion of the population without sustainable access to safe drinking water and basic sanitation”.

Note “access to water as a binding human right, and not as a commodity“. We pay for our water in the UK, and I am (generally) happy to do this. The condition is that we must be free of monopolists, or cartels, of indiscriminate and arbitrary sanctions. In the UK water is managed in a capitalist manner and it’s therefore regulated politically and there is a water regulator (Ofwat). There are flaws in this but the system at least removes the appalling monopoly control imposed by capitalist publishers. Making water a right commits the world to finding a solution, not building individual riches.

Then there is the Internet.

The United Nations has proposed to make Internet access a human right. This push was made when it called for universal access to basic communication and information services at the UN Administrative Committee on Coordination. In 2003, during the World Summit on the Information Society, another claim for this was made.[1][2]

In some countries such as Estonia,[3]
France,[4]
Finland,[5]
Greece[6] and Spain,[7] Internet access has already been made a human right. This is accomplished by authorizing universal service-contracted providers with the duty to extend a mandatory minimum connection capability to all remaining desiring home users in the country.

This does not mean that access to the internet is free of cost. I pay my broadband willingly enough. But it means that in countries more enlightened than Britain I cannot be cut off by publishers.

Publishers could cut off my broadband??

Three strikes

Main article: Graduated response

In response to copyright infringement using peer-to-peer software, the creative industries, reliant on copyright, advocate what is known as a “graduated response” which sees consumers disconnected after a number of notification letters warning that they are infringing copyright. The content industry has sought to gain the co-operation of internet service providers (ISPs), asking them to provide subscriber information for IP addresses identified by the content industry as engaged in copyright infringement.[9] The proposal for internet service providers to cut off Internet access to a subscriber who had received three warning letters of alleged copyright infringement was initially known as “three strikes”, based on the baseball rule of “three strikes and you’re out“. Because “three strikes” was understood to refer to physical assault,[citation needed] the approach was later termed “graduated response”. Media attention has focused on attempts to implement such an approach in France (see the HADOPI law) and the UK (see the Digital Economy Act 2010), though the approach, or variations of it, has been implemented in a number of other countries, or attempts are made to do so.[10]

Yes – if they think that I have been violating “their” copyright they can – without judicial process get me cut off. It is they, not a court, who decide on whether it’s a violation. Wiley pursued Shelley Batts with legal threats for posting one graph from a scientific article. (Yes Wiley, until you change the rights of access to your material I shall keep dragging up this appalling violation of scientific rights). If she had done it three times they could have had her cut off .

So I am proposing that we should press for

Access to scientific publications should be a fundamental right

 

This makes simple sense. If justification were needed:

  • Scientific publication is a vital tool in improving the quality of (or even saving) humankind.
  • Scientific publication is largely funded by the public purse and non-profit institutions. These institutions wish the widest possible dissemination of their research.
  • Scientific publications are created by scientists and others with no intention of monetary reward. Scientists also wish the widest possible dissemination of their research.
  • The culture of this century expects information to be freely available.

If we are able to adopt this simple principle, then much else follows. It follows that cost of access should be regulated, or subsidised wholly or partly. It follows that there should be no restrictions on the flow or re-use of scientific information (the dissemination cost is near zero).

We have – through decades of academic indifference – landed ourselves with a band of publisher robber-barons and it will be a struggle to escape. But there is much experience of how to do this. There is – in many countries – strong political will. It can and must happen.

 

Revamping the funding system (Ioannidis)

In this week’s Nature (not sure whether the scholarly poor have to pay for this so I précis a bit) there’s a useful review More time for research: Fund people not projects
John P. A. Ioannidis Nature 477, 529–531 (29 September 2011) doi:10.1038/477529a. It highlights the failures of the current system – huge amounts of effort expended by applicants and reviewers and there is great dissatisfaction. The current trajectory may make things consistently worse. So Ioannidis gives a table of possibilities. I reproduce bits of it here without permission (I am visiting NPG on Monday so they can throw me in their dungeon if they want):

Option ++ Pros –- Cons xx Example ?? Who would be funded

Egalitarian (fund everybody) ++ Avoids peer-review biases Gives sufficient amounts to scientists doing low-cost research Small administrative burden — Does not support large research efforts Does not recognize exceptional scientists xx Some universities fund the salaries of all their faculty?? All.

PMR: This is how I started (1967). It was called the dual-support system. You could rely on normal lab equipment, consumables, travel, and access to technical support. We were very well supported. It was actually so lavish that you were never brought to account – you could do what you wanted. I did a lot of stupid and pointless stuff. But I was also able to build the core of the chemical informatics program that I am still pursuing.

Aleatoric (fund at random) ++ Avoids peer-review biases Small administrative burden — Will not capture all deserving scientists xx Foundational Questions Institute ?? Flexible

PMR: For small amounts of funding with a large number of applicants this may be useful. For example if I have a 15% chance of getting a summer student funded I’d be happy to go into a lottery. But it doesn’t scale.

Assessment of career ++ Captures career trajectory Has gold-standard status — Is vulnerable to favouritism Inappropriate for young researchers Is labour-intensive xx MacArthur Fellows Program ?? Few elite scientists (or else administratively burdensome)

PMR: I was awarded a CIBA-GEIGY fellowship for a sabbatical year with Jack Dunitz when I had little formal track record (DPhil + 6 years in post). It changed my life. It wasn’t really assessment of career as much as assessment of promise. I hope I have fulfilled some of that.

Automated impact
indices ++ Eliminates favouritism Evaluates many applicants with ease Approaches objectivity — There are many indices, all with flaws; no consensus about best one to use Indices can be gamed Databases have shortcomings (such as imperfect citation coverage, entry errors, name disambiguation problems) xx UK Research Excellence Framework ?? Flexible

PMR: The only attraction of complete automation is efficient bureaucracy (and we know how often automation fails). It has the same objectivity as an income tax form. Formally correct but utterly depressing. It favours regression to the mean. There is no creativity in funding.

Scientific citizenship ++ May improve science, if good practices are rewarded and bad ones penalized — Automation is not yet possible for data gathering, and is difficult for some citizenship practices
Has peer-review biases xx Financial incentives to peer reviewers ?? Could be extended to many scientists only for aspects that can be automated

PMR: quoting without permission:

Funding systems could reward good scientific citizenship practices, such as data sharing4, high-quality methods, careful study design and meticulous reporting of scientific work5. Openness to collaboration, non-selective publication of ‘negative’ findings, balanced discussion of limitations in articles and high-quality contributions to peer-review, mentoring, blogging or database curation could also be encouraged. Researchers might be rewarded for publishing reproducible data, protocols and algorithms. However, some citizenship practices are difficult to capture in automated databases, so would be subject to the disadvantages of peer assessment.

I am not sure “Scientific citizenship” is the best term – this is as much about multivariate indicators of value and esteem (i.e. going beyond the mindless “how many papers and how often cited”). It can be gamed (probably easier than citations). However it’s obviously something that must be pursued and rapidly but overlaps with several other approaches.

Projects with broad goals ++ Proposals are easy to write and review Formulating work can be flexible Permits targeted innovation — Does not eliminate project proposals Is vulnerable to favouritism Holds potential for exaggerated promises and claims xx NIH Director’s Pioneer Awards Howard Hughes Medical Institute ?? Few elite scientists

PMR: I think there are two axes here – breadth / freedom, and patronage. I’ve had patronage and it can be extremely valuable (though often not meritocratic). Patronage often devolves to selected institutions, e.g. in the eScience program Cambridge was awarded a Centre. This allowed people in Cambridge to apply selectively for funds and I happened to be in a position to get 6 PDRA years fairly easily. Similarly there are often targeted programs designed around a group of institutions – if you happen to be in them it’s much easier to get funding. Perhaps the most specific was the Cambridge-MIT Institute (http://www.cmi.cam.ac.uk ) which “helped develop DSpace – a groundbreaking future-proof digital archive” (their words, not mine). I didn’t directly get funding but it helped to get JISC funding. And in eScience there were several projects where I could become a Co-I – eMinerals, materialsGrid.

 

But my biggest patron was Unilever and to them I am very grateful. They provide(d) infrastructure, project funding and a lot of freedom. Patronage is not egalitarian, can be abused, and can be misdirected. Similarly I regard Microsoft as a having elements of a patron. The long association through eScience makes it much easier to agree new mutually beneficial projects.

 

There are other models. I think cooperatives/collaborations (e.g. Mat Todd’s Open Drug Discovery and Open Source Drug Discovery in India) are valuable new models and can mix public and private funding. And while not everyone can receive money, there is greater opportunity for spreading the net and reaching out to citizen science. Similarly some of the funding that the OKF receives allows for considerable freedom and I applaud this – even more because this clearly supports a common good rather than fostering a single person’s career. It’s probably peripheral to mainstream research but it’s often still research.

 


 

Comments on the Bourne plot and other EPSRC matters

For whatever reason I get very few comments on this blog (note: I try to be courteous to posters). Comments normally come when I upset people and overstep a line (“turning up the outrage knob”). I don’t generally do this consciously. Anyway I have got two useful comments on my criticism of the EPSRC Bourne plot (http://blogs.ch.cam.ac.uk/pmr/2011/09/28/epsrc-will-eventually-have-to-disclose-where-the-bourne-incredulity-comes-from-so-please-do-it-now/ ).

I’ll post the comments and reply:

Jeremy Bentham says: September 28, 2011 at 9:17 am 

We need to raise the quality of this debate – I’m not saying EPSRC is right but it doesn’t help by distorting what they are actually saying. Let’s make a bit more effort to understand what EPSRC have done, why they have done it, and start arguing on a basis of accurate information. EPSRC have actually been very open about why they are doing this, what the process has been, and what inputs they used to make strategic decisions.

Bourne’s team have actually published quite a detailed explanation on how they made *informed judgements* about the physical sciences portfolio which is represented by the QI plot (just because a diagram isn’t quantifiable doesn’t mean it isn’t representative of something)

Read this and please blog again http://tiny.cc/f66xw

PMR: Thanks Jeremy. I agree that my challenge included discourse that I would not use in a measured response. It may trivialize some issues but it also emphasizes the real sense of injustice and arbitrariness felt by many (not just me). It’s a political, not an analytical, post and perhaps more suited to Prime Minister’s Question Time. I have read many of the chemistry documents you list and will comment. I cannot presently see how those documents translate into the Bourne diagram.

Walter Blackstock says: September 28, 2011 at 12:44 pm 

You may be interested in Prof Timothy Gowers’s deconstruction of the EPSRC announcement. http://gowers.wordpress.com/2011/07/26/a-message-from-our-sponsors/

PMR: This is an exceptionally powerful piece. http://en.wikipedia.org/wiki/Timothy_Gowers is not only a mathematics superstar but also a star of the blogosphere. He created and catalysed the Polymath project, where a virtual group of mathematicians proved an extremely challenging unsolved conjecture. He is rightly hailed as a prophet of new web-based scholarship and I try to emulate the approach in activities such as Quixote. The piece is deliberately underspoken and he uses the EPSRC language itself as as a weapon against their mathematics policy. (In simple terms this is that only certain areas of maths are worth funding. )

PMR Comments on JB and EPSRC chemistry.

I will reread the documents and see what I may have missed. If there is a clear explanation of how Bourne came to create the diagram then I (and many others) have missed it and would welcome it. Let me take:

http://www.epsrc.ac.uk/SiteCollectionDocuments/Publications/reports/ChemistryIR2009.pdf

in which an invited group of top international chemists analysed the state of UK chemistry. I’m going to take two areas that I know a bit about, chemoinformatics and polymers (specifically polymer informatics as we were sponsored by Unilever for a 4-year project). Both of these are represented by Bourne as the lowest quality of chemistry (presumably in UK?). I am naturally upset at being repsented as working in areas of very low quality – I believe it is the scientist and the actual science performed that should be judged – not the discipline. (There are limits to this, such as homeopathy, and creationism, but all the sciences in Bourne are mainstream hypothesis- or data-driven science). Let’s remember that Einstein worked in a patent office and van’t Hoff in a vetinary school (for which he was lambasted by Wislicenus).

There is no mention of chemoinformatics in R2009. If Bourne took that as an indication of quality that’s completely unjustified. Here’s polymers (my emphasis):

P 17: There are examples of excellence

that can be readily identified in synthesis, catalysis

including biocatalysis, biological chemistry (specifically

bioanalytical), supramolecular chemistry, polymers and

colloids and gas-phase spectroscopy and dynamics.

P21: Polymer Science and Engineering

Following world-wide trends, the UK polymer chemistry

community has made some important and

excellent choices. In a coordinated action a few

universities have set-up excellent programmes

covering the entire chain-of-knowledge in polymer

science and engineering. In this way, the UK is

ensured a leading position in this important field for

both scientific and industrial needs. There is strong

evidence of growth and impact in polymer synthesis in

the UK Chemistry community. There are several

centres that have been created and whose members

have established international reputations. The

polymer synthesis community has placed the field on

the map in the UK. There is interest in controlled

polymer synthesis, water-processible and functional

polymers. A very strong activity that has international

recognition is the synthesis and processing of organic

electronic materials where the UK is pre-eminent.

Applications include polymer LEDs field effect

transistors, organic photovoltaics and sensors.

This bears no relation to the low quality for “Structural Polymers and technology”.

Bourne labels chemoinformatics as the lowest quality of all: “Refocus – funding targeted towards identified priorities”. A simple reading of this is “you (==me) have been doing rubbish work and we are going to reform you”. Like a failing school.

So if this is going to happen, at least I need to know the reason for it. Who made the decision, on what evidence. If the evidence is there (even that a group of the great-and-the-good made the plot), I’ll comment objectively. Scientists are used to being told their work is poor or misguided – it one of the strengths of science.

But so far the community as a whole seems to be finding it hard to get hard evidence.

EPSRC will eventually have to disclose where the Bourne Incredulity comes from, so please do it now

 

I blogged yesterday (http://blogs.ch.cam.ac.uk/pmr/2011/09/27/the-clarke-bourne-discontinuity/ ) about Paul Clarke’s exposure (http://shear-lunacy.blogspot.com/2011/09/bourne-incredulity.html ) of the EPSRC “decision making” process which appears to use a 2-plot of “quality” vs “importance”. It has now seriously upset me. It needs the spotlight of Openness turned upon it.

Why is it so upsetting?

  • It appears to be used as an automatic method of assessing the fundability of research proposals. This in itself is highly worrying. Everyone agrees that assessing research is hard and inexact. Great scientists and scientific ideas are often missed, especially when they challenge orthodoxy. Bad or (more commonly) mediocre science is funded because it’s in fashion. “Safe” is often more attractive than “exciting”. The only acceptable way is to ask other scientists (“peers”) to assess scientific proposals, which we do without payment as we know it is the bedrock of acceptable practice. There is not, and never will be (until humans are displaced by machines) an automatic way of assessing science. Indeed when a publisher rejects or accepts papers without putting them out to peer review most scientists are seriously aggrieved. For a Research Council (which allocates public money) it is completely unacceptable to (not) allocate resources without peer review.

     

    And “peer-review” means by practising scientists, not office apparatchiks.

  • It appears to be completely unscientific. The use of unmeasurable terms such as “importance” and “quality” is unacceptable for a scientific organization. We live with the use of “healthy”, “radiant”, “volume” (of hair), as marketing terms – at best an expression of product satisfaction – at worst dangerously misleading. But “quality”?

     

    It might be possible to use “quality” as a shorthand for some metric. Perhaps the average impact factor (arggh!) of a paper in that discipline. Or the average score given by reviewers in previous grant proposals (urrgh!). Those would indicate the current metric madness, but at least they can be made Open and reproducible. “Importance”? The gross product of industries using the discipline? The number of times the term occurs in popular science magazines? Awful, but at least reproducible.

     

    But this graph, which has no metric axes and no zero point (this is the ENGINEERING and SCIENCE Research Council) breaks high school criteria of acceptability. Suppose the scales run from 99-100? All subjects are ultra important, and all subjects are top quality? Any reviewer would trash this immediately.

I intend to pursue this – in concert with Paul and Prof. Tony Barrett from Imperial College. We have to expose the pseudoscience behind this. (Actually it would make an excellent article for ben Goldacre’s “Bad Science” column in the Guardian). We first have to know the facts.

Paul and Tony have used FOI to get information from the EPSRC. They haven’t yet got any. I have suggested they should use http://whatdotheyknow.com as it records all requests and correspondence.

The EPSRC cannot hide. They are a government body and required to give information. It they try a Sir-Humphrey approach (“muddle and fluff”) we go back with specific questions. And if they fluff those then it can be taken to the Information Commissioner.

The information will definitely come out.

So when it does, what are the bets?

  • 1-2: they can’t trace the origin, mumble and fluff
  • 1-1: created by someone in the EPSRC office (“from all relevant data”)
  • 2-1: copied from some commissioned report (probably private) from “consultants”
  • 2-1: a selected panel of experts
  • 3-1: created from figures dredged up on industry performance and reviewers (see above)
  • 3-1: copied from some other funding body (e.g. NSF)
  • 3-1: the field (i.e. some other origin – a dream, throwing darts )

EPSRC: the origins of this monstrosity will be revealed anyway, so it will be best that you make a clean break of it now. You are not the most popular research council – RSC, Royal Society and several others have written to ask you to think again about arbitrary and harmful funding decisions. The longer this goes on the less will be your credibility.

 

 

 

 

    

Europe says (scientific) data should be Open. Neelie gets it! Do publishers?

Even in the burgeoning era of Openness it’s great to see the strong stance from many political leaders. Here’s European Commission Vice-President Neelie Kroes Brussels, 22 September 2011 (by Ton Zijlstra) http://epsiplatform.eu/news/news/ec_vp_kroes_on_standardization_and_open_data (I have edited savagely)

Speaking at the Open Forum Europe Summit 2011 yesterday in Brussels European Commission Vice-President Neelie Kroes has delivered a speech titled “From Common Standards to Open Data“.

In her speech she presented the progress made towards a new legal framework for European Standardisation, and interoperability. Saying that “standards are indispensable for openness, freedom and choice“, she said that in the coming years the focus of work will be to make the proposed legal framework become law as soon as possible. Furthermore also the new European Interoperability Framework has been created that will help create more and better interoperability, such as for cross-border public services in Europe.

We are going to take action: we are going to open up Europe’s public sector

I am convinced that the potential to re-use public data is significantly untapped. Such data is a resource, a new and valuable raw material. If we became able to mine it, that would be an investment which would pay off with benefits for all of us. (PMR emphasis) …

Third, benefits for science. Because research in genomics, pharmacology or the fight against cancer increasingly depends on the availability and sophisticated analysis of large data sets. Sharing such data means researchers can collaborate, compare, and creatively explore whole new realms. We cannot afford for access to scientific knowledge to become a luxury, and the results of publicly funded research in particular should be spread as widely as possible.

PMR: to hear a politician urging this is enormously encouraging

And, perhaps most importantly, benefits for democracy because it enhances transparency, accessibility and accountability. After all, what could be more natural than public authorities who have collected information on behalf of citizens using their tax money giving it back to those same citizens. New professionals such as data journalists are our allies in explaining what we do.”

PMR: lots of exciting stuff on Open Data omitted…

We’ll also be looking at charging regimes because expensive data isn’t “open data”. In short,getting out the data under reasonable conditions should be a routine part of the business of public administrations.

We are planning two data portals to give simple and systematic access to public data at European level. First we should have, bynext spring, a portal to the Commission’s own data resources. And second, for 2013, I am working on a wider, pan-European data portal, eventually giving access to re-usable public sector information from across the EU, federating and developing existing national and regional data portals.”

PMR: I’ve missed a good deal out. Read it. The message is:

  • Politicians care about open data
  • Politicians care about open science data
  • Politicians aren’t likely to be very happy with people who try to keep scientific data closed

So things are looking up. Open Data has arrived. I’m hoping that scientific publishers will realise that the future has arrived. Because if not, they will find their don’t have many allies.

The Clarke-Bourne discontinuity

I have (regrettably) only just discovered Paul Clarke’s blog http://shear-lunacy.blogspot.com/2011/09/bourne-incredulity.html . Paul is a synthetic chemist and over this year has challenged science policy funding, especially by the UK’s EPSRC. He has mobilised opinion by collecting signatories to letters to EPSRC and getting the Royal Society of Chemistry to take this seriously (which they now do). It’s a great example of the power of the blogosphere. Continual, concerted action by individuals has a massive effect.

Read Paul’s posts this year (rather than a summary from me). Essentially the EPSRC has abandoned traditional methods of funding (PhD studentships and responsive mode (funding excellent ideas from anyone)) in favour of a metric mumbo-jumbo. EPSRC has decided that it will fund high quality, high importance subjects in “sponsorship” mode. This means, effectively, that they have set themselves up as kingmakers and supporters of the Holy Roman Empire of the chosen, rather than understanding that science comes from everywhere and leads anywhere and the skill is in letting this happen. In times of reduced funding everyone will suffer, but this model selects a few privileged groups and showers the limited wealth on them. The model is defensible within a commercial company aiming for particular markets – it does not work for public science.

Unfortunately it leads to the rule of numbers, and here is the EPSRC graph (“Bourne graph”) of desirability (copied from Paul’s blog). (Since it appears to be a work of fiction it is formally copyrightable but I’ll take that risk).

 

It has two axes and until the algorithm for coordinates is revealed they remain completely subjective. (There can be valid reasons for using bibliometry to XY-plot disciplines and these include interdisciplinary links as in the MESUR and Eigenfactor projects. But these are descriptive of information flow, not value). Apparently FOI has been invoked to try to get clarity without success. I am not sure it’s on WhatDoTheyKnow and I’ll suggest that.

To play the game, my own discipline of chemoinformatics scores high on importance and ultra-low on quality. OK, I have been critical of chemoinformatics but I doubt that EPSRC is listening to me. Who decided this? And how? There is no evidence that these indicators have any basis. Are they UK or worldwide?

And polymers – another area we have worked in. Low quality (what an arrogant statement to make without justification) and low importance (why did Unilever fund us strategically?).

And although synthetic organic chemistry is of high importance it is being systematically underfunded by EPSRC. The discipline that allows people to create and modify molecules addressing disease, materials, agrochemistry…

The subject which in Dial-A-Molecule EPSRC itself selected as a Grand Challenge.

I have no objection to graphs of this sort being dreamed up in awayday workshops. The problem is that they become hardcoded into policy tools. That certain keywords guarantee that you won’t get funded.

Anyway great kudos to Paul for leading this activity. If I had known I would have signed his letters. He is in a “typical” chemistry post in that extradisciplinary work such as blogging is usually a negative factor. Campaigning takes away time from writing grants and doing research. Challenging the CEO of EPSRC probably doesn’t do you any good in getting funding.

Am. Chem. Soc. Skolnik Award 2012 (Henry Rzepa + PMR) and some future thoughts

Henry Rzepa and I have been honoured by the Herman Skolnik award of the Chemical Information Division (CINF) of the American Chemical Society. Details (from Phil McHale) can be found here:

http://www.ccl.net/cgi-bin/ccl/message-new?2011+09+26+014

Other than – naturally – feeling pleased for ourselves there are a number of points:

  • It is the first joint award that has been made in the history of the award. Our mutual support has been essential and kept us going – I have often referred to us as a “symbiote”. Joint awards are common in many disciplines and represent the need for cooperation and collaboration. Chemical informatics must, we believe, be a cooperative approach and there are signs that this is starting to happen.
  • It gives formal legitimacy to our drive for an Open semantic interoperable chemical information infrastructure. While we have never doubted that what we have designed and created is part of the present and future of chemical (and other scientific) information it is often a lonely journey. Sometimes it is possible to think that the time has passed and “it will never happen”. This award gives us great confidence to take ideas forward (and these are ideas shared by many others). It also – possibly – gives us greater believability when promoting particular actions or designs.
  • It pays tribute to the many others who have helped the ideas flourish. This is not a ritual Hollywood acceptance speech – Henry and I have been privileged to be part of a much wider community (Blue Obelisk, Quixote) which has been essential in allowing the ideas (not “our” ideas) to flourish, and even more importantly be implemented. CML owes a considerable amount of its credibility to the code written in Blue Obelisk and Quixote, and this has been done by scientists with free will (i.e. they have bought into the ideas and given their time). So I see Henry and myself as catalysts – we have helped to coordinate and focus the latent potential of the last two decades. Because we are catalysts we do not rely on the need for massive resources, and when people respond it is their work, not ours.
  • I am also very conscious of my very strong support from a number of people/organizations, especially in the Unilever Centre, Churchill College, Unilever and Microsoft Research. These have given me the resources to develop the tools and designs over the last 5 years, and without them it would have been very difficult to keep going and creating tangible information components. This means that our current trajectory is strong. I am continually picking up indications of the work being adopted by scientists and within organisations.

Last month I celebrated my 70th birthday. Unfortunately birthdays are part of the pseudo-scientific numerical metrics – anyone over 67 in this country finds it very difficult to get grants (and I am grateful to some funders, including JISC, for finding age irrelevant). It means (not necessarily a bad thing) that I can generally not act as a Principal Investigator. But, since our last 20 years have been involved in persuasion rather than coercion nothing really changes. So many of my current efforts will continue, but perhaps in a different management setting, and perhaps less obviously coordinated.

My role now is a scientist working in joint opportunities, perhaps even patronage (a common enough method two centuries ago). I have several ongoing activities and opportunities.

  • I will continue to have a place in the Unilever Centre (including writing up more work, and collaborating on computational metabolism)
  • I have a visiting position at the European Bioinformatics Inst. with Christoph Steinbeck working in his group (ChEBI) – and extensions to metabolism and NMR spectra.
  • I am continuing to take the Quixote idea forward and visiting Pacific Northwest Laboratory (PNNL) next month to collaborate on semantics and glossary for their Open Source NWChem program.
  • We will continue to support OSCAR4 and OPSIN – and now is the time to broaden the community. Both programs are now cutting edge and competitive with closed source offerings. But we need your involvement
  • I hope to spend several months next year in Australia (more later, hopefully) – developing ontologies and semantics with Nico Adams.
  • And we have just submitted a proposal to JISC for Open Bibliography 2 and should hear in a few days.
  • And a whole host of new things in the Open Knowledge Foundation.

I personally believe that awards should look to the future as well as the past and I hope this gives some indication that the Skolnik award gives new impetus and hopefully new collaborators.

There are some special facets relating to the American Chemical Society who have honoured us. Readers of this blog may get the impression that I am “anti-ACS”. I am not (or I would not be a member and not be accepting this award). I owe my involvement in chemical informatics to superb workshops and publications run by the ACS in the 1970′s. They convinced me to go into informatics in the pharma industry. Their consistent support of informatics over at least 4 decades (and much longer if we count Chemical Abstracts) has been outstanding in science. But change is needed.

I believe that much of the future depends on “trusted” organizations and among these are universities, government and their organs, museums, libraries, scholarly societies/ scientific unions. None are perfect, but they are the best we have (“Indeed, it has been said that democracy is the worst form of government except all those other forms that have been tried from time to time.” http://en.wikiquote.org/wiki/Winston_Churchill). So with the learned societies. They are democratic in structure and they are the best we have (unfettered capitalism, as practised by some scholarly publishers, is not an attractive self-regulating alternative). Part of my role, I believe, is to help show ways forward for some aspects of learned societies. I believe that some of the opinions in this blog have influenced scholarly organizations.

The problem with large learned societies is that they can become corporate-like where decisions are taken for purely market reasons rather than the good of the society and its role. The ACS is not unique in this – it seems to happen in many disciplines (including bibliography). There is a size beyond which an organization is in danger of losing its roots and values. And the larger an organization the slower it is to react to anything. We thus have tensions along several axes:

  • Size.
  • Publishing and competition. As I have blogged, I think scholarly publishing is broken. It fosters local undemocratic monopolies. It is one of the weirdest markets there is – there are no conventional market forces and no elasticity. The customers have agreed to work in a system that severely disadvantages them. The problem is that modern scholarly publishing tends to corrupt including corrupting scholarly organizations.
  • Commercial interests. I am not an anti-capitalist, but capitalism per se does not promote the ethics that I would like to see in scholarship. Chemistry has a very large commercial base, and this cannot overrule scholarly interests.
  • Web democracy. The web has opened new ideas of democracy and participation. These will not disappear. Yes, I am a web-idealist and I have seen ideals slip away over two decades but we are currently fighting for our democratic information future. The scholarly societies must recognize and build on this or they will continue to be riven by this tension.

Things are happening in at least the last 3 that offer hope. We are seeing pharma companies raising the pre-competitive level. The ideas of Open Source Drug Discovery in a commercial setting. Could we hope for “Open Source” scholarly publishing?

Henry and I have the opportunity to run the Skolnik symposium on Aug 21st 2012 in Philadelphia. We want the form and content of this to reflect the work that we and others have done. It’s also a chance to get new ideas displayed and promoted. This is not a travelling award with a lecture but if people are interested in my visiting them I have a somewhat flexible diary.

Open APIS; My attempts to be Openly Open

Having argued that we need to define Open APIs better I’ll share my experiences as a contributor to Open Knowledge and the challenges that this poses. The first message:

Openness takes effort

Creating copyrighted works costs no effort. Every keystroke, every digital photo is protected by copyright. Whereas to make something Open you have to specifically add permission to use. This is really tedious. It’s particularly difficult when the tools used have no natural support for adding permissions. So

    Most of the time I fail to add explicit permissions

That’s a fact. I am not proud of it. But, to give an example I have just discovered that almost all my material in Cambridge DSpace does not provide Open permissions. That was never my intention. But the tools don’t allow me to click a single button and change it. I have to add rights to every single document (I have ca. 90). Meanwhile the automatic system continues to pronounce “Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.” This may be true, but it isn’t the sign of a community trying to make material Open.

I have spent the last 20 minutes trying to find out how to add permissions to my material. It’s impossible. (DSpace is one of the worst systems I have encountered). So, unless you get everything right in DSpace at time of submission you cannot edit the permissions. And, of course, by default the permissions are NO, NO, NO. I should add that some of the material has been uploaded by my colleagues. So next rule:

    Unless EVERYONE involved in the system understands and operates Openness it won’t get added.

So all my intended Openness has been wasted. No-one can re-use my material because we cannot find out how to do it.

Now take this blog. I started without a licence. Then I added an NC one. Or rather I asked the blogmeister to add it (I was not in control of the Blog). And that’s the point. Very often you have to ask someone else to actually add the Openness. Then I was convinced of the error of NC and changed to CC-BY.

However in later 2010 the sysadmins changed the way the blog was served. This caused many problems, but among others it destroyed the CC-BY notice and licence. So:

    Sometimes the systems will destroy the Openness.

So I would have to put in a ticket to have the CC-BY licence restored. All these little things add up to a lot of hassle. Just to stay where we are.

So to summarise so far. (Remember I WANT everything to be OKD-Open unless indicated).

  • Blog. Not formally Open. Needs a licence added to the site and to each post?
  • Dspace. Almost nothing formally Open. Unable to change that personally. Would have to spend time with the repositarians.
  • Posts to lists. I *think* the OKF lists have a blanket Openness. But I’m not sure.
  • Tweets. No. And I wouldn’t know how to make them Open.
  • Papers. When we publish in BMC or PLoS they are required to be Open and obviously are. Pubs in IUCr are “Green” because we publish on CIF and this is free.

Now some better news. All our software is in Open Source Repositories.

  • Software/Sourceforge. Required to be Open. Licence in (some of) the source code. Probably LICENSE.txt indicating Artistic Licence.
  • Bitbucket. Ditto.

     

    Open Source software Openness is fairly trivial to assert.

Services.

  • OPSIN. (http://opsin.ch.cam.ac.uk/ ). This is a free service (intended to be OKD-Open, but not labelled) which converts chemical names to structures. Software is Open Source (Artistic). Data comes from user. Output should be labelled as Open (but isn’t). Would require Daniel to add licence info into website. The primary software (OPSIN) is Open Source. However I expect the webserver is the typical local configuration of Java/python/Freemarker etc. and doesn’t easily transport. So what do we have to do about glueware? If it doesn’t port, is Openness relevant?

Services are naturally non-Open by default.

Glueware is a problem.

Data.

  • Crystaleye. 250,000 crystal structures (http://wwmm.ch.cam.ac.uk/crystaleye ). We worked hard on this. We have added “All data on this site is licensed under PDDL and all Open Data buttons point implicitly to the PDDL licence.” On the top page, “All of the data generated is ‘Open’, and is signified by the appearance of the icon on all pages.” At the bottom, and the Open Data button on each page. And yet, according to CKAN, it’s still not Open because it cannot be downloaded in bulk. (Actually it can, and Jim Downing wrote a downloader. This was non-trivial effort, and I’ll come on to this later). So this is our best attempt (other than software) at Openness.

    Even when the data items are OKD-Open, the site is not necessarily so.

So here we are, trying to be Open, and finding it a big effort, and failing on several counts. I bet that is generally true for others, especially if the didn’t plan it from the start. So

Openness has to be planned from the start as part of the service.

(The trouble is that much of what we have done wasn’t formally planned! Much is successful experimentation over several years).

 

Open APIs: fundamentals and the cases of KEGG and Wikipedia

It is now urgent and vital that we define what is an “Open API”. The phrase is widely used, usually without any indication of what it offers and what – if any restrictions – it imposes. This blog is a first pass – I don’t expect to get everything “right” and I hope we have comments that evolve towards something generally workable. Among other things we shall need:

  • An agreement that this matters and that we must strive for OKD-open
  • Tools to help us manage it
  • A period of constructive development in trying to create fully Open APIs and a realisation of the problems and costs involved

I shall also list some additional criteria that I think are important or critical

Firstly the word “Open” (capitalised as such) is intended to convey the letter and the spirit of the Open Definition:

“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”.

This is a necessary but not sufficient condition for an “Open API”. What is the “API” bit?

Its stands for “Application Programming Interface” (http://en.wikipedia.org/wiki/Application_programming_interface ). In the current context it means a place (usually a website) where specific pieces of data can be obtained on request. It is often called a “service” and hence comes under the Open Service Definition.

“A service is open if its source code is Free/Open Source Software and non-personal data is open as in the Open Knowledge Definition (OKD).”

This is necessary, but not sufficient for what we now need. The rationale for the addition of F/OSS software is explained in
http://www.opendefinition.org/software-service/

The Open Software Service Definition defines ‘open’ in relation to online (software) services.

An online service, also known under the title of Software as a Service (SaaS), is a service provided by a software application running online and making its facilities available to users over the Internet via an interface (be that HTML presented by a web-browser such as Firefox, via a web-API or by any other means).

PMR: generally agreed. This can cover databases, repositories and other services. I shall try to illustrate

With an online-service, in contrast to a traditional software application, users no longer need to ‘possess’ (own or license) the software to use it. Instead they can simply interact via a standard client (such as web-browser) and pay, where they do pay, for use of the ‘service’ rather than for ‘owning’ (or licensing) the application itself.

PMR: I don’t fully understand this. I think there has to be an option for gratis access, else how does the system qualify as Open. But we do have to consider costs

The Definition

An open software service is one:

  1. Whose data is open as defined by the Open Knowledge Definition with the exception that where the data is personal in nature the data need only be made available to the user (i.e. the owner of that account).
  2. Whose source code is:
    1. Free/Open Source Software (that is available under a license in the OSI or FSF approved list — see note 3).
    2. Made available to the users of the service.

I shall revisit “whose data” later and particularly the need to add a phrase such as “and is made available”

Notes

  1. The Open Knowledge Definition requires technological openness. Thus, for example, the data shouldn’t be restricted by technological means such as access control and should be available in an open format.

PMR: Agreed. It may also mean that you do not need to buy/licence proprietary tools to access the data. Is a PDF document Open? The software required to READ it is usually closed. An additional concern here is the use of DRM (Digital Rights management)

  1. The OKD also requires that data should be accessible in some machine automatable manner (e.g. through a standardized open API or via download from a standard specified location).

PMR: This is critical. I read this as “ALL the data”.

  1. The OSI approved list is available at: http://www.opensource.org/licenses/ and the FSF list is at: http://www.gnu.org/philosophy/license-list.html
  2. For an online-service simply using an F/OSS licence is insufficient since the fact that users only interact with the service and never obtain the software renders many traditional F/OSS licences inoperative. Hence the need for the second requirement that the source code is made publicly available.

PMR: Services almost always involve a hotch-potch of code at the server side (e.g. servlets, database wrappers, etc.) This can be a problem.

  1. APIs: all APIs associated with the service will be assumed to be open (that is their form may be copied freely by others). This would naturally follow from the fact that the code and data underlying any of the APIs are open.

PMR: This relates to documentation, I assume

  1. It is important that the service’s code need only be made available to its users so as not to impose excessive obligations on providers of open software services.

PMR I read this as “here’s the source code but we are not under any obligation to install it for you or to make it work”. I agree with this

As examples the OSD cites Google Maps: Not Open and Wikipedia: Open

  • Code: Mediawiki is currently F/OSS (and is made available)
  • Data: Content of Wikipedia is available under an ‘open’ licence.

One of the oft-quoted aspects of F/OSS is the “freedom to fork”. (http://en.wikipedia.org/wiki/Fork_%28software_development%29 , and http://lwn.net/Articles/282261/ ). Forking is often a “bad idea” but is the ultimate tool in preserving Openness. Because it means that if the original knowledge stops being Open (becomes closed, dies, is inoperable) then at least in theory someone can take the copy and continue the existence. I think this is fundamental for Open APIs.

The APIs must provide (implicitly or explicitly) the ability for someone to fork the software and content.

It doesn’t have to be easy and it doesn’t have to be cost-free. It just has to be *possible*.

The case of KEGG (Kyoto Encyclopedia of Genes and Genomes , http://www.genome.jp/kegg/ ) is a clear example of an Open service being closed (http://www.genome.jp/kegg/docs/plea.html ). IN brief the laboratory running the services used to make everything freely available (and implicitly Open) but now:

 

Starting on July 1, 2011 the KEGG FTP site for academic users will be transferred from GenomeNet at Kyoto University to NPO Bioinformatics Japan, and it will be available only to paid subscribers. The publicly funded portion, the medicus directory, will continue to be freely accessible at GenomeNet. The KEGG FTP site for commercial customers managed by Pathway Solutions will remain unchanged. The new FTP site is available for free trial until the end of June.

I would like to emphasize that the KEGG web services, including the KEGG API, will be unaffected by the new mechanism to be introduced on July 1, 2011. Our policy for the use of the KEGG web site will remain unchanged. The only change will be to FTP access. We have already introduced a “Download KGML” link for the KGML files that used to be available only by FTP, and will continue to improve the functionality of KEGG API. I would be very grateful if you could consider obtaining a KEGG FTP subscription as your contribution to the KEGG project.)

I am not passing any moral judgment – you cannot pay people with promises. But the point is that an Open Service has become closed. With the “right-to-fork” it is possible to “clone” all the Open material (possibly with FTP) before the closure date and maintain an Open version. This may or may not be cost-effective, but it’s possible.

So what is the KEGG API mentioned above and is it Open? Almost certainly not. It may be useful but it is clear that neither the software nor the complete contents of the database are available.

By contrast Wikipedia remains an Open API. It’s possible to clone enough of the software that matters and all of the content. Installing the software is probably non-trivial (yes, I can run Mediawiki but there are all sorts of other things, configuration files, quality bots, etc. And cloning the content means dumping a snapshot at a given time. But at least, if we care enough it is LEGALLY and technically possible.

In the next post I will examine some of our own resources and how close they are to “OKD and OSD-open”. We fall down on details but we succeed in motivation.

Why do we continue to use Citations?

I have just got the following mail from Biomed Central about an article we published earlier this year (edited to remove marketing spiel, etc.)

Dear Dr Murray-Rust,

We thought you might be interested to know how many people have read your article:

ChemicalTagger: A tool for semantic text-mining in chemistry
Lezan Hawizy, David M. Jessop, Nico Adams and Peter Murray-Rust
Journal of Cheminformatics, 3:17   (16 May 2011)
http://www.jcheminf.com/content/3/1/17

Total accesses to this article since publication: 2117

This figure includes accesses to the full text, abstract and PDF of the article on the Journal of Cheminformatics website. It does not include accesses from PubMed Central or other archive sites (see http://www.biomedcentral.com/info/libraries/archive). The total access statistics for your article are therefore likely to be significantly higher.

Your article is ‘Highly accessed’ relative to age. See http://www.biomedcentral.com/info/about/mostviewed/ for more information about the ‘Highly accessed’ designation.

These high access statistics demonstrate the high visibility that is achieved by open access publication.

I agree. It does not, of course, mean that 2117 people have read the whole article. I imagine it removes obvious bots. Of course there could be something very compelling in the words in the title. After all (http://blogs.ch.cam.ac.uk/pmr/2011/07/08/plos-one-text-mining-metrics-and-bats/ ) the word “bats” in the title of one PLOSOne paper got 200,000 accesses (or it might have been “fellatio” – I wouldn’t like to guess). So I looked up “tagger” in Urban Dictionary and its main meaning is a graffiti writer. Maybe some of those could use a “chemicaltagger”? But let’s assume it’s noise.

So “Chemicaltagger” has been heavily accessed and probably even read by some accessors. Let’s assume that 10% of accessors – ca 200 – have read at least parts of the paper. That possibly means the paper is worth something. But not to the lords of the assessment exercise. Only the holy citation matters. So how many citations? Google Scholar (using its impenetrable, but at least free-if-not-open system) gives 3. Where from? Well from our prepublication manuscripts in DSpace. If we regard these as self-citations (disallowed by some metricmeisters) we get a Humpty sum:

3 – 3 = 0

So the paper is worthless.

If we wait 5 years maybe we’ll get 20 citations (I don’t know). But it’s a funny world where you have to wait 5 years to find out whether something electronic is valued.

So aren’t accesses better than citations? After all don’t we use box office receipts to tell us how good films are? Or viewing figures to tell us the value of a program? ["good" and "value" having special meanings, of course]. So why this absurd reliance on citations? After all Wakefield got 80 citations for his (retracted) paper on MMR Vaccine and autism. Many were highly critical. But it ups the index!

The reason we use citations as a metric is not that they are good – they are awful – but that they are easy. Before online journals the only way we could find out if anyone had noticed a paper was in the reference list. Of course references can be there for many reasons – some positive, some neutral, some negative and many completely ritual. They weren’t devised as a way of measuring value but as a way of helping readers understand the context of the paper and giving due credit (positive and negative) to others.

But, because academia is largely incapable of developing its own system of measuring value, it now relies on others to gather figures. And pays them lots of money. Citations are big business – probably 200->1000 million USD per year. So it’s easier for academia to pay others precious funds. And all parties have a vested interest in keeping this absurd system going. Not because it’s good, but because it saves trouble. And of course the vendors of citation data will want to preserve the market.

This directly stifles academic research in textmining of typed citation data (i.e. trying to understand WHY a citation was provided). Big business with lawyers (e.g. Google) are allowed to mine data from academic papers. Researchers such as us are forbidden. Because bibliometrics is a massive business. And any disruptive technology (e.g. Chemicaltagger, which could also be used for citations) must be prevented by legal means. And we have to deprecate access data because that threatens to holy cow and holy income of citation sales.

The sooner we get academic texts safely minable – in bulk – the sooner we shall be able to have believable information. But I think there are many vested interests who will be preventing this. After all what does objectivity matter?