Comments on the Bourne plot and other EPSRC matters

For whatever reason I get very few comments on this blog (note: I try to be courteous to posters). Comments normally come when I upset people and overstep a line (“turning up the outrage knob”). I don’t generally do this consciously. Anyway I have got two useful comments on my criticism of the EPSRC Bourne plot (/pmr/2011/09/28/epsrc-will-eventually-have-to-disclose-where-the-bourne-incredulity-comes-from-so-please-do-it-now/ ).

I’ll post the comments and reply:

Jeremy Bentham says: September 28, 2011 at 9:17 am 

We need to raise the quality of this debate – I’m not saying EPSRC is right but it doesn’t help by distorting what they are actually saying. Let’s make a bit more effort to understand what EPSRC have done, why they have done it, and start arguing on a basis of accurate information. EPSRC have actually been very open about why they are doing this, what the process has been, and what inputs they used to make strategic decisions.

Bourne’s team have actually published quite a detailed explanation on how they made *informed judgements* about the physical sciences portfolio which is represented by the QI plot (just because a diagram isn’t quantifiable doesn’t mean it isn’t representative of something)

Read this and please blog again http://tiny.cc/f66xw

PMR: Thanks Jeremy. I agree that my challenge included discourse that I would not use in a measured response. It may trivialize some issues but it also emphasizes the real sense of injustice and arbitrariness felt by many (not just me). It’s a political, not an analytical, post and perhaps more suited to Prime Minister’s Question Time. I have read many of the chemistry documents you list and will comment. I cannot presently see how those documents translate into the Bourne diagram.

Walter Blackstock says: September 28, 2011 at 12:44 pm 

You may be interested in Prof Timothy Gowers’s deconstruction of the EPSRC announcement. http://gowers.wordpress.com/2011/07/26/a-message-from-our-sponsors/

PMR: This is an exceptionally powerful piece. http://en.wikipedia.org/wiki/Timothy_Gowers is not only a mathematics superstar but also a star of the blogosphere. He created and catalysed the Polymath project, where a virtual group of mathematicians proved an extremely challenging unsolved conjecture. He is rightly hailed as a prophet of new web-based scholarship and I try to emulate the approach in activities such as Quixote. The piece is deliberately underspoken and he uses the EPSRC language itself as as a weapon against their mathematics policy. (In simple terms this is that only certain areas of maths are worth funding. )

PMR Comments on JB and EPSRC chemistry.

I will reread the documents and see what I may have missed. If there is a clear explanation of how Bourne came to create the diagram then I (and many others) have missed it and would welcome it. Let me take:

http://www.epsrc.ac.uk/SiteCollectionDocuments/Publications/reports/ChemistryIR2009.pdf

in which an invited group of top international chemists analysed the state of UK chemistry. I’m going to take two areas that I know a bit about, chemoinformatics and polymers (specifically polymer informatics as we were sponsored by Unilever for a 4-year project). Both of these are represented by Bourne as the lowest quality of chemistry (presumably in UK?). I am naturally upset at being repsented as working in areas of very low quality – I believe it is the scientist and the actual science performed that should be judged – not the discipline. (There are limits to this, such as homeopathy, and creationism, but all the sciences in Bourne are mainstream hypothesis- or data-driven science). Let’s remember that Einstein worked in a patent office and van’t Hoff in a vetinary school (for which he was lambasted by Wislicenus).

There is no mention of chemoinformatics in R2009. If Bourne took that as an indication of quality that’s completely unjustified. Here’s polymers (my emphasis):

P 17: There are examples of excellence

that can be readily identified in synthesis, catalysis

including biocatalysis, biological chemistry (specifically

bioanalytical), supramolecular chemistry, polymers and

colloids and gas-phase spectroscopy and dynamics.

P21: Polymer Science and Engineering

Following world-wide trends, the UK polymer chemistry

community has made some important and

excellent choices. In a coordinated action a few

universities have set-up excellent programmes

covering the entire chain-of-knowledge in polymer

science and engineering. In this way, the UK is

ensured a leading position in this important field for

both scientific and industrial needs. There is strong

evidence of growth and impact in polymer synthesis in

the UK Chemistry community. There are several

centres that have been created and whose members

have established international reputations. The

polymer synthesis community has placed the field on

the map in the UK. There is interest in controlled

polymer synthesis, water-processible and functional

polymers. A very strong activity that has international

recognition is the synthesis and processing of organic

electronic materials where the UK is pre-eminent.

Applications include polymer LEDs field effect

transistors, organic photovoltaics and sensors.

This bears no relation to the low quality for “Structural Polymers and technology”.

Bourne labels chemoinformatics as the lowest quality of all: “Refocus – funding targeted towards identified priorities”. A simple reading of this is “you (==me) have been doing rubbish work and we are going to reform you”. Like a failing school.

So if this is going to happen, at least I need to know the reason for it. Who made the decision, on what evidence. If the evidence is there (even that a group of the great-and-the-good made the plot), I’ll comment objectively. Scientists are used to being told their work is poor or misguided – it one of the strengths of science.

But so far the community as a whole seems to be finding it hard to get hard evidence.

Posted in Uncategorized | 1 Comment

EPSRC will eventually have to disclose where the Bourne Incredulity comes from, so please do it now

 

I blogged yesterday (/pmr/2011/09/27/the-clarke-bourne-discontinuity/ ) about Paul Clarke’s exposure (http://shear-lunacy.blogspot.com/2011/09/bourne-incredulity.html ) of the EPSRC “decision making” process which appears to use a 2-plot of “quality” vs “importance”. It has now seriously upset me. It needs the spotlight of Openness turned upon it.

Why is it so upsetting?

  • It appears to be used as an automatic method of assessing the fundability of research proposals. This in itself is highly worrying. Everyone agrees that assessing research is hard and inexact. Great scientists and scientific ideas are often missed, especially when they challenge orthodoxy. Bad or (more commonly) mediocre science is funded because it’s in fashion. “Safe” is often more attractive than “exciting”. The only acceptable way is to ask other scientists (“peers”) to assess scientific proposals, which we do without payment as we know it is the bedrock of acceptable practice. There is not, and never will be (until humans are displaced by machines) an automatic way of assessing science. Indeed when a publisher rejects or accepts papers without putting them out to peer review most scientists are seriously aggrieved. For a Research Council (which allocates public money) it is completely unacceptable to (not) allocate resources without peer review.

     

    And “peer-review” means by practising scientists, not office apparatchiks.

  • It appears to be completely unscientific. The use of unmeasurable terms such as “importance” and “quality” is unacceptable for a scientific organization. We live with the use of “healthy”, “radiant”, “volume” (of hair), as marketing terms – at best an expression of product satisfaction – at worst dangerously misleading. But “quality”?

     

    It might be possible to use “quality” as a shorthand for some metric. Perhaps the average impact factor (arggh!) of a paper in that discipline. Or the average score given by reviewers in previous grant proposals (urrgh!). Those would indicate the current metric madness, but at least they can be made Open and reproducible. “Importance”? The gross product of industries using the discipline? The number of times the term occurs in popular science magazines? Awful, but at least reproducible.

     

    But this graph, which has no metric axes and no zero point (this is the ENGINEERING and SCIENCE Research Council) breaks high school criteria of acceptability. Suppose the scales run from 99-100? All subjects are ultra important, and all subjects are top quality? Any reviewer would trash this immediately.

I intend to pursue this – in concert with Paul and Prof. Tony Barrett from Imperial College. We have to expose the pseudoscience behind this. (Actually it would make an excellent article for ben Goldacre’s “Bad Science” column in the Guardian). We first have to know the facts.

Paul and Tony have used FOI to get information from the EPSRC. They haven’t yet got any. I have suggested they should use http://whatdotheyknow.com as it records all requests and correspondence.

The EPSRC cannot hide. They are a government body and required to give information. It they try a Sir-Humphrey approach (“muddle and fluff”) we go back with specific questions. And if they fluff those then it can be taken to the Information Commissioner.

The information will definitely come out.

So when it does, what are the bets?

  • 1-2: they can’t trace the origin, mumble and fluff
  • 1-1: created by someone in the EPSRC office (“from all relevant data”)
  • 2-1: copied from some commissioned report (probably private) from “consultants”
  • 2-1: a selected panel of experts
  • 3-1: created from figures dredged up on industry performance and reviewers (see above)
  • 3-1: copied from some other funding body (e.g. NSF)
  • 3-1: the field (i.e. some other origin – a dream, throwing darts )

EPSRC: the origins of this monstrosity will be revealed anyway, so it will be best that you make a clean break of it now. You are not the most popular research council – RSC, Royal Society and several others have written to ask you to think again about arbitrary and harmful funding decisions. The longer this goes on the less will be your credibility.

 

 

 

 

    

Posted in Uncategorized | 4 Comments

Europe says (scientific) data should be Open. Neelie gets it! Do publishers?

Even in the burgeoning era of Openness it’s great to see the strong stance from many political leaders. Here’s European Commission Vice-President Neelie Kroes Brussels, 22 September 2011 (by Ton Zijlstra) http://epsiplatform.eu/news/news/ec_vp_kroes_on_standardization_and_open_data (I have edited savagely)

Speaking at the Open Forum Europe Summit 2011 yesterday in Brussels European Commission Vice-President Neelie Kroes has delivered a speech titled “From Common Standards to Open Data“.

In her speech she presented the progress made towards a new legal framework for European Standardisation, and interoperability. Saying that “standards are indispensable for openness, freedom and choice“, she said that in the coming years the focus of work will be to make the proposed legal framework become law as soon as possible. Furthermore also the new European Interoperability Framework has been created that will help create more and better interoperability, such as for cross-border public services in Europe.

We are going to take action: we are going to open up Europe’s public sector

I am convinced that the potential to re-use public data is significantly untapped. Such data is a resource, a new and valuable raw material. If we became able to mine it, that would be an investment which would pay off with benefits for all of us. (PMR emphasis) …

Third, benefits for science. Because research in genomics, pharmacology or the fight against cancer increasingly depends on the availability and sophisticated analysis of large data sets. Sharing such data means researchers can collaborate, compare, and creatively explore whole new realms. We cannot afford for access to scientific knowledge to become a luxury, and the results of publicly funded research in particular should be spread as widely as possible.

PMR: to hear a politician urging this is enormously encouraging

And, perhaps most importantly, benefits for democracy because it enhances transparency, accessibility and accountability. After all, what could be more natural than public authorities who have collected information on behalf of citizens using their tax money giving it back to those same citizens. New professionals such as data journalists are our allies in explaining what we do.”

PMR: lots of exciting stuff on Open Data omitted…

We’ll also be looking at charging regimes because expensive data isn’t “open data”. In short,getting out the data under reasonable conditions should be a routine part of the business of public administrations.

We are planning two data portals to give simple and systematic access to public data at European level. First we should have, bynext spring, a portal to the Commission’s own data resources. And second, for 2013, I am working on a wider, pan-European data portal, eventually giving access to re-usable public sector information from across the EU, federating and developing existing national and regional data portals.”

PMR: I’ve missed a good deal out. Read it. The message is:

  • Politicians care about open data
  • Politicians care about open science data
  • Politicians aren’t likely to be very happy with people who try to keep scientific data closed

So things are looking up. Open Data has arrived. I’m hoping that scientific publishers will realise that the future has arrived. Because if not, they will find their don’t have many allies.

Posted in Uncategorized | Leave a comment

The Clarke-Bourne discontinuity

I have (regrettably) only just discovered Paul Clarke’s blog http://shear-lunacy.blogspot.com/2011/09/bourne-incredulity.html . Paul is a synthetic chemist and over this year has challenged science policy funding, especially by the UK’s EPSRC. He has mobilised opinion by collecting signatories to letters to EPSRC and getting the Royal Society of Chemistry to take this seriously (which they now do). It’s a great example of the power of the blogosphere. Continual, concerted action by individuals has a massive effect.

Read Paul’s posts this year (rather than a summary from me). Essentially the EPSRC has abandoned traditional methods of funding (PhD studentships and responsive mode (funding excellent ideas from anyone)) in favour of a metric mumbo-jumbo. EPSRC has decided that it will fund high quality, high importance subjects in “sponsorship” mode. This means, effectively, that they have set themselves up as kingmakers and supporters of the Holy Roman Empire of the chosen, rather than understanding that science comes from everywhere and leads anywhere and the skill is in letting this happen. In times of reduced funding everyone will suffer, but this model selects a few privileged groups and showers the limited wealth on them. The model is defensible within a commercial company aiming for particular markets – it does not work for public science.

Unfortunately it leads to the rule of numbers, and here is the EPSRC graph (“Bourne graph”) of desirability (copied from Paul’s blog). (Since it appears to be a work of fiction it is formally copyrightable but I’ll take that risk).

 

It has two axes and until the algorithm for coordinates is revealed they remain completely subjective. (There can be valid reasons for using bibliometry to XY-plot disciplines and these include interdisciplinary links as in the MESUR and Eigenfactor projects. But these are descriptive of information flow, not value). Apparently FOI has been invoked to try to get clarity without success. I am not sure it’s on WhatDoTheyKnow and I’ll suggest that.

To play the game, my own discipline of chemoinformatics scores high on importance and ultra-low on quality. OK, I have been critical of chemoinformatics but I doubt that EPSRC is listening to me. Who decided this? And how? There is no evidence that these indicators have any basis. Are they UK or worldwide?

And polymers – another area we have worked in. Low quality (what an arrogant statement to make without justification) and low importance (why did Unilever fund us strategically?).

And although synthetic organic chemistry is of high importance it is being systematically underfunded by EPSRC. The discipline that allows people to create and modify molecules addressing disease, materials, agrochemistry…

The subject which in Dial-A-Molecule EPSRC itself selected as a Grand Challenge.

I have no objection to graphs of this sort being dreamed up in awayday workshops. The problem is that they become hardcoded into policy tools. That certain keywords guarantee that you won’t get funded.

Anyway great kudos to Paul for leading this activity. If I had known I would have signed his letters. He is in a “typical” chemistry post in that extradisciplinary work such as blogging is usually a negative factor. Campaigning takes away time from writing grants and doing research. Challenging the CEO of EPSRC probably doesn’t do you any good in getting funding.

Posted in Uncategorized | Leave a comment

Am. Chem. Soc. Skolnik Award 2012 (Henry Rzepa + PMR) and some future thoughts

Henry Rzepa and I have been honoured by the Herman Skolnik award of the Chemical Information Division (CINF) of the American Chemical Society. Details (from Phil McHale) can be found here:

http://www.ccl.net/cgi-bin/ccl/message-new?2011+09+26+014

Other than – naturally – feeling pleased for ourselves there are a number of points:

  • It is the first joint award that has been made in the history of the award. Our mutual support has been essential and kept us going – I have often referred to us as a “symbiote”. Joint awards are common in many disciplines and represent the need for cooperation and collaboration. Chemical informatics must, we believe, be a cooperative approach and there are signs that this is starting to happen.
  • It gives formal legitimacy to our drive for an Open semantic interoperable chemical information infrastructure. While we have never doubted that what we have designed and created is part of the present and future of chemical (and other scientific) information it is often a lonely journey. Sometimes it is possible to think that the time has passed and “it will never happen”. This award gives us great confidence to take ideas forward (and these are ideas shared by many others). It also – possibly – gives us greater believability when promoting particular actions or designs.
  • It pays tribute to the many others who have helped the ideas flourish. This is not a ritual Hollywood acceptance speech – Henry and I have been privileged to be part of a much wider community (Blue Obelisk, Quixote) which has been essential in allowing the ideas (not “our” ideas) to flourish, and even more importantly be implemented. CML owes a considerable amount of its credibility to the code written in Blue Obelisk and Quixote, and this has been done by scientists with free will (i.e. they have bought into the ideas and given their time). So I see Henry and myself as catalysts – we have helped to coordinate and focus the latent potential of the last two decades. Because we are catalysts we do not rely on the need for massive resources, and when people respond it is their work, not ours.
  • I am also very conscious of my very strong support from a number of people/organizations, especially in the Unilever Centre, Churchill College, Unilever and Microsoft Research. These have given me the resources to develop the tools and designs over the last 5 years, and without them it would have been very difficult to keep going and creating tangible information components. This means that our current trajectory is strong. I am continually picking up indications of the work being adopted by scientists and within organisations.

Last month I celebrated my 70th birthday. Unfortunately birthdays are part of the pseudo-scientific numerical metrics – anyone over 67 in this country finds it very difficult to get grants (and I am grateful to some funders, including JISC, for finding age irrelevant). It means (not necessarily a bad thing) that I can generally not act as a Principal Investigator. But, since our last 20 years have been involved in persuasion rather than coercion nothing really changes. So many of my current efforts will continue, but perhaps in a different management setting, and perhaps less obviously coordinated.

My role now is a scientist working in joint opportunities, perhaps even patronage (a common enough method two centuries ago). I have several ongoing activities and opportunities.

  • I will continue to have a place in the Unilever Centre (including writing up more work, and collaborating on computational metabolism)
  • I have a visiting position at the European Bioinformatics Inst. with Christoph Steinbeck working in his group (ChEBI) – and extensions to metabolism and NMR spectra.
  • I am continuing to take the Quixote idea forward and visiting Pacific Northwest Laboratory (PNNL) next month to collaborate on semantics and glossary for their Open Source NWChem program.
  • We will continue to support OSCAR4 and OPSIN – and now is the time to broaden the community. Both programs are now cutting edge and competitive with closed source offerings. But we need your involvement
  • I hope to spend several months next year in Australia (more later, hopefully) – developing ontologies and semantics with Nico Adams.
  • And we have just submitted a proposal to JISC for Open Bibliography 2 and should hear in a few days.
  • And a whole host of new things in the Open Knowledge Foundation.

I personally believe that awards should look to the future as well as the past and I hope this gives some indication that the Skolnik award gives new impetus and hopefully new collaborators.

There are some special facets relating to the American Chemical Society who have honoured us. Readers of this blog may get the impression that I am “anti-ACS”. I am not (or I would not be a member and not be accepting this award). I owe my involvement in chemical informatics to superb workshops and publications run by the ACS in the 1970’s. They convinced me to go into informatics in the pharma industry. Their consistent support of informatics over at least 4 decades (and much longer if we count Chemical Abstracts) has been outstanding in science. But change is needed.

I believe that much of the future depends on “trusted” organizations and among these are universities, government and their organs, museums, libraries, scholarly societies/ scientific unions. None are perfect, but they are the best we have (“Indeed, it has been said that democracy is the worst form of government except all those other forms that have been tried from time to time.” http://en.wikiquote.org/wiki/Winston_Churchill). So with the learned societies. They are democratic in structure and they are the best we have (unfettered capitalism, as practised by some scholarly publishers, is not an attractive self-regulating alternative). Part of my role, I believe, is to help show ways forward for some aspects of learned societies. I believe that some of the opinions in this blog have influenced scholarly organizations.

The problem with large learned societies is that they can become corporate-like where decisions are taken for purely market reasons rather than the good of the society and its role. The ACS is not unique in this – it seems to happen in many disciplines (including bibliography). There is a size beyond which an organization is in danger of losing its roots and values. And the larger an organization the slower it is to react to anything. We thus have tensions along several axes:

  • Size.
  • Publishing and competition. As I have blogged, I think scholarly publishing is broken. It fosters local undemocratic monopolies. It is one of the weirdest markets there is – there are no conventional market forces and no elasticity. The customers have agreed to work in a system that severely disadvantages them. The problem is that modern scholarly publishing tends to corrupt including corrupting scholarly organizations.
  • Commercial interests. I am not an anti-capitalist, but capitalism per se does not promote the ethics that I would like to see in scholarship. Chemistry has a very large commercial base, and this cannot overrule scholarly interests.
  • Web democracy. The web has opened new ideas of democracy and participation. These will not disappear. Yes, I am a web-idealist and I have seen ideals slip away over two decades but we are currently fighting for our democratic information future. The scholarly societies must recognize and build on this or they will continue to be riven by this tension.

Things are happening in at least the last 3 that offer hope. We are seeing pharma companies raising the pre-competitive level. The ideas of Open Source Drug Discovery in a commercial setting. Could we hope for “Open Source” scholarly publishing?

Henry and I have the opportunity to run the Skolnik symposium on Aug 21st 2012 in Philadelphia. We want the form and content of this to reflect the work that we and others have done. It’s also a chance to get new ideas displayed and promoted. This is not a travelling award with a lecture but if people are interested in my visiting them I have a somewhat flexible diary.

Posted in Uncategorized | 5 Comments

Open APIS; My attempts to be Openly Open

Having argued that we need to define Open APIs better I’ll share my experiences as a contributor to Open Knowledge and the challenges that this poses. The first message:

Openness takes effort

Creating copyrighted works costs no effort. Every keystroke, every digital photo is protected by copyright. Whereas to make something Open you have to specifically add permission to use. This is really tedious. It’s particularly difficult when the tools used have no natural support for adding permissions. So

    Most of the time I fail to add explicit permissions

That’s a fact. I am not proud of it. But, to give an example I have just discovered that almost all my material in Cambridge DSpace does not provide Open permissions. That was never my intention. But the tools don’t allow me to click a single button and change it. I have to add rights to every single document (I have ca. 90). Meanwhile the automatic system continues to pronounce “Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.” This may be true, but it isn’t the sign of a community trying to make material Open.

I have spent the last 20 minutes trying to find out how to add permissions to my material. It’s impossible. (DSpace is one of the worst systems I have encountered). So, unless you get everything right in DSpace at time of submission you cannot edit the permissions. And, of course, by default the permissions are NO, NO, NO. I should add that some of the material has been uploaded by my colleagues. So next rule:

    Unless EVERYONE involved in the system understands and operates Openness it won’t get added.

So all my intended Openness has been wasted. No-one can re-use my material because we cannot find out how to do it.

Now take this blog. I started without a licence. Then I added an NC one. Or rather I asked the blogmeister to add it (I was not in control of the Blog). And that’s the point. Very often you have to ask someone else to actually add the Openness. Then I was convinced of the error of NC and changed to CC-BY.

However in later 2010 the sysadmins changed the way the blog was served. This caused many problems, but among others it destroyed the CC-BY notice and licence. So:

    Sometimes the systems will destroy the Openness.

So I would have to put in a ticket to have the CC-BY licence restored. All these little things add up to a lot of hassle. Just to stay where we are.

So to summarise so far. (Remember I WANT everything to be OKD-Open unless indicated).

  • Blog. Not formally Open. Needs a licence added to the site and to each post?
  • Dspace. Almost nothing formally Open. Unable to change that personally. Would have to spend time with the repositarians.
  • Posts to lists. I *think* the OKF lists have a blanket Openness. But I’m not sure.
  • Tweets. No. And I wouldn’t know how to make them Open.
  • Papers. When we publish in BMC or PLoS they are required to be Open and obviously are. Pubs in IUCr are “Green” because we publish on CIF and this is free.

Now some better news. All our software is in Open Source Repositories.

  • Software/Sourceforge. Required to be Open. Licence in (some of) the source code. Probably LICENSE.txt indicating Artistic Licence.
  • Bitbucket. Ditto.

     

    Open Source software Openness is fairly trivial to assert.

Services.

  • OPSIN. (http://opsin.ch.cam.ac.uk/ ). This is a free service (intended to be OKD-Open, but not labelled) which converts chemical names to structures. Software is Open Source (Artistic). Data comes from user. Output should be labelled as Open (but isn’t). Would require Daniel to add licence info into website. The primary software (OPSIN) is Open Source. However I expect the webserver is the typical local configuration of Java/python/Freemarker etc. and doesn’t easily transport. So what do we have to do about glueware? If it doesn’t port, is Openness relevant?

Services are naturally non-Open by default.

Glueware is a problem.

Data.

  • Crystaleye. 250,000 crystal structures (http://wwmm.ch.cam.ac.uk/crystaleye ). We worked hard on this. We have added “All data on this site is licensed under PDDL and all Open Data buttons point implicitly to the PDDL licence.” On the top page, “All of the data generated is ‘Open’, and is signified by the appearance of the icon on all pages.” At the bottom, and the Open Data button on each page. And yet, according to CKAN, it’s still not Open because it cannot be downloaded in bulk. (Actually it can, and Jim Downing wrote a downloader. This was non-trivial effort, and I’ll come on to this later). So this is our best attempt (other than software) at Openness.

    Even when the data items are OKD-Open, the site is not necessarily so.

So here we are, trying to be Open, and finding it a big effort, and failing on several counts. I bet that is generally true for others, especially if the didn’t plan it from the start. So

Openness has to be planned from the start as part of the service.

(The trouble is that much of what we have done wasn’t formally planned! Much is successful experimentation over several years).

 

Posted in Uncategorized | Leave a comment

Open APIs: fundamentals and the cases of KEGG and Wikipedia

It is now urgent and vital that we define what is an “Open API”. The phrase is widely used, usually without any indication of what it offers and what – if any restrictions – it imposes. This blog is a first pass – I don’t expect to get everything “right” and I hope we have comments that evolve towards something generally workable. Among other things we shall need:

  • An agreement that this matters and that we must strive for OKD-open
  • Tools to help us manage it
  • A period of constructive development in trying to create fully Open APIs and a realisation of the problems and costs involved

I shall also list some additional criteria that I think are important or critical

Firstly the word “Open” (capitalised as such) is intended to convey the letter and the spirit of the Open Definition:

“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”.

This is a necessary but not sufficient condition for an “Open API”. What is the “API” bit?

Its stands for “Application Programming Interface” (http://en.wikipedia.org/wiki/Application_programming_interface ). In the current context it means a place (usually a website) where specific pieces of data can be obtained on request. It is often called a “service” and hence comes under the Open Service Definition.

“A service is open if its source code is Free/Open Source Software and non-personal data is open as in the Open Knowledge Definition (OKD).”

This is necessary, but not sufficient for what we now need. The rationale for the addition of F/OSS software is explained in
http://www.opendefinition.org/software-service/

The Open Software Service Definition defines ‘open’ in relation to online (software) services.

An online service, also known under the title of Software as a Service (SaaS), is a service provided by a software application running online and making its facilities available to users over the Internet via an interface (be that HTML presented by a web-browser such as Firefox, via a web-API or by any other means).

PMR: generally agreed. This can cover databases, repositories and other services. I shall try to illustrate

With an online-service, in contrast to a traditional software application, users no longer need to ‘possess’ (own or license) the software to use it. Instead they can simply interact via a standard client (such as web-browser) and pay, where they do pay, for use of the ‘service’ rather than for ‘owning’ (or licensing) the application itself.

PMR: I don’t fully understand this. I think there has to be an option for gratis access, else how does the system qualify as Open. But we do have to consider costs

The Definition

An open software service is one:

  1. Whose data is open as defined by the Open Knowledge Definition with the exception that where the data is personal in nature the data need only be made available to the user (i.e. the owner of that account).
  2. Whose source code is:
    1. Free/Open Source Software (that is available under a license in the OSI or FSF approved list — see note 3).
    2. Made available to the users of the service.

I shall revisit “whose data” later and particularly the need to add a phrase such as “and is made available”

Notes

  1. The Open Knowledge Definition requires technological openness. Thus, for example, the data shouldn’t be restricted by technological means such as access control and should be available in an open format.

PMR: Agreed. It may also mean that you do not need to buy/licence proprietary tools to access the data. Is a PDF document Open? The software required to READ it is usually closed. An additional concern here is the use of DRM (Digital Rights management)

  1. The OKD also requires that data should be accessible in some machine automatable manner (e.g. through a standardized open API or via download from a standard specified location).

PMR: This is critical. I read this as “ALL the data”.

  1. The OSI approved list is available at: http://www.opensource.org/licenses/ and the FSF list is at: http://www.gnu.org/philosophy/license-list.html
  2. For an online-service simply using an F/OSS licence is insufficient since the fact that users only interact with the service and never obtain the software renders many traditional F/OSS licences inoperative. Hence the need for the second requirement that the source code is made publicly available.

PMR: Services almost always involve a hotch-potch of code at the server side (e.g. servlets, database wrappers, etc.) This can be a problem.

  1. APIs: all APIs associated with the service will be assumed to be open (that is their form may be copied freely by others). This would naturally follow from the fact that the code and data underlying any of the APIs are open.

PMR: This relates to documentation, I assume

  1. It is important that the service’s code need only be made available to its users so as not to impose excessive obligations on providers of open software services.

PMR I read this as “here’s the source code but we are not under any obligation to install it for you or to make it work”. I agree with this

As examples the OSD cites Google Maps: Not Open and Wikipedia: Open

  • Code: Mediawiki is currently F/OSS (and is made available)
  • Data: Content of Wikipedia is available under an ‘open’ licence.

One of the oft-quoted aspects of F/OSS is the “freedom to fork”. (http://en.wikipedia.org/wiki/Fork_%28software_development%29 , and http://lwn.net/Articles/282261/ ). Forking is often a “bad idea” but is the ultimate tool in preserving Openness. Because it means that if the original knowledge stops being Open (becomes closed, dies, is inoperable) then at least in theory someone can take the copy and continue the existence. I think this is fundamental for Open APIs.

The APIs must provide (implicitly or explicitly) the ability for someone to fork the software and content.

It doesn’t have to be easy and it doesn’t have to be cost-free. It just has to be *possible*.

The case of KEGG (Kyoto Encyclopedia of Genes and Genomes , http://www.genome.jp/kegg/ ) is a clear example of an Open service being closed (http://www.genome.jp/kegg/docs/plea.html ). IN brief the laboratory running the services used to make everything freely available (and implicitly Open) but now:

 

Starting on July 1, 2011 the KEGG FTP site for academic users will be transferred from GenomeNet at Kyoto University to NPO Bioinformatics Japan, and it will be available only to paid subscribers. The publicly funded portion, the medicus directory, will continue to be freely accessible at GenomeNet. The KEGG FTP site for commercial customers managed by Pathway Solutions will remain unchanged. The new FTP site is available for free trial until the end of June.

I would like to emphasize that the KEGG web services, including the KEGG API, will be unaffected by the new mechanism to be introduced on July 1, 2011. Our policy for the use of the KEGG web site will remain unchanged. The only change will be to FTP access. We have already introduced a “Download KGML” link for the KGML files that used to be available only by FTP, and will continue to improve the functionality of KEGG API. I would be very grateful if you could consider obtaining a KEGG FTP subscription as your contribution to the KEGG project.)

I am not passing any moral judgment – you cannot pay people with promises. But the point is that an Open Service has become closed. With the “right-to-fork” it is possible to “clone” all the Open material (possibly with FTP) before the closure date and maintain an Open version. This may or may not be cost-effective, but it’s possible.

So what is the KEGG API mentioned above and is it Open? Almost certainly not. It may be useful but it is clear that neither the software nor the complete contents of the database are available.

By contrast Wikipedia remains an Open API. It’s possible to clone enough of the software that matters and all of the content. Installing the software is probably non-trivial (yes, I can run Mediawiki but there are all sorts of other things, configuration files, quality bots, etc. And cloning the content means dumping a snapshot at a given time. But at least, if we care enough it is LEGALLY and technically possible.

In the next post I will examine some of our own resources and how close they are to “OKD and OSD-open”. We fall down on details but we succeed in motivation.

Posted in Uncategorized | 3 Comments

Why do we continue to use Citations?

I have just got the following mail from Biomed Central about an article we published earlier this year (edited to remove marketing spiel, etc.)

Dear Dr Murray-Rust,

We thought you might be interested to know how many people have read your article:

ChemicalTagger: A tool for semantic text-mining in chemistry
Lezan Hawizy, David M. Jessop, Nico Adams and Peter Murray-Rust
Journal of Cheminformatics, 3:17   (16 May 2011)
http://www.jcheminf.com/content/3/1/17

Total accesses to this article since publication: 2117

This figure includes accesses to the full text, abstract and PDF of the article on the Journal of Cheminformatics website. It does not include accesses from PubMed Central or other archive sites (see http://www.biomedcentral.com/info/libraries/archive). The total access statistics for your article are therefore likely to be significantly higher.

Your article is ‘Highly accessed’ relative to age. See http://www.biomedcentral.com/info/about/mostviewed/ for more information about the ‘Highly accessed’ designation.

These high access statistics demonstrate the high visibility that is achieved by open access publication.

I agree. It does not, of course, mean that 2117 people have read the whole article. I imagine it removes obvious bots. Of course there could be something very compelling in the words in the title. After all (/pmr/2011/07/08/plos-one-text-mining-metrics-and-bats/ ) the word “bats” in the title of one PLOSOne paper got 200,000 accesses (or it might have been “fellatio” – I wouldn’t like to guess). So I looked up “tagger” in Urban Dictionary and its main meaning is a graffiti writer. Maybe some of those could use a “chemicaltagger”? But let’s assume it’s noise.

So “Chemicaltagger” has been heavily accessed and probably even read by some accessors. Let’s assume that 10% of accessors – ca 200 – have read at least parts of the paper. That possibly means the paper is worth something. But not to the lords of the assessment exercise. Only the holy citation matters. So how many citations? Google Scholar (using its impenetrable, but at least free-if-not-open system) gives 3. Where from? Well from our prepublication manuscripts in DSpace. If we regard these as self-citations (disallowed by some metricmeisters) we get a Humpty sum:

3 – 3 = 0

So the paper is worthless.

If we wait 5 years maybe we’ll get 20 citations (I don’t know). But it’s a funny world where you have to wait 5 years to find out whether something electronic is valued.

So aren’t accesses better than citations? After all don’t we use box office receipts to tell us how good films are? Or viewing figures to tell us the value of a program? [“good” and “value” having special meanings, of course]. So why this absurd reliance on citations? After all Wakefield got 80 citations for his (retracted) paper on MMR Vaccine and autism. Many were highly critical. But it ups the index!

The reason we use citations as a metric is not that they are good – they are awful – but that they are easy. Before online journals the only way we could find out if anyone had noticed a paper was in the reference list. Of course references can be there for many reasons – some positive, some neutral, some negative and many completely ritual. They weren’t devised as a way of measuring value but as a way of helping readers understand the context of the paper and giving due credit (positive and negative) to others.

But, because academia is largely incapable of developing its own system of measuring value, it now relies on others to gather figures. And pays them lots of money. Citations are big business – probably 200->1000 million USD per year. So it’s easier for academia to pay others precious funds. And all parties have a vested interest in keeping this absurd system going. Not because it’s good, but because it saves trouble. And of course the vendors of citation data will want to preserve the market.

This directly stifles academic research in textmining of typed citation data (i.e. trying to understand WHY a citation was provided). Big business with lawyers (e.g. Google) are allowed to mine data from academic papers. Researchers such as us are forbidden. Because bibliometrics is a massive business. And any disruptive technology (e.g. Chemicaltagger, which could also be used for citations) must be prevented by legal means. And we have to deprecate access data because that threatens to holy cow and holy income of citation sales.

The sooner we get academic texts safely minable – in bulk – the sooner we shall be able to have believable information. But I think there are many vested interests who will be preventing this. After all what does objectivity matter?

Posted in Uncategorized | Leave a comment

Journal of Cheminformatics special issue: Visions of a Semantic (Molecular) Future

Over the past approximately 3 months I and colleagues have been writing and editing 15 articles for the Journal of Cheminformatics on “Visions of a Semantic (Molecular) Future”. We’ve finally got to the stage where all 15 articles have been accepted and are in the final stages of processing. We expect the “issue” to appear RSN (“Real Soon Now”).

Most of the submitted drafts can be found here:

http://www.dspace.cam.ac.uk/handle/1810/238409/browse?type=title&sort_by=1&order=ASC&rpp=20&etal=-1&null=&offset=20

(Note that DSpace is poorly designed for managing collections of documents so we haven’t been able to provide our own title page which links to the articles and explains them – for that you have to know the “handles” and manually edit them. So there are also additional materials).

I wrote an editorial which can be found in full here http://www.dspace.cam.ac.uk/handle/1810/238399 and I’ll quote some sections. Note that several of the papers are general and the chemistry is almost incidental.

I’d like to thank the editorial staff of Biomed Central very much. This isn’t ritual thanks – many publishers generally deserve few thanks for their attitudes towards holding scholarship in the dark ages for their own benefit rather than serving readers and authors. It’s also no thanks to Springer (who own BMC) (see /pmr/2010/11/11/versitaspringer-%E2%80%93-please-edit-our-commercial-journals-for-free-so-we-can-sell-them-to-you/ and whose interview with Richard Poynder http://poynder.blogspot.com/2011/01/interview-with-springers-derk-haank.html showed that Springer simply regards academia as a (guaranteed) source of income.

I have commented before on how important BMC has been in establishing the credibility of Gold Open Access: /pmr/2010/06/11/reclaiming-our-scholarship-tribute-to-vitek-tracz-and-bmc/

Vitek, Matt Cockerill and others have shown that a publisher aimed at providing a service to the community can make a viable income (including profit). That in itself is valuable. But BMC, probably more than any other OA publisher, has caught the spirit of OA and more generally Openness in that it has been active in developing new facets to Openness (such as the Open Data awards and the adoption of the Panton Principles). And I confidently expect to be working in collaboration with BMC in the future and reporting it on this blog.

Iain Hrynaszkiewicz, Jan Kuras, Bailey Fallon and the editors (Christoph Steinbeck and David Wild) have all helped to adopt new features in this issue. Dan Zaharevitz’s contribution is unusual – it’s a transcript of his talk which I think captures the historical aspects of cheminformatics far better than sentences with passive verbs. Henry Rzepa , and our Open Bibliography group, eat our own dogfood and the editors have accepted this (Elsevier totally destroyed my last attempt to publish in HTML).

But the conventional publication process is out-of-date. The reviews have been useful. They’ve caught batches of typos, and we have added sections in response. Some reflect the different slants on publication and the tension between the new and the conventional. There are probably still glitches.

It’s taken about 1-2 months to write the articles (some of the authors like writing, some do not!). And about 10 weeks for the papers to go through the review process (most were posted to DSpace on 2011-07-04). And a bit more before they appear in print. I have to give great thanks to Charlotte (Bolton) who acted as amanuensis (and also to EPSRC who provided Pathways to Impact funding for the symposium and publication process).

So the timescale is probably about as good as it gets. But because BMC is an OA publisher we’ve posted the manuscripts in DSpace and Google (and perhaps you, dear reader) have been reading them. So publication was effectively immediate.

What have we gained from the formal publication process? Undoubtedly there will be people who don’t read blogs who will read them because they are in J.Cheminf. They have a formal bibliographic entry in a way that blogs don’t (yet) – but that will change. They are better because of the review process.

But the main apparent value is that they are citable, and citable for establishing the personal merit of the authors. For me that’s irrelevant – for some of the authors it’s very important. But *why* is the publishing of papers still stressed in this fashion? A paper about OSCAR4 is far less use in practice that the material we provided for the launch (tutorials, examples, downloads, etc.). Open Bibliography will be judged by how well it supports Open Scholarship – not by a paper. Henry and me recounting the development of CML might well be better in a video. Dan Z is certainly better in video! We have to change and there are increasing indications that non-paper outputs will start to be valued.

So here are some snippets from my editorial:

The articles have a common theme of representing information in a semantic manner – i.e. being largely “understandable” by machine. This theme is common across science and many of the articles can and should be read by people outside the chemical sciences, including information scientists, librarians, etc. An emergent phenomenon of the last two decades is that information systems can grow without top-down directions. This is disruptive in that it empowers anyone with energy and web-skills, and is most powerful when exercised in communities of people with similar or complementary skills.

It is often possible to move very quickly, and in our hackfests (one was prepended to the symposium) we have shown that it is possible to prototype within a day or two. This creates a new generation of scientist-hackers (I use “hacker” as “A person who enjoys exploring the details of programmable systems and stretching their capabilities” [1]). Several of the authors in this issue would regard themselves as “hackers” and enjoy communicating through software and systems rather than written English. This stretches the boundaries of the possible but also creates tension where the mainstream world cannot react on a hacker timescale and with hacker ethics.

More generally many scientists and information professionals are increasingly frustrated with the conventional means of disseminating science. Most conventional publishers regard scientific articles as “their content” and a very recent article (2011-06-20) from the STM publishers [2] indicates that the publishers believe they have the right to determine how content is, or more often is not, used. As an example most forbid by default indexing, textmining, repurposing, even of factual data to which the scientist has a legitimate subscription. This has an entirely negative effect on information-driven science, preventing even the development of the technology.

Generally, therefore, there is a culture of bottom-up change (“web democracy”) which looks to the modern web and examples of empowerment. (There are also examples of disempowerment such as attacks on Net-neutrality, walled gardens, information monopolies, vendor lock-in, etc. and this contrast activates many in the modern informatics world). There are several articles, therefore, whose main theme is the access to Open information.


I now believe that in many cases it is unethical to restrict access to publicly funded science. Lessig, in his CERN talk (“Scientific Knowledge Should Not Be Reserved For Academic Elite” [3]), showed that it would cost 500 USD for him to read the top 10 papers relating to his child’s condition. These papers are effectively only available to academics in rich universities. A colleague recently told me he had spent a month researching the literature of his child’s condition (to critically effective purpose) and we agreed he could only do this because he was a professor at a University. That is one reason I support the Open Knowledge Foundation and its projects to define and obtain Open information (of which Open Bibliography [4] in this issue is typical).


Because of this, chemistry has almost no public ontologies, and we have a vicious circle. Without ontologies, authors cannot reasonably be expected to create semantic information, and without a clear need for semantic information, the community will not take on the considerable load of creating ontologies. Several of the articles argue that the creation of lightweight dictionaries and other semantic metadata is affordable by the community and I believe that if the communal will is present, then it would be possible through bodies such as IUPAC and others, to create a full semantic infrastructure for much of the current published chemistry.

The current legal and contractual restrictions on re-using chemical data are seriously holding chemistry behind other subjects. These articles in this issue are not the place for polemics but we hope that traditional creators of information resources in chemistry will now think carefully about the value of making their data fully Openly available. This will be a considerable act of faith, because it will need a change in business model. Some of those providers have been traditionally held in high esteem by the community and if they use that esteem they have the opportunity to change the practice of chemical informatics.


A major feature underlying all of the papers is to give an insight into the process of creating an information ecology. Some of them represent scientific discoveries (e.g. Rzepa) but most are concerned with building a coherent infrastructure usable by the community. It may be useful to liken this infrastructure to the development of instrumentation in many branches of science. Science depended on the microscope, the telescope, the spectrograph, the Geiger counter and many other types of instrumentation. There is sometimes a modern tendency to discount instrumentation and infrastructure as not being ‘proper science’. We hope that this issue will redress that balance


Several of the articles (CML [13], OSCAR [14], OPSIN, dictionaries [15], WWMM [16]) in this issue cover a decade of work. We hope this will be useful to scientists and scholars who wish to implement new ideas and to give them some idea of what works, and what, more commonly, does not work. Sometimes only the passage of time and persistence achieves some level of success. Again, the short-termism of many infrastructural projects militates against developing a good platform for the future


A number represent growing points whose development is highly unpredictable. These include the WWMM [16], where the vision of a distributed peer-to-peer knowledge resource has had to wait a decade until it could be implemented. The Quixote project is only months old but takes this vision and has already built an impressive prototype, which I expect to set the model for computationally-based knowledge repositories. These projects rely heavily on community, and this is most clearly shown in the Blue Obelisk movement [20] which aims to, and has largely succeeded in, creating an Open infrastructure for cheminformatics. A major motivation for this has been not just that software and data should be universally available but also that this is the only manner in which science can be reputably validated both by humans and machines. An example of the need for such validation is shown in Henry Rzepa’s article [21].


The relative stagnation of chemical informatics suggests that change is unlikely to happen from within chemistry. As progress occurs in other areas (retail, bioscience etc.) chemistry may be dragged into the semantic world regardless. If chemists wish to retain control over their own systems they will be wise to start investing in Open semantic environments, because otherwise the rest of the world will do it for them.

How can chemical informatics survive and prosper? I think the most likely model will be Open publishing, not just of texts but data and other resources, mandated and paid for by funders. Those publishers which are able to adopt an Open model rather than continuing to maintain their own walled gardens, will ultimately triumph, and probably more rapidly than we expect.

 

 


 

Posted in Uncategorized | Leave a comment

Open API (or glorious API?)

It is becoming critical that we (?everyone) defines what is meant by “open API” and what it means operationally. This post introduces the problem – the next will suggest some ways forward.

Why does this matter? Isn’t “open” an indication of goodwill towards others? A general philosophy that we’d like to share things and work together? That things should be free?

No.

The problem is that “open” is used in so many contexts, often without thought, that it can become almost meaningless. And if you take it lightly it will cost you money or end you in court.

What does “open Access” mean? I am sure all readers know.

Except they don’t. If you are asked to pay 3000 USD as an author of an Open Access scholarly article, what are you getting? And what are you offering to the rest of the world. Often it is seriously unclear. Why pay 3000 USD if you can post your article as “Green Open Access”? Are you allowed to post your article? Can you re-use it?

In fact don’t you have to read the small print of every single publisher (if you can find it, which usually I cannot)? And make sure that what you do isn’t going to end you up in court? Yes, I’m serious. If you post a single image from a Wiley journal you are still in danger of being sued or having your subscription cut off (http://scienceblogs.com/retrospectacle/2007/04/when_fair_use_isnt_fair_1.php ). Claiming that you had some nebulous “open” right or “fair use” isn’t going to remove the lawyers. Wiley still require you to ask permission to re-use “their” material (even if you wrote it or drew the pictures).

In short I believe “Open” is only useful as an operational term if it is clearly defined as something that frees us from the threat of lawyers.

Many people use “open” like Humpty Dumpty uses “glory”

 “There’s glory for you!”
   “I don’t know what you mean by ‘glory,’ ” Alice said.
   Humpty Dumpty smiled contemptuously. “Of course you don’t—till I tell you. I meant ‘there’s a nice knock-down argument for you!’ “
   “But ‘glory’ doesn’t mean ‘a nice knock-down argument,’ ” Alice objected.
   “When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean—neither more nor less.”

Here’s a conversation I had with a vendor of information systems about 2 years ago at a JISC meeting:

V: “We have an Open API” (implying this was a GOOD THING)

Me: “can you let me have a copy of the spec?”

V: “No, it’s confidential to customers”

Me: “If I purchased your system could I share the API with others?”

V: “No, that’s a breach of contract” (i.e. I might be sued).

I have had this conversation with other vendors. When I questioned them I was told that I had a different idea of Open from them. (True). This use of “open” seems to be as useful as “healthy”. The most charitable interpretation is that they have actually documented their API. “open” is frequently a marketing word, or a word to make you feel good (about the “open”ers) or just fuzz to show the heart is in the right place.

And as such I shall replace it by Humpty’s word, “glorious”.

V: We have a glorious API.

Me: no quibble. Meaningless marketspeak but I’m used to that

So whenever you hear “open”, substitute “glorious” and see if you have lost any information.

Open Source does this well. I know that Sourceforge, Outercurve, Apache, Bitbucket, Git contain Open Source programs. And if I look this up on OSI I find: http://www.opensource.org/docs/osd

Open source doesn’t just mean access to the source code. The distribution terms of open-source software must comply with the following criteria. They are simple (fit on a page) and crystal clear to English speakers (and have of course been translated). I’m just giving the headings here, but READ them.

1. Free Redistribution

2. Source Code The program must include source code, and must allow distribution in source code

3. Derived Works The license must allow modifications and derived works, and must allow them to be distributed under the same terms as the license of the original software.

4. Integrity of The Author’s Source Code

5. No Discrimination Against Persons or Groups

6. No Discrimination Against Fields of Endeavor

7. Distribution of License The rights attached to the program must apply to all to whom the program is redistributed

8. License Must Not Be Specific to a Product

9. License Must Not Restrict Other Software

10. License Must Be Technology-Neutral

 

This doesn’t stop you running a business on Open Source (Redhat, Kitware ++). Or having moral control – as long as you can exercise it through e-charisma. But, in principle and usually in practice, anyone has the right to copy and fork your code. It may be frowned upon, but it will not bring the lawyers.

Whereas if you fork copyright material – even if “freely” available on the web, and even if not created by the copyright holder, lock the door or leave the country. (The original idea that academics signed over their copyright to publishers so that publishers could protect academics from pirates seems tragically distant now. Publishers “own” OUR material for their own ends).

So the Open Access declaration (I use Budapest http://www.soros.org/openaccess/read ) had the same noble principles:

By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.

These are great principles, and COULD have been crafted into a legal framework that ensured that readers could re-use Open material without fear. But in practice the community did not address this with the result that until recently no one knew what “open access” meant in practice. If I post a “green open access” copy of a “publisher’s article” and I re-use this for any purpose I can still be sued. There is no legal gift, no legal guarantee.

The major progress in this has been the emergence of “Open Access” publishers. These are – in the main – characterised by using CC-BY licences. A document which EXPLICITLY gives the reader/user rights. With Open Access publishers you can sleep soundly.

Note that if a document is not completely Open its status is effectively closed in legal terms. This is not a quibble – ask the lawyers when they come after you.

Sadly Institutional Repositories have almost completely failed in promoting Open Access. Almost no content carries explicit rights, and without those rights you can only assume that the content is closed. If you doubt this, try to find more than 5% of any IR which is explicitly marked as Open/CC-BY. And how did you serach for it? By hand – as repositories generally don’t provide search-by –legal-rights. So almost all content in IRs is “glorious”.

The Open Knowledge Foundation has defined “Open Knowledge” very clearly (http://www.opendefinition.org/ ):

“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”.

AND the OKF has spent time in cataloguing those licences which are OKD-compliant and those which are not. For example CC-BY is, CC-NC is not, compliant. OSI licences are compliant. A document without a licence is not, de facto, compliant. So if something is OKD-compliant, you can sleep. Otherwise you can’t.

All of this leads to a recent discussion (http://lists.okfn.org/pipermail/open-bibliography/2011-September/thread.html ) on OKF’s Open-bibliography list about the use of “Open” (http://lists.okfn.org/pipermail/open-bibliography/2011-September/001141.html).

David Weinberger <self at evident.com> wrote:

 

> LibraryCloud is a metadata server gathering

> library metadata (circ data, user reviews, etc.) and making it openly

> available via APIs and Linked Open Data.

......

 

> We are on the verge of making it accessible to a limited public (API key

> required, daily queries limited to 3,152). We're interested in

> contributing what we can as we can. (No, we cannot make its catalog

> available in its entirety. We wish.)

 

These two paragraphs contradict each other.

Is LibraryCloud an open data provider, or not?

Either data is open, and it is possible to get hold of the entire

dataset with a clear open license for what you can so with it. Or it is not open.

It is wrong to call data open if it is subject to arbitrary access restrictions like 3K entries a day.

 

--Jim [Pitman]

 

A lively and fruitful discussion followed, with some supporting “open” == OKD-compliant and other arguing that “open” was an arbitrary point on a spectrum. For which I read “glorious”. Here’s two typical passages:

 

If something

isn't "open" according to [OKD] strict standards it isn't open at all. This

completely misses the fact that "open" as in the Harvard API may be

completely fine and useful for nearly all real world purposes.

 

PMR: The good intention is clear, but “open” gives no other information. It does not keep the lawyers away.

we cannot make all our catalog

data available for bulk download. That is a limitation we all regret

but there it is.  I would argue that because the data we make

available we make available without restriction, it is reasonable to

use "open" as a modifier.

 

PMR: This shows the complexity. Perhaps the individual items *are* Open. In which case good, and in which case give them each a licence.

The sad fact is that in many cases “open” == “glorious”.

If we are to operate legally usefully then the only practicable way is to use the Open Definition. Everything else may be interesting points on a political spectrum, but only OKD make us safe.

And brings in the promised land of infinite re-use of knowledge.

There *are* technical concerns with OKD-Open APIs and I’ll discuss them in the next post

 

Posted in Uncategorized | 3 Comments