#sparc2012 a manifesto in absentia for Open Data

Dear #sparc2012

I am very sorry that I can't be physically present with you, especially since we are at a critical time for #scholpub. I'd have liked to meet and come up with new ideas on how to change the world. As it is the iffy technology (your words) means I shall write a blog and then either I or John @wilbanks will present it. Maybe John will splash this blog up, click to the relevant pages, and say what he thinks needs said. Or maybe he'll read it verbatim and perhaps adds comments. Whatever. I hope that either way my message will get through.

I've been asked to talk about Rights and Open Data. The Rights I care about are not academia, nor authors, nor publishers but the 99% of the world who cannot get effective access to scholarship. The #scholarlypoor. So here's the first principle – if you accept that then

  1. Access to the fruits of publicly funded research is a fundamental human right

We spend about 300 Billion every year on Science Technology Engineering Medicine (STEM). [I'll probably use "science" to cover all and all figures are +- half an order of magnitude so 100-1000 B USD]. That's about 50 dollars for every inhabitant. And almost no inhabitant (including 99% of those in rich nations) has effective access to this output. Now we have won many human rights and we can win this one.

It will cost money to deliver, but then it will be gratis (free to consume) and libre (free to reuse). I pay for my water. I pay for education. Those who can't afford them still have equal access. These are human rights which we have largely solved. It's the same with science. I can re-use my education as often as I wish without permission.

[Note. The "OA definition" of "libre" as "some restrictions removed" is unacceptable. Some of us (including Wikipedia and OKF) are actively working to make sure "libre" is properly defined and used

The second principle

  1. Scientific data should be libre at the point of creation

This is why we initiated the Panton Principles. There are many reasons for making scientific data libre:

  • It belongs to all of us
  • It is required to validate the science (most scientific papers are merely advertisements for the work, not the work itself – who said that?)
  • It can be re-used in millions of planned and unexpected ways

Scientists and academia have lost control of the authoring process. They must regain it, and part of this is to regain control of data. So corollary:

2a. Science data should be stamped as Libre (Panton)

Almost all data is now produced either from instruments or from scientific software. (In my field of computational chemistry there is probably 10 B spent per year on computation (machines, people, software). Maybe 100 million (computational) jobs or more. All the public fruits of this could be collected and stamped as libre. Similar ideas for images (the microscope software could stamp with "Open Data", the phone app taking gel pictures could do the same). Everything on Figshare or Dryad could be watermarked.

Huge amounts of fruitless effort are spent on bad licences. Unfortunately some of these seem deliberate – to confuse rather than help. The latest Wiley paid OA (4000 USD, "fully open access") "Chemistry Open" has so many restrictions that it's effectively closed. Why does the library community and SPARC not challenge this ? So the way forward has to be clear licences.

  1. Open Access and Data require clear, libre licences

The Open Access community has failed to address this for 10 years (since BOAI/BBB). BOAI/BBB were/are great declarations, in the tradition of liberation. But most of the OA community honours then in name only. Everyone at #sparc2012 should be able to recite, by heart:

"By 'open access' to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited."

And for the purposes of data free to "use, re-use, and redistribute" (OKDefinition).

    3a. Libre material should be clearly stamped for human and machine discoverability and reuse

UK/PubMedCentral shows the current problem clearly. It is impossible to search for "Open Access" material and even harder – almost impossible – to search for BOAI-libre material (i.e. minable). Our recent @ccess group is trying to index the malaria literature for BOAI-Openness and it has to be done paper-by-paper – IMO this is unacceptable after 10 years. University IRs are even worse. Here's mine http://www.lib.cam.ac.uk/repository/about/end_user_terms.html

Unless otherwise noted, Deposited Works in DSpace@Cambridge are made freely available for access, printing and download for the purposes of non-commercial research or private study only.

So – Institutional Repositories are set up just for academics – no one else matters. You can't use DSpace@cam for:

  • Teaching schoolchildren
  • Ideas for high-tech business (Cambridge is the UK's centre for high-tech)
  • Helping a patient understand their disease
  • Writing books
  • and 101 more examples (see Mike Taylor's http://whoneedsaccess.org )

So the next principle

  1. Only use CC-BY, CC0 and other BOAI-compliant licences.

Abandon NC, Non-commercial. It effectively prevents anything useful. (Maybe Mike Carroll will cover this, but it needs restating again and again). Corollary:

    4a. Publishers of Open Access ("Gold OA") should useBOAI-licences.

Ross Mounce (a graduate student) has done a tremendous job of collating the hybrid OA licences of major publishers and out of over 100 finds that only 5% are BOAI-compliant. Authors are paying lots of money (1000-5000 USD for this, publishers are restricting re-use to the point of uselessness and academia accepts this without a squeak. Surely this is where SPARC should be labelling offerings as BOAI-acceptable or non-acceptable. But no, we have given in and allowed this mess of "slightly Open Access". Some of the publisher terms are so badly written, piling restriction on restriction, that they are probably not even executable consistently.

And now some more general ideas on "textmining". Over the last 2 weeks I have blogged about information mining (a better term as we can mine images, speech and video for facts as well). The core is defined in: http://blogs.ch.cam.ac.uk/pmr/2012/03/04/information-mining-and-hargreaves-i-set-out-the-absolute-rights-for-readers-non-negotiable/ . I've been trying to do this for years and failing to get permission. Publishers (and libraries) have a three-valued logic:

  • Yes you may (very rare)
  • No you may not (common)
  • Mumble (make nice noises but actually say nothing). Mumble includes "let's set up a meeting", "let's talk with your librarians" "I'll refer you to our director of marketing" and much else. Mumble means hoping the problems will go away. Silence is Mumble.

Understand that mining is about LOTS of information. And the web can cope with lots of information. As an example ONE STUDENT (albeit it the very smart Daniel Lowe) last week downloaded 65,000 patents and extracted 500,000 chemical reactions AND interpreted the text and diagrams AS ATOMS and MOLECULES. Very high quality semantic information. Here's what we CAN mine:

  • Patents
  • BMC and PLoS
  • Supplementary info on publisher's web pages (FWIW we have downloaded 250,000 crystal structures in this way and haven't crashed any servers)

What we CAN'T mine are:

  • Closed access publications
  • Green OA (can't find it and anyway no rights)
  • Gold hybrid OA (can't find it and cannot machine-read the licences to find BOAI)
  • UK/PubMedCentral (impossible to find the BOAI-compliant subset)
  • Institutional Repositories (impossible to navigate and no rights in most cases anyway)

I've been asking Elsevier for 2.5 years whether we can text-mine. Have I got "Yes" or have I got "Mumble". I'll post today's mail and let you judge.

I'm not the only one. Here's Max Haussler, writing to publishers for permission to text-mine http://text.soe.ucsc.edu/progress.html . Some have taken two years of negotiations. Half haven't responded. This is an industry that Eefke Smit says is extremely helpful to requesters.

Where the publishers do respond, they want to control what research we can and can't do. See UBC and Heather Piwowar negotiating with Elsevier http://researchremix.wordpress.com/2012/03/05/talking-text-mining-with-elsevier/ . "Alicia had indicated that Elsevier facilitates text mining on a project-by-project basis". For me this is unacceptable. There is no reason in the world why Elsevier or any other publisher should "facilitate my research". ("facilitate" is Newspeak as is "Universal Access" – means of restricting access). HP: "two of my text mining use cases require reuse rights that are outside the standard Elsevier agreement". Yes, Elsevier writes additional restrictions in the contract. HP: "I asked for the text of the standard reuse agreement. It was sent to me but I was asked not to share it publicly because 'it is a legal element'". So we cannot even know what we aren't allowed to do. This is Universal Access??

So another right:

  1. Universities/subscribers should refuse to sign any contracts more restrictive than copyright itself; they should publicize the contracts

Librarians or purchasing officers should read contracts before they sign them and publicise any causes which restrict subscribers Rights. (I can't imagine how librarians all over the world have signed that we may not index the literature we buy (sorry, rent).

    5a. Where the extracted material (e.g. facts) does not violate copyright then it should be made public and posted Openly under a libre licence

I'd like to think that all involved (publishers, universities) would wish to sign up to the principles I have outlined. I'm happy for #sparc2012 to take any or all of them and wordsmith them.

And finally I'd like volunteers to work with me on extracting useful factual chemistry from the closed literature.

 

0 thoughts on “#sparc2012 a manifesto in absentia for Open Data

  1. Claire Stewart

    Peter: is Ross's research ("Ross Mounce (a graduate student) has done a tremendous job of collating the hybrid OA licences of major publishers and out of over 100 finds that only 5% are BOAI-compliant.") publicly posted? Thank you for this post/manifesto; we were sorry not to have you at SPARC2012 yesterday but John filled admirably -- two for one in fact.

    Reply
  2. Mike Taylor

    And excellent list of principles.

    Can I offer one that is even more fundamental and would cost barrier-based publishers absolutely nothing to implement? Every publisher should clear, explicitly and visible state what terms content is published under. Just knowing what rights we have would be an important start. At the moment, my feeling is that the only safe approach with (say) Elsevier is to do nothing at all with their articles, even when I have a subscription, because I just can't tell what I am and am not allowed to do. I'm sure this sort of chilling effect can't be deliberate.

    Reply
  3. Pingback: Elsevier responds to my text mining request « Research Remix

  4. Pingback: Around the Web: Some resources on the Panton Principles & open data : Confessions of a Science Librarian

  5. Pingback: Around the Web: Some resources on the Panton Principles & open data – Confessions of a Science Librarian

  6. Pingback: Unilever Centre for Molecular Informatics, Cambridge - #rds2013 Managing Data and Liberation Software; we must remember Aaron Swartz « petermr's blog

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>