petermr's blog

#okfest #openscience: the hacking begins: OpenBib to analyse patterns of science funding

Posted on September 18, 2012 by pm286

The great things about a good hackathon is that you don’t know what you will be doing until you get together and pool ideas and expertise. Today we have our Open culture and science hack-day. About 20 “streams” ranging from Wikimedia-editing to #openbiblio. And we are working on a project that comes out of two ideas:

Tom Olijhoek is one of the driving forces behinf @ccess – true Open Access to the scientific literature. Tom runs OpenMalaria. His current idea is that science can be influenced through the way it’s funded – especially in the areas of medicine and food. Can we discover this from what appears in the public literature. And, although most published science is CLOSED, we can pick up a lot from the metadata and from the 5% of Open Access science.
In parallel we have the #openbiblio project which is building the technology to make Open Bibliography fully re-usable. Examples can be found on http://bibserver.org (technology) and http://bibsoup.net (collections of bibliography)

We’re combining these in a project to scrape the bibliography from BioMedCentral (one of the few sources of Open Access publishing). Here’s the team starting to make Open Access BioMedCentral BibSoup (note Gulliver the Open Access turtle):

And here’s the actual team:

L2R: Jenny Molloy , Laura Newman, Michael Bauer (and some still at lunch) and Daniel Lombraña González (who’s running the PyBossa project) and me with the iPhone. (And Chuff the OK-API).

Michael is creating the scraping and already has a chunk of bibliography. He’s got a background of hacking and biomedical and is now with OKFN.

If you are reading this remotely and want to be involved we have an Ethernet pad – tweet @petermurrayrust

Posted in Uncategorized | Leave a comment

#okfest: The preparation

Posted on September 17, 2012 by pm286

Today is when people start arriving en masse for OKFest – the main sessions start tomorrow. I’m hanging out with the organizers, volunteers , etc. The staff and volunteers are doing the last-minute stuff whih is always more than you think. The programmes and bags have to be packed. The name labels have to be checked. The venues got ready, and so on. I’m fortunate – I have relatively little to do – other than running a panel on Wednesday. So here’s some photos.

Immediately after arriving – now the work starts:

This is a maker project. I think it’s a machine make things. Very bravely the instructions were sent from Chicago and cut in Helsinki? Are all the bits there? More or less, yes! I’ve helped by trimming off some scarf.

And here the real work is happening. Staff and volunteers hacking the final details

Sam Leon doing the analytics:

Posted in Uncategorized | Leave a comment

#okfest name badges

Posted on September 16, 2012 by pm286

OKFest is about fun, and making things (and a lot more). So here’s examples of our laser cutting badges.

Chuff went to Massimo, smiled and asked could he have a name badge. “#OKFEST, CHUFF, OKAPI”. He is now the smartest OKAPI in the whole world:

Kat has done a huge amount of work and deserves an even more special badge. So Stuart Childs (@sc_r) has made special light-up badges. They are ultra cool and shine until the battery gives out. Here’s the whole crew displaying:

Clockwise: Juha , Stuart (with 2 badges, Massimo, Kat (hand only)

Posted in Uncategorized | Leave a comment

Open Content Mining: Authors and Readers should control the process; act before it’s too late

Posted on September 16, 2012 by pm286

Scientists write articles so people can read them. But now we’ve realised that machines can read them as well. And we have argued /pmr/2012/09/16/our-manifesto-the-right-to-read-is-the-right-to-mine-universities-you-must-fight-for-open-content-mining-before-its-too-late/ that if you subscribe to read an article you also have the right to use machines to read it and extract and re-publish facts.

It’s critically important that we unit behind this view. Because at present:

The STM publishers prevent content mining
There are signs that they wish to develop this as a new activity (for which they will undoubtedly charge, build walled gardens, and otherwise restrict access to their content.

“their content”? We wrote it. And while they might tweak the odd bit of text they cannot and must not alter the facts. So the facts are created directly by scientific research.

The pharmaceutical industry is desperate for these facts. So they got the pharma industry to meet them (I think in Bruges) a few months ago and have hammered out an agreement.

http://www.stm-assoc.org/2012_09_12_PDR_ALPSP_STM_Text_Mining_Press_Release.pdf

I don’t know the details, because they are almost certainly secret. (I’ll write to STM and ALPSP and they can save me the bother by replying to this blog). If the details aren’t secret then one-cheer. (No more cheers till we see the details).

We’ve already seen on this blog that Springer are reselling their image content – double or even triple dipping into freely given content. So Springer, do you intend to charge for content mining in a fourth dipping process?

Why do I trust the worst rather than the best in this deal? Here’s the meat of what has been reported (I know no more and would be grateful for insight) Sorry it’s an image but the original is a PDF:

Nothing has been said about cost/prices so I won’t speculate. The implication here is that the results can be mounted within the pharma company but not published.

I have spent three years trying to get permission to text-mine, e.g. from Elsevier’s Directorate of Universal Access, without any progress. Universal Access is extremely helpful (because they say so). Heather Piwowar spent months getting an agreement for one group in one university (UBC). She tweeted (and I have her brave permission to re-quote):

“I am LOATH negotiating with publishers. It gives me hives” (Hives is a disease)

Elsevier have negotiated 20 text-mining contracts over the last 5 years – that’s an average on 0.25 a month. There is no way they can and will scale this demand (even if the wanted to). Then there are 100 other publishers, all currently with restrictive licences.

So the danger is that whatever is negotiated here will be be put in from of Universities/Librarians. Whose track record is that they don’t I publically challenge any contracts.

Please do not sign any content-mining contracts without alerting the world.

It is critical that we, the scientific and machine-readership argue for our rights, not the commercial benefit of publishers.

This is where YOU have to make a stand.

Posted in Uncategorized | 2 Comments

#okfest is and will be amazing

Posted on September 16, 2012 by pm286

I am at #okfest in Helsinki – 13 (THIRTEEN) separate tracks on Openness. I am already gutted that I shall miss most of them because they are in parallel. I’m helping run Open
Research and Education and obviously that will be much of my time. (BTW we are full up so I won’t urge you to come, sorry).

As an example the OKF is running all the conference management itself. Name labels? Make them! Here’s Massimo – a doctoral student in the media lab in Aalto and he’s running the lab’s laser printer to make the labels from plywood. The laser printer is on the left:

And here is what he creates:

(This one is acrylic, you can see the label is for a “CREW” member. My job (I think) is to help punch out the little chads in the middle.

Chuff the OKAPI has already been tweeted and a comment (I think) from IBM that it should be a RESTful API. So here’s the restful Chuff:

The humans are:

Joris Pekel (Open heritage GPP / Amsterdam
Juha Huuskonen (OKFestival coordinator / Helsinki)
Kat Braybrooke (OKFestival coordinatory / London)

It’s just so exciting to be at a meeting where people discuss how to share and change the world. I’ve met Stuart who runs an Open laser workshop in Leeds and works with DaveMR (Mo-seph) and Matt Venn.

Posted in Uncategorized | 1 Comment

Our manifesto: “The Right to Read is the Right to Mine”; Universities: you must fight for Open Content Mining before it’s too late

Posted on September 16, 2012 by pm286

Over the last ten years University (Libraries) have signed or resigned one million contracts with scholarly publishers (eg. Elsevier) which forbid re-use of the subscribed material. Thus, for example, if your university rents a scientific journal for 5000 USD a year a not uncommon figure) you are not allowed to extract factual information from this and republish. If you buy a BOOK (a fast disappearing thing) you can extract the facts by hand and re-publish them in Times or Trafalgar Square. But not if it’s electronic. It is of course much easier to do and much saner and what this century is all about.

But publishers add restrictive clauses to their licences and librarians just sign them.

One million times my rights have been signed away without my even knowing. The very least that should have happened is that the libraries should have alerted the rest of the world and refused to sign. But no – the only thing they are worried about is price (and they aren’t very good at keeping that down – Elsevier make 30% profit). Everything else in the information world has gone down in price, but scholarship costs more each year. And, of course, it’s actually written by you and me and given to the publishers. They don’t even produce it in a modern efficient manner – in a non-protected market #scholpub would go out of business in a year. (stop ranting, PMR and get to the point).

Jenny Molloy has collected a range of publisher restrictions that libraries sign (see full paper http://www.dspace.cam.ac.uk/bitstream/1810/243749/1/ofa.pdf

). Here’s one (from Elsevier):

“Schedule 1.2(a) General Terms and Conditions “RESTRICTIONS ON USAGE OF THE LICENSED PRODUCTS/ INTELLECTUAL PROPERTY RIGHTS” GTC1] “Subscriber shall not use spider or web-crawling or other software programs, routines, robots or other mechanized devices to continuously and automatically search and index any content accessed online under this Agreement. “

Fairly clear. Readers cannot do ANYTHING with machines. The others are just as restrictive. I cannot imagine how anyone could sign this without alerting the world to the problem.

And it gets worse. Elsevier will “allow” text-mining, but only if the individual scientist and their librarians negotiate a secret deal with Elsevier (as Heather Piwowar and UBC were required to do). This is completely unacceptable and doesn’t scale.

So we (Diane Cabell, Jenny and I) using the OKF lists have created a manifesto. The only things you need to remember are in bold type:

Principle 1: Right of Legitimate Accessors to Mine

We assert that there is no legal, ethical or moral reason to refuse to allow legitimate accessors of research content (OA or otherwise) to use machines to analyse the published output of the research community. Researchers expect to access and process the full content of the research literature with their computer programs and should be able to use their machines as they use their eyes. The right to read is the right to mine

Principle 2: Lightweight Processing Terms and Conditions

Mining by legitimate subscribers should not be prohibited by contractual or other legal barriers. Publishers should add clarifying language in subscription agreements that content is available for information mining by download or by remote access. Where access is through researcher-provided tools, no further cost should be required. Users and providers should encourage machine processing

Principle 3: Use

Researchers can and will publish facts and excerpts which they discover by reading and processing documents. They expect to disseminate and aggregate statistical results as facts and context text as fair use excerpts, openly and with no restrictions other than attribution. Publisher efforts to claim rights in the results of mining further retard the advancement of science by making those results less available to the research community; Such claims should be prohibited. Facts don’t belong to anyone.

And Diane wrote a superb supporting text (see paper) which explains the rationale, the law, and what we should do. Jenny and I stitched it together in a slightly frantic rush, added pictures, tables, references, etc. http://www.dspace.cam.ac.uk/bitstream/1810/243749/1/ofa.pdf. I have been elected to the Fellowship of the OpenForum Academy (http://www.openforumacademy.org/) who are meeting on Sept 24. I can’t go, so I have offered this paper.

The publishers have woken up to the fact that text-mining matters. They are starting to do secret deals with subscribers (I’ll write about their deal with pharma next blog). They’ll start to create walled gardens, special extra terms and who knows what.

Whereas it’s actually our RIGHT to do this.

So universities and librarians – are you going to watch while yet another set of rights disappears uncontested?

Or are you going to fight for my (and everyone else’s) rights?

Posted in Uncategorized | Leave a comment

#animalgarden at Digital Research 2012 (#openbiblio and BibSoup) and OKFest

Posted on September 14, 2012 by pm286

It’s a very busy time for #animalgarden – the group of animals committed to Openness. Last month they made the allegorical movie of weak chemical AI (“Magic Chemical Panda”, /pmr/2012/08/17/animalgarden-present-the-chemical-chinese-room-at-the-american-chemical-society-meeting/ and ). Now they’ve been busy on #openbiblio and #okfest.

[PMR I presented #openbiblio among a series of Open tools I call “Liberation Software” designed to create Open information, especially in #scholpub and Open Scholarship.

Mahendra Mahey ran a great evening session for new ideas / software along the lines of Dragon’s Den. We all had to pitch. I showed the #animalgarden video explaining how Bibsoup works https://vimeo.com/35458484

Gulliver is the Open Access turtle from BioMed Central. He’s very keen on making things Open.

We’re VERY busy. We’ve got a new member of #animalgarden who is helping us with #okfest next week

It’s Chuff the OKAPI

Chuff wears the OK logo.

OKFest is growing rapidly. We’ve got sessions on science and open-access. It’s impossible to take in everything.

It’s a must-attend event.

Posted in Uncategorized | 1 Comment

Data Liberation and the Long Tail: (and a puzzle)

Posted on September 7, 2012 by pm286

Next Tuesday I am giving an invited talk at Oxford on Open Data http://digital-research.oerc.ox.ac.uk/ , http://digital-research.oerc.ox.ac.uk/programme , and also involved with a session run by the OKF immediately afterwards. As always I don’t know what I am going to say until 0200 of the morning of the talk – this gives a chance to talk with delegates and get a feel for what is valuable.

I’ll touch on at least the following:

The Long Tail. Scientific disciplines which have little formal information infrastructure but huge amounts of science. Disciplines such as bioscience (outside mainstream bioinformatics-support, such as phylogenetics), chemistry. Materials science, observational sciences (other than astronomy), much computational and simulation research. Much of the data is valuable but thrown away. I estimate billions (sic) of dollars is wasted through non-existent infrastructure
Graduate Students. A seriously misused resource. Much of the innovation comes from third-year postgraduates and we need to give them expression
Software/informatics as a first-class activity. Builders of scientific software are often denigrated as not “doing proper science”, but they are every bit as important as the scientists who build telescopes and other instruments.
Bottom-up communities. There is a huge cognitive/informatics surplus if we treat the citizen community as equals and not inferiors. (Much of the software we work with is developed outside “research universities”. We should be helping this grow.
Liberation software. I and others are building software which will free data in dark silos, repositories, theses, journals, etc. I’ll present some of this in the afternoon briefly. The main battle we face is closed minds and vested interests; liberation software will leapfrog many of these.

I’ll be showing some of this in action, but here’s a taster. It comes from the supplemental information in a paper behind a publisher’s firewall. I don’t know if I am allowed to show it, but I’ll take the chance. It’s a mass spectrum – in simple terms it measures the mass of a molecule (here to 4 sig figures). Here are some questions. (Please add answers as comments because then I know people are interested and also I might learn something). [BTW this is how it appears in the paper – I assume the journal prints text upside down to make it easy for Australians, but I have to hang from the ceiling to read this.

UPDATE: Walter Blackstock has given some answers and I reply

Questions (in order of difficulty):

What’s the constitutional formula of the compound? (relatively easy for chemists)
How many peaks are there? (harder than it looks)
How would you find where this diagram was published? (very hard)

On Tuesday I will show how Liberation Software AMI2 can be used to answer Q 3.

Posted in Uncategorized | 3 Comments

Ross Mounce’s Visualization of “Gold” Open Access Rights and Prices

Posted on August 31, 2012 by pm286

This blog highlights some splendid work done by Ross Mounce, one of our Panton Fellows. Ross actually started this before he applied to us, but he’s done and though a lot since so we can claim a little reflected glory.

The work is blogged at http://rossmounce.co.uk/2012/08/30/a-visualization-of-gold-open-access-options/ .

“To try and publicize the variety of Gold Open Access article publication options on offer, I’ve decided to create a visualization of the journal data that has previously been collected as part of my survey of ‘Open Access’ publisher licenses’ spreadsheet. ” [RM]. So the data can be found there.

For those who don’t know a scholarly publication with a major publisher is only readable if your library has a subscription (up to more than 10,000 USD/year for a single journal) or if you pay one-off fees (40 USD for one day for one article). This means that most of the world (including most people in the rich West) do not have access and normally suffer by remaining ignorant.

There are three main approaches to scholarly publishing:

Create and run publications with no charges for publication or reading “Sponsored Publication”. IMO this is what we should ultimately be aiming at but critics dismiss this as “Fairy Godmother”. There is after all 15 Billion USD spent by universities per year so some of this could be put to use. Nonetheless many journals work this way, but not normally large ones.
Make an agreement with a publisher that a copy of the article can be put on a permanent site (“Green”). This copy is not normally the final published article (“the PDF”) but something close. Publishers have no legal requirement to allow this and many don’t. The copies have to be mafde by the authors and many don’t take the trouble. Nonetheless some academics believe that by passges of years and campaigning they can force all academics to deposit green manuscripts.
Pay the publisher (APCs or Article Processing Fees) to make the final article publicly readable (“Gold”). There are two mechanisms:

choose a journal where all artciles are Open Access. Examples of such are PLoS and BiomedCentral journals, Acta Crystallographica E, Atmospheric Chemistry and Physics and many more. This is straightforward if there is a journal in your subject (though I and others questions the need for journals). You pay the price (NOT the cost) (AFTER the article is accepted, so not “vanity publishing”) and your article is published Openly. I have managed to do this for almost all my recent publications – but it costs money.
Choose a “closed” journal, i.e. where most of the articles are not readable by the public and pay the journal APCs. This is “hybrid Gold”. For many scientists this is the only viable option, sinvce most of their natural outlets are closed. One obvious concern is that we are paying twice – once to publish and once for people to buy the journal (most of which is closed). The publishers assert that they lower their prices to account for this and although they don’t disclose accounts we trust them because publishers are by nature trustable, as are banks.

Before commenting on Ross’s data I’ll comment that there are no effective market forces for Gold. If I want to publish in Journal X I have to pay whatever the journal sets. This ranges (as Ross shows) from 160 USD for Acta Crystallographica E to 10,000 for Nature (not on the plot as it is in Nature statements rather than on the web page).

Those outside academia may well be baffled by a charge for 10,000 USD to publish a paper that an author has authored for free (authors are not paid and no-one says they should be) and academics have reviewed for free. You can buy a good used car for that. The journal incurs costs in managing the peer-review (but not normally doing it), making the journal look nice (which most people can do with Open tools for free), hiring lawyers to stop people copying articles, hiring web expets to build tools to stop people reading articles, hiring salespeople to persuade people to buy journals, and paying large dividends to shareholders.

So IMO and Ross’s it’s important to change the way we publish science so that everyone can read it.

INCLUDING MACHINES.

Because even if you can read an article you are normally explicitly debarred from using machines to read it, or especially lots of articles. I have argue that this is costing humankind huge amounts of lost value.

There is a formal way to ensure that machines ARE allowed to read articles, and that’s to add a licence explicitly allowing them to do this. The only well-known licences that are acceptable for this are CC-BY or CC0.

But many publishers do not provide CC-BY even if authors have spent thousands of dollars. This is a unilateral decision by most publishers and IMO this is immoral, and unethical. There is no justification for this (it does NOT protect authors – scholarly norms do that). These lesser licences include CC-NC (“non-commercial”) and CC-ND (“no derivatives”). Many – including me – have argued that these are counter-productive to scholarship. The publishers include them for a variety of motives:

Lazineness, incomptence and ignorance. This is not excusable – after all we are paying publishers zillions – but it’s probably the easiest to change. So Ross’ plot is a reall opportunity to name-and-shame publishers who couldn’t be bothered to think about licences. The worst category on the diagram is “no clear licence” and we hope that many publishers will realise that by a simple process of adding a single phrase to their publication they culd whizz to the top.
A misguided idea that “non-commercial” is a good thing. It isn’t . Its main effect is to hit academics themselves (e.g. can’t use in books), small businesses, government (buying publications is a commercial act), etc. If you aren’t convinced we’ll help change your mind
A desire to milk the system for every last drop. Publishers want to retain the right to resell the paper and its diagrams as reprints, in books, etc. Free-to-read is not free-to-reuse
And other means of trying to control academics, libraries etc in a confusing and highly profitable market.

So armed with that, re-read Ross’ plot. “Good” is at the top “unacceptable” at the bottom. Some points will not be in the right place on the diagram. There are several reasons:

It’s often very difficult to find out theprice (e.g. when there are page charges and colour charges (coloured electrons cost more on the internet)).
Many publishers (especially those with society journals) have many different journals
There are special deals – if you belong to some institutions they get reduced author rates
The licence information is so badly written it’s impossible to work out what’s happening (answer, use a CC licence – either CC-BY or CC0)
Some publishers offer more than one licence. I can’t understand why – they should offer only the most liberal.

Then there is the question of ownership and copyright. But that’s another day.

The price axis is one of the areas we should be addressing. The price bears NO RELATIONSHIP to the cost (except for journals at the LH side the plot, like Acta Crystallographica E). It doesn’t COST Nature 10000 USD to publish an article that has been written and reviwed for free. It doesn’t COST Perrier umpteen dollars to fill a bottle with water that comes out of the ground. These are vanity prices, and academics don’t care as long as the taxpayer or students are paying for library bills.

But you form your own conclusions from the plot. Comment on this blog or Ross’ if you think data is wrong.

Posted in Uncategorized | Leave a comment

Lee Dirks

Posted on August 30, 2012 by pm286

Lee Dirks died yesterday with his wife in a car accident in Peru:

http://latino.foxnews.com/latino/news/2012/08/29/2-americans-peruvian-die-in-peru-highway-accident/

Many have already written about Lee today, e.g.: Savas Parastatidis
http://savas.me/2012/08/rip-my-friend-lee-dirks/

And John Wilbanks: http://del-fi.org/post/30531593681/remembering-lee-dirks

So I’ll try to add something different.

I met Lee about several years ago after Tony Hey moved to run Microsoft External Research. Lee was the person we immediately interacted with and who was the lynchpin of the relationship. Lee was fun, focused, dynamic, everywhere, with a huge involvement. He was the centre of any group. He was fun to listen to, relaxing, entertaining. You never felt stress when lee was around.

He made things happen. We worked together for three years on Chem4Word (http://research.microsoft.com/en-us/projects/chem4word/ ) and I think this is one of the many things that he would like to be remembered for:

The chemistry is important but it’s not the most important thing. The task was enormous; create a working, modern chemical authoring system for the Word/Net environment. By conventional methods it would never have happened. And indeed it started slowly, with Lee steering Microsoft to work effectively with an external group on – literally – a daily basis. But as we developed Lee was able to spot opportunities and change direction when it really mattered. And something that could never have been dreamt of at the start of the project – Lee steered it to being completely Open Source.

We had a lot of laughs – you cannot survive a project like that without them. And Lee was at the centre. For me, you are still with us.

Posted in Uncategorized | Leave a comment

#okfest #openscience: the hacking begins: OpenBib to analyse patterns of science funding

#okfest: The preparation

#okfest name badges

Open Content Mining: Authors and Readers should control the process; act before it’s too late

#okfest is and will be amazing

Our manifesto: “The Right to Read is the Right to Mine”; Universities: you must fight for Open Content Mining before it’s too late

#animalgarden at Digital Research 2012 (#openbiblio and BibSoup) and OKFest

Data Liberation and the Long Tail: (and a puzzle)

Ross Mounce’s Visualization of “Gold” Open Access Rights and Prices

Lee Dirks

Recent Posts

Recent Comments

Archives

Categories

Meta