petermr's blog

A Scientist and the Web

 

Jean Claude Bradley Memorial Symposium; July 14th; let’s take Open Notebook Science to everyone

July 4th, 2014

On July 14th we are holding a memorial meeting for Jean-Claude Bradley in Cambridge. Do come; it’s open for all. [NOTE: we hope to get live streaming for those who can't come.]
http://inmemoriamjcb.wikispaces.com/Jean-Claude+Bradley+Memorial+Symposium

Jean-Claude Bradley was one of the most influential open scientists of our time. He was an innovator in all that he did, from Open Education to bleeding edge Open Science; in 2006, he coined the phrase Open Notebook Science. His loss is felt deeply by friends and colleagues around the world.

On Monday July 14, 2014 we shall gather at Cambridge University to honour his memory and the legacy he leaves behind with a highly distinguished set of invited speakers to revisit and build upon the ideas which inspired and defined his life’s work.

Speakers

Simon Coles, University of Southampton, UK
Robert Hanson, St. Olaf College, USA
Nina Jeliazkova, Ideaconsult, Bulgaria
Andrew Lang, Oral Roberts University, USA
Daniel Lowe, NextMove Software, UK
Cameron Neylon, PLOS, USA
Peter Murray-Rust, Cambridge University, UK
Noel O’Boyle, NextMove Software, UK
Henry Rzepa , Imperial College London, UK
Valery Tkachenko , Royal Society of Chemistry, UK
Matthew Todd, University of Sydney, Australia
Antony Williams, Royal Society of Chemistry, UK
Egon Willighagen, Maastricht University, Netherlands

For me this is not to look back but forward.  Science, and science communication is in crisis. We need bold, simple visions to take us out of this, and Open Notebook Science (ONS) does exactly that. It:

  • is inclusive. Anyone can be involved at any level. You don’t have to be an academic.
  • is honest. Everything that is done is Open, so there is no fraud, no misrepresentation.
  • is immediate. The science is available as it happens. Publication is not an operation, but an attitude of mind
  • is preserved. ONS ensures that the record, and the full record, persists.
  • is repeatable or falsifiable. The full details of what was done are there so the experiment can be challenged or repeated at any time
  • is inexpensive. We waste 100 Billion USD / year of science through bad practice so we save that immediately. But also we get rid of paywalls, lawyers, opportunity costs, nineteenth century publishing practices, etc.

and a lot more. I shall take the opportunity to show the opportunities:

“Open Notebook Science NOW!” – Peter Murray-Rust, University of Cambridge and Shuttleworth Fellow
Open Notebook Science can revolutionise science in the same way as Open Source has changed software. Its impact will be massive: greatly increased quality, removal of waste and duplication, and an inclusive approach to involving citizens in science. It’s straightforward to do in many areas of science, especially computational. I shall present an ONS model which we can all follow and adapt. The challenge is changing minds and to do that we should start young.

 

Mozilla Global Science Hack – A must-attend event for scientists who want programs

July 2nd, 2014

In 3 weeks from now we’ll have a massive global hack for science. Many scientists probably think software is something that other people do. “I’m not a programmer” is a frequent cry. But things are changing. Programming is increasingly about finding out what the problem is, and finding tools and people who can help solve it. If you can run a chromatograph, or a mass spectrometer or a PCR machine you can use and build programs.

The main thing is your frame of mind. If you can organize and run an experiment , you can organize data. If you can organize data you are effectively doing computing. I had the great opportunity to go to a Software Carpentry course last year and it changed my life. It showed me that I needed to understand how I think and how I work and that the rest comes relatively naturally. And it showed the value of friends.

You want a program to do X? Thinking of writing it? Chances are that much of it exists already. Much of what programs do is universal – sorting, matching, transforming, searching. And we have great toolkits – R, Python, Apache, and visualisation D3, etc. So much of the solution is knowling what, and who, is out there.

So I’m off to Mozilla, in the heart of London. I went there for the first time a month ago – a great place that is human-friendly. Here’s the blurb – join us!

A multi-site sprint this July

(Also posted on the Software Carpentry blog.)

We’ll be holding our first-ever global sprint on July 22-23, 2014. This event will be modeled on Random Hacks of Kindness: people will work with friends and colleagues at sites around the globe, then hand off to participants west of them as their days end and others’ begin. We will set up video conferencing between the various locations and a show-and-tell at the end (and yes, there will be stickers and t-shirts).

We have booked space for the sprint at the Mozilla offices in Paris, London, Toronto, Vancouver, and San Francisco. If you aren’t in one of those cities, but are willing to help organize in your area, please add yourself to this Etherpad. We’ll hash out the what and how at the next Software Carpentry lab meeting—it’s a community event, so we’d like the community to choose what to sprint on—but please get the date in your calendar: it just wouldn’t be a party without you.

Visit of Richard Stallman (RMS) to Cambridge

July 1st, 2014

Richard Stallman (RMS) from MIT stayed with us for 2 days last week. Since RMS has a 9000-word rider on what he needs and doesn’t need when visiting, I hope I will help future hosts by adding some comments. TL;DR It’s hard work.

OLYMPUS DIGITAL CAMERA

[RMS (St IGNUsias) selling PMR a GNU; (C) Murray-Rust, CC-BY]

I have a great regard for what RMS has done – Emacs, GNU, the 4 Freedoms. I heard him talk some years ago on Software Patents in Europe and it was great – he knew far more about the European system of government than I did; he had a clear political plan of action (who to write to, and when).  We’d corresponded but only met very briefly in a noisy room.

I posted on the dangers of publishers taking over our data, and he wrote and said he was coming to Cambridge (to talk at OWASP) and would like to talk. He mailed subsequently and said he was looking for somewhere to stay, so we offered him a bed. We’d read the rider – food requirements, temperature, music, dinner gurest, etc. We were prepared for a somewhat eclectic visitor.

In retrospect we should have prepared for an Old Testament prophet or mediaeval itinerant monk. (The dressing up as St IGNUsias – above – is actually quite a close parallel and a valuable addition to the rider.) Be prepared to arrange/fund taxi rides, random food browsing, and a flexible timetable.  In fact RMS didn’t require an internet cable – he used our wireless.

But the strange thing was that we had nothing to say to each other. RMS no longer writes software and does not seem engaged in practical politics or action other than raising money for FSF through sale of swag. His message – at least for these two days – was “everyone is snooping on us” (PMR agrees and is equally concerned) and “We must only run Free software” (Free as in speech, epitomised by GPL). For me GPL has the virtue of forestalling SW patents but when I raised it he seemed to downplay it. If he has a current agenda it’s not clear to me. The “Open” word is verboten in discourse – I wished to explore whether there was any difference between Free Data and Open Data (a term I promoted 9 years ago) but we didn’t.  So there was neither a practical agenda nor a dialectic.

The visit probably had the same impact on the household as most itinerant Prophets have.

And the animals are very happy to have a new addition (Connochaetes gnou). If you believe in the GNU-slash-Linux bintarian theology here it is:

GNU

 

 

 

Content Mining: we can now mine images (of phylogenetic trees and more)

June 25th, 2014

The reason I use “content mining” and not “Text and Data Mining” is that science consists of more than text – images, audio video, code and much more.  Text is the best known and the most immediately tractable and many scientific disciplines have developed Natural Language Processing (NLP). In our group Lezan Hawizy, Peter Corbett, David Jessop, Daniel Lowe and others have developed ChemicalTagger, OSCAR, Patent Analysis, and OPSIN. (http://www-pmr.ch.cam.ac.uk/wiki/Main_Page ). So the contentmine.org is exactly that – an org that mines content.

But words are often a poor way of representing science and images are common. A general approach to processing all images is very hard and 2 years ago I though it was effectively impossible. However with hard work some subsets can be tractable. Here we show you some of the possibilities in phylogenetic trees (evolutionary trees). What is described below is simple to follow and simple to carry out, but it took me some months of exploration to find the best strategy. And I owe a great debt to Noureddin Sadawi who introduced me to thinning – I haven’t used his code but his experience was invaluable.

But you don’t need to worry. Here’s a typical tree. Its from PLoSONE, (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0036933 – “Adaptive Evolution of HIV at HLA Epitopes Is Associated with Ethnicity in Canada” .

pone.0036933.g001

The tree has been wrapped into a circle with the Root at the centre and the leaves/tips on the edge of the circle. To transcribe this manually would take hours – we show it being done in a second.

There isn’t always a standard way of doing things but for many diagrams we have to:

  • flatten (remove shades of gray)
  • separate colours (often by flattening them)
  • threshold (remove noise) and background)
  • thin (remove all pixels except the 1-pixel-think backbone)

and here is the thinned diagram:

cleaned

You’ll see that the lines are all still there but exactly 1 pixel thick. (We’ve lost a few colours, but that’s irrelevant for this example). Now we are going to look at the tree (and ignore the labels):

cleaned0

This has been selected automatically on pixel count, but we can also use bounding boxes and many shape characteristics.

We now analyse the structure and break it into connected components – a topological tree – by standard traversal methods. We end up with nodes and edges – this is a snapshot of a SVG.

graphAndChars

[The black lines are artifacts of Inkscape]. So we have identified every node and every edge. The next thing is to trace the edges – that’s easy if they are straight, but here they are curved. Ideally we plan to fit circles, but we’ll use segments for the time being:

segments

The curves are actually straight-line segments, but… no matter.

It’s now a proper phylogenetic tree! And we can serialize it as Newick (or NexML if we wanted).

((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),((((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187),n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),(((n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n160))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222))))))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,(n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),((n53,n131),n159)))))));

And here is an interactive tree by posting that string into http://www.trex.uqam.ca/ (try it yourself).

tree1

So – to summarize – we have taken a phylogenetic tree – that may have taken hundreds of hours to compute and extracting the key data. (Smart people will ask “what about the text labels?” – be patient, that’s coming).

… in a second.

That scales to over a million images per year on my single laptop! And the technology scales to many other disciplines and it’s completely Open Source (Apache2). So YOU can use it – as long as you give us the credit for writing it.

 

 

 

Is this a scam or a new low for Elsevier?

June 20th, 2014

I got the following mail today. I genuinely don’t know whether it’s a scam or an unacceptable spam from Elsevier:

Measurement <measurement@elsevier.com>

9:54 AM (18 minutes ago)

Dear Dr. Peter Murray-Rust,
You have received this system-generated message because you have been registered by an Editor for the Elsevier Editorial System (EES) – the online submission and peer review tracking system for Measurement.

Here is your username and confidential password, which you will need to access EES at http://ees.elsevier.com/meas/
Your username is: REDACTED
Your password is: REDACTED

The first time you log into this new account, you will be guided through the process of creating a consolidated ‘parent’ profile to which you can link all your EES accounts.

If you have already created a consolidated profile, please use the username and password above to log into this site. You will then be guided through an easy process to add this new account to your existing consolidated profile.

Once you have logged in, you can always view or change your password and other personal information by selecting the “change details” option on the menu bar at the top of the page. Here you can also opt-out for marketing e-mails, in case you do not wish to receive news, promotions and special offers about our products and services.

TECHNICAL TIPS:
1) Please ensure that your e-mail server allows receipt of e-mails from the domain “elsevier.com“, otherwise you may not receive vital e-mails.
2) We would strongly advise that you download the latest version of Acrobat Reader, which is available free at: http://www.adobe.com/products/acrobat/readstep2.html
3) For first-time users of Elsevier Editorial System, detailed instructions and tutorials for Authors and for Reviewers are available at: http://help.elsevier.com/app/answers/list/p/7923

Kind regards,
Elsevier Editorial System
Measurement

For further assistance, please visit our customer support site at http://help.elsevier.com/app/answers/list/p/7923. Here you can search for solutions on a range of topics, find answers to frequently asked questions and learn more about EES via interactive tutorials. You will also find our 24/7 support contact details should you need any further assistance from one of our customer support representatives.

I went to the sites and although they had Elsevier logos they were of low quality and didn’t have the normal branding that is so beloved of Elsevier.
So I think it’s a scam with fake emails and URLs.
But if it isn’t, then it’s appalling. To take me and my email into a company system, add me to the system without my permission is appalling. If it turns out to be Elsevier I shall write to David Willetts, MP.
And of course they are wasting their time as I have publicly committed to have nothing to do with helping Elsevier.

Content Mining hackday in Edinburgh; we solve Scraping

June 20th, 2014

IMG_1188

[P Murray-Rust, CC  0]
We had our hack day in Edinburgh yesterday on content mining.
First, massive thanks to:
  • Mark MacGillivray for organising the event in Informatics Forum
  • Informatic Forum for being organised
  • Claire and Ianthe from Edinburgh library for sparkle and massive contributions to content mining
  • PT (Sefton) for organising material for the publishing and forbearance when it got squezzed in the program
  • Richard Smith-Unna who took time off holiday to develop his quickscrape code.
  • CottageLabs in person and remotely
  • CameronNeylon and PLoS for Grub/Tucker etc.
  • and everyone who attended
Several participants tweeted that they enjoyed it
Claire Knowles @cgknowles Thanks to @ptsefton for inviting us and @petermurrayrust for a fun day hacking #dinosaur data with @kimshepherd@ianthe88 & @cottagelabs
So now it’s official – content mining is fun!. You’ll remember we were going to
  • SCRAPE material from PLOS (and other Open) articles. And some of these are FUN! They’re about DINOSAURS!!
  • EXTRACT the information. Which papers talk about DINOSAURS? Do they have pictures?
  • REPUBLISH as a book. Make your OWN E-BOOK with Pictures of DINOSAURS with their FULL LATIN NAMES!!

About 15 people passed through and Richard Smith-Unna and Ross Mounce were online. Like all hackdays it had its own dynamics and I was really excited by the end. We had lots of discussion, several small groups crystallised and we also covered molecular dynamics. We probably didn’t do full justice to PT’s republishing technology, that’s how it goes. But we cam up with graphica art for DINOSAUR games!

We made huge progress on the overall architecture (see image) and particularly  on  SCRAPING. Ross had provided us with 15 sets of URLs from different publishers, all relating to Open DINOSAURS.

 

APP-dinosaur-DOIs.txt APP CC-BY articles, there are more that are free access but I have on… 4 days ago
BioMedCentral-dinosaur-articlelinks.txt BMC article links NOT DOI’s, filtered out ‘free’ but not CC BY articles 4 days ago
Dinosauria_valid_genera.csv List of valid genera in Dinosauria downloaded from PaleoDB. It includ… 4 days ago
Elsevier-CCBY-dinosaur-DOIs.txt 3 Elsevier CC BY articles 4 days ago
FrontiersIn-dinosaur-35articlelinks.txt FrontiersIn 4 days ago
Hindawi-dinosaur-DOIs.txt Pensoft & Hindawi 4 days ago
JournalofGeographyandGeology_DOI.txt Create JournalofGeographyandGeology_DOI.txt 2 days ago
Koedoe-DOI.txt PDF scan but CC BY from 1986 2 days ago
MDPI-dinosaur-DOI.txt MDPI one article 4 days ago
README.md Added one Evolution (Wiley) article 4 days ago
RoyalSocietyOA-dinosaur-DOIs.txt just one 4 days ago
SAJournalofScience-DOI.txt 1 CC BY article on African dinosaurs 2 days ago
SATNT-DOI.txt 1 CC-BY article in Afrikaans 2 days ago
Wiley-CCBY-dinosaurs.txt Added one Evolution (Wiley) article 4 days ago
peerj-dinosaur-DOIs.txt 8 PeerJ article DOIs 4 days ago
pensoft-dinosaur-DOIs.txt Pensoft & Hindawi 4 days ago
plos-biology-dinosaurs-DOIs.txt 20 PLOS Biology 4 days ago
plos-one-dinosaur-DOIs.txt first commit 4 days ago
Hard work, and we hope to automate it through CRAWLING, but that’s another day. So could we scrape files from these. Remember they are all Open so we don’t even have to invoke the mighty power of Hargreaves yet . However the technology is the same whether it’s Open or paywalled-and-readable-because-Cambridge-pays-lots-of-money.
We need a different scraper for each publisher (although sometimes a generic one works).  Richard Smith-Unna has created the quickscrape platform https://github.com/ContentMine/quickscrape. In this you have to create a *.json for each publisher (or even journal).
The first thing is to install quick scrape. Node.js, like java, is a WORA write-once-run-anywhere (parodied as WODE write-once-debug-everywhere). RSU has put a huge amount of effort into this so that most people installed it OK, but a few had problems. This isn’t RSU’s fault, it’s a feature of dependencies in any modern language – versions and platforms and libraries. Thanks to all yesterday’s hackers for being patient and for RSU breaking his holidy to support them. (Note – we haven’t quite hacked Windows yet, but we will). For non-hacker worksops – i.e. where we don’t expect so many technical computer experts we have a generic approach to distribs.
Then you have to decide WHAT can be scraped. This varies from whole articles  (e.g. HTML) to images (PNG) to snippets of text (e.g. licences) What really excited and delighted me was how quickly the group understood what to do and then went about it without any problems. The first task was to list all the scrapable material and we used a GoogleSpreadsheet for this. It’s not secret (quite the reverse) but I’m just checking permissions and other technicalities before we release the URL with world access.
hackday
You’ll see (just) that we have 15 publishers and about 20 attributes. Who did it? which scraper (note with pleasure that RSU’s generic scraper was pretty good!). Did it work? If not this means customsing the scraper. 9.5/15 is wonderful at this stage.
The great thing is that we have built the development architecture. If I have the Journal of Irreproducible Dinosaurs then I can write a scraper. And if I can’t it will get mailed out to the Content Mine communiaty and they/we’ll solve it.  So fairly shortly we’ll have a spreadsheet showing how we can scrape all the journals we want. In many instances (e.g. BioMedCentral) all the journals (ca 250) use the same tecnology so one-scraper-fits-all.
If YOU have a favourite journal and can hack a bit of Xpath/HTML then we’ll be telling you how you can tackle it and add to the spreadsheet. For the moment just leave a comment on this blog.

Hackday 2014-06-19 in Edinburgh – a radically new approach to Scholarly Communication in the Digital Enlightenment

June 13th, 2014
Summary: Help us change the way we communicate Science and the Humanities in the Digital Enlightenment. Free [1] EVERYONE can help.Edinburgh is the capital of the Scottish Enlightenment where free thinkers changed the way we think about and run the world. Next week (June 19th) we’ll be running a hackday to change the way that we communicate Science and the  Humanities.

For 400 years we have relied on the “printed journal” and “articles” (e.g. “PDFs”) and now we’re doing something completely different. Authors should be able to do what *they* want and readers should be able to read in the way *they* want. And readers aren’t just lecturers, they are 4-year olds, patients and machines. 4-year olds LOVE DINOSAURS.

We’ve built most of the basics. We are going to:

  • SCRAPE material from PLOS (and other Open) articles. And some of these are FUN! They’re about DINOSAURS!!
  • EXTRACT the information. Which papers talk about DINOSAURS? Do they have pictures?
  • REPUBLISH as a book. Make your OWN E-BOOK with Pictures of DINOSAURS with their FULL LATIN NAMES!!

[I'm serious about the 4-year olds. I have two high quality data points where 4-year olds LOVE Binomial names. This hackday is NOT designed for kids... but future ones maybe]

For the Techies:

  • Ross Mounce has zillions of Open DOIs about dinosaurs (i.e. a list of papers).
  • Richard Smith-Unna has built the world’s latest and greatest scraper (quickscrape) for journal articles. Anyone who can edit a file can learn to use it in 5 minutes
  • Peter Murray-Rust and friends have written AMI which can extract many types of information from articles. The simplest method is regexes, but we can do phylogenetic trees from diagrams, chemistry and much else. All in a giant Java Jar. This can filter out either the articles you want or just the bits you want!
  • Peter Sefton has built scholarly authoring systems that academics actually want to use!! We’ll probably use eBook technology which can reassemble the bits that AMI has found and you want to read. All the adverts are gone! We can make ebooks for a given subject, or today’s publications, or methods for cloning mosquitoes or all the graphs about climate change…
In hackdays YOU decide what you want to do, find friends and explore. You might create something wonderful or you might just have fun.
YES! Edinburgh has DINSOSAUR skeletons.
Mark writes:
“Room 1.15, Informatics Forum, University of Edinburgh, George Square, Edinburgh”
This room has tables and seats for 12 people comfortably, and another 8 folding seats for people to dot around – I was not sure how big we were aiming, but the Forum also has a fair bit of open space if we need to de-camp some people. There is a computer and a projector too, and a whiteboard.
I should keep track of how many people plan to attend, to make sure we have space. So, could we add the following to the summary:
“If you would like to join us, please email mark@cottagelabs.com to confirm attendance”
[we think Cameron said food provided by PLoS! - we're checking ]

We launch The Content Mine In Vienna, Interviews, Talks and our first public Workshop

June 13th, 2014

Last week was one of the most exciting in my life – but also among the hardest I have worked. I travelled from Budapest to Vienna to be the guest of the Austrian Science Fund (FWF) and to give a lecture: http://www.fwf.ac.at/en/aktuelles_detail.asp?N_ID=597 .. I changed the title to “Open Notebook Science” in honour of the late Jean-Claude Bradley and to promote his ideas. My talk’s on Slideshare: [http://www.slideshare.net/petermurrayrust/open-notebook-science].

Before that I had given two interviews – one to ORF ( http://en.wikipedia.org/wiki/ORF_%28broadcaster%29 ) , the Austrian public Broadcasting network Österreichischer Rundfunk Here’s the interview – I haven’t seen a translation but web translaters give a reasonable version http://science.orf.at/stories/1740033/ I explained why science was important beyond the walls of academia and why we needed to liberate scientific knowledge.

Then the “launch” of The Content Mine ( http://contentmine.org ), my Shuttleworth Fellowship project, which aims to extract 100,000,000 facts from the scientific literature. The philosophy is not that *I* do this but that *WE* do this. To do that we have to:

  • have reliable, compelling, distributable software. That’s hard. But we’ve got one of the best small teams in the world – it would be harder to think of a better one. That’s because we are developer-scholars – we are not only very experience in the coding and design of information , but we are also expert in our own right in our fields (Chemistry, Phylogenetics, Plant Genetics, and Informatics/ScholarlyPublishing). That means we know where we are going, know what works (or rather what *doesn’t* work!) and know who else in the world is doing similar stuff. And because I’m funded by the Shuttleworth Foundation there’s a guarantee  that we won’t get bought by Elsevier or Macmillan or Thomson-Reuters. I wouldn’t swap any of the team for ten million dollars – that’s how important they are to my life.
  • show YOU how to become part of US. The goal is to create a community. We’re in very good touch with Wikimedia, Mozilla, Software Carpentry, OpenStretMap, Open Knowledge, Blue Obelisk, Apache, so our community will be recognisable in that environment. And also think of WellcomeTrust, Austrian Science Fund, RCUK, NIH, to get a feel for how we relate to science funders. We’ve only been going 3 months so we want to see a community evolve rather than design it prematurely. When it’s strong and energetic it will start to suggest where we should be going organisationally. We also work closely with domain repositories such as PubChem, EuropePubMedCentral, Treebank, Dryad, Crystallography Open Database, etc.
  • At present we are reaching out through workshops. We’re doing several this summer – Edinburgh, Berlin/OKFest, Wikimania, OK Brazil, and one or two more yet to be finalised. We’re informed by the Software carpentry philosophy, where we ru a workshop for a sponsor, and during the workshop train apprentices. Then these apprentices wll be able to help run new workshops and then perhaps their own workshops. So although Michelle and I ran this workshop, there will be later ones with different leaders.

So we ran our first public workshop on 2014-06-04 at  Institute of Science and Technology Austria (IST Austria) We advertised it as:

Workshop with Peter Murray-Rust and Michelle Brook: “Can we build an intelligent scientific reader?”

Venue: IST Austria, Am Campus 1, 3400 Klosterneuburg
Time: 4th of June 2014, 10:00 a.m. – 4 p.m. (ballroom)
Participants: 10 places are still available (first come, first serve)
Registration: send an email (incl. first name, surname, institution, email) asap but until 30/5/2014 to falk.reckling@fwf.ac.at

Workshop Description
The workshop will be suitable for anyone interested in biological science and not frightened of installing and running pre-prepared programs and data (following written guidance and with support from those present in the room). The aim is to introduce computational methods for processing scientific papers, enabling analysis of multiple papers in a rapid fashion. These techniques include how to download multiple files, extract concepts and facts from the literature and figures, using Natural Language Processing and Computer Vision.

Technical expertise required
Very little expertise is required beyond general use of a computer. Much more important is a willingness to learn and experiment. However we will ensure options are made available for those who are confident/technically able, including providing opportunities to develop their own tools for analysis.

We got 18 brave people, mainly compsci but also bioscientists and it went well. Michelle is getting formal feedback. We’re hard at work taking our own criticism on board (Michelle collected a very thorough set of observations). It was hard work, but we now know we can do it and it works. The main emphasis was on understanding the concept (with highlighter pens and paper!), scraping, extraction, and how to work as a community. We’ve got attendees who want to folow up on how they can use it! That’s the philosophy.

Then the next day an all-day hack run by OKFN Austria (Stefan Kasberger and Peter Kraker (Panton Fellow) – http://okfn.at/2014/05/19/content-mining-meetup-with-peter-murray-rust/. A wonderful hackspace (metalab), couches, soft drinks + honour payment, bits of kit lying around – grafitti – you know the sort of thing.

And then at the end 4 invited speakers (including PMR). We are very impressed by OKFN Austria – the day drew perhaps 25 people. And a lovely city.

But Exhausting! At the  end I crashed for a long night. (In writing my Shuttleworth Quarterly report I was aksed “What was your greatest loss during this quarter?” Answer: SLEEP!

Much more to come – a hackday in Edinburgh next week to be announced later today.

 

My MPs say “You can ignore Elsevier’s TDM click-through API and we urge your library to do so too”

June 11th, 2014

 

A little while ago I wrote to Minister David Willetts through my MP Julian Huppert on two issues;

  1. Elsevier’s misselling of Open Access Articles (later described by Elsevier as their “bumpy road to Open Access”)
  2. Elsevier’s unnecessary click-through API which would constrain researchers and get them and librraies to sign away their rights.

Today I have got a reply on both points which I reproduce below.

1) TL;DR They’ve talked with Elsevier about the bumpy road (i.e. charging people for Open Access). You’ll have to read between the lines as to what was actually said, but it might be “David, we’re terribly sorry, grovel grovel [1]”

2) They held firm and said “yes, the point of the law was that researchers could mine facts (etc.) without having to sign publisher APIs”. “Yes, PMR has a right to do it and you can’t stop it”. After all, if they didn’t say that, what’s the point of the law? Elsevier and the other publishers have lost that battle and should move on.

Just in case any other publishers think the message wasn’t clear, here it is. So thank you very much David and Julian. You have worked hard and consistently for that. And I and other researchers in the UK will show that your effort has unleased a massive potential for increasing wealth, human well-being and enhancing the status of the UK. I’ll be blogging on that RSN.

 

 

 

I have redacted my address so that GCHQ can’t say where I live and tell the NSA. (Ha!)

willetts1

 

 

TL;DR Elsevier are very slowly responding to my criticisms. It seems the more money a company takes in the harder it is for them to get their Systems right.  Good that they encourage Gold OA; Bad that they exercise no price control; Ugly that they think “Access to Research” is more than a cosmetic gesture. (That’s the one where citizens can cyle through the snow to their nearest library, have an hour to read a dumb screen, cannot cut and paste, cannot copy and cannot print; what we want is legal access over the Internet, not some Charles Dickens stupidity).

NOW the more exciting part…

willetts2

TL;DR. A UK academic has the legal right to carry out TDM for non-commercial purposes unless THEIR LIBRARY stops them, by agreeing that they will act as publisher police. And , LIBRARIES, the goverment is making sure you know this. Ignorance is unacceptable. So why might you sign? The publisher might sweet-talk you into doing this, just like washing machine salesforce sell you “insurance” that is worse than your current legal rights. Remember PPI? Click through licences are as honest as misselled PPI.  They’ll offer you a “better price” if you agree to constrain your researcher

The carrot for not signingis that your researchers will thank you and praise the library for freeing them from Elsevier’s wish to agree to their research project. You will have a warm fuzzy feeling that you have stood up for freedom. Libraries are more important to reseachers than publishers!

The stick… you can’t hide. The FOI-flying squad can find out whether you’ve signed the click through or other TDM restrictions. Resistance is futile. No “it’s too difficult to tell you, ” “we can’t find our contracts”, etc. There’ll be a giant UK spreadsheet (promise!) with your institution on it.

It’s easy. When a publisher salesperson comes to you mumble the mantra: “Yes to TDM; no to click-through”. They’ll try anything, but use the Force and be strong.

[1] a well-known parliamentary expression

Elsevier’s new API approach to Content Mining should be avoided by all Librarians

June 10th, 2014

Yesterday Elsevier updated its approach to Text and Data Mining. This is a rapid response . Elsevier’s material is at http://www.elsevier.com/connect/how-does-elseviers-text-mining-policy-work-with-new-uk-tdm-law and is italicised here.  My emphasis in Elsevier’s text is [thus]. My comments interleaved.

TL;DR [summary] Elsevier’s new approach is unnecessary and should be avoided by all libraries and researchers.

Some arguments below suggest that mining is better and easier with Elsevier’s APIs. This is an untested assertion. There are many Free and Open tools that can mine content. Unix tools are quite satisfactory and we have developed Free and Open content mining tools at http://contentmine.org . 

There is a SUMMARY at the end

How does Elsevier’s text mining policy work with new UK TDM law?

By Gemma Hersh | Posted on 9 June 2014

In January, Elsevier announced a new text and data mining policy, which allows academic researchers at subscribing institutions to text mine subscribed content for non-commercial research purposes.

PMR: we and others showed that this was deeply and utterly flawed and contained many clauses which were solely for Elsevier’s benefit.

Last week, a new UK text and data mining copyright exception came into force which allows researchers with lawful access to works to make copies of these for the purposes of non-commercial text and data mining. Accordingly, it’s is a good opportunity to reflect on how our policy and the exception work together.

YES, and my blogs posts will reflect. Note that I have lawful access to all works I want to mine for my non-commercial research processes

Elsevier and the UK TDM copyright exception

A new UK text and data mining copyright exception came into force on June 1st. What is it and how do Elsevier’s systems accommodate this requirement?

  • An exception to copyright is when someone is allowed to copy a work without seeking the permission of the rights holder. In this instance, researchers with lawful access to works published by Elsevier can copy these without asking,  [using tools we have provided for this purpose], provided they are doing the copying to carry out non-commercial text and data mining.

The highlighted phrase is completely spurious. We can copy the material with OUR tools which are Open. or with anyone else’s such as GNU/Linux Tools. Many readers may misread this phrase as part of the legislation – it is FUD and its introduction is completely irresponsible

  • Elsevier offers an Application Programming Interface (API) to facilitate text and data mining of content held on Science Direct. This API makes the process [easier and more efficient] for researchers compared to manual downloading and mining of articles. It also helps us to provide a good experience to human readers and to miners at the same time.

This is an untested assertion and written from a marketing perspective rather than an actual study. It is unlikely to be easier than FreeOpen tools which work for all publishers’ output.

  • Under the UK legislation, publishers can use “reasonable measures to maintain the stability and security” of their networks, and so the [requirement to use] this API is fully compatible with the copyright exception. 

So this appears to a MANDATORY API; if we do not use it Elsevier will take action. This is INCOMPATIBLE with the new legislation that allowed miners to ignore restrictions imposed by publishers.

  • Our approach to TDM remains under review and continual refinement. We have already made changes based on [researcher feedback during our pilot] and will continue to do so in order to support researchers.

PMR: Where is this “researcher feedback”? NAME them and publish the full details. No one has consulted me or many of the other proponents of unrestricted mining under the law. It’s always possible to find someone who will provide support for some case, but that’s neither scientific or responsible.

  • We believe text and data mining is important for advancing science, and we are keen to provide tools to support researchers who wish to mine no matter where they are located.

 This is vacuous marketing mumble.

Related resources

Elsevier has provided [text and data mining support for researchers since 2006].

PMR: Not for me. I spent years trying to get a reasonable approach. https://blogs.ch.cam.ac.uk/pmr/2011/11/27/textmining-my-years-negotiating-with-elsevier/

We designed our policy framework to span across all legal environments as research is global, and this framework complements the UK exception. Since the beginning of the year, in accordance with our policy, we have started to include text and data mining rights for non-commercial purposes in all new ScienceDirect subscription agreements and upon renewal for existing academic customers. [The UK law adds weight to our position; we are ensuring that those with "lawful access" (in UK legislation speak) have the right to mine our works].

PMR The UK law allows ME to mine Elsevier content WITHOUT the rights included in contracts. Read those clauses carefully , LIBRARIANS. It is highly likely that you will be giving up some of MY rights

Contrary to what some have suggested, [our policy was not designed to undermine library lobbying for copyright exceptions for text and data mining], but rather to position us to continue to offer flexible and scalable solutions to support researchers no matter where they are based.

PMR: Last year the massed mainstream publishers INCLUDING ELSEVIER  fought against the European libraries, funders, JISC, SURF, etc to require licences for content mining. “Licences 4 Europe”. The talks in Brussels broke down. Neelie Kroes stated that licences were not the answer.

PMR So it was all a misunderstanding? Elsevier wasn’t fighting us? Orwell calls this DOUBLESPEAK. Just reading the previous sentence should convince your that publishers are not “our partners”.

What the law alone cannot do – in the UK or elsewhere – [is resolve some of the technical sticking points that often frustrate a researcher's mining experience]. That’s why our policy facilitates text mining via an Application Programming Interface (API).

PMR The FreeOpen software can already deal with the technical sticking points

The advantages of using APIs for text mining

As users of many popular websites will know, it is standard best practice for users (well, their machines) to be asked to use APIs or other download mechanisms when the website in question holds a lot of content. That’s the case with ScienceDirect, which holds over 12.5 million articles and almost 20,000 books, and we are among many other large platforms, including Wikipedia PubMed Central and Twitter, in asking for our API to be used for downloading and mining content. We do this to provide researchers with an optimum text mining experience.  

PMR Wikipedia and PubMedCentral (on whose advisory board I am) have public and democratic approaches to governance and control. Elsevier’s API is developed without any significant community input. If I saw an Elsevier API Advisory Board, with public minutes and transparency of the stature of PubMedCentral I would be prepared to engage

PMR APIs also allow websites to monitor (Snoop) on who uses the API for what purpose and when. It also allows the provider to provide the particular view (often limited or distorted) that they wish to promote.

For starters, access via the API provides full-text content of ScienceDirect in XML and plaintext formats, [which researchers tell us they prefer to HTML] for mining.

PMR Weasel words (Wikipedia term). I (PMR) find good standards-conformant HTML totally acceptable and often superior. I will be happy to report publicly whether Elsevier’s HTML is standards-conformant.

Similarly, experience in our pilots has indicated that text miners prefer API access for automated text mining for several other reasons, one being that content is available from our APIs without all of the extraneous information that is added to web pages intended for human consumption but which make text mining more difficult (e.g., presentational JavaScript, navigational controls and images, website branding, advertisements). Access via our API also provides content to researchers in stable, well-documented formats; by contrast, HTML coding can change at any time, making it arduous to keep “screen-scraping” scripts up to date.

PMR Human readers are no doubt clamouring for the extraneous information ,  yearning for website branding, and reading the site for the advertisements. Our content mining tools can avoid this clutter.

It’s not just text miners who benefit from our API, but users of ScienceDirect who are there to read content rather than download and mine it. Their user experience of ScienceDirect can be maintained at the highest level, as bulk downloading needed for mining is done elsewhere, via our API. If bulk downloading over a short period of time took place on the ScienceDirect site, [the system's stability would be compromised, affecting researchers of every hue]. By contrast, our API is designed to cope with high-frequency requests from automated bots and crawlers in a very efficient manner which enables us to scale our systems to meet demand.

PMR I shan’t comment on what human ScienceDirect readers want;  Cameron Neylon has already demolished the idea that commercial publishers cannot provide robust servers for all types of use.

PMR: I do not understand why the hue (=colour) of researchers is important; In the UK and many other countries this is objectionable language and should not appear on a reputable publisher’s site. Please apologise and remove or I shall report this.

The Explanatory Notes published alongside the UK legislation make clear that publishers are able to impose “reasonable measures to maintain the stability and security” of their networks, as long as researchers are able to benefit from the exception to carry out non-commercial research. In other words, researchers with lawful access to works can copy these for the purposes of non-commercial text and data mining, and publishers have a role to play in managing this process. [1]The “reasonable measures” include requesting that miners to carry out text mining via a separate API], in line with Elsevier’s existing policy, and we have received numerous reassurances from the UK Government [2]that use of our API will be in compliance with the law].

PMR [1] You may request but you may not require.

PMR [2] And ignoring your API is ALSO in compliance with the law.

PMR. If the law is interpreted as “the publisher decides whether an activity is compliant with the law” then the law is pointless.

We will continue to monitor how our API is used and to make tweaks and changes to our policy in response to community feedback. We have already made several adjustments. For example, we no longer request a project description as part of the API registration process, and we now allow TDM output to be hosted in an institutional repository. We also know, for example, [that researchers would like to mine third-party images and graphics that they cannot currently download automatically via our API].

PMR: Yes. I would like to mine images and I will mine images. If Elsevier does not provide images through their API this is an unassailable argument for getting them directly from the website as the law allows.

[We of course make this content available to researchers on request],

PMR You didn’t (“of course”) make anything available to me during the three years I “negotiated” with you.

 

. but we are looking at how we might ensure that the rights of [third-party content owners] are respected whilst at the same time providing researchers with all of the content they want immediately via our API.

PMR. More FUD. We have a complete right to mine third-party content as well. Elsevier’s “ensuring rights” is a process that is of indeterminate duration.

And we are a signatory to the new CrossRef Prospect text and data mining service, which aims to allow researchers to mine content from a range of publishers through one single portal.

PMR CrossRef is set up by publishers and guided by the publishers who finance it.

Further, we’re looking at how we ensure that researchers [know what they can and cannot do with content, or where to go for further information], without giving the impression that we are claiming ownership over non-copyrightable facts and data.

PMR. I know what I can do and where I can go without Elsevier’s  help. And it’s likelythat miners may choose to come to http://contentmine.org  and similar community sites for information provided by the community for the community.

 

We’ve already altered our output terms, so that researchers can redistribute 200 characters in addition to text entity matches; [researchers] told us that our previous inclusion of text entity matches within that 200 character limit sometimes caused problems when displaying lengthy chemical formulas.

PMR “Researchers” was actually me. It’s polite to credit sources.

In short, we will continue to do what we have always done: work with the research community to support their research, listen to feedback and respond to changing needs. Our text and data mining policy is a reflection of this and will continue to evolve accordingly.

PMR More FUD and mumble.

SUMMARY.

  • LIBRARANS: DO NOT SIGN AWAY ANY RIGHTS.
  • NO-ONE ABSOLUTELY NEEDS ELSEVIERS API
  • IF ELSEVIER “MANDATES” AN API  WE CAN IGNORE IT UNDER UK LAW.
  • ELSEVIER’S CURRENT API PROVIDES MUCH LESS THAN THE WEBSITE
  • THERE ARE FRRE/OPEN TOOLS THAT ARE AN ACCEPTABLE ALTERNATIVE APPROACH