Monthly Archives: December 2011

Semantic Physical Science Workshop

We (Charlotte Bolton and I) are preparing the material for the Semantic Physical Science Workshop in January (10/12). A major feature of this is our Jumbo-Converters which convert legacy log files to semantic CML. To do that we are cleaning up and testing the code – which runs to probably tens of thousands of lines designed by Jim Downing and Sam Adams and implemented mainly by me.

To make it usable it has to be cleaned of historical cruft, tested and documented , probably all together and iteratively. We've had several iterations of wrappers for J-C including two versions of "Lensfield". Currently it looks like we are going back to a much simpler commandline-interface, and putting some responsibility on the user to write the wrapper. This is a common problem – workflows are hard and local and don't seem to generalise or abstract well. Moreover when you commit to one it's very hard to remove and change to another. So, as we have done with OSCAR, we've whittled away the wrapper stuff.

We didn't get as far as we'd hoped for today, because of this:

It's a Sparrowhawk (Accipiter Nisus) and although we have seen them in our garden from time to time, this one – a female - has started to use our spruce tree as its dining table. Sparrowhawks eat mainly birds (superb article in Wikipedia http://en.wikipedia.org/wiki/Eurasian_Sparrowhawk where we learn that this one might eat 1500 Great Tits a year. This looked like it was eating a Blackbird and we recovered the following

Which I am guessing is a female blackbird (Turdus Merula).

We managed to get the photo with my birdwatching telescope and a phone equivalent pointed at the eyepiece.

Since the hawk takes about an hour to eat a bird and because it went off for another one and came back again, it took up a lot of our time. So the evening will have to make up for what I didn't manage during the day.

Videos from BiomedCentral: IainH and Gulliver Turtle (Panton #4 and #5) and thanks to Musopen

I have spent the last days on and off editing the material that Laura Newman (OKFN) and I collected from BiomedCentral – interviewing Iain Hrynaszkiewicz and also Gulliver Turtle. Iain has got the (slightly edited) video/audio and hoped to let me have comments shortly, when I can release a final edited version.

Before I talk about the details I want to say how much I appreciate what BiomedCentral has done for the processing of publishing science Openly. They have been going a bit over 10 years and when they started their business model was unproven. I have paid tribute to this in http://blogs.ch.cam.ac.uk/pmr/2010/06/11/reclaiming-our-scholarship-tribute-to-vitek-tracz-and-bmc/ last year. I have also commended their initiatives in going beyond the mainstream. While most publisher "open access" (such as it is) has to be dragged year-by-year from apparently resistant publishers BMC has gone out in front and want to show what we should be doing. So there is a lot of this in Iain's interview.

Editing videos is hard work and I would be grateful for advice from people who've done it. It depends critically on the material and Iain was a superb interviewee. He knew what the questions would be and had prepared thoroughly, so when we interviewed him the replies were fluent, without hesitation, deviation or repetition. There are about 20 questions and answers – here is a typical one –

and the whole interview lasts about 28 minutes. So I am planning to create:

  • A complete edited movie of 29 minutes
  • 20 snippets (Q + A), each in its own movie (1-2 mins each)

Each Q+A has the interviewer (mainly Laura) in the semi-background, but quite audible and then Iain's response. The audio seems very clear – it was an empty room with a lapel mike for Iain and a camera mike for the interviewer.

Q: should I create 20 snippets or try to bundle them into larger themes?

Q: where should I post them (currently I will use VIMEO with a CC-BY licence)?

I also wish to get a transcript of the session (this is very important for indexing by search engines). Last time we asked for OKF volunteers and it took ages. I am considering Mechanical Turk which will costs about 1 USD/min of video, so ca 30 USD. There's a good tutorial on this (http://waxy.org/2008/09/audio_transcription_with_mechanical_turk/ ) so it seems to be very cost-effective and I am expect of high quality (given the simplicity of the task and the clarity of the material).

Meanwhile I have also created the final version of Gulliver Turtle's interview (http://vimeo.com/34259668).

I wanted to add music to the slideshow so that it added to the atmosphere, and @davemurrayrust offered his CC-BY material (http://mo-seph.com/). However it was too good in that the reader/listener spent more attention on the music than the text. So I started looking for CC-BY or CC-PD music and was pointed to a wonderful site (http://musopen.org). This has many hundred public domain recordings (sic, CC-PD) mainly from "the classics". So it was question of selecting something that added to the video.

I couldn't find Carnival of the Animals so first tried Schumann's kinderszenen – but we all agreed it was too sentimental. So the animals now have Bach's Anna Magdalena in the background (far better than I can play it!). It's fairly easy to add music – you have to trim it to the right length. It's repeated three times to fit and has a fade at the end. I'd value comments, but I am thinking of using it as the basic AnimalGarden background for any generally "happy" photocomic.

So then I resurrected the slideshow that I had given at the Serpentine Gallery http://blogs.ch.cam.ac.uk/pmr/2011/10/16/garden-marathon-at-serpentine/ and added music to it. This was harder, as the themes were Innocence, Greed and Treachery, and Hope. Still choosing from Musopen (and again many thanks) I chose Anna Magdalena, Winter (4 Seasons) and Brandenburg 6-1. Certianly I am really happy to have found a PD site that gives me so much choice.

(There will be no music for IainH's interview)

Panton Discussions #4 and #5

Yesterday Laura Newman (OKF) and I met with Iain Hrynaszkiewicz of BiomedCentral and recorded a Panton discussion. Panton discussions are irregular discussions with figures in the Open world and have traditionally taken place in the Panton Arms in Cambridge, where the Panton Principles (http://pantonprinciples.org/ ) were announced.

The previous three discussions were with:

  • Richard Poynder
  • David Dobbs
  • Peter Murray-Rust (interviewed by Richard Grant, F1000)

Until recently discussions depended on my finding someone with a recorder and preferably video camera as well. Now I have permanent access to a camera and have learnt how to use it, so we don't have to get people to Cambridge (though it's nice to try). We've also got funding for Panton Fellowships (OSI) so that's an extra resource and has enabled Laura and me to work together.

IainH has been a great advocate of Open data and this recording explores his ideas. I'm currently editing it but here is the most controversial part as a teaser – how does Iain pronounce his name?:

http://dl.dropbox.com/u/6280676/ih_prononunce.wmv

This clip is a test of whether blogging videos works in my environment.

When AnimalGarden found out we were going to BMC they insisted we had to have an interview with Gulliver Turtle – the Open Access Turtle. So here are a few snippets of their recording:

The current movie is in *.wmv –sorry if you can't read it. I am getting a converter so the full interview

Should be in MP4

The Open Access Movement is disorganized; this must not continue

I am going to have to reply to an article by Stevan Harnad (http://openaccess.eprints.org/index.php?/archives/862-guid.html ) where he argues inter alia that gratisOA (e.g. through Green, CC-restricted) rather than libreOA (e.g. through Gold, or CC-BY) should be adopted because:

"Note that many peer-reviewed journal article authors may not want to allow others to make and publish re-mixes of their verbatim texts"

"It is not at all it clear, however, that researchers want and need the right to make and publish re-mixes of other researchers' verbatim texts."

"Nor is it clear that all or most researchers want to allow others to make and publish re-mixes of their verbatim texts. "

"Hence Gratis OA clearly fulfils an important, universal and longstanding universal need of research and researchers.

This is a new, specious and highly damaging assertion that I have to challenge. If we restrict ourselves to STM publishing (where almost all of the funders' efforts are concentrated) there is not a shred of evidence that any author wishes to restrict the re-use of their publications through licences. My analysis is:

  • Most scientists don't care about Open Access. (Unfortunate, but we have to change that)
  • Of the ones that care, almost none care about licence details. If they are told it is "open Access" and fulfils the funders' requirements then they will agree to anything. If the publisher has a page labelled "full Open Access – CC-NC – consistent with NIH funding" then they won't think twice about what the licence is.
  • Of the ones who care I have never met a case of a scientist – and I want to restrict the discussion to STM – who wishes to restrict the use of their material through licences. No author says "You can look at my graph, but I am going to sue you if you reproduce it" (although some publishers, such as Wiley did in the Shelley Batts affair, and presumably still do).

My larger point, however, is that the OA movement is disorganised and because of that is ineffective. The movement :

  • Cannot agree on what "open access" means in practice
  • Appears to be composed of factions which while they agree on some things disagree on enough others to make this a serious problem.
  • Does not sufficiently alert its followers to serious issues
  • Has no simple central resource for public analysis of the major issues.
  • Is composed of isolated individuals and groups rather than acting on a concerted strategy
  • Spends (directly or indirectly) large amounts of public money (certainly hundreds of millions of dollars in author-side fees) without changing the balance of the market
  • Has no clear intermediate or end-goals

I accept that all movements have differences of opinion. This happened in the Open Source movement. Richard Stallman proposed a model of [UPDATED] Free Software with a strong political/moral basis. Software should be free and should be a tool of liberation, and hence the viral GPL (similar to CC-BY-SA). Others see software as a public good where optimum values is obtained by requiring libre publication but not restricting downstream works to this model (similar to CC-BY). These coexist fairly well in two families – copyleft and copyright. There are adherents of both – my own approach is copyright where anyone can use my software downstream for whatever purpose save only crediting the authors (and with ArtisticLicence requiring forks to use a different name).

The point relevant to Open Access is that this market / movement is regulated. They have formed The Open Source Initiative (OSI, http://www.opensource.org/) which

"is a non-profit corporation with global scope formed to educate about and advocate for the benefits of open source and to build bridges among different constituencies in the open source community."

"One of our most important activities is as a standards body, maintaining the Open Source Definition for the good of the community. The Open Source Initiative Approved License trademark and program creates a nexus of trust around which developers, users, corporations and governments can organize open source cooperation."

This is critical.

When I find an Open Source program, I know what I am getting. When I find an Open Access paper I haven't a clue what I am getting. When I publish my code as Open Source I can't make up the rules. I must have a licence and it must be approved by OSI (they have a long list of conformant licences and discuss why some other licences are non-conformant.

The Open Source movement decided early on that the Non-Commercial clause in licences was inappropriate / incorrect / unworkable and NO OSI licences allow such restrictions. This is one of the great achievements of OpenSource

And it's policed. When I published some GPL-licenced CML-generating code I added the constraint "if you alter this code you cannot claim the result is CML". I was contacted by the FSF and told this was incompatible with the GPL. So I changed it.

In short the OS community cares about what Open Source is, how it is defined, how it is labelled and whether the practice conforms to the requirements.

By contrast the OA community does not care about these things.

That's a harsh thing to say, and there are many individuals and organizations who do care. But as a whole there is no coherent place where the OA movement expresses its concern and where concerns can be raised.

The problem can be expressed simply. "Open Access" was defined in the Budapest and other declarations. Budapest (see http://www.earlham.edu/~peters/fos/boaifaq.htm ) says:

"By 'open access' to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited."

Everyone (including Stevan) would agree that this is now consistent with what is (belatedly) being labelled as OA-libre. Note that Stevan was a signatory to this definition of Open Access.

My immediate concern is that unless we organize the definition, labelling and practice of Open Access we are simply giving OA-opponents or OA-doubters carte blanche to do whatever they like without being brought to account. We are throwing away hundreds of millions of dollars in a wasteful fashion. We are exposing people to legal action because the terms are undefined.

In short "Open Access" is a legally meaningless term. And, whether you like it or not, the law matters. If you try to re-use non-libre material because it was labelled "Open Access" you could still end up in court.

As a UK taxpayer I fund scientists to do medical research (through the MRC). The MRC has decided (rightly) that the results of scientific research should be made Open. But they are not Open according to the BOAI declaration. Every paper costs taxpayers 3000 USD to publish and we do not get our money's worth.

The fundamental problem is that this is an unregulated micro-monopoly market and no-one really cares. The BBC discuss today about how some bus operators had 70% of the market. That means they have an effective monopoly. They can run the services they want, not what we need. The only possible constraint is government regulation.

Scientific publishing is an unregulated pseudo-monopoly market where the publishers make the rules and make the prices. The bargaining (such as it is) comes from:

  • Libraries (who are fragmented, and who care only about price, not rights)
  • Funders (who are fragmented across disciplines and countries)
  • Small scholarly organisations such as SPARC. (BTW I thought SPARC was advocating for OA-libre but I can't see much sign)
  • Individuals such as Stevan, Peter Suber, Alma Swan, with relatively little coordination and no bargaining power

These have been ineffectual in creating a coherent OA market. BMC and PLoS have been very useful in showing that OA is possible on several fronts – without them I think OA in STM would be effectively dead.

So the OA movement desperately needs coordination. Coordination of:

  • Terminology
  • labelling
  • Dissemination of information and coordinated search
  • Advocacy
  • Enforcement

It also needs to coordinate on price-bargaining in a more effective way than libraries do at the moment. We (our governments, our charities, our universities) provide the money but they don't coordinate how it's spent or what value they get.

So my simple proposal is that we need an Open Access Institute/initiative similar to the OSI for Open Source. It would cost a small fraction of what we already pay in unregulated Open Access fees (who for example challenges the 5000 USD that Nature charges for a hybrid paper). It costs more to run a car than to publish one hybrid paper per year in Nature.

Stevan's response will be: "let's concentrate on getting all papers published as Green before we worry about anything else". I don't agree with this and I will explain more later. The OAI will have to accommodate such differences of opinion and label the approaches properly rather than allowing everyone to redefine Open Access as they think best or trying to get everyone under a common ultra-fuzzy label.

[UPDATE: I typed "Open Source" instead of "Free Software" by mistake and apologize to RMS for the slip.]

 

LBM2011 Singapore; a milestone in Text-mining and Natural Language Processing, OSCAR, OPSIN, ChemicalTagger

Yesterday I gave an invited plenary lecture at http://lbm2011.biopathway.org/ - The Fourth International Symposium
on Languages in Biology and Medicine (LBM 2011) Nanyang Technological University, Singapore .14th and 15th December, 2011

The meeting was on Natural Language Processing – using computational techniques to "understand" science in publications and get machines to help us with the often boring and error-prone part of extracting detailed meaning. It's an exciting field and progress is steady.

The meeting itself was great. Very high standard of talks. I understood most of them both in their intent and their methodology. And a really great atmosphere – ca 30 people in a relaxed atmosphere and prepared to exchange ideas.

I realised the night before that this lecture represented a milestone in my NLP/TextMining career so I took considerable time to adjust it to the audience and the occasion. Normally in my lectures I don't know what I am going to say (and choose from HTML slides). This time I wanted to pay tribute to all the people who have contributed so much over the years and have brought us to this milestone. So here's the second slide of my talk:

(I've only included each person once). If there is anyone who is omitted let me know.

I'd like first to thank the people in the Centre I have had the chance to work with (including of course Jim Downing). They've been unusual in a good way in that they haven't been obsessed with academic competition. They have worked as a team, creating joint products and they have also put a high premium of creating things that are useful and work. That's not so common in academia and this group has traded H-indexes for software and systems that are out there and being used. If only academia gave credit for that they would be stars. That time will come.

I'm proud to say that they've all joined high-tech UK companies which have to be part of our future. Making digital things that people want to buy. Thermodynamics owes more to the steam engine than the steam engine to thermodynamics. We should be learning from the companies that this group has gone into. I'm proud of the software engineering that Jim introduced and that the group adopted without mandates or coercion but simply because it was so evidently right.

That pride has gone into the three products, OPSIN, OSCAR4 and ChemicalTagger.

In a real sense we can draw a line under them. They work, they are "out there" and they are used. We don't know how much they are used because people are so secretive. I'd guess that there are probably 20-100 installations of OSCAR. We get little feedback because the software works. (We've got no formal feedback from OSCAR2 and we know that it's widely used).

And I'm going to be unusually boastful, because it's for them, not me:

OPSIN, OSCAR and ChemicalTagger are the best in their class that we know about. There may be private confidential programs that we don't know about, but hey! Because they are Open Source people don't try to compete and duplicate the functionality. They re-use it. So OSCAR is used in Bioclipse, used at EBI for their chemical databases and ontology. Open source doesn't necessary make a program functionally better per se but it allows other people to work on it. More testers, more bugs discovered, more progress.

Why can we draw a line?

Because essentially we have done what we needed to. We've built them as frameworks and we are confident that the frameworks will work for some time before they need refactoring (everything needs refactoring). So if you think OSCAR4 has less functionality than OSCAR3, that's because it's modular. There is no point in US writing web interfaces that you need to put on your server. Instead we have written an API that is so simple and powerful it's 2 lines of code in its basic form. Easy to understand, easy to test, easy to install, easy to customise.

There's lost more to do, but it doesn't involve rewriting the programs. OSCAR is designed to be extended through APIs. If you want to use a new corpus there's and API. A new dictionary/lexicon? An API. A new machine-learning algorithm? Yes, an API. It should be hours, not years to reconfigure. Here's how to do it:

ChemicalEntityRecogniser myRecogniser = new PatternRecogniser()

Oscar oscar= newOscar();

oscar.setRecogniser(myRecogniser);

oscar.setDictionaryRegistry(myDictionaryRegistry);

List<ResolvedNamedEntity> entities =

oscar.findResolvableEntities(text);

Five lines of code (of course someone has to write that recogniser and the dictionary but then you can plug and play them). So if you want OSCAR to use Conditional Random Fields, find an Open Source library (there are lots) and bolt it in as a Recogniser.

Yes, group, I am proud of you and yesterday was the day I said so publicly. I used your own (Powerpoint) slides!

So where does it leave us? What does the signpost point to?

I said yesterday that Language Processing research had two goals and I'll prepend a third:

  1. LP research develops new approaches to LP. Our main contribution here has been to add chemistry and we've covered most of what's involved. There is no reason why the technology shouldn't be extended to different human languages, different corpora. We've not made any great Chomskian-like breakthroughs in understanding language itself. We haven't been able to compare chemical corpora because we don't have a collection of Open ones.
  2. LP uses discourse to give insights into the science itself. I'd hoped to do that in chemistry, but the universal refusal of chemistry publishers to provide Open corpora has meant that we have been restricted to patents. And patents, designed to conceal as well as to reveal, are not where new ideas in the fundamentals of chemistry come from. Contrast bioscience where language is a primary tool for understanding the discipline
  3. LP as a tool provides new useful knowledge to the whole world. Here again we are stymied by the publishers. As this blog has shown publishers are unwilling to make papers Openly available and for the extracted knowledge also to be Open. At a conservative estimate publishers have held back LP by a decade.

I then divided chemical LP into two areas:

  1. Chemical LP in chemical corpora. There's nothing useful we can do here until the scientific literature is Opened. There are 10 million syntheses published a year, and even being pessimistic, PubCrawler and OSCAR could analyse 2 million of these (I think it's higher). Richard Whitby in Dial-a-Molecule could use all this in his project for designing the molecules we will need in the future. But we are simply legally forbidden.

     

    So I am giving up LP in chemical documents. There is no point. Some commercial companies will possibly use OSCAR or OSCSAR-like tools to do a small bit of this – but necessarily inefficiently. We can forget the idea of a chemical Bingle. Chemistry remains stagnated

     

  2. Chemical LP in biological documents. Unlike the chemists, biologists really need and want LP/Textmining. They are also hampered by the restrictive practices, but they can probably work out the scope (so long as they don't publish the extracted data – all our data are belong to the publishers). There's areas such as metabolism (where Peter Corbett had some great and easily implantable ideas) which would yield massive results. Metabolism with why drugs work and why they don't work. It matters. Lots of it is in the existing literature but technically and legally locked up, gathering dust.

So I am encouraging the bioscientists to use our software. I am happy to work with anyone – I am not slavishly tied to generate REF points. There is some valuable chemistry to discover by mining the bio-literature

And I intend to go more into the patient-oriented literature and to use LP to help the scholarly poor. Because it may help them to become scholarly richer in spite of everything. And I picked up quite a lot about medical LP at the meeting so I'm fired up.

Workshop and Symposium on Semantic Physical Science

We are delighted to announce that we are running a Workshop and Symposium on Semantic Physical Science (see below). This will explore how to use physical scientific data in semantic form and will explore: creation by humans and machines, specifications including dictionaries/schemas, writing and using code (Java, FORTRAN), creating semantic data/annotation, data repositories, and publication/re-use.

We are encouraged by the success of a 1-day workshop we ran in Melbourne last month where we explored the ideas and technology. The feedback was very positive, suggesting that the time has come for Physical Science to embrace semantics. Common components are dictionaries, units of measurement, errors, metadata and the workshop/symposium will explore how to create and use these. We are also encouraged by a recent JISC/OKF/W3C-SWAT4LS life sciences hackathon that we helped to run (which showed that substantial progress can be made in 2-3 days)

This workshop is made possible by support from EPSRC. Please feel free to mail this to other scientific groups and organizations and mailing lists, especially in chemistry, earth and materials sciences.

 

Workshop and Symposium on Semantic Physical Science, Unilever Centre for Molecular Science Informatics, Cambridge UK (2012-01-10/12)

We are running a hands-on workshop (January 10th/11th 2012) and symposium (January 12th 2012) on Semantic Physical Science, supported by EPSRC ("Pathways to Impact"). At these events, we will be investigating how semantic technologies (dictionaries, mark-up languages, ontologies, data-typing) can be applied to the capture, publication, preservation and re-use of data in the physical sciences (especially chemistry and materials science). We have invited 25 scientists (particularly from the fields of crystallography/solid state, analytical spectroscopy and computational chemistry; see list below) to a two-day workshop where we will review and create toolkits and protocols. We are delighted to see very great interest from national laboratories and national providers of services.

The results of the workshop and general talks on semantic principles will be presented at the full day symposium, which will be of particular interest to creators of chemical software, publishers, repository managers and funders who encourage data publication. The symposium is open to everyone without charge. The approximate program will be released shortly but some details will reflect the progress made in the preceding workshop. There will be significant time for discussion. If you wish to attend the symposium, please email spsworkshop@gmail.com to register; places are limited to 50 – first come, first served!

Please distribute this flyer to anyone who you feel may be interested.

There may be one or two places still available in the workshop – please email spsworkshop@gmail.com for information.

Confirmed attendees:

Nico Adams (CSIRO), Simon Coles (University of Southampton), Clyde Davies (Microsoft), Bert de Jong (Pacific Northwest National Laboratory), Martin Dove (University of Cambridge/Queen Mary University of London), Jorge Estrada (Zaragoza Scientific Center for Advanced Modeling), Marcus Hanwell (Kitware), Marcus Kraft (University of Cambridge), Mahendra Mahey (JISC), Brian McMahon (IUCr), Karl Mueller (Pacific Northwest National Laboratory), Weerapong Phadungsukanan (University of Cambridge), Henry Rzepa (Imperial College), William Shelton (Pacific Northwest National Laboratory), Paul Sherwood (STFC Daresbury Laboratory), Michael Simmons (University of Cambridge), Christoph Steinbeck (EBI), Jens Thomas (STFC Daresbury Laboratory), Andrew Walker (University of Bristol), Alex Wade (Microsoft), Nancy Washton (Pacific Northwest National Laboratory ), Mark Williamson (University of Cambridge), Erica Yang (Science and Technology Facilities Council)

JISC/OKFN/SWAT4LS Hackathon: Bibsoup and disease video

This weeks' hackathon showed how much could be accomplished in a (short) day-and-a-half. I've already given a brief overview but here I discuss our project – Open Research Reports and disease - in detail.

David Shotton and Tanya gray introduced Minimal Information for Infectious Disease Investigations (MIIDI). [Lively debate about whether the Minimal Info idea was useful in bioscience]. Graham Steel and Gilles Frydman brought the patient axis = ACOR (Assoc Cancer Online Resources) and PatientsLikeMe. What is absolutely clear is that:

  • Patients want to be and must be equals in the use of information
  • Patients have a huge interest , huge energy and increasingly community experience and knowledge
  • They are seriously disadvantaged by lack of access to full-text articles (academics do not realise this)
  • Bibliography is critically useful

So we are developing Bibsoup to take in selected Pubmed IDs (UKPMC) and ingest them into a Bibserver to provide a BibSoup instance. Mark MacGillivray has developed a facetted browsing technology, based on BibJSON, which is both formal and fluid. It's easy for people to annotate and should be straightforward to extend.

Along with other presentations I have captured ours (3.20 mins) and annotated some of the footage. Currently in my Public dropbox (http://dl.dropbox.com/u/6280676/hackathon.mp4 )

We are excited with how effective the hackathon was in bringing several people together into the BibSoup cluster and how the various ideas came together. We are hoping to create a future event where patients are participants. Spent a day on Friday with Gilles Frydman discussing what would be involved.

Again thanks to Gilles, Graham, Jenny, Mark, Andrea, Mahendra, Naomi and everyone else.

The hackathon was a great success (Open Research Reports, SWAT4LS, JISC, OKFN, Open Bibliography/BibSoup …)

The 1.5 day hackathon prequel to the SWAT4LS workshop was a great success – at least all the 30+ people there thought so. This is just a brief report and thanks. First enormous thanks to Andrea Splendiani and Mahendra Mahey (JISC) who jointly came up with the idea. And JISC who sponsored the event. Thanks to Jenny Molloy (OKFN) who spent a lot of time preparing and advertising and to everyone who contributed (i.e. everyone)

Very briefly we started with an evening session where people presented ideas – some prepared (Professor OWL showed her video) so extempore. Ideas about our ORR interests (Open Citations (David Shotton and Tanya Gray), Open Bibliography, and patient centred information and decisions (Gilles Frydman and Graham Steel)). But also how to build networks of gene-drug interactions, design an artificial genome, map the incidence of disease, etc.).

Then next day Mahendra gently organized us into groups – about 6 critical masses or people who felt they could create something by the end of the day. Not necessarily software – it could be exploration, resources, specs, etc. Mahendra, Naomi Lillie (recently joined OKF staff) and I had a roving brief interviewing tables, people and generally recording the event.

I have recently discovered the joys of video recording and since Mahendra's was full I rushed out to Tottenham Court Road for a tripod. We started to record individual attendees and find out who, why, what, etc. ULU isn't the easiest place to record as it's got large echoing rooms and passages, banging doors, people with drills, police cars, and students shrieking with laughter. So some of the early efforts were poor quality. Mahendra has a colleague who can apparently work magic and he's taken them all away to edit. But anyway at lunchtime I rushed out again and got a lapel mike and spare card and here we are. Here I'll just put stills from the movies. But I am very impressed with the quality – you can often see every word on the screen.

[Sorry I don't have names for everyone – feel free to annotate through blog comments]

Graham McDawg, Gilles KosherFrog and Jenny Molloy

Building an artificial genome

 

The BibSoup cluster (anticlockwise) : XX, Naomi Lillie, Tania Gray, Mark McGillivray, Jenny Molloy, Gilles Frydman, Graham Steel

BibSoup being presented (Jenny Molloy)

The details [there was also a great demo on screen by Mark – I'll post it when transcoded)

Soup of the evening, beautiful BibSoup

McDawg – stand back – I'm a scientist

David Shotton (middle rear, beard) and others

I am really excited about the whole thing – the different disciplines and experiences really came together. There was a lot of interest in Open Bibliography and Citations and BibSoup. These can become central tools in the semantic web – and when allied with UKPMC they specifically serve bioscience. The disease theme was very strong – not just our group. And great interest in patient-centred approaches.

More on all this later.

“Open Access” and Non-Commercial Licences - summary

This is the last post I shall make – on this blog – on this subject.

Recent discussion has highlighted that this issue is much larger and much more critical than I thought even a few days ago. It is critical that we assemble clear dispassionate arguments to show overwhelming that CC-NC is totally incompatible with "Open Access" publishing and to persuade funders and publishers to move as rapidly as possible to a fully Open (Open Definition) licence. I am therefore taking this to the discussion group(s) of the Open Knowledge Foundation (http://lists.okfn.org/mailman/listinfo/okfn-discuss ) where it will have a wider audience and more informed comment.

Ross Mounce has put together a number of comments, the first of which states

"i.e. this mess has caused irreparable damage to the re-usability of the literature."

I completely agree with this – and every month that it remains makes the future of Open Scholarly publishing worse. You can read all the comments but here are some which discuss the issue:

http://blogs.ch.cam.ac.uk/pmr/2011/12/06/acceptance-of-cc-nc-has-sold-readers-and-authors-seriously-short/#comment-101293 (Richard Kidd, Royal Society of Chemistry)

http://blogs.ch.cam.ac.uk/pmr/2011/12/06/acceptance-of-cc-nc-has-sold-readers-and-authors-seriously-short/#comment-101279 (Daniel Mietchen)

http://blogs.ch.cam.ac.uk/pmr/2011/11/29/scientists-should-never-use-cc-nc-this-explains-why/#comment-101229 (Mike Linksvayer)

http://blogs.ch.cam.ac.uk/pmr/2011/12/05/%E2%80%9Copen-access%E2%80%9D-and-%E2%80%9Cnon-commercial%E2%80%9D-%E2%80%93-yet-again-can-any-publisher-justify-fees-for-hybrid-articles/#comment-101292 (Ross Mounce)

http://blogs.ch.cam.ac.uk/pmr/2011/12/05/%E2%80%9Copen-access%E2%80%9D-and-%E2%80%9Cnon-commercial%E2%80%9D-%E2%80%93-yet-again-can-any-publisher-justify-fees-for-hybrid-articles/#comment-101291 (Ross Mounce)

http://blogs.ch.cam.ac.uk/pmr/2011/12/05/%E2%80%9Copen-access%E2%80%9D-and-%E2%80%9Cnon-commercial%E2%80%9D-%E2%80%93-yet-again-can-any-publisher-justify-fees-for-hybrid-articles/#comment-101291 (Ross Mounce)

Anecdotal surveys in the comments by Daniel, Ross and PMR show that a very high proportion of well-known publishers are using CC-NC.

 

 

 

 

Acceptance of CC-NC has sold readers and authors seriously short

I was at the AGM of UK PubMedCentral last Monday and asked about the Open Access subset of PMC – those papers where authors/funders have paid large amounts of money to ensure their papers are "Open Access". I asked about the licence, fully expecting these to be all CC-BY and was appalled to hear that most of them were only available as CC-NC. This appears to be near universal – mots major publishers only allow "Open Access" to be CC-NC.

Very simply, this is a disaster.

Because CC-NC gives the reader or re-user almost no additional rights. The author is paying anything up to 3000 currency units for something which is little more than permission to put the article on their web page. And as far as I can see the funders have acquiesced to this. Whether it was the best they could negotiate or whether it's an oversight I don't know – hopefully a funder will let us know.

I and others have written at length on the restrictions imposed by NC. NC forbids any commercial use. Commercial is not related to motivation – profit/non-profit, etc. It is whether there is an exchange of some form of goods. Among the things NC forbids are:

  • Public text- and data-mining. A third party could make commercial use of the results
  • Republication of diagrams, etc. in journals. Publication is a commercial act.
  • Creation of learning materials. Students pay for their education.

And many more. You may think I am being picky and that no-one would object to this. But a licence is a legal document and these are commercial activities, regardless of the motivation.

So, simply, CC-NC forbids almost all downstream activity, rendering the "Open Access" valueless other than for human eyes.

Why do the publishers do this? After all BMC publishes all material as CC-BY and nothing terrible has happened to it. Why especially should a scholarly society which is meant to foster communication and science actually restrict its use. I won't speculate, but leave it to them to reply. Note that making something CC-NC is not giving the reader permissions, it's effectively removing permissions from Open Access CC-BY.

I find the whole cluster of "Open Foo" deeply worrying. These have names such as:

  • Open Choice
  • Author Choice
  • Open Science (this is an appalling name – completely at odds with normal usage and conveying no information)
  • Free content

When you see something described as "Open Choice" instead of Open Access it's a very good indication that you will have to read the small print. Often there isn't a licence. I can't find licences on the RSC FAQs. "You may deposit the accepted version of the submitted article in other repository(ies) as required, with no embargo period, except that you are not permitted to deposit your work in any commercial service." This isn't a licence, it's mumble. Springer (http://www.springer.com/open+access/authors+rights?SGWID=0-176704-12-467999-0 ): The copyright will remain with you and the article will be published under the Creative Commons Attribution-Noncommercial License. The cost of Springer Open Choice (USD 3.000/ EUR 2.000) is – as stated on the NIH web site – a permissible cost in your grant. CC-NC. Since Springer allow self-archiving (Green) the 2000 EUR is buying almost nothing (perhaps a slightly different form of the manuscript?). Nature (http://www.nature.com/srep/policies/index.html#license-agreement ) "Papers are published under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported or the Attribution-NonCommercial-ShareAlike 3.0 Unported licence, at the choice of the authors. " and I could go on with other publishers.

One feature of these "Open Foo" products is that they are different for every publisher. Some bury the licence several pages down, others don't mention it (RSC). It's a huge amount of work to make sense of this. It shouldn't have to be people like me.

The bottom line is that this is an unregulated market. Some of us are looking to the funders to act as regulators. They are not. They probably feel that 2000 for a CC-NC licence is what they want.

Unfortunately it's not what I want and I have the feeling, yet again, that we have been sold short.

In summary:

  • I'd like the RSC to justify NC because I can't
  • I'd like one of more funders to say why they have accepted such bad value in CC-NC licences.