petermr's blog

A Scientist and the Web

 

Archive for April, 2012

A pictorial Amusement

Monday, April 30th, 2012

I dropped in to see our computer officers today – they’ve just had an aircon failure and I was offering sympathy – they have a lot to deal with. While there I noticed this splendid spanner (== wrench/US). I love tools and this one has a majesty of its own in a computer office. It’s about 40 cm long (see ruler) and we guess it’s about 2 kilos.

I naturally assumed it was for something like bolting units to the floor or something like that, but that’s not why it was ordered. The reason is gently amusing – perhaps you can make some guesses).

Meanwhile tomorrow I’ll be blogging about text-mining. I’ve been hacking code furiously over the last 5 days and feeling it. There is a lot I need to write about but textmining is the priority.

 

Text-mining the scholarly literature: towards a set of universal Principles; Update and strategy

Wednesday, April 25th, 2012

For some years I have seen the primary literature as an enormous untapped resource of scholarly information. We humans are very good at some aspects of “reading the literature” but there are many areas where machines are better and should be used. These include scale (hundreds of thousands of manuscripts), checking, validation, transformation (e.g. scientific units), deduction (many papers have implicit semantics), aggregation of knowledge, and much more. We are now reaching the time when the technology of “text-mining” is mature enough to deploy and, for example, my group and I have developed among the best tools in the world for mining chemistry. I am now expanding that to other fields which I will describe in later posts.

In general the readers of the scholarly literature (who may include the #scholarlypoor) have been seriously frustrated by the restrictions imposed by publishers and universally agreed by librarians. Most subscriptions to most major journals have terms forbidding readers to mine/crawl/index/extract etc. This is not a consequence of copyright – it is an additional restriction imposed by published and apparently automatically assented to by academic purchasing systems (mainly libraries). This automatic assent has done scholarship a grave disservice, so I give the library community a chance to correct the historical record:

Has any library ever publicly challenged the terms of use [on mining] set by publishers? I haven’t seen any. But I’d be grateful to know public cases, and what happened. My current view is that publishers set conditions and that libraries accept them verbatim, which, unfortunately, means that they don’t have a track record of fighting for text-mining or other freedoms.

Moving on, the UK Hargreaves report has recommended removing these restrictions (which are not legally required) and also modifying copyright law. My grapevine suggests there is a high probability that significant changes will be made and that “text-mining” will become widely available without requiring explicit permission. We should prepare for this, and any responsible publisher and library/purchaser should be preparing for this.

A month ago I and colleagues in OKF submitted cases to the Hargreaves process. As part of that I asked 6 major publishers whether I could “text-mine” their journals. Naomi Lillie of OKF is summarising the results and I will keep you in suspense till then. It’s fair to say some were helpful, some were not and some were fuzzy (for whatever motivation).

A number of publishers said we should discuss it with the library. There is no need for this. I and my group can text mine material by myself – in one week Daniel Lowe extracted 500,000 chemical reactions from the US Patent Office without needing any help. Nick Day has built PubCrawler and extracted 200,000 crystal structures from supplemental information without any help. The only thing I need is:

  • An assurance I won’t be sued for behaving like a responsible scholar
  • An assurance that my institution won’t get cut off for (my) responsible behaviour

In case anyone in the publishing or library communities doesn’t understand what “responsible” means, it means:

  • I do not intend deliberately to re-publish the publishers manuscripts (“the PDF”) in bulk without valid scholarly reason.

I am a responsible scholar. I conform to health and safety. I obey the law of the UK. I do not steal. I can justify the expenditures on my grants. I attempt to value and promote human equality in my scholarship. I try to give credit where it is due. Responsible scholarship is a fundamental principle which I believe applies to almost all readers of the scholarly literature. Occasionally I and others fail – there are ample mechanisms for addressing these without forbidding textmining.

So this post asserts my absolute right as a subscriber to the scholarly literature to carry out textmining and to disseminate the results to anyone. I do not need any other permissions.

A number of details follow which I’ll address in later posts.

At present, therefore, a group of us – under the aegis of the Open Knowledge Foundation – is drafting a set of principles for textmining. They include:

We shall come up with a manifesto/set-of-principles. This will be a statement of our rights and our responsibilities. It is not a negotiation, anymore than Tom Paine or the Founding fathers negotiated in the construction of their declarations. Or, more recently, the BBB declarations of Open Access. Those declaration are priceless – it’s just a pity that there are not enough who believe in them enough to push for their universal acceptance. We shall not make the same mistake with the principles of textmining.

 

Panton Fellows, Principles in Japanese, #pantonscience

Tuesday, April 24th, 2012

 

 

It’s been an exciting week in Pantonia. I have been very active with our new Panton Fellows (http://science.okfn.org/2012/04/03/introducing-our-panton-fellows/) Last Monday Ross Mounce came over to Cambridge and we looked in depth about liberating information about phylogenetic trees. This is exciting and keeps me up at night and active on train journeys. And yesterday I took the train to Oxford to visit Sophie Kershaw who’s putting together a radically different course for Graduates, with emphasis in reproducible computing. I’m deliberately downplaying both of these here, as they’ll be telling you all about what they are doing.

Part of yesterday was an evening meeting run by Jenny Molloy – a new Open Science groups with about 12 of us in the Oxford eResearch Centre (OeRC) where we met Dave de Roure who took us out the dinner in the Royal Oak. While there we discussed in some depth what need to be done for text-mining including Diane Cabell and Dave Shotton. It’s really great to see critical mass in this way. I will have a LOT to write about textmining.

So today I met with Ayumi Koso (above) from Tokyo. Ayumi works with the Japanese government in Tokyo on the National Bioscience Database Centre (NBDC). She has already translated the Panton Principles into Japanese (http://pantonprinciples.org/translations/#Japanese ). She’s staying in Cambridge and so today has a chance to meet some OKF people. Here’s our visit to the Panton Arms, preceded by a visit to Hinxton/Sanger Centre to visit Tim Hubbard (OKF advisory). And this afternoon Laura Newman will be coming round to meet.

I am really fortunate to be living in the middle of all this.

(We’ve decided today that the Panton hashtag is #pantonscience)

 

My virtual talk in Poland

Friday, April 13th, 2012

 

I am presenting a talk in Poland today – although I am in Rome. http://www.eifl.net/events/open-science-education-conference-poland

Open science & education conference, Poland

13 Apr 2012 – 14 Apr 2012 Nicolaus Copernicus University (NCU) invites you to the Third International Conference on Open Access, that will take place on 13-14 April in Bydgoszcz, Poland, at Collegium Medicum NCU. This year’s theme is Open science and Open education. More information.

I hope to be on skype from Rome airport – it may be rather hairy.

I tried to create a set of slides and run an audio over them. I used Powerpoint, because it allowed narration easily. It was quite easy to create and I made a 6 minute introduction. But trying to upload it was a disaster – the upload is so asymmetric that it was taking hours and crashing. So I have changed the strategy.

We’ll play the first 6 minutes. http://dl.dropbox.com/u/6280676/first.pptx

Then if we can’t skype it is worth playing the OKF/JennyMolloy/PMR video http://vimeo.com/31861413

If we can skype then Cameron Neylon has agreed to click through the points and links below while I speak.

Start with the “Academic Spring” The Guardian’s remarkable and remarkably apposite http://www.guardian.co.uk/science/2012/apr/09/wellcome-trust-academic-spring

General points

  • Most science research/data is never properly published or used => Bad science, duplication
  • This costs/loses 100 Billion+ per year; so HUGE opportunities for new business/products. Europe or Silicon Valley??
  • The long-tail of science; scholarship OUTSIDE academia?
  • Conventional publication does not work for data
  • Diversity. No single solution. Communities of scholarship. HEP, Astronomy, Chemistry
  • Domain repositories essential; Inst Repos don’t work for science
  • OPEN. Must be BOAI-compliant: use CC-BY/CC0
  • Are universities the solution or the problem?
  • Sustainability. Funders and National Laboratories
  • Mandates are poor instruments; Culture must change. Rewards?
  • Create an author-centric culture/technology. Semantic documents. “ScienceForge”
  • Sustainability. Alliance with wealth-generation industries?
  • Text-mining VERY topical
  • Theses. Must become centralised semantic, Europe?? NL++, UK–
  • Demos: text-mining, repositories
  • Growing points:
    • Open (Web) Technology continues to advance
    • Linked Open Data / Semantic Web
    • Graduate students
    • Scholarly poor
    • Wikip(m)edia
    • Open Knowledge Foundation

“Slide links” – bold is priority

And a big thank you to everyone in Poland for their patience and to Cameron for helping

Horizon2020 what I said in Rome (and what Neelie said)

Wednesday, April 11th, 2012

I always try to blog what I said in meetings as I don’t (can’t) use traditional slides. Today I scraped slides off the web and my talk was significantly different from what I had prepared. This was in considerable part because of what had been said in the morning by, among others, Neelie Kroes (Deputy European commissioner) and Geoffrey Boulton (Royal Society). They anticipated many of my concerns (previous blog post) and I could simply praise them for it. Neelie Kroes was veyer impressive. She knew the field very well and was clearly fundamentally committed to making it happen. Europe can feel proud of her.

I was able to aks her a question – shouldn’t we be supporting young people and how can we get them to contribute to European wealth creation. Why no Euro Google/facebook, etc.? She was very excited and recounted how she’d been to a young person’s hacker camp (? In Spain) with ?thousands camping in tents. And how when she asked a 14-year old “aren’t you afraid of giving away information” – he said “you don’t get it, it’s about sharing”. We exchanged cards and I’m hoping that I can get some of the young people in the OKF involved.

I’d love to blog other aspects of the meeting – don’t even know whether it’s being tweeted

Open Infrastructure for Open Science/Data; and Academic Spring

Wednesday, April 11th, 2012

 

I am presenting this afternoon in Rome to an important group of science-oriented people/organizations – about 70 people will be there. As always I try to talk to people before the presentation. I’ve got 20 minutes, and I want to get across both ideas and examples. So I can’t do it all. This is my “checklist” for things I think are important. (Almost all my “slides” are scraped from the web and I will publish the links shortly in a separate blog).

  • Most science research/data is never properly published or used => Bad science, duplication
  • This costs/loses 100 Billion+ per year; so HUGE opportunities for new business/products. Europe or Silicon Valley??
  • The long-tail of science; scholarship OUTSIDE academia?
  • Conventional publication does not work for data
  • Diversity. No single solution. Communities of scholarship. HEP, Astronomy, Chemistry
  • Domain repositories essential; Inst Repos don’t work for science
  • OPEN. Must be BOAI-compliant: use CC-BY/CC0
  • Are universities the solution or the problem?
  • Sustainability. Funders and National Laboratories
  • Mandates are poor instruments; Culture must change. Rewards?
  • Create an author-centric culture/technology. Semantic documents. “ScienceForge”
  • Sustainability. Alliance with wealth-generation industries?
  • Text-mining
  • Theses. Must become centralised semantic, Europe?? NL++, UK–
  • Demos: text-mining, repositories
  • Growing points:
    • Open (Web) Technology continues to advance
    • Linked Open Data / Semantic Web
    • Graduate students
    • Scholarly poor
    • Wikip(m)edia
    • Open Knowledge Foundation

e-Infrastructures for Open Science – my talk in Rome

Monday, April 9th, 2012

I have been invited to Rome to help start the Horizon 2020 Consultation for future European funding:

http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/agenda_alea_rome.pdf

It’s an important meeting and follows a morning presentation by Neelie Kroes (European Digital Agenda) and responses by National scientific societies.

http://www.allea.org/Content/ALLEA/General%20Assemblies/General%20Assembly%202012/GA_draft_programme_overview_final.pdf

This blog helps me to coordinate my ideas and also acts as a record of them. My remit is to introduce this theme: “Open e-Infrastructures for Open Science” which then devolves into 3 parallel sessions:

  • Open global data infrastructure
  • Open scientific content
  • Open research culture

I’m interested in all of these and shall try to address all of them – of course there is a lot of overlap. I apologize for any UK-centricity but I hope the issues are genera. I bring the following experience:

  • A practising scientist in “long-tail science”, mainly on the informatics side. Heavily involved in the UK eScience programme.
  • Spent time in industry and academia.
  • Active in the Open Knowledge movement (especially science data and open source code/informatics).

I’ll divide this into these areas and probably be slightly controversial:

What have we learned in the last 10 years of eScience?

The UK eScience programme broke much new ground. Its greatest success was bringing groups of scientists and computationalists together and that continues (e.g. in the Oxford eResearch Centre) and that has made it eminently worthwhile. But I’ll also comment on things that didn’t work:

  • Top-down design. Technology progresses so rapidly in the Internet world that trying to design the future doesn’t work.
  • Academic-industry infrastructure. The problems of shared vision and a secure collaborative infrastructure are too difficult and expensive for either partner. Instead we should concentrate on areas where industry can share the results of academic work through an Open Infrastructure
  • Universities are generally not the best place for managing collaborative research infrastructure on an ongoing basis. Institutional repositories do not effectively serve science. In contrast inter/national research organisations have the infrastructure and the mission to make this happen.
  • There was and is very little investment in infrastructure for “long-tail science” – I exclude bioscience supported by EBI/NCBI, etc. There are no useful repositories for many of the disciplines, few ontologies, and little interest in the dissemination of science
  • Academia and many scientists are conservative and increasingly driven by self-interests. Open practice will not happen rapidly.

What are the current problems?

The eScience program has had little impact on the current practice of science. Informatics is carried out using whatever commodity tools are available and the culture is dominated by commercial scientific publication. This has not changed in 10 years and is now seriously holding back innovation in several ways.

  • The result of research is a “PDF”, not scientific information
  • The rewards are almost solely based on “citations” – a flawed measure of value
  • Almost everyone outside academia (and many within) is denied effective access to scientific output “the scholarlyPoor”.
  • Young researchers are stifled by the system and institutionalised.

There is little incentive to change the system or to build a better infrastructure.

And alongside this we have the battle between commercial closed “walled gardens” and Open knowledge (CC-BY, CC0 – anything other is almost valueless). Academia is NOT committed to Openness – it points inwards and builds systems for itself, not the world. And there is a dysfunctional academic-publisher complex which reinforces stagnation.

What are we losing?

We can consider this both in world terms and European terms. There is now huge potential in new information industries downstream of scientific publication “Google for Science”. I have estimated to the UK Hargreaves enquiry that in chemistry alone this could be “low billions” worldwide. Are we going to let Silicon Valley capture yet another new market?

  • We lose the value of the research we fund.
  • We lose the opportunity of creating new information industries
  • We make seriously bad decisions
  • Our science is worse – often unchallenged or duplicated
  • Or culture does not reward change.

What are the growing points?

We must not ignore the rest of the world. Our greatest human capital is OUTSIDE academia. Examples of worldwide growth are:

  • Wikipedia etc. (probably the greatest communal effort to build quality public science in many disciplines)
  • Open Streetmap. An unfunded project that shows what one person can make happen and within a few years become a word resource and standard
  • Open Source software.
  • Open Knowledge
  • Open science (Open Source Drug Discovery). Open Science moves faster than conventional because it grows communities rapidly, shares knowledge and avoids mistakes.
  • Internet-aware interest and practice groups (e.g. Malaria World)
  • Young people. One graduate year can create a high-quality growing point: Figshare, Altmetrics, and PMR group (OSCAR/OPSIN chemical NLP, Crystaleye – all now being taken up). Give undergraduates and graduates encouragement to explore and innovate
  • Open publishers (PLoS, Wellcome, BMC(Springer))

What should we do?

We have to change the culture. I don’t know how to do that in detail, but here are some things I’d like to see happen.

  • A scientist-oriented system for scientific research. “ScienceForge”. It’s been solved for computer programming (“SourceForge”) without any central investment. It can’t be top-down; it has to grow organically. It has to support scientists in their daily work so naturally that they don’t notice it. A scientist should then be able to share their work anywhere. It should support embargoed publication, on scientists’ terms not publishers.
  • 3rd year graduate students designing the informatics structure and training
  • Use multiple metrics for science output not just “citations”
  • Actively involve the “scholarlypoor” outside academia. Reward successful extra-academic enterprise
  • Put national laboratories at the centre of the infrastructure for long-tail scientific information.
  • Develop sustainable profitable business models based on Open practice.

I’ll think of some more on the plane. And I shall, as always, react to what is said before me. I am very impressed with Nellie Kroes and hope I get a chance to meet

Congratulations to our Panton Fellows, Sophie and Ross

Friday, April 6th, 2012

This blog has been down but I can now congratulate our 2 new Panton fellows, Sophie Kershaw and Ross Mounce. Laura Newman has written an account http://blog.okfn.org/2012/03/30/introducing-our-panton-fellows/ and I took photos.

I will write much more (I’m very busy at present) and will be meeting with both later this month. Congratulations to both.

Read Laura’s blog.