Text-mining the scholarly literature: towards a set of universal Principles; Update and strategy

Posted on April 25, 2012 by pm286

For some years I have seen the primary literature as an enormous untapped resource of scholarly information. We humans are very good at some aspects of “reading the literature” but there are many areas where machines are better and should be used. These include scale (hundreds of thousands of manuscripts), checking, validation, transformation (e.g. scientific units), deduction (many papers have implicit semantics), aggregation of knowledge, and much more. We are now reaching the time when the technology of “text-mining” is mature enough to deploy and, for example, my group and I have developed among the best tools in the world for mining chemistry. I am now expanding that to other fields which I will describe in later posts.

In general the readers of the scholarly literature (who may include the #scholarlypoor) have been seriously frustrated by the restrictions imposed by publishers and universally agreed by librarians. Most subscriptions to most major journals have terms forbidding readers to mine/crawl/index/extract etc. This is not a consequence of copyright – it is an additional restriction imposed by published and apparently automatically assented to by academic purchasing systems (mainly libraries). This automatic assent has done scholarship a grave disservice, so I give the library community a chance to correct the historical record:

Has any library ever publicly challenged the terms of use [on mining] set by publishers? I haven’t seen any. But I’d be grateful to know public cases, and what happened. My current view is that publishers set conditions and that libraries accept them verbatim, which, unfortunately, means that they don’t have a track record of fighting for text-mining or other freedoms.

Moving on, the UK Hargreaves report has recommended removing these restrictions (which are not legally required) and also modifying copyright law. My grapevine suggests there is a high probability that significant changes will be made and that “text-mining” will become widely available without requiring explicit permission. We should prepare for this, and any responsible publisher and library/purchaser should be preparing for this.

A month ago I and colleagues in OKF submitted cases to the Hargreaves process. As part of that I asked 6 major publishers whether I could “text-mine” their journals. Naomi Lillie of OKF is summarising the results and I will keep you in suspense till then. It’s fair to say some were helpful, some were not and some were fuzzy (for whatever motivation).

A number of publishers said we should discuss it with the library. There is no need for this. I and my group can text mine material by myself – in one week Daniel Lowe extracted 500,000 chemical reactions from the US Patent Office without needing any help. Nick Day has built PubCrawler and extracted 200,000 crystal structures from supplemental information without any help. The only thing I need is:

An assurance I won’t be sued for behaving like a responsible scholar
An assurance that my institution won’t get cut off for (my) responsible behaviour

In case anyone in the publishing or library communities doesn’t understand what “responsible” means, it means:

I do not intend deliberately to re-publish the publishers manuscripts (“the PDF”) in bulk without valid scholarly reason.

I am a responsible scholar. I conform to health and safety. I obey the law of the UK. I do not steal. I can justify the expenditures on my grants. I attempt to value and promote human equality in my scholarship. I try to give credit where it is due. Responsible scholarship is a fundamental principle which I believe applies to almost all readers of the scholarly literature. Occasionally I and others fail – there are ample mechanisms for addressing these without forbidding textmining.

So this post asserts my absolute right as a subscriber to the scholarly literature to carry out textmining and to disseminate the results to anyone. I do not need any other permissions.

A number of details follow which I’ll address in later posts.

At present, therefore, a group of us – under the aegis of the Open Knowledge Foundation – is drafting a set of principles for textmining. They include:

Heather Piwowar. Heather has written several blogposts (http://researchremix.wordpress.com/ ) about text-mining. They include negotiations with Elsevier (which include the need for Elsevier and librarians to give her permission) and more recently a manifesto (http://researchremix.wordpress.com/2012/04/20/new-fron/ ).
Maximilian Haussler. See (/pmr/2012/03/09/textmining-update-max-haussler%E2%80%99s-questions-to-publishers-they-have-a-duty-to-reply/ ). Max was quoted 85,000 USD by NPG to mine their content (I think this has been altered to 0?) . He and colleagues have fought for the right and he has submitted a detailed case to the US government
Diane Cabell and Jenny Molloy, OKF. Diane is a specialist in intellectual property law and has helped to craft the OKF open-science response to Hargreaves.
Ross Mounce. Panton fellow (http://about.me/rossmounce ). Ross has created a superb and damning summary of publishers distortion of the term “Open Access” in paid hybrid journals. Ross and I are now working on the technology and strategy of textmining.

We shall come up with a manifesto/set-of-principles. This will be a statement of our rights and our responsibilities. It is not a negotiation, anymore than Tom Paine or the Founding fathers negotiated in the construction of their declarations. Or, more recently, the BBB declarations of Open Access. Those declaration are priceless – it’s just a pity that there are not enough who believe in them enough to push for their universal acceptance. We shall not make the same mistake with the principles of textmining.

Posted in Uncategorized | 25 Comments

Panton Fellows, Principles in Japanese, #pantonscience

Posted on April 24, 2012 by pm286

It’s been an exciting week in Pantonia. I have been very active with our new Panton Fellows (http://science.okfn.org/2012/04/03/introducing-our-panton-fellows/) Last Monday Ross Mounce came over to Cambridge and we looked in depth about liberating information about phylogenetic trees. This is exciting and keeps me up at night and active on train journeys. And yesterday I took the train to Oxford to visit Sophie Kershaw who’s putting together a radically different course for Graduates, with emphasis in reproducible computing. I’m deliberately downplaying both of these here, as they’ll be telling you all about what they are doing.

Part of yesterday was an evening meeting run by Jenny Molloy – a new Open Science groups with about 12 of us in the Oxford eResearch Centre (OeRC) where we met Dave de Roure who took us out the dinner in the Royal Oak. While there we discussed in some depth what need to be done for text-mining including Diane Cabell and Dave Shotton. It’s really great to see critical mass in this way. I will have a LOT to write about textmining.

So today I met with Ayumi Koso (above) from Tokyo. Ayumi works with the Japanese government in Tokyo on the National Bioscience Database Centre (NBDC). She has already translated the Panton Principles into Japanese (http://pantonprinciples.org/translations/#Japanese ). She’s staying in Cambridge and so today has a chance to meet some OKF people. Here’s our visit to the Panton Arms, preceded by a visit to Hinxton/Sanger Centre to visit Tim Hubbard (OKF advisory). And this afternoon Laura Newman will be coming round to meet.

I am really fortunate to be living in the middle of all this.

(We’ve decided today that the Panton hashtag is #pantonscience)

Posted in Uncategorized | Leave a comment

My virtual talk in Poland

Posted on April 13, 2012 by pm286

I am presenting a talk in Poland today – although I am in Rome. http://www.eifl.net/events/open-science-education-conference-poland

Open science & education conference, Poland

13 Apr 2012 – 14 Apr 2012 Nicolaus Copernicus University (NCU) invites you to the Third International Conference on Open Access, that will take place on 13-14 April in Bydgoszcz, Poland, at Collegium Medicum NCU. This year’s theme is Open science and Open education. More information.

I hope to be on skype from Rome airport – it may be rather hairy.

I tried to create a set of slides and run an audio over them. I used Powerpoint, because it allowed narration easily. It was quite easy to create and I made a 6 minute introduction. But trying to upload it was a disaster – the upload is so asymmetric that it was taking hours and crashing. So I have changed the strategy.

We’ll play the first 6 minutes. http://dl.dropbox.com/u/6280676/first.pptx

Then if we can’t skype it is worth playing the OKF/JennyMolloy/PMR video http://vimeo.com/31861413

If we can skype then Cameron Neylon has agreed to click through the points and links below while I speak.

Start with the “Academic Spring” The Guardian’s remarkable and remarkably apposite http://www.guardian.co.uk/science/2012/apr/09/wellcome-trust-academic-spring

General points

Most science research/data is never properly published or used => Bad science, duplication
This costs/loses 100 Billion+ per year; so HUGE opportunities for new business/products. Europe or Silicon Valley??
The long-tail of science; scholarship OUTSIDE academia?
Conventional publication does not work for data
Diversity. No single solution. Communities of scholarship. HEP, Astronomy, Chemistry
Domain repositories essential; Inst Repos don’t work for science
OPEN. Must be BOAI-compliant: use CC-BY/CC0
Are universities the solution or the problem?
Sustainability. Funders and National Laboratories
Mandates are poor instruments; Culture must change. Rewards?
Create an author-centric culture/technology. Semantic documents. “ScienceForge”
Sustainability. Alliance with wealth-generation industries?
Text-mining VERY topical
Theses. Must become centralised semantic, Europe?? NL++, UK–
Demos: text-mining, repositories
Growing points:
- Open (Web) Technology continues to advance
- Linked Open Data / Semantic Web
- Graduate students
- Scholarly poor
- Wikip(m)edia
- Open Knowledge Foundation

“Slide links” – bold is priority

Our Panton Principles and Panton Fellowships http://pantonprinciples.org/
The Panton Fellows
http://science.okfn.org/2012/04/03/introducing-our-panton-fellows/
Ross’ Mounce (Panton Fellow) on RCUK mandate
http://access.okfn.org/2012/03/27/the-new-rcuk-draft-open-access-mandate/
Ross on OKFN hackathon at the Barbican, London http://www.science3point0.com/palphy/
Our new @access movement and website http://whoneedsaccess.org/latest-news/
Wikversity http://en.wikiversity.org/wiki/Wikiversity:Main_Page
PLoS and Wikpedia team up http://wir.okfn.org/2012/03/29/plos-computational-biology-goes-wiki/
OKFN and peer-2-peer university create a new Data Wrangling course http://blogs.p2pu.org/blog/2012/02/08/announcing-the-okfn-p2pu-school-of-data/
Why traditional repositories aren’t open
http://www.lib.cam.ac.uk/repository/about/end_user_terms.html
How data are critical for good science http://pipeline.corante.com/archives/2012/04/10/biomarker_caution.php
New repositories
OKFN
http://thedatahub.org/
Dryad http://datadryad.org/
Figshare
http://figshare.com/
http://precedings.nature.com/documents/7151/version/1
Cooper’s ideas for Open Data Shariing http://figshare.com/articles/A_call_for_disruptive_innovation_in_science_publishing_with_a_new_open_data-sharing_platform_for_the_life_sciences/91541
Textmining – the value (JISC) http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-of-text-mining.aspx
Our textmining tools http://www.jcheminf.com/content/3/1/17
Our textmining of Patents http://www.jcheminf.com/content/3/1/40

And a big thank you to everyone in Poland for their patience and to Cameron for helping

Posted in Uncategorized | Leave a comment

Horizon2020 what I said in Rome (and what Neelie said)

Posted on April 11, 2012 by pm286

I always try to blog what I said in meetings as I don’t (can’t) use traditional slides. Today I scraped slides off the web and my talk was significantly different from what I had prepared. This was in considerable part because of what had been said in the morning by, among others, Neelie Kroes (Deputy European commissioner) and Geoffrey Boulton (Royal Society). They anticipated many of my concerns (previous blog post) and I could simply praise them for it. Neelie Kroes was veyer impressive. She knew the field very well and was clearly fundamentally committed to making it happen. Europe can feel proud of her.

I was able to aks her a question – shouldn’t we be supporting young people and how can we get them to contribute to European wealth creation. Why no Euro Google/facebook, etc.? She was very excited and recounted how she’d been to a young person’s hacker camp (? In Spain) with ?thousands camping in tents. And how when she asked a 14-year old “aren’t you afraid of giving away information” – he said “you don’t get it, it’s about sharing”. We exchanged cards and I’m hoping that I can get some of the young people in the OKF involved.

Here’s some of my “slides”. Luckily I had copied many of them as there was a cap on connections to the web. Some will reward browsing.
The Guardian’s remarkable and remarkably apposite http://www.guardian.co.uk/science/2012/apr/09/wellcome-trust-academic-spring
Our Panton Principles and Panton Fellowships http://pantonprinciples.org/
The Panton Fellows http://science.okfn.org/2012/04/03/introducing-our-panton-fellows/
Ross’ Mounce (Panton Fellow) on RCUK mandate http://access.okfn.org/2012/03/27/the-new-rcuk-draft-open-access-mandate/
Ross on OKFN hackathon at the Barbican, London http://www.science3point0.com/palphy/
Our new @access movement and website http://whoneedsaccess.org/latest-news/
Wikversity http://en.wikiversity.org/wiki/Wikiversity:Main_Page
PLoS and Wikpedia team up http://wir.okfn.org/2012/03/29/plos-computational-biology-goes-wiki/
OKFN and peer-2-peer university create a new Data Wrangling course http://blogs.p2pu.org/blog/2012/02/08/announcing-the-okfn-p2pu-school-of-data/
Why traditional repositories aren’t open http://www.lib.cam.ac.uk/repository/about/end_user_terms.html
How data are critical for good science http://pipeline.corante.com/archives/2012/04/10/biomarker_caution.php
New repositories
OKFN http://thedatahub.org/
Dryad http://datadryad.org/
Figshare http://figshare.com/
http://precedings.nature.com/documents/7151/version/1
Cooper’s ideas for Open Data Shariing http://figshare.com/articles/A_call_for_disruptive_innovation_in_science_publishing_with_a_new_open_data-sharing_platform_for_the_life_sciences/91541
Neurocloud http://www.neuro-cloud.net/
Our semantic data repository (chemistry) http://quixote.ch.cam.ac.uk
Textmining – the value (JISC) http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-of-text-mining.aspx
Our textmining tools http://www.jcheminf.com/content/3/1/17
Our textmining of Patents http://www.jcheminf.com/content/3/1/40

I’d love to blog other aspects of the meeting – don’t even know whether it’s being tweeted

Posted in Uncategorized | Leave a comment

Open Infrastructure for Open Science/Data; and Academic Spring

Posted on April 11, 2012 by pm286

I am presenting this afternoon in Rome to an important group of science-oriented people/organizations – about 70 people will be there. As always I try to talk to people before the presentation. I’ve got 20 minutes, and I want to get across both ideas and examples. So I can’t do it all. This is my “checklist” for things I think are important. (Almost all my “slides” are scraped from the web and I will publish the links shortly in a separate blog).

Most science research/data is never properly published or used => Bad science, duplication
This costs/loses 100 Billion+ per year; so HUGE opportunities for new business/products. Europe or Silicon Valley??
The long-tail of science; scholarship OUTSIDE academia?
Conventional publication does not work for data
Diversity. No single solution. Communities of scholarship. HEP, Astronomy, Chemistry
Domain repositories essential; Inst Repos don’t work for science
OPEN. Must be BOAI-compliant: use CC-BY/CC0
Are universities the solution or the problem?
Sustainability. Funders and National Laboratories
Mandates are poor instruments; Culture must change. Rewards?
Create an author-centric culture/technology. Semantic documents. “ScienceForge”
Sustainability. Alliance with wealth-generation industries?
Text-mining
Theses. Must become centralised semantic, Europe?? NL++, UK–
Demos: text-mining, repositories
Growing points:
- Open (Web) Technology continues to advance
- Linked Open Data / Semantic Web
- Graduate students
- Scholarly poor
- Wikip(m)edia
- Open Knowledge Foundation

Posted in Uncategorized | Leave a comment

e-Infrastructures for Open Science – my talk in Rome

Posted on April 9, 2012 by pm286

I have been invited to Rome to help start the Horizon 2020 Consultation for future European funding:

http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/agenda_alea_rome.pdf

It’s an important meeting and follows a morning presentation by Neelie Kroes (European Digital Agenda) and responses by National scientific societies.

http://www.allea.org/Content/ALLEA/General%20Assemblies/General%20Assembly%202012/GA_draft_programme_overview_final.pdf

This blog helps me to coordinate my ideas and also acts as a record of them. My remit is to introduce this theme: “Open e-Infrastructures for Open Science” which then devolves into 3 parallel sessions:

Open global data infrastructure
Open scientific content
Open research culture

I’m interested in all of these and shall try to address all of them – of course there is a lot of overlap. I apologize for any UK-centricity but I hope the issues are genera. I bring the following experience:

A practising scientist in “long-tail science”, mainly on the informatics side. Heavily involved in the UK eScience programme.
Spent time in industry and academia.
Active in the Open Knowledge movement (especially science data and open source code/informatics).

I’ll divide this into these areas and probably be slightly controversial:

What have we learned in the last 10 years of eScience?

The UK eScience programme broke much new ground. Its greatest success was bringing groups of scientists and computationalists together and that continues (e.g. in the Oxford eResearch Centre) and that has made it eminently worthwhile. But I’ll also comment on things that didn’t work:

Top-down design. Technology progresses so rapidly in the Internet world that trying to design the future doesn’t work.
Academic-industry infrastructure. The problems of shared vision and a secure collaborative infrastructure are too difficult and expensive for either partner. Instead we should concentrate on areas where industry can share the results of academic work through an Open Infrastructure
Universities are generally not the best place for managing collaborative research infrastructure on an ongoing basis. Institutional repositories do not effectively serve science. In contrast inter/national research organisations have the infrastructure and the mission to make this happen.
There was and is very little investment in infrastructure for “long-tail science” – I exclude bioscience supported by EBI/NCBI, etc. There are no useful repositories for many of the disciplines, few ontologies, and little interest in the dissemination of science
Academia and many scientists are conservative and increasingly driven by self-interests. Open practice will not happen rapidly.

What are the current problems?

The eScience program has had little impact on the current practice of science. Informatics is carried out using whatever commodity tools are available and the culture is dominated by commercial scientific publication. This has not changed in 10 years and is now seriously holding back innovation in several ways.

The result of research is a “PDF”, not scientific information
The rewards are almost solely based on “citations” – a flawed measure of value
Almost everyone outside academia (and many within) is denied effective access to scientific output “the scholarlyPoor”.
Young researchers are stifled by the system and institutionalised.

There is little incentive to change the system or to build a better infrastructure.

And alongside this we have the battle between commercial closed “walled gardens” and Open knowledge (CC-BY, CC0 – anything other is almost valueless). Academia is NOT committed to Openness – it points inwards and builds systems for itself, not the world. And there is a dysfunctional academic-publisher complex which reinforces stagnation.

What are we losing?

We can consider this both in world terms and European terms. There is now huge potential in new information industries downstream of scientific publication “Google for Science”. I have estimated to the UK Hargreaves enquiry that in chemistry alone this could be “low billions” worldwide. Are we going to let Silicon Valley capture yet another new market?

We lose the value of the research we fund.
We lose the opportunity of creating new information industries
We make seriously bad decisions
Our science is worse – often unchallenged or duplicated
Or culture does not reward change.

What are the growing points?

We must not ignore the rest of the world. Our greatest human capital is OUTSIDE academia. Examples of worldwide growth are:

Wikipedia etc. (probably the greatest communal effort to build quality public science in many disciplines)
Open Streetmap. An unfunded project that shows what one person can make happen and within a few years become a word resource and standard
Open Source software.
Open Knowledge
Open science (Open Source Drug Discovery). Open Science moves faster than conventional because it grows communities rapidly, shares knowledge and avoids mistakes.
Internet-aware interest and practice groups (e.g. Malaria World)
Young people. One graduate year can create a high-quality growing point: Figshare, Altmetrics, and PMR group (OSCAR/OPSIN chemical NLP, Crystaleye – all now being taken up). Give undergraduates and graduates encouragement to explore and innovate
Open publishers (PLoS, Wellcome, BMC(Springer))

What should we do?

We have to change the culture. I don’t know how to do that in detail, but here are some things I’d like to see happen.

A scientist-oriented system for scientific research. “ScienceForge”. It’s been solved for computer programming (“SourceForge”) without any central investment. It can’t be top-down; it has to grow organically. It has to support scientists in their daily work so naturally that they don’t notice it. A scientist should then be able to share their work anywhere. It should support embargoed publication, on scientists’ terms not publishers.
3^rd year graduate students designing the informatics structure and training
Use multiple metrics for science output not just “citations”
Actively involve the “scholarlypoor” outside academia. Reward successful extra-academic enterprise
Put national laboratories at the centre of the infrastructure for long-tail scientific information.
Develop sustainable profitable business models based on Open practice.

I’ll think of some more on the plane. And I shall, as always, react to what is said before me. I am very impressed with Nellie Kroes and hope I get a chance to meet

Posted in Uncategorized | Leave a comment

Congratulations to our Panton Fellows, Sophie and Ross

Posted on April 6, 2012 by pm286

This blog has been down but I can now congratulate our 2 new Panton fellows, Sophie Kershaw and Ross Mounce. Laura Newman has written an account http://blog.okfn.org/2012/03/30/introducing-our-panton-fellows/ and I took photos.

I will write much more (I’m very busy at present) and will be meeting with both later this month. Congratulations to both.

Read Laura’s blog.

Posted in Uncategorized | Leave a comment

The Guardian Open Day – C21 publishing as it should be

Posted on March 26, 2012 by pm286

TomMR made me take a day off to the Guardian Open day http://www.guardian.co.uk/news/blog/2012/mar/24/the-guardian-open-weekend-live-blog . For non-UK readers the Guardian (http://en.wikipedia.org/wiki/The_Guardian orginally the Manchester Guardian ) is 180 years old and one of the few non-profit, major daily newspapers. The Guardian put on show many of it’s regular features and beyond – and for us one of the highlights were the crosswords sessions run by “Paul” and “Araucaria”. I’ll devote a blog for that – you’ll see why.

But the session which most excited me was the Guardian Open Digital Platform. I’d come across this before as both Timetrics and the OKF have worked with the Guardian , especially on data and data-journalism. The Guardian team is absolutely committed to Openness. They see their content as something to be re-used – for example I could reformat the Guardian and produce my own newspaper. They work with Facebook, creating a new entry to a different generation of young people, many of whom never read newspapers. No wonder that the G has the second highest online presence in the UK (the much larger and much … Daily mail is first).

They work with Open source and Open content. They see a vision beyond the traditional newspaper. They don’t know what it looks like or even what role they have in shaping it – leader? Infrastructure? Early adopter? But they want to be the first there.

Literally abutting onto to them is a major scientific publisher, Macmillan/NaturePublishingGroup. What a contrast!

Posted in Uncategorized | Leave a comment

My response to Hargreaves on copyright reform: I request the removal of contractual restrictions and independent oversight

Posted on March 21, 2012 by pm286

Jenny Molloy, Diane Cabell, Laura Newman and I have been working to create a considered, hopefully powerful and constructive report to the Hargreaves report recommending the reform of UK copyright. (This is not a formal OKF response – OKF deliberately does not pursue advocacy – but has been done using OKF community processes and tools). We have created a response from all of us, but I felt that I could give personal evidence about the effect of the current publisher-imposed contractual and technical restrictions on information mining.

I shall comment later in detail (and hope that this will generate lively discussion). Here I simply highlight my claim that the downstream market for chemical information alone is at least a billion and that much value is lost through the restrictions. I outline some of the types of lost value and, while some are slightly anecdotal, I hope they are compelling. I also make the case for removing control from the publishers to an independent body.

I thank Jenny, Diane and Laura for help.

Dear Mr Taffy Yui

Please find below a response to the IPO [Intellectual Property Office] copyright consultation from Peter Murray-Rust (pm286@cam.ac.uk)

Jenny Molloy
Coordinator, Open Science Working Group
Open Knowledge Foundation

Personal experience and evidence from Professor Peter Murray-Rust.

I have been involved in developing and deploying text and other forms of data mining in chemistry and related sciences (e.g. biosciences and material sciences) for ten years. I have developed open source tools for chemistry (OSCAR [1], OPSIN [2], ChemicalTagger [3]), which have been developed with funding from EPSRC, JISC, DTI and Unilever PLC. These tools represent the de facto open source standard and are used throughout the world. In November 2011, I gave an invited plenary lecture on their use to LBM 2011 (Languages in Biology and Medicine) in Singapore [4].

These tools are capable of very high throughput and accuracy. Last week we extracted and analysed 500,000 chemical reactions from the US patent office service; approximately 100,000 reactions per processor per day. Our machine interpretation of chemical names (OPSIN) is over 99.5% accurate, better than any human. The extractions are complete, factual records of the experiment, to the extent that humans and machines could use them to repeat the work precisely or to identify errors made by the original authors.

It should be noted that many types of media other than text provide valuable scientific information, especially graphs and tables, images of scientific phenomena, and audio / video captures of scientific factual material. Many publishers and rights agencies would assert that graphs and machine-created images were subject to copyright while I would call them “facts”. I therefore often use the term “information mining” rather than “text mining”.

It is difficult to estimate the value of this work precisely, because we are currently restricted from deploying it on the current scientific literature by contractual restrictions imposed by all major publishers. However it is not fanciful to suggest that our software could be used in a “Chemical Google” indexing the scientific literature and therefore potentially worth low billions.
Some indications of value are:

1. My research cost £2 million in funding, and because of its widespread applicability, would be conservatively expected to be valued at several times that amount. The UK has a number of highly valued textmining companies such as Autonomy [5], Linguamatics [6], and Digital Science (Macmillan) [7]. Our work is highly valuable to them, as they both use our software [under Open licence] and recruit our staff when they finish. In this sense already, we have contributed to UK wealth generation.

2. The downstream value of high quality, high throughput chemical information extracted from the literature can be measured against conventional abstraction services, such as the Chemical Abstracts Service of the ACS [8] and Reaxys [9] from Elsevier, with a combined annual turnover of perhaps $500-1,000 million dollars. We believe our tools are capable of building the next and better generation of chemical abstraction services, and they would be direct competitors in this high value market. This supports our valuation of chemical textmining in the low billions.

3. The value of the tools themselves is difficult to estimate, but Chemical Informatics has for many years been a traditional SME activity in the UK and would have been expected to grow if textmining had been permitted. Companies such as Hampden Data services, ORAC, Oxford Molecular, Lhasa have values in the 10-100 millions.

4. I come from a UK pharmaceutical industrial background (15 years in Glaxo). I know from personal experience and discussions with other companies that it is not uncommon for drugs which fail to have post-mortems showing that the reason for failure could have been predicted from the original scientific literature, had it been analysed properly. Such failures can run to $100 million and the lack of ability to use the literature in an effective modern manner must contribute to serious loss of both effort and opportunity. My colleague Professor Steve Ley has estimated that because of poor literature analysis tools 20-25% of the work done in his synthetic chemistry lab is unnecessary duplication or could be predicted to fail. In a 20-year visionary EPSRC Grand Challenge (Dial-a-molecule) Prof Richard Whitby of Southampton is coordinating UK chemists, including industry, to design a system that can predict how to make any given molecule. The top priority is to be able to use the literature in an “artificially intelligent manner” where machines rather than humans can process it, impossible without widespread mining rights.

5. The science and technology of information mining itself is seriously held back by the current contractual restrictions. The acknowledged approach to building quality software is to agree on an open, immutable, ‘gold standard’ corpus of relevant literature, against which machine learning methods are trained. We have been forbidden by rights holders from distributing such corpora, and as a result our methods are seriously delayed (I estimate by at least three years) and are impoverished in their comprehensiveness and applicability. It is difficult to quantify the lost opportunities, but my expert judgement is that by linking scientific facts, such as those in the chemical literature, to major semantic resources such as Linked Open Data [10] and DBPedia [11] an enormous number of potential opportunities arise, both for better practice, and for the generation of new wealth generating tools.

Note: Most of my current work involves factual information, and I believe is therefore not subject to copyright. However, it is impossible to get clarification on this, and pu
blishers have threatened to sue scientists for publishing factual information. I have always erred on the side of caution, and would greatly value clear guidelines from this process, indicating where I have an absolute right to extract without this continuing fear.

In response to Consultation Question 103

“What are the advantages and disadvantages of allowing copyright exceptions to be overridden by contracts? Can you provide evidence of the costs or benefits of introducing a contract-override clause of the type described above?”

The difficulties I have faced are not even due to copyright problems as I understand it, but to additional contractual and technical barriers imposed by publishers to access their information for the purposes of extracting facts and redistributing them for the good of science and the wider community.

The barriers I have faced over the last five years appear common to all major publishers and include not only technical constraints (e.g. the denial of literature by publisher robot technology) but also difficulties in establishing copyright/contractual restrictions, which I do not wish to break. It is extremely difficult to get clear permissions to carry out any work in this field, and while a court might find that I had not been guilty of violating copyright/contract, I cannot rely on this. Therefore, I have taken the safest course of not deploying my world leading research.

Among the publishers with which I have had correspondence are Nature Publishing Group, American Chemical Society, Royal Society of Chemistry, Wiley, Elsevier, Springer. None have given me explicit permission to use their content for the unrestricted access of scientific facts by automated means and many have failed even to acknowledge my request for permission. I have for example challenged the assertion made by the Public Research Consortium that ‘publishers seem relatively liberal in granting permission’ for content mining. [12]

In conclusion, I stress that any need to request permissions drastically reduces the value of text mining. I have spent at least a year’s worth of my time attempting to get permissions as opposed to actually carrying out my research. At LBM 2011, I asked other participants, and they universally agreed that it was effectively impossible to get useful permissions for text mining. This is backed up by the evidence of Max Haussler to the US OSTP [13] and his comprehensive analysis of publisher impediments where it has taken some publishers over two years to agree any permissions, while many others have failed to respond within 30 days of being asked [14]. I do not believe therefore, that this problem can be solved by goodwill assertions from the publishers. Part of the Hargreaves initiated reform should be to assert the rights that everyone has in using the scientific factual literature for human benefit.

In response to Consultation Question 77

“Would an exception for text and data mining that is limited to non commercial research be capable of delivering the intended benefits? Can you provide evidence of the costs and benefits of this measure? Are there any alternative solutions that could support the growth of text and data mining technologies and access to them?”

Non-commercial clauses are completely prejudicial to effective use of text mining, because many of the providers and consumers will be commercial. For example, the UK SMEs could not use a corpus produced under these conditions, nor could they develop added downstream value.

I have had discussions with several publishers who have insisted on imposing NC restrictions on material. They are clearly aware of its role, and it is difficult to understand their motives in insisting on NC, other than to protect the publishers’ own interests by denying the widespread exploitation of the content. In two recent peer-reviewed papers, it has been convincingly shown that NC adds no benefits, is almost impossible to operate cleanly, and is highly restrictive of downstream use. [15, 16]

Alternative Solutions:
These contractual restrictions have been introduced unilaterally by publishers without effective challenge from the academic and wider community. The publishers have shown that they are not impartial custodians of the scientific literature. I believe this is unacceptable for the future and that a different process for regulation and enforcement is required. The questions I would wish to see addressed are:
Which parts of the scientific literature are so important that they should effectively be available to the public? One would consider, at least:
facts (in their widest sense, i.e. including graphs, images, audio/visual)
additional material such as design of experiments, caveats from the authors, discussions,

metadata such as citations, annotations, bibliography

Who should decide this?
It must not be the publishers. Unfortunately many scientific societies also have a large publishing arm (e.g. Royal Soc Chem) and they cannot be seen as impartial.
I would suggest either the British Library, or a subgroup of the RCUK and other funding bodies
How show it be policed and conflicts resolved?

Where possible the regulator I propose should obtain agreement from all parties before potential violation. If not possible, then the onus should be on the publishers to challenge the miners, thought the regulator. Ultimately there is always final recourse to the law.

[1] http://www.jcheminf.com/content/3/1/41;

[2] http://pubs.acs.org/articlesonrequest/AOR-PcYgSy87ettZWfqyvHmN;

[3] http://www.jcheminf.com/content/3/1/17

[4] http://lbm2011.biopathway.org/

[5] http://www.autonomy.com/;

[6] http://www.linguamatics.com/;

[7] http://www.digital-science.com/

[8] http://www.cas.org/;

[9] https://www.reaxys.com/info/

[10] http://linkeddata.org/

[11] http://dbpedia.org/About

[12] Smit, Eefke and van der Graaf, Maurits, ‘Journal Article Mining’, Publishing Research Consortium, Amsterdam, May 2011. http://www.publishingresearch.net/documents/PRCSmitJAMreport20June2011VersionofRecord.pdf.

[13] http://www.whitehouse.gov/sites/default/files/microsites/ostp/scholarly-pubs-%28%23226%29%20hauessler.pdf

[14] See also Max Haeussler, CBSE, UC Santa Cruz, 2012, tracking data titled
Current coverage of Pubmed, Requests for permission sent to publishers, at http://text.soe.ucsc.edu/progress.html

[15] Hagedorn, Mietchen, Morris, Agosti, Penev, Berendsohn & Hobern, ‘Creative Commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information’, ZooKeys 150 (2011) : Special issue: 127-149, ‘e-Infrastructures for data publishing in biodiversity science’;

[16] Carroll MW (2011) Why Full Open Access Matters. PLoS Biol 9(11): e1001210. doi:10.1371/journal.pbio.1001210

Posted in Uncategorized | 3 Comments

ACS Fall meeting Skolnik Symposium “Molecular Science and the Semantic Web”: Invitation to submit abstracts

Posted on March 20, 2012 by pm286

As recipients of the Skolnik award Henry Rzepa and I are organizing the symposium at the ACS Fall meeting in Philadelphia http://portal.acs.org/portal/acs/corg/content?_nfpb=true&_pageLabel=PP_ARTICLEMAIN&node_id=395&content_id=CNBP_029137&use_sec=true&sec_url_var=region1&__uuid=2f5d0717-e31c-47f1-af30-0a1337afd759. (Aug 19-23). Depending on how many abstracts we receive this will last between 1 and 3 days, most likely 1.5-2. The theme of our symposium is

“Molecular Science and the Semantic Web”

Many readers will already be aware of the symposium and we have already received some abstracts (deadline March 25, i.e. this week and it is strict). However some may have held back as some symposia in the past have been invitation-only. This is not the case here – anyone may submit an abstract and Henry and I will be the primary judges of their suitability. The abstracts should address the title above and should ideally have a strong basis in modern Semantic Web thinking and practice (for example “Web 3.0” but not limited to that). Abstracts are short (150 words) and all abstracts are indexed by Chemical Abstracts and some other indexing agencies

The “Semantic Web” theme honours the ideas of TimBL and can cover things like tools, linked open data, and Open communities. We are aware that some disciplines may be ahead of chemical practice in the Semantic Web. A small number of presentations might be from “outside chemistry” if the authors can convince us that their work can have a direct bearing on future progress in chemistry.

Product placements of tools and data are unlikely to be acceptable.

A very small number of presentations may be remote (with Henry or me managing the real-time process). These are completely at our discretion and likely to be limited to people we know and we can guarantee will provide compelling input.

Please note that the ACS does not provide expenses for speakers.

Posted in Uncategorized | Leave a comment

petermr's blog

Text-mining the scholarly literature: towards a set of universal Principles; Update and strategy

Panton Fellows, Principles in Japanese, #pantonscience

My virtual talk in Poland

Open science & education conference, Poland

Horizon2020 what I said in Rome (and what Neelie said)

Open Infrastructure for Open Science/Data; and Academic Spring

e-Infrastructures for Open Science – my talk in Rome

Congratulations to our Panton Fellows, Sophie and Ross

The Guardian Open Day – C21 publishing as it should be

My response to Hargreaves on copyright reform: I request the removal of contractual restrictions and independent oversight

ACS Fall meeting Skolnik Symposium “Molecular Science and the Semantic Web”: Invitation to submit abstracts

Recent Posts

Recent Comments

Archives

Categories

Meta