Monthly Archives: March 2012

The Guardian Open Day - C21 publishing as it should be

TomMR made me take a day off to the Guardian Open day http://www.guardian.co.uk/news/blog/2012/mar/24/the-guardian-open-weekend-live-blog . For non-UK readers the Guardian (http://en.wikipedia.org/wiki/The_Guardian orginally the Manchester Guardian ) is 180 years old and one of the few non-profit, major daily newspapers. The Guardian put on show many of it's regular features and beyond – and for us one of the highlights were the crosswords sessions run by "Paul" and "Araucaria". I'll devote a blog for that – you'll see why.

But the session which most excited me was the Guardian Open Digital Platform. I'd come across this before as both Timetrics and the OKF have worked with the Guardian , especially on data and data-journalism. The Guardian team is absolutely committed to Openness. They see their content as something to be re-used – for example I could reformat the Guardian and produce my own newspaper. They work with Facebook, creating a new entry to a different generation of young people, many of whom never read newspapers. No wonder that the G has the second highest online presence in the UK (the much larger and much … Daily mail is first).

They work with Open source and Open content. They see a vision beyond the traditional newspaper. They don't know what it looks like or even what role they have in shaping it – leader? Infrastructure? Early adopter? But they want to be the first there.

Literally abutting onto to them is a major scientific publisher, Macmillan/NaturePublishingGroup. What a contrast!

My response to Hargreaves on copyright reform: I request the removal of contractual restrictions and independent oversight

Jenny Molloy, Diane Cabell, Laura Newman and I have been working to create a considered, hopefully powerful and constructive report to the Hargreaves report recommending the reform of UK copyright. (This is not a formal OKF response – OKF deliberately does not pursue advocacy – but has been done using OKF community processes and tools). We have created a response from all of us, but I felt that I could give personal evidence about the effect of the current publisher-imposed contractual and technical restrictions on information mining.

 

I shall comment later in detail (and hope that this will generate lively discussion). Here I simply highlight my claim that the downstream market for chemical information alone is at least a billion and that much value is lost through the restrictions. I outline some of the types of lost value and, while some are slightly anecdotal, I hope they are compelling. I also make the case for removing control from the publishers to an independent body.

 

I thank Jenny, Diane and Laura for help.

 

Dear Mr Taffy Yui

 

Please find below a response to the IPO [Intellectual Property Office] copyright consultation from Peter Murray-Rust (pm286@cam.ac.uk)

Jenny Molloy
Coordinator, Open Science Working Group
Open Knowledge Foundation

Personal experience and evidence from Professor Peter Murray-Rust.

I have been involved in developing and deploying text and other forms of data mining in chemistry and related sciences (e.g. biosciences and material sciences) for ten years. I have developed open source tools for chemistry (OSCAR [1], OPSIN [2], ChemicalTagger [3]), which have been developed with funding from EPSRC, JISC, DTI and Unilever PLC. These tools represent the de facto open source standard and are used throughout the world. In November 2011, I gave an invited plenary lecture on their use to LBM 2011 (Languages in Biology and Medicine) in Singapore [4]. 


These tools are capable of very high throughput and accuracy. Last week we extracted and analysed 500,000 chemical reactions from the US patent office service; approximately 100,000 reactions per processor per day. Our machine interpretation of chemical names (OPSIN) is over 99.5% accurate, better than any human. The extractions are complete, factual records of the experiment, to the extent that humans and machines could use them to repeat the work precisely or to identify errors made by the original authors. 


It  should be noted that many types of media  other than text provide valuable scientific information, especially graphs and  tables, images of scientific phenomena, and audio / video captures  of scientific  factual material. Many publishers and rights agencies would assert that graphs and machine-created images were subject to copyright while I would call them "facts". I therefore often use the term "information mining" rather than "text mining". 


It is difficult to estimate the value of this work precisely, because we are currently restricted from deploying it on the current scientific literature by contractual restrictions imposed by all major publishers. However it is not fanciful to suggest that our software could be used in a "Chemical Google" indexing the scientific literature and therefore potentially worth low billions.

Some indications of value are:

1. My research cost £2 million in funding, and because of its widespread applicability, would be conservatively expected to be valued at several times that amount. The UK has a number of highly valued textmining companies such as Autonomy [5], Linguamatics [6], and Digital Science (Macmillan) [7]. Our work is highly valuable to them, as they both use our software [under Open licence] and recruit our staff when they finish. In this sense already, we have contributed to UK wealth generation.


2. The downstream value of high quality, high throughput chemical information extracted from the literature can be measured against conventional abstraction services, such as the Chemical Abstracts Service of the ACS [8] and Reaxys [9] from Elsevier, with a combined annual turnover of perhaps $500-1,000 million dollars. We believe our tools are capable of building the next and better generation of chemical abstraction services, and they would be direct competitors in this high value market. This supports our valuation of chemical textmining in the low billions.


3. The value of the tools themselves is difficult to estimate, but Chemical Informatics has for many years been a traditional SME activity in the UK and would have been expected to grow if textmining had been permitted. Companies such as Hampden Data services, ORAC, Oxford Molecular, Lhasa have values in the 10-100 millions.


4. I come from a UK pharmaceutical industrial background (15 years in Glaxo). I know from personal experience and discussions with other companies that it is not uncommon for drugs which fail to have post-mortems showing that the reason for failure could have been predicted from the original scientific literature, had it been analysed properly. Such failures can run to $100 million and the lack of ability to use the literature in an effective modern manner must contribute to serious loss of both effort and opportunity. My colleague Professor Steve Ley has estimated that because of poor literature analysis tools 20-25% of the work done in his synthetic chemistry lab is unnecessary duplication or could be predicted to fail. In a 20-year visionary EPSRC Grand Challenge (Dial-a-molecule) Prof Richard Whitby of Southampton is coordinating UK chemists, including industry, to design a system that can predict how to make any given molecule. The top priority is to be able to use the literature in an "artificially intelligent manner" where machines rather than humans can process it, impossible without widespread mining rights.


5. The science and technology of information mining itself is seriously held back by the current contractual restrictions. The acknowledged approach to building quality software is to agree on an open, immutable, 'gold standard' corpus of relevant literature, against which machine learning methods are trained. We have been forbidden by rights holders from distributing such corpora, and as a result our methods are seriously delayed (I estimate by at least three years) and are impoverished in their comprehensiveness and applicability. It is difficult to quantify the lost opportunities, but my expert judgement is that by linking scientific facts, such as those in the chemical literature, to major semantic resources such as Linked Open Data [10] and DBPedia [11] an enormous number of potential opportunities arise, both for better practice, and for the generation of new wealth generating tools. 


Note: Most of my current work involves factual information, and I believe is therefore not subject to copyright. However, it is impossible to get clarification on this, and publishers have threatened to sue scientists for publishing factual information. I have always erred on the side of caution, and would greatly value clear guidelines from this process, indicating where I have an absolute right to extract without this continuing fear. 


In response to Consultation Question 103 


"What are the advantages and disadvantages of allowing copyright exceptions to be overridden by contracts? Can you provide evidence of the costs or benefits of introducing a contract-override clause of the type described above?"


The difficulties I have faced are not even due to copyright problems as I understand it, but to additional contractual and technical barriers imposed by publishers to access their information for the purposes of extracting facts and redistributing them for the good of science and the wider community.


The barriers I have faced over the last five years appear common to all major publishers and include not only technical constraints (e.g. the denial of literature by publisher robot technology) but also difficulties in establishing  copyright/contractual restrictions, which I do not wish to break. It is extremely difficult to get clear permissions to carry out any work in this field, and while a court might find that I had not been guilty of violating copyright/contract, I cannot rely on this. Therefore, I have taken the safest course of not deploying my world leading research. 


Among the publishers with which I have had correspondence are Nature Publishing Group, American Chemical Society, Royal Society of Chemistry, Wiley, Elsevier, Springer. None have given me explicit permission to use their content for the unrestricted access of scientific facts by automated means and many have failed even to acknowledge my request for permission. I have for example challenged the assertion made by the Public Research Consortium that 'publishers seem relatively liberal in granting permission' for content mining. [12]


In conclusion, I stress that any need to request permissions drastically reduces the value of text mining. I have spent at least a year's worth of my time attempting to get permissions as opposed to actually carrying out my research. At LBM 2011, I asked other participants, and they universally agreed that it was effectively impossible to get useful permissions for text mining. This is backed up by the evidence of Max Haussler to the US OSTP [13] and his comprehensive analysis of publisher impediments where it has taken some publishers over two years to agree any permissions, while many others have failed to respond within 30 days of being asked [14]. I do not believe therefore, that this problem can be solved by goodwill assertions from the publishers. Part of the Hargreaves initiated reform should be to assert the rights that everyone has in using the scientific factual literature for human benefit. 


In response to Consultation Question 77 


"Would an exception for text and data mining that is limited to non commercial research be capable of delivering the intended benefits? Can you provide evidence of the costs and benefits of this measure? Are there any alternative solutions that could support the growth of text and data mining technologies and access to them?"


Non-commercial clauses are completely prejudicial to effective use of text mining, because many of the providers and consumers will be commercial. For example, the UK SMEs could not use a corpus produced under these conditions, nor could they develop added downstream value. 


I have had discussions with several publishers who have insisted on imposing NC restrictions on material. They are clearly aware of its role, and it is difficult to understand their motives in insisting on NC, other than to protect the publishers' own interests by denying the widespread exploitation of the content. In two recent peer-reviewed papers, it has been convincingly shown that NC adds no benefits, is almost impossible to operate cleanly, and is highly restrictive of downstream use. [15, 16]

Alternative Solutions:
These contractual restrictions have been introduced unilaterally by publishers without effective challenge from the academic and wider community. The publishers have shown that they are not impartial custodians of the scientific literature. I believe this is unacceptable for the future and that a different process for regulation and enforcement is required. The questions I would wish to see addressed are:
Which parts of the scientific literature are so important that they should effectively be available to the public? One would consider, at least:
facts (in their widest sense, i.e. including graphs, images, audio/visual)
additional material such as design of experiments, caveats from the authors, discussions, 

metadata such as citations, annotations, bibliography

    
Who should decide this?
 It must not be the publishers. Unfortunately many scientific societies also have a large publishing arm (e.g. Royal Soc Chem) and they cannot be seen as impartial. 
I would suggest either the British Library, or a subgroup of the RCUK and other funding bodies
How show it be policed and conflicts resolved? 

Where possible the regulator I propose should obtain agreement from all parties before potential violation. If not possible, then the onus should be on the publishers to challenge the miners, thought the regulator. Ultimately there is always final recourse to the law.

[1] http://www.jcheminf.com/content/3/1/41;

[2] http://pubs.acs.org/articlesonrequest/AOR-PcYgSy87ettZWfqyvHmN


[3] http://www.jcheminf.com/content/3/1/17


[4] http://lbm2011.biopathway.org/


[5] http://www.autonomy.com/


[6] http://www.linguamatics.com/;


[7] http://www.digital-science.com/


[8] http://www.cas.org/


[9] https://www.reaxys.com/info/


[10] http://linkeddata.org/


[11] http://dbpedia.org/About


[12] Smit, Eefke and van der Graaf, Maurits, 'Journal Article Mining', Publishing Research Consortium, Amsterdam, May 2011. http://www.publishingresearch.net/documents/PRCSmitJAMreport20June2011VersionofRecord.pdf.


[13] http://www.whitehouse.gov/sites/default/files/microsites/ostp/scholarly-pubs-%28%23226%29%20hauessler.pdf


[14] See also Max Haeussler, CBSE, UC Santa Cruz, 2012, tracking data titled

Current coverage of Pubmed, Requests for permission sent to publishers, at http://text.soe.ucsc.edu/progress.html

[15] Hagedorn, Mietchen, Morris, Agosti, Penev, Berendsohn & Hobern, 'Creative Commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information', ZooKeys 150 (2011) : Special issue: 127-149, 'e-Infrastructures for data publishing in biodiversity science'; 


[16] Carroll MW (2011) Why Full Open Access Matters. PLoS Biol 9(11): e1001210. doi:10.1371/journal.pbio.1001210


ACS Fall meeting Skolnik Symposium “Molecular Science and the Semantic Web”: Invitation to submit abstracts

As recipients of the Skolnik award Henry Rzepa and I are organizing the symposium at the ACS Fall meeting in Philadelphia http://portal.acs.org/portal/acs/corg/content?_nfpb=true&_pageLabel=PP_ARTICLEMAIN&node_id=395&content_id=CNBP_029137&use_sec=true&sec_url_var=region1&__uuid=2f5d0717-e31c-47f1-af30-0a1337afd759. (Aug 19-23). Depending on how many abstracts we receive this will last between 1 and 3 days, most likely 1.5-2. The theme of our symposium is

"Molecular Science and the Semantic Web"

Many readers will already be aware of the symposium and we have already received some abstracts (deadline March 25, i.e. this week and it is strict). However some may have held back as some symposia in the past have been invitation-only. This is not the case here – anyone may submit an abstract and Henry and I will be the primary judges of their suitability. The abstracts should address the title above and should ideally have a strong basis in modern Semantic Web thinking and practice (for example "Web 3.0" but not limited to that). Abstracts are short (150 words) and all abstracts are indexed by Chemical Abstracts and some other indexing agencies

The "Semantic Web" theme honours the ideas of TimBL and can cover things like tools, linked open data, and Open communities. We are aware that some disciplines may be ahead of chemical practice in the Semantic Web. A small number of presentations might be from "outside chemistry" if the authors can convince us that their work can have a direct bearing on future progress in chemistry.

Product placements of tools and data are unlikely to be acceptable.

A very small number of presentations may be remote (with Henry or me managing the real-time process). These are completely at our discretion and likely to be limited to people we know and we can guarantee will provide compelling input.

Please note that the ACS does not provide expenses for speakers.

Update; and a semantic amusement for you

I've extremely busy so this is just to let anyone know I am still working on a number of threads.

  • Chem4Word. We had a really valuable discussion yesterday (sic) in Microsoft Research Cambridge with Alex Wade, Joe Townsend, Clyde Davies and me. We went over the code for Clyde's benefit as he is writing it up (a) for further C4W work and (b) for #semphyssci publication. Even though I was PI of the project and heavily involved there are swathes of code I didn't even know existed. It's a VERY impressive piece of work and Joe Townsend and several others can take great pride. A major part of the next phase is with Nico Adams whom I shall be visiting soon.
  • Panton Fellowships. We are delighted in the very high quality response to the PFs and Laura Newman (OKF) has been doing a great job servicing the applications for us to make decisions on who to interview (skype).
  • Hargreaves. Jenny Molloy, Diane Cabell and I are putting together a response to IPO/Hargreaves. (I've got responses from 6 publishers I wrote to – thanks! – and will summarise and postb them to this blog, probably in a day or two.
  • Semantic Physical Science. I am really excited. We have now developed a completely declarative approach to forcefields such that it should be possible to define the complete problem on the fly using MathML and CML (MathCML). Given that a forcefield (misnamed) evaluates the energy as a function of molecular geometry, atom types and a parameterised forcefield it will be possible to code this in a page or two of declarative code supported by standard libraries. The forcefield can be manipulated (e.g. to calculate derivatives) so it should be possible to both optimise geometry and elaborate trajectories in a declarative manner. With Mark Williamson, Andrew Walker, Martin Dove and Jens Thomas.

And so a semantic amusement.

A cup contains 200 ml of water and an apple (4 cm radius) is placed on top.

From reading this description what can be deduced by:

  • A 10 year-old Anglophone child
  • A first year undergraduate scientist
  • A logician
  • "Shallow thought" – the accumulation of current "AI" – e.g. Wolfram Alpha, True Knowledge, Cyc, Google, Wikipedia and any other engines you think would be relevant (the problem is given to them cold – they are not trained in this domain).

(I'm interested because I want to develop "Shallow thought" for chemistry – more on that later).

#sparc2012 a manifesto in absentia for Open Data

Dear #sparc2012

I am very sorry that I can't be physically present with you, especially since we are at a critical time for #scholpub. I'd have liked to meet and come up with new ideas on how to change the world. As it is the iffy technology (your words) means I shall write a blog and then either I or John @wilbanks will present it. Maybe John will splash this blog up, click to the relevant pages, and say what he thinks needs said. Or maybe he'll read it verbatim and perhaps adds comments. Whatever. I hope that either way my message will get through.

I've been asked to talk about Rights and Open Data. The Rights I care about are not academia, nor authors, nor publishers but the 99% of the world who cannot get effective access to scholarship. The #scholarlypoor. So here's the first principle – if you accept that then

  1. Access to the fruits of publicly funded research is a fundamental human right

We spend about 300 Billion every year on Science Technology Engineering Medicine (STEM). [I'll probably use "science" to cover all and all figures are +- half an order of magnitude so 100-1000 B USD]. That's about 50 dollars for every inhabitant. And almost no inhabitant (including 99% of those in rich nations) has effective access to this output. Now we have won many human rights and we can win this one.

It will cost money to deliver, but then it will be gratis (free to consume) and libre (free to reuse). I pay for my water. I pay for education. Those who can't afford them still have equal access. These are human rights which we have largely solved. It's the same with science. I can re-use my education as often as I wish without permission.

[Note. The "OA definition" of "libre" as "some restrictions removed" is unacceptable. Some of us (including Wikipedia and OKF) are actively working to make sure "libre" is properly defined and used

The second principle

  1. Scientific data should be libre at the point of creation

This is why we initiated the Panton Principles. There are many reasons for making scientific data libre:

  • It belongs to all of us
  • It is required to validate the science (most scientific papers are merely advertisements for the work, not the work itself – who said that?)
  • It can be re-used in millions of planned and unexpected ways

Scientists and academia have lost control of the authoring process. They must regain it, and part of this is to regain control of data. So corollary:

2a. Science data should be stamped as Libre (Panton)

Almost all data is now produced either from instruments or from scientific software. (In my field of computational chemistry there is probably 10 B spent per year on computation (machines, people, software). Maybe 100 million (computational) jobs or more. All the public fruits of this could be collected and stamped as libre. Similar ideas for images (the microscope software could stamp with "Open Data", the phone app taking gel pictures could do the same). Everything on Figshare or Dryad could be watermarked.

Huge amounts of fruitless effort are spent on bad licences. Unfortunately some of these seem deliberate – to confuse rather than help. The latest Wiley paid OA (4000 USD, "fully open access") "Chemistry Open" has so many restrictions that it's effectively closed. Why does the library community and SPARC not challenge this ? So the way forward has to be clear licences.

  1. Open Access and Data require clear, libre licences

The Open Access community has failed to address this for 10 years (since BOAI/BBB). BOAI/BBB were/are great declarations, in the tradition of liberation. But most of the OA community honours then in name only. Everyone at #sparc2012 should be able to recite, by heart:

"By 'open access' to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited."

And for the purposes of data free to "use, re-use, and redistribute" (OKDefinition).

    3a. Libre material should be clearly stamped for human and machine discoverability and reuse

UK/PubMedCentral shows the current problem clearly. It is impossible to search for "Open Access" material and even harder – almost impossible – to search for BOAI-libre material (i.e. minable). Our recent @ccess group is trying to index the malaria literature for BOAI-Openness and it has to be done paper-by-paper – IMO this is unacceptable after 10 years. University IRs are even worse. Here's mine http://www.lib.cam.ac.uk/repository/about/end_user_terms.html

Unless otherwise noted, Deposited Works in DSpace@Cambridge are made freely available for access, printing and download for the purposes of non-commercial research or private study only.

So – Institutional Repositories are set up just for academics – no one else matters. You can't use DSpace@cam for:

  • Teaching schoolchildren
  • Ideas for high-tech business (Cambridge is the UK's centre for high-tech)
  • Helping a patient understand their disease
  • Writing books
  • and 101 more examples (see Mike Taylor's http://whoneedsaccess.org )

So the next principle

  1. Only use CC-BY, CC0 and other BOAI-compliant licences.

Abandon NC, Non-commercial. It effectively prevents anything useful. (Maybe Mike Carroll will cover this, but it needs restating again and again). Corollary:

    4a. Publishers of Open Access ("Gold OA") should useBOAI-licences.

Ross Mounce (a graduate student) has done a tremendous job of collating the hybrid OA licences of major publishers and out of over 100 finds that only 5% are BOAI-compliant. Authors are paying lots of money (1000-5000 USD for this, publishers are restricting re-use to the point of uselessness and academia accepts this without a squeak. Surely this is where SPARC should be labelling offerings as BOAI-acceptable or non-acceptable. But no, we have given in and allowed this mess of "slightly Open Access". Some of the publisher terms are so badly written, piling restriction on restriction, that they are probably not even executable consistently.

And now some more general ideas on "textmining". Over the last 2 weeks I have blogged about information mining (a better term as we can mine images, speech and video for facts as well). The core is defined in: http://blogs.ch.cam.ac.uk/pmr/2012/03/04/information-mining-and-hargreaves-i-set-out-the-absolute-rights-for-readers-non-negotiable/ . I've been trying to do this for years and failing to get permission. Publishers (and libraries) have a three-valued logic:

  • Yes you may (very rare)
  • No you may not (common)
  • Mumble (make nice noises but actually say nothing). Mumble includes "let's set up a meeting", "let's talk with your librarians" "I'll refer you to our director of marketing" and much else. Mumble means hoping the problems will go away. Silence is Mumble.

Understand that mining is about LOTS of information. And the web can cope with lots of information. As an example ONE STUDENT (albeit it the very smart Daniel Lowe) last week downloaded 65,000 patents and extracted 500,000 chemical reactions AND interpreted the text and diagrams AS ATOMS and MOLECULES. Very high quality semantic information. Here's what we CAN mine:

  • Patents
  • BMC and PLoS
  • Supplementary info on publisher's web pages (FWIW we have downloaded 250,000 crystal structures in this way and haven't crashed any servers)

What we CAN'T mine are:

  • Closed access publications
  • Green OA (can't find it and anyway no rights)
  • Gold hybrid OA (can't find it and cannot machine-read the licences to find BOAI)
  • UK/PubMedCentral (impossible to find the BOAI-compliant subset)
  • Institutional Repositories (impossible to navigate and no rights in most cases anyway)

I've been asking Elsevier for 2.5 years whether we can text-mine. Have I got "Yes" or have I got "Mumble". I'll post today's mail and let you judge.

I'm not the only one. Here's Max Haussler, writing to publishers for permission to text-mine http://text.soe.ucsc.edu/progress.html . Some have taken two years of negotiations. Half haven't responded. This is an industry that Eefke Smit says is extremely helpful to requesters.

Where the publishers do respond, they want to control what research we can and can't do. See UBC and Heather Piwowar negotiating with Elsevier http://researchremix.wordpress.com/2012/03/05/talking-text-mining-with-elsevier/ . "Alicia had indicated that Elsevier facilitates text mining on a project-by-project basis". For me this is unacceptable. There is no reason in the world why Elsevier or any other publisher should "facilitate my research". ("facilitate" is Newspeak as is "Universal Access" – means of restricting access). HP: "two of my text mining use cases require reuse rights that are outside the standard Elsevier agreement". Yes, Elsevier writes additional restrictions in the contract. HP: "I asked for the text of the standard reuse agreement. It was sent to me but I was asked not to share it publicly because 'it is a legal element'". So we cannot even know what we aren't allowed to do. This is Universal Access??

So another right:

  1. Universities/subscribers should refuse to sign any contracts more restrictive than copyright itself; they should publicize the contracts

Librarians or purchasing officers should read contracts before they sign them and publicise any causes which restrict subscribers Rights. (I can't imagine how librarians all over the world have signed that we may not index the literature we buy (sorry, rent).

    5a. Where the extracted material (e.g. facts) does not violate copyright then it should be made public and posted Openly under a libre licence

I'd like to think that all involved (publishers, universities) would wish to sign up to the principles I have outlined. I'm happy for #sparc2012 to take any or all of them and wordsmith them.

And finally I'd like volunteers to work with me on extracting useful factual chemistry from the closed literature.

 

Permission for information-mining : Update and response from Royal Society of Chemistry

In our current search [some request only went out on Saturday] for factual information from publishers on permission for "text-mining" the position is:

  • Elsevier. Permission granted in principle for PM-R. [PMR and community is now gearing up to extract factual chemistry from Elsevier journals. First step will be to create a complete index of all content (e.g. in Open biblio/ Bibsoup] and then decide on strategy. Top driver is mat Todd's need to find antimalarial compounds – so we'll look in chemistry journals first.]
  • Wiley. Request [2012-03-07] to Bob Campbell transferred to Duncan Campbell.
  • Nature. Request to Philip Campbell [2012-03-10] transferred to appropriate department.
  • American Chemical Society. No reply yet.
  • Springer. Request [2012-03-10] transferred internally and significant useful response [next mail]
  • Royal Society of Chemistry. [2012-03-10] Significant response from Richard Kidd. See below. Note that Richard has often commented on this blog.

    Dear Peter 

    Thanks for your request. It's good to see from this and the accompanying blog post you still have some positive memories of text mining with some publishers. So far, we have mainly supplied articles for academic text mining purposes as one-off deliveries – such as for the SESL project, and the 50,000+ articles we supplied to both the ChETA and TREC Chem projects. Often it is easier for miners to bulk load within their own systems than crawling to collect, but we recognise that times are changing.

    We ask you talk to your librarian colleagues, both in terms of them being happy with what you're doing under the agreed licenses with RSC, and so they understand what ongoing value the results of any mining exercise derives from the RSC subscription.  

    This ongoing value issue is important in terms of text mining implications for us. Along with most publishers we supply counter stats to librarians of usage within their institution – and, as you know, when renewal times comes these are used to judge which journals are of most value. Our concern is if the mining extracts and republishes sufficient content from the publications as to reduce apparent usage (and citation) of the published papers in future. At the moment full text downloads are the major measure we have (rightly or wrongly in principle) for the librarian to judge if publications are of value to the institution, and republication of extracted facts and data at least potentially could affect this. Done right, the effect can be positive, but it could also be detrimental.

    Some of Cameron's suggested principles of research data mining would have been a valuable addition to your proposed non-negotiables, to reduce concerns that future derived would reduce usage of the original papers by your institution and others:

    * Always link back to the version of record of the research output you have mined.

    * Include elements and snippets by reference, not by value. Restrict content replication to that reasonably allowed by Fair Use provisions or enabled by licences, and required for efficient services

    * Only redistribute content where copyright terms explicitly allow it

    * Respect API service limits where posted and develop polite tooling with exponential back-off where appropriate

    (a couple of principles deleted, due to non-relevance to this specific question rather than disagreement)

    Finally, a correction. You say we cut off access a  few years ago. My recollection is slightly different and I have the correspondence if you'd like to  see it, from 2006. We didn't cut you off, though we suggested we would block one IP address if the downloading continued without any contact. We discussed it amicably – explanation made it clear and the download behaviour was modified for both sides to be happy with continuation. But it's an excellent illustration of why we appreciate being asked about the approach – as in this case the downloader was trying to retrieve non-existent issues, filling our developers' mailboxes with 404 alerts. So while you think we're only concerned about server load with on-demand mining, you can end up killing other systems we have to improve customer service. Mike Taylor clearly values publishers who try to stay on top of broken links ;-)

    I would also ask that you include our response verbatim if you are using it in any of your Hargreaves submissions, and of course we will be preparing our own submission. 

    In summary, we would strongly appreciate discussion on the extent of the factual information you intend to republish (I have seen the examples on the blog), together with the involvement of your librarian colleagues in the process – for current agreements, and effects on future usage and value measures.

    Best wishes

     

    Richard

This is a useful response. It doesn't however give me permission to text-mine RSC without permission. It suggests I contact my librarian. I have done on regular intervals – I think they recognise I don't need technical help from them – I am simply alerting them to what I am doing.

Text-mining distorts the publisher metrics on value? Surely that can be overcome technically. If that's the only problem lets' create a dark cache and I'll play in the sandbox. This is one of the sort of things where with goodwill on both sides a solution is straightforward.

Is it progress? Difficult to say – it's no good to me or Mat Todd as it doesn't advance my current ability to mine the RSC literature.

Information mining from Springer full-text: I ask for freedom

This is the last of the current series of requests to publishers for freedom to mine factual information. Note "freedom", not "permission". I don't ask permission to speak in public, I take it as a freedom. I have now sent such requests to Elsevier, Springer, Wiley, Royal Soc Chemistry, Amer. Chem. Soc. and Nature Publishing Group. [If anyone wishes to contact other publishers feel free to use the text of my letters and let me know].

I'll publish updates, hopefully daily , with publisher responses. I've given every one a hard deadline because Hargreaves/IPO has a hard deadline.

Wim van der Stelt (Executive Vice President Corporate Strategy) is the only person whose email address I know in Springer so I hope he can find the right place for a rapid answer.

Wim,
[We corresponded earlier. If you are not the correct person in Springer to answer the question below please can you forward it to the person who is, let me know their name/email and ask them to reply substantively to me.]

We are making representations in response to the Hargreaves report and in particular about the freedom to use machines to extract and publish factual information from scientific publications without legal and technical barriers.

We are now in the position where we can extract factual chemical information from the full text of articles with high precision and recall (accuracy is > 99.5% and recall > 95%) and with great speed and cost-effectiveness. The University of Cambridge is a subscriber to Springer journals and we would like to begin to extract information on a systematic basis for Open scientific research. This applies to all Springer journals, not just BMC and Springer Open. We don't need technical help or permission from the Springer . We have copied Cambridge University Library staff.

This mail is to ask your assurance that we can do this without (a) legal/contractual barriers from Springer and (b) that we shall not be cut off by Springer robots. We wish to start immediately to show Hargreaves the benefit of information mining – they have a deadline for 2012-03-21 so we would like your agreement by 2012-03-15. All we require is:

YES: you may mine and publish factual information from Springer journals without additional payment and without restriction from legal and technical barriers.

I hope you can trust me to act responsibly on not violating copyright and being considerate to your robots. I have set out more details and a non-exhaustive illustration of facts in http://blogs.ch.cam.ac.uk/pmr/2012/03/04/information-mining-and-hargreaves-i-set-out-the-absolute-rights-for-readers-non-negotiable .

Unfortunately any other reply than YES by 2012-03-15 will be regarded as unacceptable for the purposes of Hargreaves.

You will note that we are also approaching other major publishers of science. Elsevier has already publicly said we can mine their content for research and we'll be publishing the facts under an Open licence.

Best wishes,

Peter

Hargreaves and information mining: I request freedom to mine factual data from Nature Publishing Group full-text

I have sent the following letter to Philip Campbell, editor of Nature. In it I request freedom to mine factual information without legal or technical barriers. We have worked closely with Timo Hannay (then of NPG) and no of Digital Science, another Macmillan company in the same building. Digital Science has great interest in published information and (maybe( uses some of our toolkit such as OPSIN.

Philip,
We are making a submission in response to the Hargreaves report and specifically about the freedom to extract and publish factual information from scientific publications. I have appreciated your cooperation in the past over the requirement to publish data that supports scientific research. I have copied Timo who, as you know, has supported our research here in developing semantic informatics, including tools for extraction. This involved a summer student and in-kind support for our Sciborg (EPSRC) project. You'll know that two of our staff have since joined Timo's Digital Science; and we are very proud to produce valuable human resources.

We are now in the position where we can extract factual chemical information from the full text of articles with high precision and recall (OPSIN accuracy is > 99.5% and recall > 95%) and with great speed and cost-effectiveness. The University of Cambridge is a subscriber to NPG journals and we would like to begin to extract information on a systematic basis for Open scientific research. We don't need technical help or permission from NPG. We have copied Cambridge University Library staff.

This mail is to ask your assurance that we can do this without (a) legal/contractual barriers from NPG and (b) that we shall not be cut off by NPG robots (unfortunately this happened some years ago). We wish to start immediately to show Hargreaves the benefit of information mining – they have a deadline for 2012-03-21 so we would like your agreement by 2012-03-15. All we require is:

YES: you may mine and publish factual information from the full text of NPG journals without additional payment and without restriction from legal and technical barriers.

I hope you can trust me to act responsibly on not violating copyright and being considerate to your robots. I have set out more details and a non-exhaustive illustration of facts in http://blogs.ch.cam.ac.uk/pmr/2012/03/04/information-mining-and-hargreaves-i-set-out-the-absolute-rights-for-readers-non-negotiable .

Unfortunately any other reply than YES by 2012-03-15 will be regarded as unacceptable for the purposes of Hargreaves.

You will note that we are also approaching other major publishers of chemistry. Elsevier has already publicly said we can mine their content for research and we'll be publishing the facts under an Open licence.

Best wishes,

Peter

Hargreaves and Information mining: I ask the American Chemical Society for freedom to mine factual data

Here is a letter I have sent to Madeleine Jacobs, CEO of the American Chemistry Society (ACS) and former director of publications. In it I ask for freedom to extract factual information from the full-text of ACS journals.

Henry Rzepa and I are joint recipients of a prestigious ACS award and are organizing a symposium in the Fall. We hope to be able to show what we have managed to do with extraction of factual data from full-text. Here I ask Madeleine for assurance we can do this without barriers from ACS.

 

Dear Madeleine,

Unfortunately we've not yet been able to meet though our paths have crossed for several years. (I have copied in Dave Martinsen in ACS Publications whom I have known for 20 years).

You'll know that I am this year's recipient (joint with Henry Rzepa) of the Society's CINF Division Herman Skolnik award. Part of the award is for our work in machine extraction of semantic chemical information (in Chemical Markup Language, CML) and re-use for new scientific opportunities. As a Skolnik medallist Henry and I are organizing part of this year's Fall CINF meeting and shall be demonstrating some of our achievements. In particular we wish to show the great opportunities that semantic chemistry gives and particularly the ability to use the factual information in the primary literature.

We are now in the position where we can extract factual chemical information from the full text of articles with high precision and recall. For example Our OPSIN name-to-structure tool (published last year in the Society's J.Chem. Inf. Model [1] and highly accessed)  has accuracy is > 99.5% and recall > 95%. The University of Cambridge is a subscriber to ACS journals and we would like to begin to extract information on a systematic basis for Open scientific research. We don't need technical help or permission from the ACS. We have copied Cambridge University Library staff.

This mail is to ask your assurance that we can do this without (a) legal/contractual barriers from ACS and (b) that we shall not be cut off by ACS robots (unfortunately this happened some years ago even though we hadn't violated anything). We wish to start immediately to show Hargreaves the benefit of information mining – they have a deadline for 2012-03-21 so we would like your agreement by 2012-03-15. All we require is:

YES: you may mine and publish factual information from ACS journals without additional payment and without restriction from legal and technical barriers.

I hope you can trust me to act responsibly on not violating copyright and being considerate to your robots. I have set out more details and a non-exhaustive illustration of facts in http://blogs.ch.cam.ac.uk/pmr/2012/03/04/information-mining-and-hargreaves-i-set-out-the-absolute-rights-for-readers-non-negotiable .

Unfortunately any other reply than YES by 2012-03-15 will be regarded as unacceptable for the purposes of Hargreaves.

You will note that we are also approaching other major publishers of chemistry. Alicia Wise, Director of Universal Access at Elsevier, has already publicly said we can mine their content for research and we'll be publishing their factual data under an Open licence. As a result we should have a great opportunity to show the power of the semantic approach at the Fall Symposium.

And, of course, I would be delighted to meet you there!

Best wishes,

Peter

[1] http://pubs.acs.org/doi/abs/10.1021/ci100384d?journalCode=jcisd8


Information mining for Hargreaves and Open Science: I ask the Royal Society of Chemistry

I've now asked the Royal Society of Chemistry for permission to extract factual information from the journals to which Cambridge subscribes. For background for non-chemists, the RSC has supported our research in information mining through funding summer students, and in kind for the Sciborg (EPSRC) and the CheTA (JISC) projects. For example our Experimental Data Checker (OSCAR2) is hosted on the RSC website and very widely used for checking the quality of chemical papers before and after publication. Chemspider is a novel, volunteer populated, resource for collecting and validating chemical information (http://www.chemspider.com )

David, Richard,
We are preparing a response to the Hargreaves report about information mining from scientific publications. As you know we have developed a world class set of Open Source tools for chemical information extraction, some of them with your support - for which public thanks!

We are now in the position where we can extract factual chemical information from the full text of articles with high precision and recall (OPSIN accuracy is > 99.5% and recall > 95%) and with great speed and cost-effectiveness. The University of Cambridge is a subscriber to RSC journals and we would like to begin to extract information on a systematic basis for Open scientific research. We don't need technical help or permission from the RSC. We have copied Cambridge University Library staff.

This mail is to ask your assurance that we can do this without (a) legal/contractual barriers from RSC and (b) that we shall not be cut off by RSC robots (unfortunately this happened some years ago). We wish to start immediately to show Hargreaves the benefit of information mining - they have a deadline for 2012-03-21 so we would like your agreement by 2012-03-15. All we require is:

YES: you may mine and publish factual information from RSC journals without additional payment and without restriction from legal and technical barriers.

I hope you can trust me to act responsibly on not violating copyright and being considerate to your robots. I have set out more details and a non-exhaustive illustration of facts in http://blogs.ch.cam.ac.uk/pmr/2012/03/04/information-mining-and-hargreaves-i-set-out-the-absolute-rights-for-readers-non-negotiable .

Unfortunately any other reply than YES by 2012-03-15 will be regarded as unacceptable for the purposes of Hargreaves.

You will note that we are also approaching other major publishers of chemistry. Elsevier has already publicly said we can mine their content for research and we'll be publishing the facts under an Open licence. This means that Chemspider (Tony Williams copied) can immediately use all this information in the Chemspider resource.

Best wishes,

Peter

One of the immediate benefits is our collaboration with Mat Todd (Sydney) who is running an Open project for discovering novel antimalarials. The RSC publishes much high-quality research in (for example) its journal "Organic and Biomolecular Chemistry" and Mat will be able to scan the factual list of factual compounds and factual data for leads to develop antimalarials.