Update; and a semantic amusement for you

I’ve extremely busy so this is just to let anyone know I am still working on a number of threads.

  • Chem4Word. We had a really valuable discussion yesterday (sic) in Microsoft Research Cambridge with Alex Wade, Joe Townsend, Clyde Davies and me. We went over the code for Clyde’s benefit as he is writing it up (a) for further C4W work and (b) for #semphyssci publication. Even though I was PI of the project and heavily involved there are swathes of code I didn’t even know existed. It’s a VERY impressive piece of work and Joe Townsend and several others can take great pride. A major part of the next phase is with Nico Adams whom I shall be visiting soon.
  • Panton Fellowships. We are delighted in the very high quality response to the PFs and Laura Newman (OKF) has been doing a great job servicing the applications for us to make decisions on who to interview (skype).
  • Hargreaves. Jenny Molloy, Diane Cabell and I are putting together a response to IPO/Hargreaves. (I’ve got responses from 6 publishers I wrote to – thanks! – and will summarise and postb them to this blog, probably in a day or two.
  • Semantic Physical Science. I am really excited. We have now developed a completely declarative approach to forcefields such that it should be possible to define the complete problem on the fly using MathML and CML (MathCML). Given that a forcefield (misnamed) evaluates the energy as a function of molecular geometry, atom types and a parameterised forcefield it will be possible to code this in a page or two of declarative code supported by standard libraries. The forcefield can be manipulated (e.g. to calculate derivatives) so it should be possible to both optimise geometry and elaborate trajectories in a declarative manner. With Mark Williamson, Andrew Walker, Martin Dove and Jens Thomas.

And so a semantic amusement.

A cup contains 200 ml of water and an apple (4 cm radius) is placed on top.

From reading this description what can be deduced by:

  • A 10 year-old Anglophone child
  • A first year undergraduate scientist
  • A logician
  • “Shallow thought” – the accumulation of current “AI” – e.g. Wolfram Alpha, True Knowledge, Cyc, Google, Wikipedia and any other engines you think would be relevant (the problem is given to them cold – they are not trained in this domain).

(I’m interested because I want to develop “Shallow thought” for chemistry – more on that later).

Posted in Uncategorized | Leave a comment

#sparc2012 a manifesto in absentia for Open Data

Dear #sparc2012

I am very sorry that I can’t be physically present with you, especially since we are at a critical time for #scholpub. I’d have liked to meet and come up with new ideas on how to change the world. As it is the iffy technology (your words) means I shall write a blog and then either I or John @wilbanks will present it. Maybe John will splash this blog up, click to the relevant pages, and say what he thinks needs said. Or maybe he’ll read it verbatim and perhaps adds comments. Whatever. I hope that either way my message will get through.

I’ve been asked to talk about Rights and Open Data. The Rights I care about are not academia, nor authors, nor publishers but the 99% of the world who cannot get effective access to scholarship. The #scholarlypoor. So here’s the first principle – if you accept that then

  1. Access to the fruits of publicly funded research is a fundamental human right

We spend about 300 Billion every year on Science Technology Engineering Medicine (STEM). [I’ll probably use “science” to cover all and all figures are +- half an order of magnitude so 100-1000 B USD]. That’s about 50 dollars for every inhabitant. And almost no inhabitant (including 99% of those in rich nations) has effective access to this output. Now we have won many human rights and we can win this one.

It will cost money to deliver, but then it will be gratis (free to consume) and libre (free to reuse). I pay for my water. I pay for education. Those who can’t afford them still have equal access. These are human rights which we have largely solved. It’s the same with science. I can re-use my education as often as I wish without permission.

[Note. The “OA definition” of “libre” as “some restrictions removed” is unacceptable. Some of us (including Wikipedia and OKF) are actively working to make sure “libre” is properly defined and used

The second principle

  1. Scientific data should be libre at the point of creation

This is why we initiated the Panton Principles. There are many reasons for making scientific data libre:

  • It belongs to all of us
  • It is required to validate the science (most scientific papers are merely advertisements for the work, not the work itself – who said that?)
  • It can be re-used in millions of planned and unexpected ways

Scientists and academia have lost control of the authoring process. They must regain it, and part of this is to regain control of data. So corollary:

2a. Science data should be stamped as Libre (Panton)

Almost all data is now produced either from instruments or from scientific software. (In my field of computational chemistry there is probably 10 B spent per year on computation (machines, people, software). Maybe 100 million (computational) jobs or more. All the public fruits of this could be collected and stamped as libre. Similar ideas for images (the microscope software could stamp with “Open Data”, the phone app taking gel pictures could do the same). Everything on Figshare or Dryad could be watermarked.

Huge amounts of fruitless effort are spent on bad licences. Unfortunately some of these seem deliberate – to confuse rather than help. The latest Wiley paid OA (4000 USD, “fully open access”) “Chemistry Open” has so many restrictions that it’s effectively closed. Why does the library community and SPARC not challenge this ? So the way forward has to be clear licences.

  1. Open Access and Data require clear, libre licences

The Open Access community has failed to address this for 10 years (since BOAI/BBB). BOAI/BBB were/are great declarations, in the tradition of liberation. But most of the OA community honours then in name only. Everyone at #sparc2012 should be able to recite, by heart:

“By ‘open access’ to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.”

And for the purposes of data free to “use, re-use, and redistribute” (OKDefinition).

    3a. Libre material should be clearly stamped for human and machine discoverability and reuse

UK/PubMedCentral shows the current problem clearly. It is impossible to search for “Open Access” material and even harder – almost impossible – to search for BOAI-libre material (i.e. minable). Our recent @ccess group is trying to index the malaria literature for BOAI-Openness and it has to be done paper-by-paper – IMO this is unacceptable after 10 years. University IRs are even worse. Here’s mine http://www.lib.cam.ac.uk/repository/about/end_user_terms.html

Unless otherwise noted, Deposited Works in DSpace@Cambridge are made freely available for access, printing and download for the purposes of non-commercial research or private study only.

So – Institutional Repositories are set up just for academics – no one else matters. You can’t use DSpace@cam for:

  • Teaching schoolchildren
  • Ideas for high-tech business (Cambridge is the UK’s centre for high-tech)
  • Helping a patient understand their disease
  • Writing books
  • and 101 more examples (see Mike Taylor’s http://whoneedsaccess.org )

So the next principle

  1. Only use CC-BY, CC0 and other BOAI-compliant licences.

Abandon NC, Non-commercial. It effectively prevents anything useful. (Maybe Mike Carroll will cover this, but it needs restating again and again). Corollary:

    4a. Publishers of Open Access (“Gold OA”) should useBOAI-licences.

Ross Mounce (a graduate student) has done a tremendous job of collating the hybrid OA licences of major publishers and out of over 100 finds that only 5% are BOAI-compliant. Authors are paying lots of money (1000-5000 USD for this, publishers are restricting re-use to the point of uselessness and academia accepts this without a squeak. Surely this is where SPARC should be labelling offerings as BOAI-acceptable or non-acceptable. But no, we have given in and allowed this mess of “slightly Open Access”. Some of the publisher terms are so badly written, piling restriction on restriction, that they are probably not even executable consistently.

And now some more general ideas on “textmining”. Over the last 2 weeks I have blogged about information mining (a better term as we can mine images, speech and video for facts as well). The core is defined in: /pmr/2012/03/04/information-mining-and-hargreaves-i-set-out-the-absolute-rights-for-readers-non-negotiable/ . I’ve been trying to do this for years and failing to get permission. Publishers (and libraries) have a three-valued logic:

  • Yes you may (very rare)
  • No you may not (common)
  • Mumble (make nice noises but actually say nothing). Mumble includes “let’s set up a meeting”, “let’s talk with your librarians” “I’ll refer you to our director of marketing” and much else. Mumble means hoping the problems will go away. Silence is Mumble.

Understand that mining is about LOTS of information. And the web can cope with lots of information. As an example ONE STUDENT (albeit it the very smart Daniel Lowe) last week downloaded 65,000 patents and extracted 500,000 chemical reactions AND interpreted the text and diagrams AS ATOMS and MOLECULES. Very high quality semantic information. Here’s what we CAN mine:

  • Patents
  • BMC and PLoS
  • Supplementary info on publisher’s web pages (FWIW we have downloaded 250,000 crystal structures in this way and haven’t crashed any servers)

What we CAN’T mine are:

  • Closed access publications
  • Green OA (can’t find it and anyway no rights)
  • Gold hybrid OA (can’t find it and cannot machine-read the licences to find BOAI)
  • UK/PubMedCentral (impossible to find the BOAI-compliant subset)
  • Institutional Repositories (impossible to navigate and no rights in most cases anyway)

I’ve been asking Elsevier for 2.5 years whether we can text-mine. Have I got “Yes” or have I got “Mumble”. I’ll post today’s mail and let you judge.

I’m not the only one. Here’s Max Haussler, writing to publishers for permission to text-mine http://text.soe.ucsc.edu/progress.html . Some have taken two years of negotiations. Half haven’t responded. This is an industry that Eefke Smit says is extremely helpful to requesters.

Where the publishers do respond, they want to control what research we can and can’t do. See UBC and Heather Piwowar negotiating with Elsevier http://researchremix.wordpress.com/2012/03/05/talking-text-mining-with-elsevier/ . “Alicia had indicated that Elsevier facilitates text mining on a project-by-project basis”. For me this is unacceptable. There is no reason in the world why Elsevier or any other publisher should “facilitate my research”. (“facilitate” is Newspeak as is “Universal Access” – means of restricting access). HP: “two of my text mining use cases require reuse rights that are outside the standard Elsevier agreement”. Yes, Elsevier writes additional restrictions in the contract. HP: “I asked for the text of the standard reuse agreement. It was sent to me but I was asked not to share it publicly because ‘it is a legal element'”. So we cannot even know what we aren’t allowed to do. This is Universal Access??

So another right:

  1. Universities/subscribers should refuse to sign any contracts more restrictive than copyright itself; they should publicize the contracts

Librarians or purchasing officers should read contracts before they sign them and publicise any causes which restrict subscribers Rights. (I can’t imagine how librarians all over the world have signed that we may not index the literature we buy (sorry, rent).

    5a. Where the extracted material (e.g. facts) does not violate copyright then it should be made public and posted Openly under a libre licence

I’d like to think that all involved (publishers, universities) would wish to sign up to the principles I have outlined. I’m happy for #sparc2012 to take any or all of them and wordsmith them.

And finally I’d like volunteers to work with me on extracting useful factual chemistry from the closed literature.

 

Posted in Uncategorized | 7 Comments

Permission for information-mining : Update and response from Royal Society of Chemistry

In our current search [some request only went out on Saturday] for factual information from publishers on permission for “text-mining” the position is:

  • Elsevier. Permission granted in principle for PM-R. [PMR and community is now gearing up to extract factual chemistry from Elsevier journals. First step will be to create a complete index of all content (e.g. in Open biblio/ Bibsoup] and then decide on strategy. Top driver is mat Todd’s need to find antimalarial compounds – so we’ll look in chemistry journals first.]
  • Wiley. Request [2012-03-07] to Bob Campbell transferred to Duncan Campbell.
  • Nature. Request to Philip Campbell [2012-03-10] transferred to appropriate department.
  • American Chemical Society. No reply yet.
  • Springer. Request [2012-03-10] transferred internally and significant useful response [next mail]
  • Royal Society of Chemistry. [2012-03-10] Significant response from Richard Kidd. See below. Note that Richard has often commented on this blog.

    Dear Peter 

    Thanks for your request. It’s good to see from this and the accompanying blog post you still have some positive memories of text mining with some publishers. So far, we have mainly supplied articles for academic text mining purposes as one-off deliveries – such as for the SESL project, and the 50,000+ articles we supplied to both the ChETA and TREC Chem projects. Often it is easier for miners to bulk load within their own systems than crawling to collect, but we recognise that times are changing.

    We ask you talk to your librarian colleagues, both in terms of them being happy with what you’re doing under the agreed licenses with RSC, and so they understand what ongoing value the results of any mining exercise derives from the RSC subscription.  

    This ongoing value issue is important in terms of text mining implications for us. Along with most publishers we supply counter stats to librarians of usage within their institution – and, as you know, when renewal times comes these are used to judge which journals are of most value. Our concern is if the mining extracts and republishes sufficient content from the publications as to reduce apparent usage (and citation) of the published papers in future. At the moment full text downloads are the major measure we have (rightly or wrongly in principle) for the librarian to judge if publications are of value to the institution, and republication of extracted facts and data at least potentially could affect this. Done right, the effect can be positive, but it could also be detrimental.

    Some of Cameron’s suggested principles of research data mining would have been a valuable addition to your proposed non-negotiables, to reduce concerns that future derived would reduce usage of the original papers by your institution and others:

    * Always link back to the version of record of the research output you have mined.

    * Include elements and snippets by reference, not by value. Restrict content replication to that reasonably allowed by Fair Use provisions or enabled by licences, and required for efficient services

    * Only redistribute content where copyright terms explicitly allow it

    * Respect API service limits where posted and develop polite tooling with exponential back-off where appropriate

    (a couple of principles deleted, due to non-relevance to this specific question rather than disagreement)

    Finally, a correction. You say we cut off access a  few years ago. My recollection is slightly different and I have the correspondence if you’d like to  see it, from 2006. We didn’t cut you off, though we suggested we would block one IP address if the downloading continued without any contact. We discussed it amicably – explanation made it clear and the download behaviour was modified for both sides to be happy with continuation. But it’s an excellent illustration of why we appreciate being asked about the approach – as in this case the downloader was trying to retrieve non-existent issues, filling our developers’ mailboxes with 404 alerts. So while you think we’re only concerned about server load with on-demand mining, you can end up killing other systems we have to improve customer service. Mike Taylor clearly values publishers who try to stay on top of broken links 😉

    I would also ask that you include our response verbatim if you are using it in any of your Hargreaves submissions, and of course we will be preparing our own submission. 

    In summary, we would strongly appreciate discussion on the extent of the factual information you intend to republish (I have seen the examples on the blog), together with the involvement of your librarian colleagues in the process – for current agreements, and effects on future usage and value measures.

    Best wishes

     

    Richard

This is a useful response. It doesn’t however give me permission to text-mine RSC without permission. It suggests I contact my librarian. I have done on regular intervals – I think they recognise I don’t need technical help from them – I am simply alerting them to what I am doing.

Text-mining distorts the publisher metrics on value? Surely that can be overcome technically. If that’s the only problem lets’ create a dark cache and I’ll play in the sandbox. This is one of the sort of things where with goodwill on both sides a solution is straightforward.

Is it progress? Difficult to say – it’s no good to me or Mat Todd as it doesn’t advance my current ability to mine the RSC literature.

Posted in Uncategorized | 2 Comments

Information mining from Springer full-text: I ask for freedom

This is the last of the current series of requests to publishers for freedom to mine factual information. Note “freedom”, not “permission”. I don’t ask permission to speak in public, I take it as a freedom. I have now sent such requests to Elsevier, Springer, Wiley, Royal Soc Chemistry, Amer. Chem. Soc. and Nature Publishing Group. [If anyone wishes to contact other publishers feel free to use the text of my letters and let me know].

I’ll publish updates, hopefully daily , with publisher responses. I’ve given every one a hard deadline because Hargreaves/IPO has a hard deadline.

Wim van der Stelt (Executive Vice President Corporate Strategy) is the only person whose email address I know in Springer so I hope he can find the right place for a rapid answer.

Wim,
[We corresponded earlier. If you are not the correct person in Springer to answer the question below please can you forward it to the person who is, let me know their name/email and ask them to reply substantively to me.]

We are making representations in response to the Hargreaves report and in particular about the freedom to use machines to extract and publish factual information from scientific publications without legal and technical barriers.

We are now in the position where we can extract factual chemical information from the full text of articles with high precision and recall (accuracy is > 99.5% and recall > 95%) and with great speed and cost-effectiveness. The University of Cambridge is a subscriber to Springer journals and we would like to begin to extract information on a systematic basis for Open scientific research. This applies to all Springer journals, not just BMC and Springer Open. We don’t need technical help or permission from the Springer . We have copied Cambridge University Library staff.

This mail is to ask your assurance that we can do this without (a) legal/contractual barriers from Springer and (b) that we shall not be cut off by Springer robots. We wish to start immediately to show Hargreaves the benefit of information mining – they have a deadline for 2012-03-21 so we would like your agreement by 2012-03-15. All we require is:

YES: you may mine and publish factual information from Springer journals without additional payment and without restriction from legal and technical barriers.

I hope you can trust me to act responsibly on not violating copyright and being considerate to your robots. I have set out more details and a non-exhaustive illustration of facts in /pmr/2012/03/04/information-mining-and-hargreaves-i-set-out-the-absolute-rights-for-readers-non-negotiable .

Unfortunately any other reply than YES by 2012-03-15 will be regarded as unacceptable for the purposes of Hargreaves.

You will note that we are also approaching other major publishers of science. Elsevier has already publicly said we can mine their content for research and we’ll be publishing the facts under an Open licence.

Best wishes,

Peter

Posted in Uncategorized | 1 Comment

Hargreaves and information mining: I request freedom to mine factual data from Nature Publishing Group full-text

I have sent the following letter to Philip Campbell, editor of Nature. In it I request freedom to mine factual information without legal or technical barriers. We have worked closely with Timo Hannay (then of NPG) and no of Digital Science, another Macmillan company in the same building. Digital Science has great interest in published information and (maybe( uses some of our toolkit such as OPSIN.

Philip,
We are making a submission in response to the Hargreaves report and specifically about the freedom to extract and publish factual information from scientific publications. I have appreciated your cooperation in the past over the requirement to publish data that supports scientific research. I have copied Timo who, as you know, has supported our research here in developing semantic informatics, including tools for extraction. This involved a summer student and in-kind support for our Sciborg (EPSRC) project. You’ll know that two of our staff have since joined Timo’s Digital Science; and we are very proud to produce valuable human resources.

We are now in the position where we can extract factual chemical information from the full text of articles with high precision and recall (OPSIN accuracy is > 99.5% and recall > 95%) and with great speed and cost-effectiveness. The University of Cambridge is a subscriber to NPG journals and we would like to begin to extract information on a systematic basis for Open scientific research. We don’t need technical help or permission from NPG. We have copied Cambridge University Library staff.

This mail is to ask your assurance that we can do this without (a) legal/contractual barriers from NPG and (b) that we shall not be cut off by NPG robots (unfortunately this happened some years ago). We wish to start immediately to show Hargreaves the benefit of information mining – they have a deadline for 2012-03-21 so we would like your agreement by 2012-03-15. All we require is:

YES: you may mine and publish factual information from the full text of NPG journals without additional payment and without restriction from legal and technical barriers.

I hope you can trust me to act responsibly on not violating copyright and being considerate to your robots. I have set out more details and a non-exhaustive illustration of facts in /pmr/2012/03/04/information-mining-and-hargreaves-i-set-out-the-absolute-rights-for-readers-non-negotiable .

Unfortunately any other reply than YES by 2012-03-15 will be regarded as unacceptable for the purposes of Hargreaves.

You will note that we are also approaching other major publishers of chemistry. Elsevier has already publicly said we can mine their content for research and we’ll be publishing the facts under an Open licence.

Best wishes,

Peter

Posted in Uncategorized | Leave a comment

Hargreaves and Information mining: I ask the American Chemical Society for freedom to mine factual data

Here is a letter I have sent to Madeleine Jacobs, CEO of the American Chemistry Society (ACS) and former director of publications. In it I ask for freedom to extract factual information from the full-text of ACS journals.

Henry Rzepa and I are joint recipients of a prestigious ACS award and are organizing a symposium in the Fall. We hope to be able to show what we have managed to do with extraction of factual data from full-text. Here I ask Madeleine for assurance we can do this without barriers from ACS.

 

Dear Madeleine,

Unfortunately we’ve not yet been able to meet though our paths have crossed for several years. (I have copied in Dave Martinsen in ACS Publications whom I have known for 20 years).

You’ll know that I am this year’s recipient (joint with Henry Rzepa) of the Society’s CINF Division Herman Skolnik award. Part of the award is for our work in machine extraction of semantic chemical information (in Chemical Markup Language, CML) and re-use for new scientific opportunities. As a Skolnik medallist Henry and I are organizing part of this year’s Fall CINF meeting and shall be demonstrating some of our achievements. In particular we wish to show the great opportunities that semantic chemistry gives and particularly the ability to use the factual information in the primary literature.

We are now in the position where we can extract factual chemical information from the full text of articles with high precision and recall. For example Our OPSIN name-to-structure tool (published last year in the Society’s J.Chem. Inf. Model [1] and highly accessed)  has accuracy is > 99.5% and recall > 95%. The University of Cambridge is a subscriber to ACS journals and we would like to begin to extract information on a systematic basis for Open scientific research. We don’t need technical help or permission from the ACS. We have copied Cambridge University Library staff.

This mail is to ask your assurance that we can do this without (a) legal/contractual barriers from ACS and (b) that we shall not be cut off by ACS robots (unfortunately this happened some years ago even though we hadn’t violated anything). We wish to start immediately to show Hargreaves the benefit of information mining – they have a deadline for 2012-03-21 so we would like your agreement by 2012-03-15. All we require is:

YES: you may mine and publish factual information from ACS journals without additional payment and without restriction from legal and technical barriers.

I hope you can trust me to act responsibly on not violating copyright and being considerate to your robots. I have set out more details and a non-exhaustive illustration of facts in /pmr/2012/03/04/information-mining-and-hargreaves-i-set-out-the-absolute-rights-for-readers-non-negotiable .

Unfortunately any other reply than YES by 2012-03-15 will be regarded as unacceptable for the purposes of Hargreaves.

You will note that we are also approaching other major publishers of chemistry. Alicia Wise, Director of Universal Access at Elsevier, has already publicly said we can mine their content for research and we’ll be publishing their factual data under an Open licence. As a result we should have a great opportunity to show the power of the semantic approach at the Fall Symposium.

And, of course, I would be delighted to meet you there!

Best wishes,

Peter

[1] http://pubs.acs.org/doi/abs/10.1021/ci100384d?journalCode=jcisd8


Posted in Uncategorized | Leave a comment

Information mining for Hargreaves and Open Science: I ask the Royal Society of Chemistry

I’ve now asked the Royal Society of Chemistry for permission to extract factual information from the journals to which Cambridge subscribes. For background for non-chemists, the RSC has supported our research in information mining through funding summer students, and in kind for the Sciborg (EPSRC) and the CheTA (JISC) projects. For example our Experimental Data Checker (OSCAR2) is hosted on the RSC website and very widely used for checking the quality of chemical papers before and after publication. Chemspider is a novel, volunteer populated, resource for collecting and validating chemical information (http://www.chemspider.com )

David, Richard,
We are preparing a response to the Hargreaves report about information mining from scientific publications. As you know we have developed a world class set of Open Source tools for chemical information extraction, some of them with your support – for which public thanks!

We are now in the position where we can extract factual chemical information from the full text of articles with high precision and recall (OPSIN accuracy is > 99.5% and recall > 95%) and with great speed and cost-effectiveness. The University of Cambridge is a subscriber to RSC journals and we would like to begin to extract information on a systematic basis for Open scientific research. We don’t need technical help or permission from the RSC. We have copied Cambridge University Library staff.

This mail is to ask your assurance that we can do this without (a) legal/contractual barriers from RSC and (b) that we shall not be cut off by RSC robots (unfortunately this happened some years ago). We wish to start immediately to show Hargreaves the benefit of information mining – they have a deadline for 2012-03-21 so we would like your agreement by 2012-03-15. All we require is:

YES: you may mine and publish factual information from RSC journals without additional payment and without restriction from legal and technical barriers.

I hope you can trust me to act responsibly on not violating copyright and being considerate to your robots. I have set out more details and a non-exhaustive illustration of facts in /pmr/2012/03/04/information-mining-and-hargreaves-i-set-out-the-absolute-rights-for-readers-non-negotiable .

Unfortunately any other reply than YES by 2012-03-15 will be regarded as unacceptable for the purposes of Hargreaves.

You will note that we are also approaching other major publishers of chemistry. Elsevier has already publicly said we can mine their content for research and we’ll be publishing the facts under an Open licence. This means that Chemspider (Tony Williams copied) can immediately use all this information in the Chemspider resource.

Best wishes,

Peter

One of the immediate benefits is our collaboration with Mat Todd (Sydney) who is running an Open project for discovering novel antimalarials. The RSC publishes much high-quality research in (for example) its journal “Organic and Biomolecular Chemistry” and Mat will be able to scan the factual list of factual compounds and factual data for leads to develop antimalarials.

Posted in Uncategorized | Leave a comment

Information-mining: Discussions with Wiley

I’ve now heard from Duncan (sic) Campbell in Wiley [I include his email because he is currently the contact point for information mining. I include our recent correspondence (Duncan in italics)

On Fri, Mar 9, 2012 at 4:16 PM, Campbell, Duncan – Oxford <dcampbell@wiley.com> wrote:

Peter

 

I’ve copied in my colleagues in the Cambridge University Library

Thanks for getting in touch. We would be happy to discuss your specific requirements for text-mining Wiley content, and how we can work with you to enable mining in a mutually-acceptable manner.


Excellent. You’ll appreciate that this is a matter of great public interest at present and an opportunity to show how helpful publishers are, so I’ll be posting the correspondence on my blog.

I don’t have specific requirements. I have the technology to extract facts from Wiley publications and do scientific research on them and I’d like to do that. In the first instance I’ll analyze which journals contain chemistry and extract all the chemical facts and then do research on them. Since the data are factual there is no question of copyright being violated.
As our group is the leading creator of Open Source information-mining software for chemistry and we are regarded as among the world’s experts I have a large number of collaborators. There are a large number of projects already but we add at least one a week so there’s no point in burdening you with the details. Here are just 5 to show you the power.

  • scanning the literature for potential antimalarial compounds (Mat Todd). We have to search for every compound as there is no golden rule for finding drugs against this killer disease
  • finding second harmonic generators for solar panels, leading to increased energy efficiency and greenness for the planet
  • Computing the human metabolome. Again we have to find all instances where compounds have been mentioned that might be human metabolites
  • Improving the eco-friendliness of chemical reactions. What solvents have been used in what reactions? Can we use solvents that are more friendly to the planet. Again we need to look at every reaction.
  • Improving the accuracy of computaional chemistry. There are billions of dollars spent on trying to predict the structure of matter. We want to find every paper and find the most cost effective methods

There are also many added benefits in scientific information-mining research itself where I am an acknowledged world expert (sorry to sound boastful, it’s just to assure you I know what I’m doing).

I’m not asking you to get involved in any of the technical details and we don’t need any special technology from the publisher, any special versions of the articles or any APIs. There is no need to involve CUL in details. All we need is:

  1. To download and analyze, using machines, papers from Wiley journals to which we have subscriptions (we use web-friendly crawling protocols)
  2. An assurance from Wiley that you will not impose technical and legal/contractual barriers.
  3. To be able to publish the data on which the science is based (science without data is almost worthless as you know)

We give you an assurance that we shan’t deliberately publish any copyright material such as the complete verbatim Version of Record.

 
 

We are keen to enhance the usage of our journal content by encouraging text and data mining, and welcome the opportunity to work on a specific project with you that would enable us to gain further experience in this area.  As you’ll appreciate, at this stage there are still questions around access, processing and distribution of the outputs of text mining, which Wiley, in common with most other STM publishers, is working through.

     I look forward to hearing from you further.

 There is an urgency. We are keen to start some of these projects within a day or two as we want to present to the Hargreaves enquiry how valuable text-mining can be. We therefore only need from you an assurance that we can employ factual mining and to get into the report we’ll need this by 2012-03-14. I am afraid promises of intent are worthless at this stage. There is only one acceptable answer:

YES – you can go ahead without further permission from Wiley

anything else, I’m afraid will be a NO for Hargreaves.

Posted in Uncategorized | Leave a comment

Textmining Update: Max Haussler’s Questions to publishers: They have a duty to reply

Today’s update is limited to one topic. Getting replies from publishers.

In the digital age it is in principle possible for readers to contact publishers directly through a new technology called “email”. It’s very recent – about 20-30 years old – and allows people to send messages to other people over the Internet. Often what happens is that I send an email to someone I know and – within minutes – that person mails back. We can even have complete dialogues within a day.

It would be really valuable to be able to email publishers and get a reply. For example Bob Campbell of Wiley said last week that “all I had to do was ask” Wiley for permission to text-mine. So to mail someone you need something called their “email address”. If you don’t know it you can’t mail them. If you do know it you *can* mail them. But you don’t know whether they’ve got it (this is actually a well-known problem in modern communications science).

Now I’m not the only one doing this. Max Haussler has been wanting to text-mine for biological research for several years. He’s published an account of trying to get permissions from publishers. http://text.soe.ucsc.edu/progress.html

Read this. Max has contacted 26 publishers, some in 2009 and the latest in 2012-02 (last month). 13 of these have not replied. About 5 have a relatively simple answer giving some or all permission. Several have taken 2 or more years to give and answer. Others have not yet arrived at an answer.

The STM publishers (Eefke Smit) have stressed how easy it is to get permission from publishers to do text-mining. Are Max and I just unlucky in not getting a positive rapid helpful response?

The main problem is that publishers do not make it easy for readers (Max and I are readers) to have any dialogue with them. It’s not that they deliberately make it hard, they just don’t actively make it easy. Visit the Wiley site http://eu.wiley.com/WileyCDA/Section/index.html and see if you can find where to ask for permission to text-mine.

“Rights and permissions” http://eu.wiley.com/WileyCDA/Section/id-403436.html – sounds useful:

The quick and easy way to clear permissions via Rights Link®, the Copyright Clearance Center’s online service

You can now obtain permission to reproduce any Wiley or Wiley Blackwell article (in whole or in part) directly from the article abstract on Wiley Online Library.

  • Click on the ‘Request Permission’ link
  • Follow the online instructions and select your requirements from the drop down options to gain a ‘quick quote’
  • Create a RightsLink® account to complete and pay for your transaction (if you do not already have one)
  • Read and accept our Terms & Conditions and Download your license

Beware! Anything with “Rights Link®”
means you are at the bottom of the rathole. These rights are the right given by the author to Wiley for Wiley to charge us twice for the article – first to read it and then to copy it or re-use it. There’s are complete list of organizations who will charge for re-using material in Wiley journals http://eu.wiley.com/WileyCDA/Section/id-301726.html . If you try to ask these about text-mining you’ll go even deeper.

Where’s the address for helpful answers on text-mining. There isn’t one. Where’s the department? Who knows.

You have to know the person to contact, and their email, and hope they haven’t moved department. Luckily I had had this personal offer from Bob Campbell. I searched the Wiley site and found his address and got back the following mail:

Dear Peter
At the meeting in Rhodes House I said that anyone interested in mining our journal content should contact us. Any such inquiries will be treated on a case by case basis. I followed up with an email to you suggesting you contact my colleague Duncan Campbell. I am copying him in.
Yours
Bob

Well, at least I’ve made contact with one Campbell in Wiley. I expected that Duncan (sic) would mail me but I can’t find anything in my mailbox over the last 2 days so I assume he hasn’t. So I have to mail him. It’s just another example of Institutional anti-readerism.

Anyway I have mailed him. I’ll report here.

So, publishers, it is unacceptable that readers find it so much of an effort to contact you and have a constructive discussion. I’ve been through several Elsevier staff over 2.5 years before I happened on Alicia. Probably like Max. It’s a grotesque waste of our time, and I would have thought, of your time. An efficient organization would have solved the problem years ago. (But hang on, that doesn’t matter – you can just put the (monopoly) prices up to cover the inefficiency.)

Now I am getting back to some chemical lambda calculus for relaxation.

 

Posted in Uncategorized | 1 Comment

Textmining: Update, Wiley, Nature and Hargreaves. And Elsevier allows me unrestricted text-mining! Thanks!!!

I shall continue to update on a daily basis.

Hargreaves

We have formed a small group to coordinate our reply to Hargreaves and this will take place on the OKF open-science and open-access lists (and @ccess). Please let us know of useful experience in access to published material

Nature

Today Richard van Noorden (of Nature) posted a useful article http://www.nature.com/news/trouble-at-the-text-mine-1.10184 on the current frustration within the research community about not being able to textmine when and where they want. It’s moderately well balanced. However it doesn’t say anything critical about Nature:

“Nature Publishing Group in London, which publishes this journal, says that it does not charge existing subscribers to mine content to which they already have access, subject to contract.”

RvN didn’t say that NPG sent Max Haeussler a quote for 85,000 USD to mine Nature content. I talked with RvN, gave him a lot of material for his article and pointed him to MaxH. I said that I would expect RvN to be objective in his report and not favour Nature. He said he would, but had to get his copy agreed. In the end he decided not to use any of my material – that’s fine, journalists collect more material than they can use

Elsevier

I have had a useful set of email communications with Alicia Wise of Elsevier. Today she has agreed that I can go ahead with textmining as I wish! Thank you Alicia!

Hi Peter,

As I indicated to you when we met in Oxford, we (at Elsevier) have no problem in principle with you text mining for research purposes. There are some practical matters to resolve through discussion. With regret I have formed the view that you are not – at this time – really seeking practical solutions. If this changes please do let me know as we remain willing to work with you and other colleagues at Cambridge – and elsewhere – who need and want to text mine.

While I am here, I would like to stress the real value of librarians in these discussions. Your library colleagues at Cambridge have – both directly and through JISC Collections – relationships and existing agreements with a wide array of publishers. They are constructive partners for us all in facilitating text mining and scaling up as we move forward.

With kind wishes,

Alicia

(Alicia Wise, Director of Universal Access, Elsevier, @wisealic)

Preamble

  • I am actively seeking practical solutions. I’m going to start tomorrow! (I’ll let our library know in case there are teething glitches). Last week we (Daniel Lowe) mined 1,000,000 (1 million) chemical reactions from US patents. This is for RESEARCH purposes (we are not going to sell them). We are analysing how well the technology works and then what types of chemistry are most effective. This feeds into the EPSRC Dial-A-Molecule Grand Challenge looking at how we can create better chemical synthesis for drugs. It could lead to a radical improvement of chemistry that’s RESEARCH. We are going to put the results up on DSpace and Figshare and our own Quixote so everyone else can do research on them as well. NOTE: I didn’t need any help from the USPTO or Cambridge Library.
  • I simply want to do the same with papers in Elsevier journals. I shan’t release any of the final PDF. I’m just going to publish the factual material – and conveniently in DSpace and Figshare and Quixote. This is research because the science is done for a different purpose than invention and generally is aimed towards novelty rather than production. So we get a whole new set of chemistry. It’s also done on a different scale – much novel chemistry doesn’t scale directly into production.

This is VERY good news. Thanks you Alicia. It’s not everything I have asked for but it’s of real value. We can mine Elsevier journals for research purposes. We start today!!!

I assume you will trust me as to what RESEARCH in chemical text-mining is – I’m a world expert, honoured by the ACS for this work. And I assume you will trust me not to publish copyright content – I haven’t done so in 10 years of semantic research. I shan’t publish the VoR PDF nor the author’s final manuscript. But I shall publish all the factual data on which the RESEARCH relies and all the bibliography metadata which is required to manage the output.

So here’s what I am going to do:

  • Use our Pubcrawler software to systematically retrieve all publications from Elsevier journals. (We can do this – we don’t need any technical help from Elsevier or our Library and we don’t need Sciverse, Scopus, Reaxys, Science Direct or any other Elsevier product. We shall only use the material for information mining
  • We shall determine which papers contain chemistry using our OSCAR4 software. This is the best Open Source software for chemical textmining and probably as good as if not better than closed proprietary tools
  • We shall filter the articles into those that have a significant proportion of chemistry and those that don’t and concentrate on the former.
  • We shall then extract and analyse the chemical names and formulae. Where possible we shall try to match redundant information (e.g. names and structure diagrams).
  • We shall extract the factual data (spectra) and check their validity against the chemical structure using our OSCAR2 software (Open Source). Many papers contain many errors (even Elsevier papers contain many errors). We’ll show where papers contain errors (and that’s a real benefit to scientific RESEARCH)
  • We shall use computational chemistry to compute the properties of the compounds and compare them with experiment. That’s really valuable RESEARCH. 15% of all supercomputer time is on compchem and there is a desperate need to calibrate its usefulness.
  • We shall extract the chemical reactions. There is very little research done in academia on the phenomenology of published reactions – we did some of this last year at the Open Science Summit where we analysed chemical reactions for eco-friendliness (the “Green Chain Reaction”). We’ll be able now to show whether the chemistry in Elsevier journals is more eco-friendly than in patents

     

That’s the start. It’ll take us a day or two to deploy the software on Elsevier journals but after that only a few days to do the analysis. Because it’s research we shall publish it (choice of publisher is currently Open) and the referees will demand that we make the data available. So we have to put it up publicly and we have DSpace and Figshare and our own Quixote system to do this.

This is really exciting! Thanks Alicia!

Posted in Uncategorized | 8 Comments