I shall continue to update on a daily basis.
We have formed a small group to coordinate our reply to Hargreaves and this will take place on the OKF open-science and open-access lists (and @ccess). Please let us know of useful experience in access to published material
Today Richard van Noorden (of Nature) posted a useful article http://www.nature.com/news/trouble-at-the-text-mine-1.10184 on the current frustration within the research community about not being able to textmine when and where they want. It’s moderately well balanced. However it doesn’t say anything critical about Nature:
“Nature Publishing Group in London, which publishes this journal, says that it does not charge existing subscribers to mine content to which they already have access, subject to contract.”
RvN didn’t say that NPG sent Max Haeussler a quote for 85,000 USD to mine Nature content. I talked with RvN, gave him a lot of material for his article and pointed him to MaxH. I said that I would expect RvN to be objective in his report and not favour Nature. He said he would, but had to get his copy agreed. In the end he decided not to use any of my material – that’s fine, journalists collect more material than they can use
I have had a useful set of email communications with Alicia Wise of Elsevier. Today she has agreed that I can go ahead with textmining as I wish! Thank you Alicia!
As I indicated to you when we met in Oxford, we (at Elsevier) have no problem in principle with you text mining for research purposes. There are some practical matters to resolve through discussion. With regret I have formed the view that you are not – at this time – really seeking practical solutions. If this changes please do let me know as we remain willing to work with you and other colleagues at Cambridge – and elsewhere – who need and want to text mine.
While I am here, I would like to stress the real value of librarians in these discussions. Your library colleagues at Cambridge have – both directly and through JISC Collections – relationships and existing agreements with a wide array of publishers. They are constructive partners for us all in facilitating text mining and scaling up as we move forward.
With kind wishes,
(Alicia Wise, Director of Universal Access, Elsevier, @wisealic)
- I am actively seeking practical solutions. I’m going to start tomorrow! (I’ll let our library know in case there are teething glitches). Last week we (Daniel Lowe) mined 1,000,000 (1 million) chemical reactions from US patents. This is for RESEARCH purposes (we are not going to sell them). We are analysing how well the technology works and then what types of chemistry are most effective. This feeds into the EPSRC Dial-A-Molecule Grand Challenge looking at how we can create better chemical synthesis for drugs. It could lead to a radical improvement of chemistry that’s RESEARCH. We are going to put the results up on DSpace and Figshare and our own Quixote so everyone else can do research on them as well. NOTE: I didn’t need any help from the USPTO or Cambridge Library.
- I simply want to do the same with papers in Elsevier journals. I shan’t release any of the final PDF. I’m just going to publish the factual material – and conveniently in DSpace and Figshare and Quixote. This is research because the science is done for a different purpose than invention and generally is aimed towards novelty rather than production. So we get a whole new set of chemistry. It’s also done on a different scale – much novel chemistry doesn’t scale directly into production.
This is VERY good news. Thanks you Alicia. It’s not everything I have asked for but it’s of real value. We can mine Elsevier journals for research purposes. We start today!!!
I assume you will trust me as to what RESEARCH in chemical text-mining is – I’m a world expert, honoured by the ACS for this work. And I assume you will trust me not to publish copyright content – I haven’t done so in 10 years of semantic research. I shan’t publish the VoR PDF nor the author’s final manuscript. But I shall publish all the factual data on which the RESEARCH relies and all the bibliography metadata which is required to manage the output.
So here’s what I am going to do:
- Use our Pubcrawler software to systematically retrieve all publications from Elsevier journals. (We can do this – we don’t need any technical help from Elsevier or our Library and we don’t need Sciverse, Scopus, Reaxys, Science Direct or any other Elsevier product. We shall only use the material for information mining
- We shall determine which papers contain chemistry using our OSCAR4 software. This is the best Open Source software for chemical textmining and probably as good as if not better than closed proprietary tools
- We shall filter the articles into those that have a significant proportion of chemistry and those that don’t and concentrate on the former.
- We shall then extract and analyse the chemical names and formulae. Where possible we shall try to match redundant information (e.g. names and structure diagrams).
- We shall extract the factual data (spectra) and check their validity against the chemical structure using our OSCAR2 software (Open Source). Many papers contain many errors (even Elsevier papers contain many errors). We’ll show where papers contain errors (and that’s a real benefit to scientific RESEARCH)
- We shall use computational chemistry to compute the properties of the compounds and compare them with experiment. That’s really valuable RESEARCH. 15% of all supercomputer time is on compchem and there is a desperate need to calibrate its usefulness.
We shall extract the chemical reactions. There is very little research done in academia on the phenomenology of published reactions – we did some of this last year at the Open Science Summit where we analysed chemical reactions for eco-friendliness (the “Green Chain Reaction”). We’ll be able now to show whether the chemistry in Elsevier journals is more eco-friendly than in patents
That’s the start. It’ll take us a day or two to deploy the software on Elsevier journals but after that only a few days to do the analysis. Because it’s research we shall publish it (choice of publisher is currently Open) and the referees will demand that we make the data available. So we have to put it up publicly and we have DSpace and Figshare and our own Quixote system to do this.
This is really exciting! Thanks Alicia!