Monthly Archives: November 2011

We’re coming to the Hackathon

The JISC/SWAT4LS/OKF Hackathon starts on Tuesday in London. http://www.swat4ls.org/workshops/london2011/ and see Jenny Molloy's blog post: http://science.okfn.org/2011/11/09/open-research-reports-trailer/

Here are some of the stars:

On the left is KosherFrog (alias @gfry, Alias Gilles Frydman). (We'll create some better glasses for him). Here's his twitter avatar:

In the middle is a patient – or rather the mother of a patient. She is carrying Roo's salbutamol inhaler. She needs access to medical information.

And on the right is McDawg. He's Graham Steel. Here's HIS twitter avatar.

We'll be creating semantic resources for disease.

Be there.

 

 

 

 

 

 

 

 

 

Cambridge Crystallographic Data Centre disputes non-re-usability of primary data (Am. Chem. Soc charges > 100 USD to view this discussion)

I have been alerted to a discussion in the letter pages of J. Chem. Inf. Modeling (an ACS Journal). I normally read the literature through a paywall window (my home machine has no privileges and so I get a "citizen-enhanced" view of the primary literature. The enhancement is of course massively negative – I can't read most of this. For most things if I can't read them they don't exist – an increasingly common approach. Occasionally I switch on access to the University VPN which allows me to read the fulltext – thereby requiring the University to continue its subscription (in dollars) to this journal. Unless they use the paywall filter academics in rich universities (which is the only real market for scholarly journals) have no idea how impoverished the world is. But many of my readers will appreciate – they are the Scholarly Poor. And what follows can be understood by anyone – you don't have to be a chemist. Note that many research institutions do not subscribe to JCIM so I expect most readers will have a "scholarly poor lens" on what follows.

  • Earlier this year a paper was published http://pubs.acs.org/doi/abs/10.1021/ci100223t

    Data-Driven High-Throughput Prediction of the 3-D Structure of Small Molecules: Review and Progress

    Alessio Andronico, Arlo Randall, Ryan W. Benz, and Pierre Baldi*

    School of Information and Computer Sciences, Institute for Genomics and Bioinformatics and Department of Biological Chemistry, University of California, Irvine, Irvine, California 92697-3435, United States

    J. Chem. Inf. Model., 2011, 51 (4), pp 760–776 DOI: 10.1021/ci100223t Publication Date (Web): March 18, 2011 Copyright © 2011 American Chemical Society

I can't reproduce the abstract because although it was written by the authors they have signed over its ownership/copyright to ACS. (ACS in their generosity allow you to read this at the end of the link above). Note that the system is mounted at http://cosmos.igb.uci.edu/ . It contains the rubric:

Note: In as much as this Service uses data from the CSD [Cambridge Structural Database] , it has been given express permission from the CCDC [Cambridge Crystallographic Data Centre] . At the request of the CCDC, no more than 100 molecules can be uploaded to the Service at a time, and the Service ought to be used for scientific purposes only, and not for commercial benefit or gain.

Well – that was a pretty challenging paper, wasn't it? (Sorry scholarly poor, I can't tell you what it said – but trust me – or pay 35 USD).

This elicited a response from the director of the (CCDC). If you read the abstract you will see their involvement. (BTW I have no relation to them except geographical proximity and the University has declared that they don't belong to the University (for FOI) although they are listed as a department). Here is his 1-page response:

  • http://pubs.acs.org/doi/pdfplus/10.1021/ci2002523 Data-Driven High-Throughput Prediction of the 3-D Structure of Small Molecules: Review and Progress. A Response from The Cambridge Crystallographic Data Centre,

    Colin R Groom* The Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, U.K.

He clearly disagrees with their contention. (Scholarly Poor you will have to fork out another 35 USD to read this single page). [2]

And the original authors responded

  • (http://pubs.acs.org/doi/abs/10.1021/ci200460z ) Data-Driven High-Throughput Prediction of the 3-D Structure of Small Molecules: Review and Progress. A Response to the Letter by the Cambridge Crystallographic Data Center

    Pierre Baldi

    J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/ci200460z • Publication Date (Web): 22 Nov 2011

Wow! Some strong disagreement on matters of fact. (Stop whining Scholarly Poor and pay another 35 USD to read this letter – it's nearly 2 pages!). I'll reveal that it contains phrases like "simply false". And you can read the abstract which contains the phrase "significant impediments to scientific research posed by the CCDC."

So that is a pretty damning indictment. Of the CCDC? Maybe, if you can read the letters. But certainly of the ACS. An important discussion about the freedom of re-use of the scholarly literature is hidden behind a paywall. The letters have been written by scientists and presumably reproduced verbatim by the ACS. What possible justification is there for requiring the charge of 35 USD? There is no peer review involved. But then the ACS charges 35 USD for everything, including an 8-WORD retraction notice. (It's sort of easier just to charge vast amounts of money than think what you are doing to science).

So I am in a dilemma. How to I bring this discussion to public view. Because that is what a Scholarly Society SHOULD wish. I can't expect everyone to pay 105 USD. (The part of the first paper that is involved is only two sentences). I have the following options:

  • Do nothing – this will perpetuate the injustices
  • Write summaries of the letters (absurd because it will distort the meaning)
  • Extract paragraphs and publish them under fair use. (There is no doctrine of fair use in the UK and I could be sued for any phrase extracted – I have already laid myself open to this with the phrase "simply false"
  • Urge the authors of the letters to publish them Openly. In doing so they will break the conditions of publication and lay themselves open to legal action or having subscriptions to JCIM cut off
  • Write to the editor of the Journal suggesting it would be in the public interest to publish the letters? In general editors don't reply – but I know this one. But in any caseI dounbt they would do it and it makes the situation worse
  • Or follow a reader's suggestion I haven't thought of

Because I am now going to continue to challenge the CCDC. I have been turned down on FOI ground with a technicality (that the CCDC although listed as a department of the University isn't part of it for FOI). BTW it took the University FOI 19.8 days to work that out.

If you read the last paper (shut up and pay!) you will see that the authors quote our work on Crystaleye and suggest that it, together with the Crystallography Open Data Base (COD) could and now should replace the CCDC. They say (I have removed all the letter "O"s [1] to avoid direct quoting) 35 USD will tell you where the O's are meant to be.

As histry shws, thse wh stand in the way f demcracy and scientific prgress end up lsing ver the lng-run. The reactinary attitude f the CCDC staff has started t backfire by energizing academic labratries arund the wrld t find alternative slutins arund the CCDC.

I agree with the sentiments expressed. The only problem is that the authors chose to do it behind a paywall.

I shall continue my campaign to liberate "our" data from the CCDC+Wiley/Elsevier/Springer monopoly. Sancho Panza (http://en.wikipedia.org/wiki/Sancho_Panza ) is welcome to join me.

[1] http://en.wikipedia.org/wiki/The_Wonderful_O James Thurber.

[2] UPDATE: I managed to get it for free but maybe I have a cached copy?

UPDATE: It now seems that most people can get the first letter ("Editorial") for free but I still have to pay for the UCI response

Scientists should NEVER use CC-NC. This explains why.

There is a really important article at http://www.pensoft.net/journals/zookeys/article/2189/creative-commons-licenses-and-the-non-commercial-condition-implications-for-the-re-use-of-biodiversity-information. (Hagedorn G et al)

[NOTE the OKF has a clear indication of the problems of CC-NC. They should add a link to Hagedorn. See my earlier blog post http://blogs.ch.cam.ac.uk/pmr/2010/12/17/why-i-and-you-should-avoid-nc-licences/ ].

So, you aren't interested in Biodiversity Journals? Never read Zookeys? (I didn't know it existed). But in 1 day about 1200 people have accessed this article. Yet another proof that WHAT you publishe matters, not WHERE. And hopefully this blog will send a few more that way.

I can't summarise all of it. The authors give a very detailed and, I assume, competent analysis of Copyright applied to scientific content (data, articles, software) and its licensability under Creative Commons. Note that "This work is published Under a Creative Commons Licence" – which so many people glibly use is almost useless. It really means "This work is copyrighted [unless it's CC0] and to find out whether you have any rights you will have to look at the licence". So please, always, specific WHAT CC licence you use.

The one you choose matters, because it applies the rule of LAW to your documents. If someone does something with them that is incompatible with the licence they have broken copyright law. For example combining a CC-NC-SA licence with CC-BY-SA licence is impossible without breaking the law.

There are so many misconceptions about NC. Many people think it's about showing that you want people to share your motivation. Motivation is irrelevant. The only thing that matters is whether the court assessing the use by the licensor breaks the formal non-commercial licence. There's little case law, but the Hagedorn paper argues that being a non-profit doesn't mean non-commercial. Recovering costs can be seen as commercial. And so on.

We came across this when we wished to distribute a corpus of 42 papers using in training OSCAR3. The corpus was made available by the Royal Society of Chemistry. It was used (with contributions from elsewhere) to tune the performance of OSCAR3 to chemistry journals. Because training with a corpus is a key part of computational linguistics we wished to distribute the corpus (it's probably less than 0.1% of the RSC's published material – it would hardly affect their sales). After several years they agreed, on the basis that the corpus would be licenced as CC-NC. I pointed out very clearly that CC-NC would mean we couldn't redistribute the corpus as a training resource (and that this was essential since others would wish to recalibrate OSCAR). Yes, they understood the implications. No they wouldn't change. They realised the problems it would cause downstream. So we cannot redistribute the corpus with OSCAR3. The science of textmining suffers again.

Why? If I understood correctly (and they can correct me if I have got it wrong) it was to prevent their competitors using the corpus. (The competitors includes other learned societies. )

I thought that learned societies existed to promote their discipline. To work to increase quality. To help generate communal resources for the better understanding and practice of the science. And chemistry really badly needs communal resources – it's fifteen years behind bioscience because of its restrictive practices. But I'm wrong. Competition against other learned societies is more important than promoting the quality of science.

Meanwhile Creative Commons is rethinking NC. They realise that it causes major problems. There are several plans (see Hagedorn paper):

Creative Commons is aware of the problems with NC licenses. Within the context of the upcoming version 4.0 of Creative Commons licenses (Peters 2011), it considers various options of reform (Linksvayer 2011b; Dobusch 2011):

• hiding the NC option from the license chooser in the future, thus formally retiring the NC condition

• dropping the BY-NC-SA and BY-NC-ND variant, leaving BY-NC the only non-commercial option

• rebranding NC licenses as something other than CC; perhaps moving to a "non-creativecommons.org" domain as a bold statement

• clarifying the definition of NC

I'd support some of these (in combination) but not the last. Because while it is still available many people will use it on the basis that it's the honourable thing to do (I made this mistake on this blog). And others will use it deliberately to stop the full dissemination of content.

What is the basis of the NaCTeM-Elsevier agreement? FOI should give the answer

In the previous posts (http://blogs.ch.cam.ac.uk/pmr/2011/11/25/textmining-nactem-and-elsevier-team-up-i-am-worried/ and http://blogs.ch.cam.ac.uk/pmr/2011/11/27/textmining-my-years-negotiating-with-elsevier/ ) I highlight concerns (not just mine) about the publicly announced collaboration between NaCTeM (The National Centre for Textmining at the University of Manchester) and Elsevier (henceforth N+E). I am now going to find out precisely the details of this collaboration and, when I have the answers, will be in a position to answer the following questions:

  • What is NaCTeM's mission for the nation? (NaCTeM formally has a responsibility to the Nation)
  • What public finance has NaCTeM had and what is planned in the future?
  • What public money has gone into the N+E?
  • What are the planned the benefits to Elsevier?
  • What are the planned benefits of N+E to NaCTeM?
  • Are there plans to pass any of these benefits to the wider national community

In particular my concerns are:

  • Will the benefits of this work be available only through Elsevier's Sciverse platform?
  • Are we getting value for money?

It may seem strange – and potentially confrontational – to use FOI to get this information rather than simply asking the University or NaCTeM. But the power of FOI is that the University has specialist staff to give clear unemotional answers. And in particular it will highlight precisely whether there are hidden confidential aspects. If so it will be especially important to assess whether this is in the Nation's interest. And, with the possibility that this will reveal material that is useful to the Hargreaves process and UK government (through my MP) it is important that my facts are correct.

For those who aren't familiar with the FOI process each public institution has a nominated office/r who must, within 20 working days, give answers to all questions (or show why s/he should not). I shall use http://whatdotheyknow.com – a superb site set up for this purpose which means that everyone can follow the process and read the answers. FOI officers are required to respond promptly, and I hope that Manchester will do so – and be quicker than Oxbridge who ritually take 19.8 days to respond. Note that I am not expected to give my motivation. I shall request the information in existing documents or known facts – this is not a place for future hypotheticals or good intentions.

 

Dear FOI University of Manchester,

I am requesting information under FOI about the National Centre for Text Mining (NaCTeM) and the University's recently announced collaboration of NaCTeM with Elsevier (http://www.manchester.ac.uk/aboutus/news/display/?id=7627 ). The information should be supported by existing documents (minutes, policy statements, etc.). I shall be concerned about the availability of resource material to the UK in general (i.e. beyond papers and articles). I use the word "Open" (capitalised) to mean information or services which are available for free use, re-use and redistribution without further permission (see http:// http://opendefinition.org/ ). In general this means OSI-compliant Open Source for code and CC-BY or CC0 for content (CC-NC and "for academics only" are not Open).

General

  • What is the current mission statement of NaCTeM?
  • Does NaCTeM have governing or advisory bodies or processes? If so please list membership, dates of previous meetings and provide minutes and support papers.
  • List the current public funding (amounts and funders) for NaCTeM over the last three years and the expected public funding in the foreseeable future.
  • What current products, content and services are provided to the UK community (academic and non-academic) other than to NaCTeM?
  • What proportion of papers published by NaCTeM are fully Open?
  • What proportion and amount of software, content (such as corpora) and services provided by NaCTeM is fully Open?

Elsevier collaboration

  • Has the contract with Elsevier been formally discussed with (a) funders (b) bodies of the University of Manchester (e.g. senates, councils)? Please provide documentation.
  • Is there an advisory board for the collaboration?
  • Has third party outside NaCTeM formally discussed the advantages and disadvantages of the Elsevier collaboration.
  • Please provide a copy of the contract between the University and Elsevier. Please also include relevant planning documents, MoIs, etc.
  • Please highlight the duration, the financial resource provided by (a) the University (b) Elsevier. Please indicate what percentage of Full Economic Costs (FEC) will be be recovered from Elsevier. (I shall assume that a figure of less that 100% indicates that the University is "subsidising Elsevier" and one greater than 100% means the University gains.
  • Please indicate what contributions in kind (software, content, services, etc.) are made by either party and what they are valued at.
  • Please outline the expected deliverables. Please indicate whether any of the deliverables are made exclusively available to either or both parties and over what planned timescale.
  • Are any of the deliverables Open?
  • What is the IP for the deliverables in the collaboration?
  • Are any of the deliverables planned to be resold as software, services or content beyond the parties?
  • Has NaCTeM or the University or any involved third party raised the concern that contributing to Sciverse may be detrimental to the UK community?
  • Please indicate clearly what the planned benefit of the collaboration is to the UK.

 

I shall post this tomorrow so please comment now if you wish to.

 

Textmining: My years negotiating with Elsevier

This post – which is long, but necessary – recounts my attempts to obtain permission to text-mine content published in Elsevier's journals. (If you wish to trust my account the simple answer is – I have got nowhere and I am increasingly worried about Elsevier's Sciverse as a monopolistic walled garden. If you don't trust this judgement read the details). What matters is that the publishers are presenting themselves as "extremely helpful and responsive to request for textmining" – my experience is the opposite and I have said so to Efke Smit of the STM publishers' assoc. In particular I believe that Elsevier made me and the chemical community a public promise 2 years ago and they have failed to honour it.

Although it is about chemistry it is immediately understandable by non-scientists. It is immediately relevant to my concerns about the collaboration between the University of Manchester and Elsevier but has much wider implications for scientific text-mining in general. New readers should read recent blogs posts here including http://blogs.ch.cam.ac.uk/pmr/2011/11/25/the-scandal-of-publisher-forbidden-textmining-the-vision-denied/ which explains what scientific textmining can cover and should also read forthcoming posts and comments.

I shall frequently use "we" to mean the group I created in Cambridge and extended virtual coworkers. I am not normally a self-promotionist, but it is important to realise that in the following history "we" are the leading group in chemical textmining, objectively confirmed by the ACS Skolnik award. "we" deserve a modicum of respect in this.

 

I start from common practice, logic, and legal facts. My basic premises are:

  • I have the fundamental and absolute right to extract factual data from the literature and republish it as Open content. "facts cannot be copyrighted" (though collections can). It has been common practice over two or more centuries for scientists to abstract factual data from the literature to which they have access (either by subscription or through public libraries). There are huge compilations of facts. A typical example is the NIST webbook; please look at http://webbook.nist.gov/cgi/cbook.cgi?ID=C64175&Units=SI&Mask=1#Thermo-Gas. This is a typical page (of probably >> 100,000) carefully abstracted from the literature by humans. It is legal, it is valuable and it is essential.
  • We have developed technology to automate this process. I argue logically that what a human can do, so can a machine. Logic has no force in business or in court and I am forbidden to deploy my technology it by restrictive publisher contracts (see previous posts). So what is a perfectly natural extension of human practice to machines is forbidden for no reason other than the protection of business interests. It has no logical basis.
  • I wish to mine factual data from Elsevier journals, specifically "Tetrahedron" and "Tetrahedron Letters". I shall refer to these jointly as "Tetrahedron". The factual content in these journals is created by academics and effectively 100% of this factual content is published verbatim without editorial correction. Authors are required to sign over their rights to Elsevier (and even if there may be exceptions they are tortuous in the extreme and most authors simply sign). Elsevier staff refer to this as "Elsevier content". I shall always quote this phrase as otherwise it implies legitimacy which I dispute – I do not believe it is legally possible to sign over factual data to a monopolist third party. But it has never been challenged in court.
  • Everything I do is Open. I have no hidden secrets in my emails and anyone is welcome to write to the University of Cambridge under FOI and request any of all my emails with Elsevier. I personally cannot publish them many of them because they contain the phrase: "The information is intended to be for the exclusive use of the intended addressee(s).  If you are not an intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this message is strictly prohibited." However I suspect an FOI request would overrule this.

     

I have corresponded verbally and by email with several employees of Elsevier. I have done this through my natural contacts as Elsevier provide no central place for me to discuss the questions. I shall anonymise some of the Elsevier employees. If they feel their position has been misrepresented they are welcome to post a comment here and it will be reported in full. If they send an email I reserve the right to publish it Openly.

The simple facts (which can partly be substantiated by FOI on my emails but are stated without them are):

  • About 5 years ago I wrote to all five editors of Tetrahedron and also the Elsevier office about the possibility of enhancing Tetrahedron content through text-mining. I did not receive a single reply
  • Two years ago there was a textmining meeting at Manchester, organized by NaCTeM and UKOLN (http://www.nactem.ac.uk/tm-ukoln.php). At that meeting Rafael Sidi, Vice-President Product Management, Elsevier presented "Open Up" (30 mins). [He is the named Elsevier contact in the NaCTeM / Elsevier contract]. He gave no abstract and I do not have his slides. From a contemporaneous blog (http://namesproject.wordpress.com/2009/10/ ) "Rafael Sidi of Elsevier (who got through an eye-boggling 180 slides in 30 minutes!) emphasised the importance of openness in encouraging innovation". With no other record I paraphrase the subsequent discussion between him and me (and I would be grateful for any eyewitness accounts or recordings). If Rafael Sidi wishes to give his account, he is welcome to use this blog.

     

    Essentially Rafael Sidi enthusiastically stated that we should adopt open principles for scientific content and mashup everything with everything. I then asked him if I could textmine Tetrahedron and mashup the content Openly. He said I could. I then publicly said I would follow this up. I have taken this as a public commitment by Sidi (who was representing Elsevier very clearly) that factual content in Tetrahedron could be mined without further permission.

     

  • I then followed it up with mails and phone calls to Sidi. Suffice to report that all the drive can from me and that after six months I had made no progress. I then tried another tack with Elsevier contact. After another 6 months no progress. I then raised this in 2010-10 with a member of Elsevier staff involved with the Beyond the PDF initiative http://sites.google.com/site/beyondthepdf/ . Although not directly concerned with chemistry she took up the case (and I personally thank her efforts) and thought she had made progress (a) by getting Elsevier to draw up a contract allowing me to textmine Tetrahedron and (b) relaying this to David Tempest (Deputy Director "Universal Access", Elsevier) who is "currently reviewing policies" and "we have finalised our policy and guidelines I would be happy to discuss this further with you." [That was 9 months ago and I have heard nothing].

The contract is public, apparently available to anyone to negotiate (though there are no rights – all decisions are made by Elsevier). I was told:

You can mine 5 years of Tetrahedron, and will be helped to do so by Frankfurt. You can talk to them about formats.  There are two conditions:

1) You agree with the SciVerse developer's agreement - on http://developer.sciverse.com/start this is http://developer.sciverse.com/developeragreement - this also means you are not allowed to provide access to the Tetrahedron content (no surprise)

2) You can send us a description of the project you are working on, specifically describing the entities you are interested in mining, and the way in which you will use them.

To summarise:

  • Elsevier decide whether I can mine "their" content. I have no right. I can only beg.
  • All my results belong to Elsevier and I cannot publish them. Specifically:

     

    3.1.3 the Developer has not used robots, spiders or any other device which could retrieve or

    index portions of the Elsevier website, the Elsevier content or the APIs for any unauthorized

    purpose, and Developer conforms to all ethical use guidelines as published on the Elsevier

    website;

So I cannot search their site except as they permit

3.1.4 the Developer acknowledges that all right, title and interest in and to the Elsevier content,

and any derivative works based upon the Elsevier content, remain with Elsevier and its

suppliers, except as expressly set forth in this Agreement, and that the unauthorized

redistribution of the Elsevier content is not permitted;

"And any derivative works" means that everything I do – chemical structures, spectral data – everything BELONGS TO ELSEVIER. Note the phrase "Elsevier content". The whole agreement is based on the concept that Sciverse (their platform for publishing "Elsevier content") is being developed as a walled garden where no-one has rights other than Elsevier.

Well I have only taken 18 months to get to that position. I might be able to negotiate something slightly better if I take another 2 or three years.

And, in any case, I am not begging for permission to do a project. I am asking for my right. Both implied by current practice and also started by Rafael Sidi.

[Incidentally It will be interesting to see if the University of Manchester has signed up to

 

And that's where the matter rests. No progress…

 

 

But no, I received a request from Elsevier asking if they can use my software. (Why? Because our group is a/the leading one in chemical information extraction). I can't reproduce it as it's confidential and I have therefore omitted names , but here is my reply (copied to all the people in Elsevier including Rafael Sidi):

Dear Mr. Murray-Rust,

With great interest I have read your description of the OSCAR 4 chemical entity recognizer. We (redacted) would like to evaluate OSCAR for use in our entity recognizer system and compare it to other analysers.

Because OSCAR is Open Source you may do this without permission.

A few months ago, I have done some comparisons with other annotators and can only say that OSCAR compares quite favourably and is easily deployed – that is to say, if it runs as a Java server.

I assume these comparisons are confidential to Elsevier

This type of functionality is included in the the OSCAR 3 implementation and is really easy to access because no coding layers are required to go between our code and yours – just an http webrequest.

We are using .Net for all our development so a web interface would be real nice. I gather from the article posted (OSCAR4: a flexible architecture for

chemical text-mining) that there are several wrappers around by several users – is there any chance that there is a .Net or HTTP wrapper that we might use? A short-cut in Java to build one ourselves?

I understand this to be a request for free consultancy. Unfortunately we have run out of free consultancy at present.

Do you have any advice here?

Normally I would reply in a positive light to anyone asking polite questions, but I have had two years of unfulfilled promises from Elsevier so I am will engage on one condition - that Elsevier honour the public promise that Rafael Sidi made two years ago.

Mr Sidi stated in public that I could have permission to use OSCAR on chemical reactions published in Elsevier journals (Tetrahedron, Tett Letters, etc.) and to make the results publicly Open. Over that last two years I have tried to get action on this (see copied people). The  closest I got was an agreement which I would have to sign saying that all my work would belong exclusively to Elsevier and that I would not be able to publish any of it. (The current agreement that my library has signed for subscriptions to Elsevier is that all text-mining is explicitly and strictly forbidden). Not surprisingly I did not sign this.

By Elsevier making a public promise I assumed I would be able to do research in this field and publish all the results. In fact Elsevier has effectively held back my work for this period and looks to continue to do it. I regard Elsevier as the biggest obstacle to the academic deployment of textmining at present.

The work that you are asking me to help you with will be an Elsevier monopoly with restrictive redistribution conditions and I am not keen on supporting monopolies. If you can arrange for Elsevier to honour their promise I will be prepared to explore a business arrangement though I am making no promises at present.

Thank you very much,

I am sorry this mail is written in a less than friendly tone but I can not at present donate time to an organisation which works against the direction of my research and academia in general. If Elsevier agrees that scientific content can be textmined without permission and redistributed (as it should be if it is to be useful) then you will have helped to make progress.

I have copied in your colleagues who have been involved in the correspondence over the last two years.

[Name redacted]

I am currently treating your request as confidential as it says so but I do not necessarily regard my reply as such. You will understand that I need a reply

Needless to say I have received no reply. You may regard my reply as rude, but it is the product of broken promises from Elsevier, delays, etc. So, Rafael Sidi, if you are reading this blog I would appreciate a reply and the uncontrolled permission to mine and publish data from Tetrahedron.

Because I shall forward your response (or the lack of one) to the UK government who will use your reply as an example of whether the publishers are helpful to those wanting to textmine the literature.

 

 

 

 

Textmining: NaCTeM and Elsevier team up; I am worried

A bit over two weeks ago the following appeared on DCC-associates: http://www.mail-archive.com/dcc-associates@lists.ed.ac.uk/msg00618.html

Mon, 07 Nov 2011 09:16:34 -0800

This press release may be of interest to list members. 

 

University enters collaboration to develop text mining applications

07 Nov 2011

http://www.manchester.ac.uk/aboutus/news/display/?id=7627


			

 

The University of Manchester has joined forces with Elsevier, a leading 

provider of scientific, technical and medical information products and 

services, to develop new applications for text mining, a crucial research tool.

 

The primary goal of text mining is to extract new information such as named 

entities, relations hidden in text and to enable scientists to systematically 

and efficiently discover, collect, interpret and curate knowledge required for 

research.

 

The collaborative team will develop applications for SciVerse Applications, 

which provides opportunities for researchers to collaborate with developers in 

creating and promoting new applications that improve research workflows.

 

The University's National Centre for Text Mining (NaCTeM), the first 

publicly-funded text mining centre in the world, will work with Elsevier's 

Application Marketplace and Developer Network team on the project. 

 

Text mining extracts semantic metadata such as terms, relationships and events, 

which enable more pertinent search. NaCTeM provides a number of text mining 

services, tools and resources for leading corporations and government agencies 

that enhance search and discovery.

 

Sophia Ananiadou, Professor in the University's School of Computer Science and 

Director of the National Centre for Text Mining, said: "Text mining supports 

new knowledge discovery and hypothesis generation. 

 

"Elsevier's SciVerse platform will enable access to sophisticated text mining 

techniques and content that can deliver more pertinent, focused search results."

 

"NaCTeM has developed a number of innovative, semantic-based and time-saving 

text mining tools for various organizations," said Rafael Sidi, Vice President 

Product Management, Applications Marketplace and Developer Network, Elsevier. 

 

"We are excited to work with the NaCTeM team to bring this expertise to the 

research community."

 

Now I have worked with NaCTeM, and actually held a JISC grant (ChETA) in which NaCTeM were collaborators and which resulted in both useful work, published articles and Open Source software. The immediate response to the news was from Simon Fenton-Jones

Let me see if I got this right.

"Elsevier, a leading provider of scientific, technical and medical

information products and services", at a cost which increases much faster

than inflation, to libraries who can't organize their researchers to back up

a copy of their journal articles so they can be aggregated, is to have their

platform, Sciverse, made more attractive, by the public purse by a simple

text mining tool which they could build on a shoestring. 

 

Sciverse Applications, in return, will take advantage of this public

largesse to charge more for the journals which should/could have been

compiled by public digital curators in the first instance. 

 

Hmmm. So this is progress.

 

Hey. It's not my money!  

 

[PMR: I think it's "not his money" because he writes from Australia, but he will still suffer]

PMR: I agree with this analysis. I posted an initial response (http://www.mail-archive.com/dcc-associates@lists.ed.ac.uk/msg00621.html )

 

No - it's worse. I have been expressly and consistently asking Elsevier for

permission to text-mine factual data form their (sorry OUR) papers. They

have prevaricated and fudged and the current situation is:

"you can sign a text-mining licence which forbids you to publish any

results and handsover all results to Elsevier"

 

I shall not let this drop - I am very happy to collect allies. Basically I

am forbidden to deploy my text-mining tools on Elsevier content.

 

P.

 

I shall elaborate on this. I was about to write more, because I completely agree about the use of public money and the lack of benefit to the community. However I have been making enquiries and it appears that public funding for NaCTeM is being run down – effectively they are becoming a "normal" department of the university – with less (or no) "national" role.

However the implications of this deal are deeply worrying – because it further impoverishes our rights in the public arena and I will explain further later. I'd like to know exactly what NaCTeM and the University of Manchester are giving to Elsevier and what they are getting out of it.

This post will give them a public chance – in the comments section, please - to make their position clear.

 

The scandal of publisher-forbidden textmining: The vision denied

This is the first post of probably several in my concern about textmining. You do NOT have to be a scientist to understand the point with total clarity. This topic is one of the most important I have written about this year. We are at a critical point where unless we take action our scholarly rights will be further eroded. What I write here is designed to be submitted to the UK government as evidence if required. I am going to argue that the science and technology of textmining is systematically restricted by scholarly publishers to the serious detriment of the utilisation of publicly funded research.

What is textmining?

The natural process of reporting science often involves text as well as tables. Here is an example from chemistry (please do not switch off – you do not need to know any chemistry.) I'll refer to it as a "preparation" as it recounts how the scientist(s) made a chemical compound.

To a solution of 3-bromobenzophenone (1.00 g, 4 mmol) in MeOH (15 mL) was added sodium borohydride (0.3 mL, 8 mmol) portionwise at rt and the suspension was stirred at rt for 1-24 h. The reaction was diluted slowly with water and extracted with CH2Cl2. The organic layer was washed successively with water, brine, dried over Na2SO4, and concentrated to give the title compound as oil (0.8 g, 79%), which was used in the next reaction without further purification. MS (ESI, pos. ion) m/z: 247.1 (M-OH).

The point is that this is a purely factual report of an experiment. No opinion, no subjectivity. A simple, necessary account of the work done. Indeed if this were not included it would be difficult to work out what had been done and whether it had been done correctly. A student who got this wrong in their thesis would be asked to redo the experiment.

This is tedious for a human to read. However during the C20 there have been large industries based on humans reading this and reporting the results. Two of the best known abstracters are the ACS's Chemical Abstracts and Beilstein's database (now owned by Elsevier). These abstracting services have been essential for chemistry – to know what has been done and how to repeat it (much chemistry involves repeating previous experiments to make material for further synthesis , testing etc.).

Over the years our group has developed technology to read and "understand" language like this. Credit to Joe Townsend, Fraser Norton, Chris Waudby, Sam Adams, Peter Corbett, Lezan Hawizy, Nico Adams, David Jessop, Daniel Lowe. Their work has resulted in an Open Source toolkit (OSCAR4, OPSIN, ChemicalTagger) which is widely used in academia and industry (including publishers). So we can run ChemicalTagger over this text and get:

EVERY word in this has been interpreted. The colours show the "meaning" of the various phrases. But there is more. Daniel Lowe has developed OPSIN which works out (from a 500-page rulebook from IUPAC) what the compounds are. So he has been able to construct a complete semantic reaction:

If you are a chemist I hope you are amazed. This is a complete balanced chemical reaction with every detail accurately extracted. The fate of every atom in the reaction has been worked out. If you are not a chemist, try to be amazed by the technology which can read "English prose" and turn it into diagrams. This is the power of textmining.

There are probably about 10 million such preparations reported in the scholarly literature. There is an overwhelming value in using textmining to extract the reactions. In Richard Whitby's Dial-a-molecule project (EPSRC) the UK chemistry community identified the critical need to text-mine the literature.

So why don't we?

Is it too costly to deploy?

No.

Will it cause undue load on pubklisher servers.

No, if we behave in a responsible manner.

Does it break confidentiality?

No – all the material is "in the public domain" (i.e. there are no secrets)

Is it irresponsible to let "ordinary people" do this/

No.

Then let's start!

NO!!!!

BECAUSE THE PUBLISHERS EXPRESSLY FORBID US TO DO TEXTMINING

But Universities pay about 5-10 Billion USD per year as subscriptions for journals. Surely this gives us the right to textmine the content we subscribe to.

NO, NO, NO.

Here is part of the contract that Universities sign with Elsevier (I think CDL is California Digital Library but Cambridge's is similar) see http://lists.okfn.org/pipermail/open-science/2011-April/000724.html for more resources

 The CDL/ Elsevier contract includes [@ "Schedule 1.2(a)

 General Terms and Conditions  "RESTRICTIONS ON USAGE OF THE LICENSED PRODUCTS/ INTELLECTUAL PROPERTY RIGHTS" GTC1] 

"Subscriber shall not use spider or web-crawling or other software programs, routines, robots or other mechanized devices to continuously and automatically search and index any content accessed online under this Agreement. "

 

What does that mean?

NO-TEXTMING. No INDEXING. NO ROBOTS. No NOTHING.

Whyever did the library sign this?

I have NO IDEA. It's one of the worst abrogations of our rights I have seen.

Did the libraries not flag this up as a serious problem?

If they did I can find no record.

So the only thing they negotiated on was price? Right?

Appears so. After all 10 Billion USD is pretty cheap to read the literature that we scientists have written. [sarcasm].

So YOU are forbidden to deploy your state-of-the art technology?

PMR: That's right. Basically the publishers have destroyed the value of my research. (I exclude CC-BY publishers but not the usual major lot).

What would happen if you actually did try to textmine it.

They would cut the whole University off within a second.

Come on, you're exaggerating.

Nope – it's happened twice. And I wasn't breaking the contract – they just thought I was "stealing content".

Don't they ask you to find out if there is a problem?

No. Suspicion of theft. Readers are Guilty until proven innocent. That's publisher morality. And remember that we have GIVEN them this content. If I wished to datamine my own chemistry papers I wouldn't be allowed to.

But surely the publishers are responsive to reasonable requests?

That's the line they are pushing. I will give my own experience in the next post.

So they weren't helpful?

You will have to find out.

Meanwhile you are going to send this to the government, right?

Right. The UK has commissioned a report on this. Prof Hargreaves. http://www.ipo.gov.uk/ipreview-finalreport.pdf

And it thinks we should have unrestricted textmining?

Certainly for science technical and medical.

So what do the publishers say?

They think it's over the top. After all they have always been incredibly helpful and responsive to academics. So there isn't a real problem. See http://www.techdirt.com/articles/20111115/02315716776/uk-publishers-moan-about-content-minings-possible-problems-dismiss-other-countries-actual-experience.shtml

Nonetheless, the UK Publishers Association, which describes its "core service" as "representation and lobbying, around copyright, rights and other matters relevant to our members, who represent roughly 80 per cent of the industry by turnover", is unhappy. Here's Richard Mollet, the Association's CEO, explaining why it is against the idea of such a text-mining exception:

If publishers lost the ability to manage access to allow content mining, three things would happen. First, the platforms would collapse under the technological weight of crawler-bots. Some technical specialists liken the effects to a denial-of-service attack; others say it would be analogous to a broadband connection being diminished by competing use. Those who are already working in partnership on data mining routinely ask searchers to "throttle back" at certain times to prevent such overloads from occurring. Such requests would be impossible to make if no-one had to ask permission in the first place.

They've got a point, haven't they?

PMR This is appalling disinformation. This is ONLY the content that is behind the publisher's paywalls. If there were any technical problems they would know where they come from and could arrange a solution.

Then there is the commercial risk. It is all very well allowing a researcher to access and copy content to mine if they are, indeed, a researcher. But what if they are not? What if their intention is to copy the work for a directly competing-use; what if they have the intention of copying the work and then infringing the copyright in it? Sure they will still be breaking the law, but how do you chase after someone if you don't know who, or where, they are? The current system of managed access allows the bona fides of miners to be checked out. An exception would make such checks impossible.

["managed access" == total ban]

If you don't immediately see this is a spurious argument, then read the techndirt article. The ideal situation for publishers is if no-one reads the literature. Then it's easy to control. This is, after all PUBLISHING (although Orwell would have loved the idea of modern publishing being to destroy communication).

Which leads to the third risk. Britain would be placing itself at a competitive disadvantage in the European & global marketplace if it were the only country to provide such an exception (oh, except the Japanese and some Nordic countries). Why run the risk of publishing in the UK, which opens its data up to any Tom, Dick & Harry, not to mention the attendant technical and commercial risks, if there are other countries which take a more responsible attitude.

So PMR doing cutting-edge research puts Britain at a competitive disadvantage. I'd better pack up.

But not before I have given my own account of what we are missing and the collaboration that the publishers have shown me.

And I'll return to my views about the deal between University of Manchester and Elsevier.

Open Research Reports Hackathon: What is Semantics? URIs and URLs

#orr2011

I (and colleagues) are getting ready for the December Hackathon (JISC, OKF, SWAT4LS) which includes Open Research Reports and the Semantic Web For Life Sciences. The Hackathon can include any activity but we are preparing material to bring along based on Open Research for diseases and which is or can be semantified. We hope this will be an important step forward for making disease information more widely available and useful.

So what's Semantics? It's not a disease, is it?

No. It's a formal way of talking about things. Humans are (usually) very good at understanding each other even when they use fuzzy language. For example there is a sign next to our bicycle shed which says:

NO BICYCLES HERE

*We* all know this means:

"Do not put bicycles here"

and the study of this is called pragmatics http://en.wikipedia.org/wiki/Pragmatics .

Here are three sentences where (English speakers) easily distinguish the difference between the meaning of the symbol "cold":

  • She has a cold
  • She has a cold sore
  • She has a cold foot

We will return to these later.

Unfortunately pragmatics is beyond the range of most computer systems so we have to we have to create formal systems for them – these are based on syntax (a common symbolic representation, http://en.wikipedia.org/wiki/Syntax ) and semantics (agreement on meaning, http://en.wikipedia.org/wiki/Semantics ). (Be warned that the border between these is fuzzy).

Our syntax for the semantic web includes:

B: HANG ON! That's not a name, it's an address. It's a Uniform Resource Locator (URL, http://en.wikipedia.org/wiki/URL, )

A: Yes. It's an address and also a name. The URI identifies the resources and also locates it.

B: But it might not be there – you might get a 404.

A: Wikipedia never 404s

B: Or someone could copy the page to another address. It's still the same page, but a different URL.

A: but it's not the definitive URI

B: Why not. And anyway The XML crew spent 10,000 mail messages debating that names and addresses were different.

A: well they are the same now. Tim says so.

B: That's a distorted view of reality.

PMR: Hussssh! This has been a major debate for years and will continue to be so. Here's Tim (http://en.wikipedia.org/wiki/Linked_Data ):

Tim Berners-Lee outlined four principles of Linked Data in his Design Issues: Linked Data note, paraphrased along the following lines:

  1. Use URIs to identify things.
  2. Use HTTP URIs so that these things can be referred to and looked up ("dereferenced") by people and user agents.
  3. Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML.
  4. Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

B: So these are conflated URIs ("HTTP URIs"). They only work if the thing is a web resource.

A: Here he is again:

Tim Berners-Lee gave a presentation on Linked Data at the TED 2009 conference. In it, he restated the Linked Data principles as three "extremely simple" rules:

  1. All kinds of conceptual things, they have names now that start with HTTP.
  2. I get important information back. I will get back some data in a standard format which is kind of useful data that somebody might like to know about that thing, about that event.
  3. I get back that information it's not just got somebody's height and weight and when they were born, it's got relationships. And when it has relationships, whenever it expresses a relationship then the other thing that it's related to is given one of those names that starts with HTTP.

Note that although the second rule mentions "standard formats", it does not require any specific standard, such as RDF/XML.

B: so it's only "conceptual things". Like "syntax". My cat cannot have an HTTP-URI.

A: not your cat. But TBL can be dereferenced: Look at http://en.wikipedia.org/wiki/Tim_Berners-Lee .

B: That's not him – it's a web page about him. You make it sound as if Wikipedia defines reality. If it's not in Wikipedia it doesn't exist. You are a Borgesian.

A: A what?

PMR: Shhhh! This is the sort of "robust discussion" we get into all the time. We are going to take a very simple approach to the semantic web. The advantage is it is easy to understand and will work. We are first of all going to give things precise labels.

B: Like a "cold"

PMR: exactly like that. We will call a cold "J00.0"

B: whatever for? I won't remember that.

A: You don't have to – the machines will. A cold will always be J00.0

B: Well why "J00.0"? Why not "common_cold", like Wikipedia (http://en.wikipedia.org/wiki/Common_cold )?

A: Because that's what the WHO call it. In their International Classification Of Disease Edition 10 (ICD-10) http://en.wikipedia.org/wiki/ICD-10 . PMR actually worked with the WHO (in Uppsala) to convert ICD-10 to XML. He knows it by heart.

PMR: well I did. I've forgotten most of it.

B: OK, well I suppose the WHO has a right to create names for diseases. But surely they aren't the only ones?

A: No = there's http://en.wikipedia.org/wiki/Medical_Subject_Headings (MeSH) - which calls it D003139 . And ICD-9

B: The ninth edition I suppose …

A: Yes. Calls it 460.

B: I bet they don't all agree on what a cold is.

PMR: No. There's lots of variation in medical terminology. There's the http://en.wikipedia.org/wiki/Unified_Medical_Language_System (UMLS) It:

is a compendium of many controlled vocabularies in the biomedical sciences (created 1986[1]). It provides a mapping structure among these vocabularies and thus allows one to translate among the various terminology systems; it may also be viewed as a comprehensive thesaurus and ontology of biomedical concepts.

B: and now this "ontology" word?

A: it's a formal system (http://en.wikipedia.org/wiki/Ontology_%28computer_science%29 ):

an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain.

B: It's too complicated for me.

PMR: We are going to start simple. Ontologies tell computers how to distinguish between different meanings of the concept "cold". We'll just assume that we humans generally agree.

A: But doctors don't agree on diagnoses – how can we?

PMR: This isn't about whether you are actually infected by rhinovirus…

B: … ???

PMR: The virus that causes a cold. It looks like this:


A: Yes – it's got icosahedral symmetry and was …

PMR: … back to the semantics. It's about putting the concept of "cold" into computers. We need a unique identifier and we can use the WHO one.

B: but J00.0 isn't unique. That's the number of my neighbour's car.

PMR: so we turn it into a URI. An HTTP-URI is unique because it's based on domain names, and they are unique.

A: but what domain name? Since the WHO invented it, let's use the HTTP-URL for the cold. That's http://apps.who.int/classifications/icd10/browse/2010/en#/J00-J06

B: but that should be http://apps.who.int/classifications/icd10/browse/2010/en#/J00 - but that doesn't resolve. And in any case I bet the "apps" bit changes. That's why addresses are no use for URIs

PMR: It's really up to authorities like WHO to give stable identifiers for this, that are persistent in name and address.

B: That's a tough order. Do you think the WHO are up to it?

PMR: Probably not yet. We'll probably need to invent a way round it. Perhaps with a PURL (http://en.wikipedia.org/wiki/Persistent_Uniform_Resource_Locator ).

B: and you said this was easy?

PMR: The Semantic Web community is working hard to make this easy for you, yes. Anyway, nearly there. Let's just use http://purl.org/who/classifications/icd10/J00.0 as a shorthand for "common cold"

B: "shorthand"

PMR: sorry, identifier. And address – which we can make resolvable by redirecting the PURL.

A: OK, we've now got an identifier system for all diseases. Will we always use ICD-10?

PMR: It'll make it easier for our ORR project and we shan't need mappings or ontologies.

A: So we can identify "cold sores" as http://purl.org/who/classifications/icd10/B00.1

PMR: Yes

B. You've convinced me that we can give each disease a unique identifier (whether we actually have the disease or not). But "cold sores" is not a disease – it's a symptom. And the disease is "Herpesviral vesicular dermatitis" according to WHO. The virus isn't a disease as such so does it have its own identifier?

 

PMR: Yes. The virus (http://en.wikipedia.org/wiki/Herpes_simplex_virus ) is actually a combination of protein and RNA. Its classification is:

Family: Herpesviridae Subfamily: Alphaherpesvirinae Genus: Simplexvirus Species Herpes simplex virus 1 (HSV-1)

B: But that's not an identifier.

PMR: Agreed. So somewhere we need to find an identifier or work out a schem for creating one.

B: So the semantic web won't work?

A: We are all at the stage of creating it. There's been a huge increase in identifier systems. There are now thousands in the Linked Open Data cloud. And that's the sort of thing we'll tackle in the Hackathon.

B: I'm knackered. I've learnt that we need HTTP-URIs for everything. We've just done diseases. If we want to do ORRs we need people, places, parasites and so on. All in semantic form, right?

PMR: Right.

B: so there's a HUGE amount of work to be done.

A: But lots of people are involved. And once we've done it, it will be persistent.

B: until we change our concepts…

PMR: But by then we shall already have shown how powerful it is.

 

 


 

Multiple Metrics for “Visions of a Semantic Molecular Feature”

After posting the access metrics for VoaSMF , Egon Willighagen suggested that we also use some of the new alternative metrics ("altmetrics"). There's more than one such effort and they are to be welcomed (citation metrics by themselves are inefficient, imprecise and suffer huge timelag). So people such as Egon, Heather Piwowar, Cameron Neylon are creating immediate metrics – literally hour-by hour. Here we show "Total Impact" (http://total-impact.org/).

Before we show the figures let me commend this and similar efforts because:

  • They are not tied to a commercial business. (Journal Impact factors have long been tainted with the suspicion that they are manipulated between aggregators and publishers. Not for the benefit of science, but in the near-meaningless struggle between journals for branding)
  • They are immediate. Within hours of publication you can get alternative metrics.
  • They are objective and democratic. Anyone can build tools to carry them out. It gives real scientists a say in how science is measured. I expect universities to give them the cold-shoulder for some time as they challenge the current trivial system and they are free.
  • They are multivariate. A whole host of different metrics can be used. You can make your own up.

The altmetrics software was originally hacked at a workshop that Cameron ran, I think. Anyway it is typical of the quality and speed that can be achieved by people working together with a common vision and shared tools. Indeed (I hope) this shows the challenge to the lumbering publication systems that publishers build and force us to use. We are starting to liberate our expressiveness.

So here are our 15 articles and I comment later:

 

report for Visions of a Semantic Molecular Future

download data
run update updated 15 Nov, 2011 created 15 Nov, 2011 15 artifacts

Permalink: http://total-impact.org/report.php?id=CXtjIzcopy

article

Dataset [there is a bug, these should be labelled "article"]

More detail on available metrics. Missing some artifacts or metrics? See current limitations.

DB....
status log

an project.
source code on github

There are at least 6 metrics:

  • Tweets
  • Blogs (a limited selection)
  • Bookmarks (Cite-U-Like)
  • Wikipedia entries
  • Facebook shares and likes
  • Mentions

Most are sparsely populated – the exception being JohnW's tweets. These are real – the twittersphere resounded on day with a massive list of tweets about John's article. There are some technical issues – some metrics currently require DOIs, etc.

People may argue that these metrics can be gamed. (Of course John pulled people off the streets of SF to tweet his article.) Seriously I think the accesses are reasonably accurate and haven't been gamed AFAIK. The altmetrics don't visit enough blogs, I think but that will come.

The point is that rather than waiting 3 years to find out if anyone has read our articles we get a current picture. It doesn't surprise me that there are hundreds of accesses for each tweet or blog – we've had 300,000 downloads of Chem4Word and hardly a squeak. And most cheminformaticians don't tweet or blog or show themselves in the glare of Open social networks.

Thanks to the altmetricians for their efforts.

P.

Semantic Molecular Future: Article accesses during first 30 days

I believe that we seriously need a set of new metrics for scholarly publication. It should be multidimensional and one of these should be accesses. [Yes, I know it's possible to game the system. Everything can be gamed. The only thing that can't is scientific reality, but that's increasingly irrelevant to academia]. So here is my contribution to access-metrics.

Our special issue of "Visions of a Semantic (molecular) Future" in J. Cheminformatics (BiomedCentral) has now been out for a month. Since BMC publish the article accesses for each article (a) confidentially to the author and (b) publicly for the highest accessed articles each month I can start to do some simple analysis. (I'm only missing one author, Cameron Neylon).

Here are the stats, after exactly one month (October 14th -> Nov 14th). The actual stats are for a window of 30 days.

2639 Accesses Openness as infrastructure
John Wilbanks Journal of Cheminformatics 2011, 3:36 (14 October 2011)

1185 Accesses Open Bibliography for Science, Technology, and Medicine Richard Jones, Mark MacGillivray, Peter Murray-Rust, Jim Pitman, Peter Sefton, Ben O'Steen, William Waites Journal of Cheminformatics 2011, 3:47 (14 October 2011)

1018 Accesses Semantic science and its communication - a personal view Peter Murray-Rust Journal of Cheminformatics 2011, 3:48 (14 October 2011)

936 Accesses Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on Noel M O'Boyle, Rajarshi Guha, Egon L Willighagen, Samuel E Adams, Jonathan Alvarsson, Jean-Claude Bradley, Igor V Filippov, Robert M Hanson, Marcus D Hanwell, Geoffrey R Hutchison, Craig A James, Nina Jeliazkova, Andrew SID Lang, Karol M Langner, David C Lonie, Daniel M Lowe, Jérôme Pansanel, Dmitry Pavlov, Ola Spjuth, Christoph Steinbeck, Adam L Tenderholt, Kevin J Theisen, Peter Murray-Rust Journal of Cheminformatics 2011, 3:37 (14 October 2011)

822 Accesses Ami - The chemist's amanuensis Brian J Brooks, Adam L Thorn, Matthew Smith, Peter Matthews, Shaoming Chen, Ben O'Steen, Sam E Adams, Joe A Townsend, Peter Murray-Rust Journal of Cheminformatics 2011, 3:45 (14 October 2011)

681 Accesses The past, present and future of Scientific discourse Henry S Rzepa Journal of Cheminformatics 2011, 3:46 (14 October 2011)

531 Accesses OSCAR4: a flexible architecture for chemical text-mining David M Jessop, Sam E Adams, Egon L Willighagen, Lezan Hawizy, Peter Murray-Rust Journal of Cheminformatics 2011, 3:41 (14 October 2011)

429 Accesses CML: Evolution and design Peter Murray-Rust, Henry S Rzepa Journal of Cheminformatics 2011, 3:44 (14 October 2011)

420 Accesses Mining chemical information from open patents David M Jessop, Sam E Adams, Peter Murray-Rust Journal of Cheminformatics 2011, 3:40 (14 October 2011)

313 Accesses The semantics of Chemical Markup Language (CML): dictionaries and conventions Peter Murray-Rust, Joe A Townsend, Sam E Adams, Weerapong Phadungsukanan, Jens Thomas Journal of Cheminformatics 2011, 3:43 (14 October 2011)

280 Accesses The Quixote project: Collaborative and Open Quantum Chemistry data management in the Internet age Sam Adams, Pablo de Castro, Pablo Echenique, Jorge Estrada, Marcus D Hanwell, Peter Murray-Rust, Paul Sherwood, Jens Thomas, Joe Townsend Journal of Cheminformatics 2011, 3:38 (14 October 2011)

265 Accesses Adventures in public data Dan W Zaharevitz Journal of Cheminformatics 2011, 3:34 (14 October 2011)

^^^ cut-off of top 25 papers this month ^^^

264 Accesses CMLLite: a design philosophy for CML Joe A Townsend, Peter Murray-Rust Journal of Cheminformatics 2011, 3:39 (14 October 2011)

263 Accesses The semantic architecture of the World-Wide Molecular Matrix (WWMM)
Peter Murray-Rust, Sam E Adams, Jim Downing, Joe A Townsend, Yong Zhang Journal of Cheminformatics 2011, 3:42 (14 October 2011)

??? Accesses Three stories about the conduct of science: Past, future, and present Cameron Neylon Journal of Cheminformatics 2011, 3:35 (14 October 2011)

=== and a previously-published article ===

235 Accesses
ChemicalTagger: A tool for semantic text-mining in chemistry
Lezan Hawizy*, David M Jessop*, Nico Adams and Peter Murray-Rust Journal of Cheminformatics 2011, 3:17 doi:10.1186/1758-2946-3-17 (This has been out for several months and is noted as "Highly Accessed")

Total 10063 for 15 papers in 1 initial month == ca 20 accesses per day per paper

So what conclusions?

  • Clearly the figures aren't random, or created solely by bots (I joked about the "wilbots", but I am sure these are real people reading John's article. There were about 50 tweets mentioning this article so people wanted other people to know about it.
  • The top articles are not significantly about chemistry. So it doesn't matter where material is published. This is a really important message. If you have something to say, then people will find it.
  • A LOT of people read Open Access material. Note that after 14 days the articles were also available on Pubmed so we probably double the figures.
  • I can't believe that these papers would have been nearly so widely read in the – effective Closed competitor – J. Chem. Inf. And Modeling. We have also had high readership there – OPSIN paper was "highly accessed" but I have no figures. But I suspect that only a small fraction of the current readership would have access. So closed access publication constrains innovation and constrains multidisciplinarity.

UPDATE:

Egon writes in a comment:

Peter, why not give Total Impact a try… click the 'Manually edit this collection' link in the left menu under 'Collect IDs from:' and add the DOIs for those papers, and hit the red 'Get Metrics' button…

PMR: I don't get anything back that makes sense… Nor do I know how to add PMIDs instead of DOIs