petermr's blog

What is the basis of the NaCTeM-Elsevier agreement? FOI should give the answer

Posted on November 27, 2011 by pm286

In the previous posts (/pmr/2011/11/25/textmining-nactem-and-elsevier-team-up-i-am-worried/ and /pmr/2011/11/27/textmining-my-years-negotiating-with-elsevier/ ) I highlight concerns (not just mine) about the publicly announced collaboration between NaCTeM (The National Centre for Textmining at the University of Manchester) and Elsevier (henceforth N+E). I am now going to find out precisely the details of this collaboration and, when I have the answers, will be in a position to answer the following questions:

What is NaCTeM’s mission for the nation? (NaCTeM formally has a responsibility to the Nation)
What public finance has NaCTeM had and what is planned in the future?
What public money has gone into the N+E?
What are the planned the benefits to Elsevier?
What are the planned benefits of N+E to NaCTeM?
Are there plans to pass any of these benefits to the wider national community

In particular my concerns are:

Will the benefits of this work be available only through Elsevier’s Sciverse platform?
Are we getting value for money?

It may seem strange – and potentially confrontational – to use FOI to get this information rather than simply asking the University or NaCTeM. But the power of FOI is that the University has specialist staff to give clear unemotional answers. And in particular it will highlight precisely whether there are hidden confidential aspects. If so it will be especially important to assess whether this is in the Nation’s interest. And, with the possibility that this will reveal material that is useful to the Hargreaves process and UK government (through my MP) it is important that my facts are correct.

For those who aren’t familiar with the FOI process each public institution has a nominated office/r who must, within 20 working days, give answers to all questions (or show why s/he should not). I shall use http://whatdotheyknow.com – a superb site set up for this purpose which means that everyone can follow the process and read the answers. FOI officers are required to respond promptly, and I hope that Manchester will do so – and be quicker than Oxbridge who ritually take 19.8 days to respond. Note that I am not expected to give my motivation. I shall request the information in existing documents or known facts – this is not a place for future hypotheticals or good intentions.

Dear FOI University of Manchester,

I am requesting information under FOI about the National Centre for Text Mining (NaCTeM) and the University’s recently announced collaboration of NaCTeM with Elsevier (http://www.manchester.ac.uk/aboutus/news/display/?id=7627 ). The information should be supported by existing documents (minutes, policy statements, etc.). I shall be concerned about the availability of resource material to the UK in general (i.e. beyond papers and articles). I use the word “Open” (capitalised) to mean information or services which are available for free use, re-use and redistribution without further permission (see http:// http://opendefinition.org/ ). In general this means OSI-compliant Open Source for code and CC-BY or CC0 for content (CC-NC and “for academics only” are not Open).

General

What is the current mission statement of NaCTeM?
Does NaCTeM have governing or advisory bodies or processes? If so please list membership, dates of previous meetings and provide minutes and support papers.
List the current public funding (amounts and funders) for NaCTeM over the last three years and the expected public funding in the foreseeable future.
What current products, content and services are provided to the UK community (academic and non-academic) other than to NaCTeM?
What proportion of papers published by NaCTeM are fully Open?
What proportion and amount of software, content (such as corpora) and services provided by NaCTeM is fully Open?

Elsevier collaboration

Has the contract with Elsevier been formally discussed with (a) funders (b) bodies of the University of Manchester (e.g. senates, councils)? Please provide documentation.
Is there an advisory board for the collaboration?
Has third party outside NaCTeM formally discussed the advantages and disadvantages of the Elsevier collaboration.
Please provide a copy of the contract between the University and Elsevier. Please also include relevant planning documents, MoIs, etc.
Please highlight the duration, the financial resource provided by (a) the University (b) Elsevier. Please indicate what percentage of Full Economic Costs (FEC) will be be recovered from Elsevier. (I shall assume that a figure of less that 100% indicates that the University is “subsidising Elsevier” and one greater than 100% means the University gains.
Please indicate what contributions in kind (software, content, services, etc.) are made by either party and what they are valued at.
Please outline the expected deliverables. Please indicate whether any of the deliverables are made exclusively available to either or both parties and over what planned timescale.
Are any of the deliverables Open?
What is the IP for the deliverables in the collaboration?
Are any of the deliverables planned to be resold as software, services or content beyond the parties?
Has NaCTeM or the University or any involved third party raised the concern that contributing to Sciverse may be detrimental to the UK community?
Please indicate clearly what the planned benefit of the collaboration is to the UK.

I shall post this tomorrow so please comment now if you wish to.

Posted in Uncategorized | 5 Comments

Textmining: My years negotiating with Elsevier

Posted on November 27, 2011 by pm286

This post – which is long, but necessary – recounts my attempts to obtain permission to text-mine content published in Elsevier’s journals. (If you wish to trust my account the simple answer is – I have got nowhere and I am increasingly worried about Elsevier’s Sciverse as a monopolistic walled garden. If you don’t trust this judgement read the details). What matters is that the publishers are presenting themselves as “extremely helpful and responsive to request for textmining” – my experience is the opposite and I have said so to Efke Smit of the STM publishers’ assoc. In particular I believe that Elsevier made me and the chemical community a public promise 2 years ago and they have failed to honour it.

Although it is about chemistry it is immediately understandable by non-scientists. It is immediately relevant to my concerns about the collaboration between the University of Manchester and Elsevier but has much wider implications for scientific text-mining in general. New readers should read recent blogs posts here including /pmr/2011/11/25/the-scandal-of-publisher-forbidden-textmining-the-vision-denied/ which explains what scientific textmining can cover and should also read forthcoming posts and comments.

I shall frequently use “we” to mean the group I created in Cambridge and extended virtual coworkers. I am not normally a self-promotionist, but it is important to realise that in the following history “we” are the leading group in chemical textmining, objectively confirmed by the ACS Skolnik award. “we” deserve a modicum of respect in this.

I start from common practice, logic, and legal facts. My basic premises are:

I have the fundamental and absolute right to extract factual data from the literature and republish it as Open content. “facts cannot be copyrighted” (though collections can). It has been common practice over two or more centuries for scientists to abstract factual data from the literature to which they have access (either by subscription or through public libraries). There are huge compilations of facts. A typical example is the NIST webbook; please look at http://webbook.nist.gov/cgi/cbook.cgi?ID=C64175&Units=SI&Mask=1#Thermo-Gas. This is a typical page (of probably >> 100,000) carefully abstracted from the literature by humans. It is legal, it is valuable and it is essential.
We have developed technology to automate this process. I argue logically that what a human can do, so can a machine. Logic has no force in business or in court and I am forbidden to deploy my technology it by restrictive publisher contracts (see previous posts). So what is a perfectly natural extension of human practice to machines is forbidden for no reason other than the protection of business interests. It has no logical basis.
I wish to mine factual data from Elsevier journals, specifically “Tetrahedron” and “Tetrahedron Letters”. I shall refer to these jointly as “Tetrahedron”. The factual content in these journals is created by academics and effectively 100% of this factual content is published verbatim without editorial correction. Authors are required to sign over their rights to Elsevier (and even if there may be exceptions they are tortuous in the extreme and most authors simply sign). Elsevier staff refer to this as “Elsevier content”. I shall always quote this phrase as otherwise it implies legitimacy which I dispute – I do not believe it is legally possible to sign over factual data to a monopolist third party. But it has never been challenged in court.
Everything I do is Open. I have no hidden secrets in my emails and anyone is welcome to write to the University of Cambridge under FOI and request any of all my emails with Elsevier. I personally cannot publish them many of them because they contain the phrase: “The information is intended to be for the exclusive use of the intended addressee(s). If you are not an intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this message is strictly prohibited.” However I suspect an FOI request would overrule this.

I have corresponded verbally and by email with several employees of Elsevier. I have done this through my natural contacts as Elsevier provide no central place for me to discuss the questions. I shall anonymise some of the Elsevier employees. If they feel their position has been misrepresented they are welcome to post a comment here and it will be reported in full. If they send an email I reserve the right to publish it Openly.

The simple facts (which can partly be substantiated by FOI on my emails but are stated without them are):

About 5 years ago I wrote to all five editors of Tetrahedron and also the Elsevier office about the possibility of enhancing Tetrahedron content through text-mining. I did not receive a single reply
Two years ago there was a textmining meeting at Manchester, organized by NaCTeM and UKOLN (http://www.nactem.ac.uk/tm-ukoln.php). At that meeting Rafael Sidi, Vice-President Product Management, Elsevier presented “Open Up” (30 mins). [He is the named Elsevier contact in the NaCTeM / Elsevier contract]. He gave no abstract and I do not have his slides. From a contemporaneous blog (http://namesproject.wordpress.com/2009/10/ ) “Rafael Sidi of Elsevier (who got through an eye-boggling 180 slides in 30 minutes!) emphasised the importance of openness in encouraging innovation”. With no other record I paraphrase the subsequent discussion between him and me (and I would be grateful for any eyewitness accounts or recordings). If Rafael Sidi wishes to give his account, he is welcome to use this blog.

Essentially Rafael Sidi enthusiastically stated that we should adopt open principles for scientific content and mashup everything with everything. I then asked him if I could textmine Tetrahedron and mashup the content Openly. He said I could. I then publicly said I would follow this up. I have taken this as a public commitment by Sidi (who was representing Elsevier very clearly) that factual content in Tetrahedron could be mined without further permission.
I then followed it up with mails and phone calls to Sidi. Suffice to report that all the drive can from me and that after six months I had made no progress. I then tried another tack with Elsevier contact. After another 6 months no progress. I then raised this in 2010-10 with a member of Elsevier staff involved with the Beyond the PDF initiative http://sites.google.com/site/beyondthepdf/ . Although not directly concerned with chemistry she took up the case (and I personally thank her efforts) and thought she had made progress (a) by getting Elsevier to draw up a contract allowing me to textmine Tetrahedron and (b) relaying this to David Tempest (Deputy Director “Universal Access”, Elsevier) who is “currently reviewing policies” and “we have finalised our policy and guidelines I would be happy to discuss this further with you.” [That was 9 months ago and I have heard nothing].

The contract is public, apparently available to anyone to negotiate (though there are no rights – all decisions are made by Elsevier). I was told:

You can mine 5 years of Tetrahedron, and will be helped to do so by Frankfurt. You can talk to them about formats. There are two conditions:

1) You agree with the SciVerse developer’s agreement – on http://developer.sciverse.com/start this is http://developer.sciverse.com/developeragreement – this also means you are not allowed to provide access to the Tetrahedron content (no surprise)

2) You can send us a description of the project you are working on, specifically describing the entities you are interested in mining, and the way in which you will use them.

To summarise:

Elsevier decide whether I can mine “their” content. I have no right. I can only beg.
All my results belong to Elsevier and I cannot publish them. Specifically:

3.1.3 the Developer has not used robots, spiders or any other device which could retrieve or

index portions of the Elsevier website, the Elsevier content or the APIs for any unauthorized

purpose, and Developer conforms to all ethical use guidelines as published on the Elsevier

website;

So I cannot search their site except as they permit

3.1.4 the Developer acknowledges that all right, title and interest in and to the Elsevier content,

and any derivative works based upon the Elsevier content, remain with Elsevier and its

suppliers, except as expressly set forth in this Agreement, and that the unauthorized

redistribution of the Elsevier content is not permitted;

“And any derivative works” means that everything I do – chemical structures, spectral data – everything BELONGS TO ELSEVIER. Note the phrase “Elsevier content”. The whole agreement is based on the concept that Sciverse (their platform for publishing “Elsevier content”) is being developed as a walled garden where no-one has rights other than Elsevier.

Well I have only taken 18 months to get to that position. I might be able to negotiate something slightly better if I take another 2 or three years.

And, in any case, I am not begging for permission to do a project. I am asking for my right. Both implied by current practice and also started by Rafael Sidi.

[Incidentally It will be interesting to see if the University of Manchester has signed up to

And that’s where the matter rests. No progress…

But no, I received a request from Elsevier asking if they can use my software. (Why? Because our group is a/the leading one in chemical information extraction). I can’t reproduce it as it’s confidential and I have therefore omitted names , but here is my reply (copied to all the people in Elsevier including Rafael Sidi):

Dear Mr. Murray-Rust,

With great interest I have read your description of the OSCAR 4 chemical entity recognizer. We (redacted) would like to evaluate OSCAR for use in our entity recognizer system and compare it to other analysers.

Because OSCAR is Open Source you may do this without permission.

A few months ago, I have done some comparisons with other annotators and can only say that OSCAR compares quite favourably and is easily deployed – that is to say, if it runs as a Java server.

I assume these comparisons are confidential to Elsevier

This type of functionality is included in the the OSCAR 3 implementation and is really easy to access because no coding layers are required to go between our code and yours – just an http webrequest.

We are using .Net for all our development so a web interface would be real nice. I gather from the article posted (OSCAR4: a flexible architecture for

chemical text-mining) that there are several wrappers around by several users – is there any chance that there is a .Net or HTTP wrapper that we might use? A short-cut in Java to build one ourselves?

I understand this to be a request for free consultancy. Unfortunately we have run out of free consultancy at present.

Do you have any advice here?

Normally I would reply in a positive light to anyone asking polite questions, but I have had two years of unfulfilled promises from Elsevier so I am will engage on one condition – that Elsevier honour the public promise that Rafael Sidi made two years ago.

Mr Sidi stated in public that I could have permission to use OSCAR on chemical reactions published in Elsevier journals (Tetrahedron, Tett Letters, etc.) and to make the results publicly Open. Over that last two years I have tried to get action on this (see copied people). The closest I got was an agreement which I would have to sign saying that all my work would belong exclusively to Elsevier and that I would not be able to publish any of it. (The current agreement that my library has signed for subscriptions to Elsevier is that all text-mining is explicitly and strictly forbidden). Not surprisingly I did not sign this.

By Elsevier making a public promise I assumed I would be able to do research in this field and publish all the results. In fact Elsevier has effectively held back my work for this period and looks to continue to do it. I regard Elsevier as the biggest obstacle to the academic deployment of textmining at present.

The work that you are asking me to help you with will be an Elsevier monopoly with restrictive redistribution conditions and I am not keen on supporting monopolies. If you can arrange for Elsevier to honour their promise I will be prepared to explore a business arrangement though I am making no promises at present.

Thank you very much,

I am sorry this mail is written in a less than friendly tone but I can not at present donate time to an organisation which works against the direction of my research and academia in general. If Elsevier agrees that scientific content can be textmined without permission and redistributed (as it should be if it is to be useful) then you will have helped to make progress.

I have copied in your colleagues who have been involved in the correspondence over the last two years.

[Name redacted]

I am currently treating your request as confidential as it says so but I do not necessarily regard my reply as such. You will understand that I need a reply

Needless to say I have received no reply. You may regard my reply as rude, but it is the product of broken promises from Elsevier, delays, etc. So, Rafael Sidi, if you are reading this blog I would appreciate a reply and the uncontrolled permission to mine and publish data from Tetrahedron.

Because I shall forward your response (or the lack of one) to the UK government who will use your reply as an example of whether the publishers are helpful to those wanting to textmine the literature.

Posted in Uncategorized | 22 Comments

Textmining: NaCTeM and Elsevier team up; I am worried

Posted on November 25, 2011 by pm286

A bit over two weeks ago the following appeared on DCC-associates: http://www.mail-archive.com/dcc-associates@lists.ed.ac.uk/msg00618.html

Mon, 07 Nov 2011 09:16:34 -0800

This press release may be of interest to list members.

University enters collaboration to develop text mining applications

07 Nov 2011

http://www.manchester.ac.uk/aboutus/news/display/?id=7627

The University of Manchester has joined forces with Elsevier, a leading

provider of scientific, technical and medical information products and

services, to develop new applications for text mining, a crucial research tool.

The primary goal of text mining is to extract new information such as named

entities, relations hidden in text and to enable scientists to systematically

and efficiently discover, collect, interpret and curate knowledge required for

research.

The collaborative team will develop applications for SciVerse Applications,

which provides opportunities for researchers to collaborate with developers in

creating and promoting new applications that improve research workflows.

The University's National Centre for Text Mining (NaCTeM), the first

publicly-funded text mining centre in the world, will work with Elsevier's

Application Marketplace and Developer Network team on the project.

Text mining extracts semantic metadata such as terms, relationships and events,

which enable more pertinent search. NaCTeM provides a number of text mining

services, tools and resources for leading corporations and government agencies

that enhance search and discovery.

Sophia Ananiadou, Professor in the University's School of Computer Science and

Director of the National Centre for Text Mining, said: "Text mining supports

new knowledge discovery and hypothesis generation.

"Elsevier's SciVerse platform will enable access to sophisticated text mining

techniques and content that can deliver more pertinent, focused search results."

"NaCTeM has developed a number of innovative, semantic-based and time-saving

text mining tools for various organizations," said Rafael Sidi, Vice President

Product Management, Applications Marketplace and Developer Network, Elsevier.

"We are excited to work with the NaCTeM team to bring this expertise to the

research community."

Now I have worked with NaCTeM, and actually held a JISC grant (ChETA) in which NaCTeM were collaborators and which resulted in both useful work, published articles and Open Source software. The immediate response to the news was from Simon Fenton-Jones

Let me see if I got this right.

"Elsevier, a leading provider of scientific, technical and medical

information products and services", at a cost which increases much faster

than inflation, to libraries who can't organize their researchers to back up

a copy of their journal articles so they can be aggregated, is to have their

platform, Sciverse, made more attractive, by the public purse by a simple

text mining tool which they could build on a shoestring.

Sciverse Applications, in return, will take advantage of this public

largesse to charge more for the journals which should/could have been

compiled by public digital curators in the first instance.

Hmmm. So this is progress.

Hey. It's not my money!

[PMR: I think it’s “not his money” because he writes from Australia, but he will still suffer]

PMR: I agree with this analysis. I posted an initial response (http://www.mail-archive.com/dcc-associates@lists.ed.ac.uk/msg00621.html )

No – it’s worse. I have been expressly and consistently asking Elsevier for

permission to text-mine factual data form their (sorry OUR) papers. They

have prevaricated and fudged and the current situation is:

“you can sign a text-mining licence which forbids you to publish any

results and handsover all results to Elsevier”

I shall not let this drop – I am very happy to collect allies. Basically I

am forbidden to deploy my text-mining tools on Elsevier content.

I shall elaborate on this. I was about to write more, because I completely agree about the use of public money and the lack of benefit to the community. However I have been making enquiries and it appears that public funding for NaCTeM is being run down – effectively they are becoming a “normal” department of the university – with less (or no) “national” role.

However the implications of this deal are deeply worrying – because it further impoverishes our rights in the public arena and I will explain further later. I’d like to know exactly what NaCTeM and the University of Manchester are giving to Elsevier and what they are getting out of it.

This post will give them a public chance – in the comments section, please – to make their position clear.

Posted in Uncategorized | 10 Comments

The scandal of publisher-forbidden textmining: The vision denied

Posted on November 25, 2011 by pm286

This is the first post of probably several in my concern about textmining. You do NOT have to be a scientist to understand the point with total clarity. This topic is one of the most important I have written about this year. We are at a critical point where unless we take action our scholarly rights will be further eroded. What I write here is designed to be submitted to the UK government as evidence if required. I am going to argue that the science and technology of textmining is systematically restricted by scholarly publishers to the serious detriment of the utilisation of publicly funded research.

What is textmining?

The natural process of reporting science often involves text as well as tables. Here is an example from chemistry (please do not switch off – you do not need to know any chemistry.) I’ll refer to it as a “preparation” as it recounts how the scientist(s) made a chemical compound.

To a solution of 3-bromobenzophenone (1.00 g, 4 mmol) in MeOH (15 mL) was added sodium borohydride (0.3 mL, 8 mmol) portionwise at rt and the suspension was stirred at rt for 1-24 h. The reaction was diluted slowly with water and extracted with CH2Cl2. The organic layer was washed successively with water, brine, dried over Na2SO4, and concentrated to give the title compound as oil (0.8 g, 79%), which was used in the next reaction without further purification. MS (ESI, pos. ion) m/z: 247.1 (M-OH).

The point is that this is a purely factual report of an experiment. No opinion, no subjectivity. A simple, necessary account of the work done. Indeed if this were not included it would be difficult to work out what had been done and whether it had been done correctly. A student who got this wrong in their thesis would be asked to redo the experiment.

This is tedious for a human to read. However during the C20 there have been large industries based on humans reading this and reporting the results. Two of the best known abstracters are the ACS’s Chemical Abstracts and Beilstein’s database (now owned by Elsevier). These abstracting services have been essential for chemistry – to know what has been done and how to repeat it (much chemistry involves repeating previous experiments to make material for further synthesis , testing etc.).

Over the years our group has developed technology to read and “understand” language like this. Credit to Joe Townsend, Fraser Norton, Chris Waudby, Sam Adams, Peter Corbett, Lezan Hawizy, Nico Adams, David Jessop, Daniel Lowe. Their work has resulted in an Open Source toolkit (OSCAR4, OPSIN, ChemicalTagger) which is widely used in academia and industry (including publishers). So we can run ChemicalTagger over this text and get:

EVERY word in this has been interpreted. The colours show the “meaning” of the various phrases. But there is more. Daniel Lowe has developed OPSIN which works out (from a 500-page rulebook from IUPAC) what the compounds are. So he has been able to construct a complete semantic reaction:

If you are a chemist I hope you are amazed. This is a complete balanced chemical reaction with every detail accurately extracted. The fate of every atom in the reaction has been worked out. If you are not a chemist, try to be amazed by the technology which can read “English prose” and turn it into diagrams. This is the power of textmining.

There are probably about 10 million such preparations reported in the scholarly literature. There is an overwhelming value in using textmining to extract the reactions. In Richard Whitby’s Dial-a-molecule project (EPSRC) the UK chemistry community identified the critical need to text-mine the literature.

So why don’t we?

Is it too costly to deploy?

No.

Will it cause undue load on pubklisher servers.

No, if we behave in a responsible manner.

Does it break confidentiality?

No – all the material is “in the public domain” (i.e. there are no secrets)

Is it irresponsible to let “ordinary people” do this/

No.

Then let’s start!

NO!!!!

BECAUSE THE PUBLISHERS EXPRESSLY FORBID US TO DO TEXTMINING

But Universities pay about 5-10 Billion USD per year as subscriptions for journals. Surely this gives us the right to textmine the content we subscribe to.

NO, NO, NO.

Here is part of the contract that Universities sign with Elsevier (I think CDL is California Digital Library but Cambridge’s is similar) see http://lists.okfn.org/pipermail/open-science/2011-April/000724.html for more resources

 The CDL/ Elsevier contract includes [@ "Schedule 1.2(a)

 General Terms and Conditions  "RESTRICTIONS ON USAGE OF THE LICENSED PRODUCTS/ INTELLECTUAL PROPERTY RIGHTS" GTC1]

"Subscriber shall not use spider or web-crawling or other software programs, routines, robots or other mechanized devices to continuously and automatically search and index any content accessed online under this Agreement. "

What does that mean?

NO-TEXTMING. No INDEXING. NO ROBOTS. No NOTHING.

Whyever did the library sign this?

I have NO IDEA. It’s one of the worst abrogations of our rights I have seen.

Did the libraries not flag this up as a serious problem?

If they did I can find no record.

So the only thing they negotiated on was price? Right?

Appears so. After all 10 Billion USD is pretty cheap to read the literature that we scientists have written. [sarcasm].

So YOU are forbidden to deploy your state-of-the art technology?

PMR: That’s right. Basically the publishers have destroyed the value of my research. (I exclude CC-BY publishers but not the usual major lot).

What would happen if you actually did try to textmine it.

They would cut the whole University off within a second.

Come on, you’re exaggerating.

Nope – it’s happened twice. And I wasn’t breaking the contract – they just thought I was “stealing content”.

Don’t they ask you to find out if there is a problem?

No. Suspicion of theft. Readers are Guilty until proven innocent. That’s publisher morality. And remember that we have GIVEN them this content. If I wished to datamine my own chemistry papers I wouldn’t be allowed to.

But surely the publishers are responsive to reasonable requests?

That’s the line they are pushing. I will give my own experience in the next post.

So they weren’t helpful?

You will have to find out.

Meanwhile you are going to send this to the government, right?

Right. The UK has commissioned a report on this. Prof Hargreaves. http://www.ipo.gov.uk/ipreview-finalreport.pdf

And it thinks we should have unrestricted textmining?

Certainly for science technical and medical.

So what do the publishers say?

They think it’s over the top. After all they have always been incredibly helpful and responsive to academics. So there isn’t a real problem. See http://www.techdirt.com/articles/20111115/02315716776/uk-publishers-moan-about-content-minings-possible-problems-dismiss-other-countries-actual-experience.shtml

Nonetheless, the UK Publishers Association, which describes its “core service” as “representation and lobbying, around copyright, rights and other matters relevant to our members, who represent roughly 80 per cent of the industry by turnover”, is unhappy. Here’s Richard Mollet, the Association’s CEO, explaining why it is against the idea of such a text-mining exception:

If publishers lost the ability to manage access to allow content mining, three things would happen. First, the platforms would collapse under the technological weight of crawler-bots. Some technical specialists liken the effects to a denial-of-service attack; others say it would be analogous to a broadband connection being diminished by competing use. Those who are already working in partnership on data mining routinely ask searchers to “throttle back” at certain times to prevent such overloads from occurring. Such requests would be impossible to make if no-one had to ask permission in the first place.

They’ve got a point, haven’t they?

PMR This is appalling disinformation. This is ONLY the content that is behind the publisher’s paywalls. If there were any technical problems they would know where they come from and could arrange a solution.

Then there is the commercial risk. It is all very well allowing a researcher to access and copy content to mine if they are, indeed, a researcher. But what if they are not? What if their intention is to copy the work for a directly competing-use; what if they have the intention of copying the work and then infringing the copyright in it? Sure they will still be breaking the law, but how do you chase after someone if you don’t know who, or where, they are? The current system of managed access allows the bona fides of miners to be checked out. An exception would make such checks impossible.

[“managed access” == total ban]

If you don’t immediately see this is a spurious argument, then read the techndirt article. The ideal situation for publishers is if no-one reads the literature. Then it’s easy to control. This is, after all PUBLISHING (although Orwell would have loved the idea of modern publishing being to destroy communication).

Which leads to the third risk. Britain would be placing itself at a competitive disadvantage in the European & global marketplace if it were the only country to provide such an exception (oh, except the Japanese and some Nordic countries). Why run the risk of publishing in the UK, which opens its data up to any Tom, Dick & Harry, not to mention the attendant technical and commercial risks, if there are other countries which take a more responsible attitude.

So PMR doing cutting-edge research puts Britain at a competitive disadvantage. I’d better pack up.

But not before I have given my own account of what we are missing and the collaboration that the publishers have shown me.

And I’ll return to my views about the deal between University of Manchester and Elsevier.

Posted in Uncategorized | 21 Comments

Open Research Reports Hackathon: What is Semantics? URIs and URLs

Posted on November 17, 2011 by pm286

#orr2011

I (and colleagues) are getting ready for the December Hackathon (JISC, OKF, SWAT4LS) which includes Open Research Reports and the Semantic Web For Life Sciences. The Hackathon can include any activity but we are preparing material to bring along based on Open Research for diseases and which is or can be semantified. We hope this will be an important step forward for making disease information more widely available and useful.

So what’s Semantics? It’s not a disease, is it?

No. It’s a formal way of talking about things. Humans are (usually) very good at understanding each other even when they use fuzzy language. For example there is a sign next to our bicycle shed which says:

NO BICYCLES HERE

*We* all know this means:

“Do not put bicycles here”

and the study of this is called pragmatics http://en.wikipedia.org/wiki/Pragmatics .

Here are three sentences where (English speakers) easily distinguish the difference between the meaning of the symbol “cold”:

She has a cold
She has a cold sore
She has a cold foot

We will return to these later.

Unfortunately pragmatics is beyond the range of most computer systems so we have to we have to create formal systems for them – these are based on syntax (a common symbolic representation, http://en.wikipedia.org/wiki/Syntax ) and semantics (agreement on meaning, http://en.wikipedia.org/wiki/Semantics ). (Be warned that the border between these is fuzzy).

Our syntax for the semantic web includes:

URIs (http://en.wikipedia.org/wiki/Uniform_Resource_Identifier ). This is a universally agreed mechanism for giving things-on-the-web names. Thus the URI for the Wikipedia article on “syntax” is “http://en.wikipedia.org/wiki/Syntax”

B: HANG ON! That’s not a name, it’s an address. It’s a Uniform Resource Locator (URL, http://en.wikipedia.org/wiki/URL, )

A: Yes. It’s an address and also a name. The URI identifies the resources and also locates it.

B: But it might not be there – you might get a 404.

A: Wikipedia never 404s

B: Or someone could copy the page to another address. It’s still the same page, but a different URL.

A: but it’s not the definitive URI

B: Why not. And anyway The XML crew spent 10,000 mail messages debating that names and addresses were different.

A: well they are the same now. Tim says so.

B: That’s a distorted view of reality.

PMR: Hussssh! This has been a major debate for years and will continue to be so. Here’s Tim (http://en.wikipedia.org/wiki/Linked_Data ):

Tim Berners-Lee outlined four principles of Linked Data in his Design Issues: Linked Data note, paraphrased along the following lines:

Use URIs to identify things.
Use HTTP URIs so that these things can be referred to and looked up (“dereferenced“) by people and user agents.
Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML.
Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

B: So these are conflated URIs (“HTTP URIs”). They only work if the thing is a web resource.

A: Here he is again:

Tim Berners-Lee gave a presentation on Linked Data at the TED 2009 conference. In it, he restated the Linked Data principles as three “extremely simple” rules:

All kinds of conceptual things, they have names now that start with HTTP.
I get important information back. I will get back some data in a standard format which is kind of useful data that somebody might like to know about that thing, about that event.
I get back that information it’s not just got somebody’s height and weight and when they were born, it’s got relationships. And when it has relationships, whenever it expresses a relationship then the other thing that it’s related to is given one of those names that starts with HTTP.

Note that although the second rule mentions “standard formats”, it does not require any specific standard, such as RDF/XML.

B: so it’s only “conceptual things”. Like “syntax”. My cat cannot have an HTTP-URI.

A: not your cat. But TBL can be dereferenced: Look at http://en.wikipedia.org/wiki/Tim_Berners-Lee .

B: That’s not him – it’s a web page about him. You make it sound as if Wikipedia defines reality. If it’s not in Wikipedia it doesn’t exist. You are a Borgesian.

A: A what?

PMR: Shhhh! This is the sort of “robust discussion” we get into all the time. We are going to take a very simple approach to the semantic web. The advantage is it is easy to understand and will work. We are first of all going to give things precise labels.

B: Like a “cold”

PMR: exactly like that. We will call a cold “J00.0”

B: whatever for? I won’t remember that.

A: You don’t have to – the machines will. A cold will always be J00.0

B: Well why “J00.0”? Why not “common_cold”, like Wikipedia (http://en.wikipedia.org/wiki/Common_cold )?

A: Because that’s what the WHO call it. In their International Classification Of Disease Edition 10 (ICD-10) http://en.wikipedia.org/wiki/ICD-10 . PMR actually worked with the WHO (in Uppsala) to convert ICD-10 to XML. He knows it by heart.

PMR: well I did. I’ve forgotten most of it.

B: OK, well I suppose the WHO has a right to create names for diseases. But surely they aren’t the only ones?

A: No = there’s http://en.wikipedia.org/wiki/Medical_Subject_Headings (MeSH) – which calls it D003139 . And ICD–9 …

B: The ninth edition I suppose …

A: Yes. Calls it 460.

B: I bet they don’t all agree on what a cold is.

PMR: No. There’s lots of variation in medical terminology. There’s the http://en.wikipedia.org/wiki/Unified_Medical_Language_System (UMLS) It:

is a compendium of many controlled vocabularies in the biomedical sciences (created 1986^[1]). It provides a mapping structure among these vocabularies and thus allows one to translate among the various terminology systems; it may also be viewed as a comprehensive thesaurus and ontology of biomedical concepts.

B: and now this “ontology” word?

A: it’s a formal system (http://en.wikipedia.org/wiki/Ontology_%28computer_science%29 ):

an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain.

B: It’s too complicated for me.

PMR: We are going to start simple. Ontologies tell computers how to distinguish between different meanings of the concept “cold”. We’ll just assume that we humans generally agree.

A: But doctors don’t agree on diagnoses – how can we?

PMR: This isn’t about whether you are actually infected by rhinovirus…

B: … ???

PMR: The virus that causes a cold. It looks like this:

A: Yes – it’s got icosahedral symmetry and was …

PMR: … back to the semantics. It’s about putting the concept of “cold” into computers. We need a unique identifier and we can use the WHO one.

B: but J00.0 isn’t unique. That’s the number of my neighbour’s car.

PMR: so we turn it into a URI. An HTTP-URI is unique because it’s based on domain names, and they are unique.

A: but what domain name? Since the WHO invented it, let’s use the HTTP-URL for the cold. That’s http://apps.who.int/classifications/icd10/browse/2010/en#/J00-J06

B: but that should be http://apps.who.int/classifications/icd10/browse/2010/en#/J00 – but that doesn’t resolve. And in any case I bet the “apps” bit changes. That’s why addresses are no use for URIs

PMR: It’s really up to authorities like WHO to give stable identifiers for this, that are persistent in name and address.

B: That’s a tough order. Do you think the WHO are up to it?

PMR: Probably not yet. We’ll probably need to invent a way round it. Perhaps with a PURL (http://en.wikipedia.org/wiki/Persistent_Uniform_Resource_Locator ).

B: and you said this was easy?

PMR: The Semantic Web community is working hard to make this easy for you, yes. Anyway, nearly there. Let’s just use http://purl.org/who/classifications/icd10/J00.0 as a shorthand for “common cold”

B: “shorthand”

PMR: sorry, identifier. And address – which we can make resolvable by redirecting the PURL.

A: OK, we’ve now got an identifier system for all diseases. Will we always use ICD-10?

PMR: It’ll make it easier for our ORR project and we shan’t need mappings or ontologies.

A: So we can identify “cold sores” as http://purl.org/who/classifications/icd10/B00.1

PMR: Yes

B. You’ve convinced me that we can give each disease a unique identifier (whether we actually have the disease or not). But “cold sores” is not a disease – it’s a symptom. And the disease is “Herpesviral vesicular dermatitis” according to WHO. The virus isn’t a disease as such so does it have its own identifier?

PMR: Yes. The virus (http://en.wikipedia.org/wiki/Herpes_simplex_virus ) is actually a combination of protein and RNA. Its classification is:

Family: Herpesviridae Subfamily: Alphaherpesvirinae Genus: Simplexvirus Species Herpes simplex virus 1 (HSV-1)

B: But that’s not an identifier.

PMR: Agreed. So somewhere we need to find an identifier or work out a schem for creating one.

B: So the semantic web won’t work?

A: We are all at the stage of creating it. There’s been a huge increase in identifier systems. There are now thousands in the Linked Open Data cloud. And that’s the sort of thing we’ll tackle in the Hackathon.

B: I’m knackered. I’ve learnt that we need HTTP-URIs for everything. We’ve just done diseases. If we want to do ORRs we need people, places, parasites and so on. All in semantic form, right?

PMR: Right.

B: so there’s a HUGE amount of work to be done.

A: But lots of people are involved. And once we’ve done it, it will be persistent.

B: until we change our concepts…

PMR: But by then we shall already have shown how powerful it is.

Posted in Uncategorized | Leave a comment

Multiple Metrics for “Visions of a Semantic Molecular Feature”

Posted on November 16, 2011 by pm286

After posting the access metrics for VoaSMF , Egon Willighagen suggested that we also use some of the new alternative metrics (“altmetrics”). There’s more than one such effort and they are to be welcomed (citation metrics by themselves are inefficient, imprecise and suffer huge timelag). So people such as Egon, Heather Piwowar, Cameron Neylon are creating immediate metrics – literally hour-by hour. Here we show “Total Impact” (http://total-impact.org/).

Before we show the figures let me commend this and similar efforts because:

They are not tied to a commercial business. (Journal Impact factors have long been tainted with the suspicion that they are manipulated between aggregators and publishers. Not for the benefit of science, but in the near-meaningless struggle between journals for branding)
They are immediate. Within hours of publication you can get alternative metrics.
They are objective and democratic. Anyone can build tools to carry them out. It gives real scientists a say in how science is measured. I expect universities to give them the cold-shoulder for some time as they challenge the current trivial system and they are free.
They are multivariate. A whole host of different metrics can be used. You can make your own up.

The altmetrics software was originally hacked at a workshop that Cameron ran, I think. Anyway it is typical of the quality and speed that can be achieved by people working together with a common vision and shared tools. Indeed (I hope) this shows the challenge to the lumbering publication systems that publishers build and force us to use. We are starting to liberate our expressiveness.

So here are our 15 articles and I comment later:

report for Visions of a Semantic Molecular Future

download data
run update updated 15 Nov, 2011 created 15 Nov, 2011 15 artifacts

Permalink: http://total-impact.org/report.php?id=CXtjIz copy

article

10.1186/1758-2946-3-36
- Wilbanks (2011) Openness as infrastructure J Cheminf.
- 50tweets
- 8influential tweets
- 5shares
- 2bookmarks
- 1likes
10.1186/1758-2946-3-35
- Neylon (2011) Three stories about the conduct of science: Past, future, and present J Cheminf.
- 2tweets
- 1influential tweets
10.1186/1758-2946-3-34
- Zaharevitz (2011) Adventures in public data J Cheminf.
- 1tweets

Dataset [there is a bug, these should be labelled “article”]

10.1186/1758-2946-3-39
- Townsend JA, Murray-Rust P (2011) CMLLite: a design philosophy for CML. J Cheminform. 21999395
  
  all available metrics are zero.
10.1186/1758-2946-3-38
- Adams S, de Castro P, Echenique P, Estrada J, Hanwell MD, Murray-Rust P, Sherwood P, Thomas J, Townsend JA (2011) The Quixote project: Collaborative and Open Quantum Chemistry data management in the Internet age. J Cheminform. 21999363
  
  all available metrics are zero.
10.1186/1758-2946-3-37
- O’Boyle NM, Guha R, Willighagen EL, Adams SE, Alvarsson J, Bradley JC, Filippov IV, Hanson RM, Hanwell MD, Hutchison GR, James CA, Jeliazkova N, Lang AS, Langner KM, Lonie DC, Lowe DM, Pansanel J, Pavlov D, Spjuth O, Steinbeck C, Tenderholt AL, Theisen KJ, Murray-Rust P (2011) Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on. J Cheminform. 21999342
- 5bookmarks
- 1blogs
- 1mentions
10.1186/1758-2946-3-48
- Murray-Rust P (2011) Semantic science and its communication – a personal view. J Cheminform. 21999715
- 1bookmarks
10.1186/1758-2946-3-46
- Rzepa HS (2011) The past, present and future of Scientific discourse. J Cheminform. 21999632
  
  all available metrics are zero.
10.1186/1758-2946-3-47
- Jones R, Macgillivray M, Murray-Rust P, Pitman J, Sefton P, O’Steen B, Waites W (2011) Open Bibliography for Science, Technology, and Medicine. J Cheminform. 21999661
- 4bookmarks
10.1186/1758-2946-3-44
- Murray-Rust P, Rzepa HS (2011) CML: Evolution and Design. J Cheminform. 21999549
  
  all available metrics are zero.
10.1186/1758-2946-3-45
- Brooks BJ, Thorn AL, Smith M, Matthews P, Chen S, O’Steen B, Adams SE, Townsend JA, Murray-Rust P (2011) Ami – The Chemist’s Amanuensis. J Cheminform. 21999587
  
  all available metrics are zero.
10.1186/1758-2946-3-42
- Murray-Rust P, Adams SE, Downing J, Townsend JA, Zhang Y (2011) The semantic architecture of the World-Wide Molecular Matrix (WWMM). J Cheminform. 21999475
  
  all available metrics are zero.
10.1186/1758-2946-3-43
- Murray-Rust P, Townsend JA, Adams SE, Phadungsukanan W, Thomas J (2011) The semantics of Chemical Markup Language (CML): dictionaries and conventions. J Cheminform. 21999509
  
  all available metrics are zero.
10.1186/1758-2946-3-40
- Jessop DM, Adams SE, Murray-Rust P (2011) Mining chemical information from Open patents. J Cheminform. 21999425
- 2bookmarks
10.1186/1758-2946-3-41
- Jessop DM, Adams SE, Willighagen EL, Hawizy L, Murray-Rust P (2011) OSCAR4: a flexible architecture for chemical text-mining. J Cheminform. 21999457
- 1bookmarks

More detail on available metrics. Missing some artifacts or metrics? See current limitations.

DB….
status log

an project.
source code on github

There are at least 6 metrics:

Tweets
Blogs (a limited selection)
Bookmarks (Cite-U-Like)
Wikipedia entries
Facebook shares and likes
Mentions

Most are sparsely populated – the exception being JohnW’s tweets. These are real – the twittersphere resounded on day with a massive list of tweets about John’s article. There are some technical issues – some metrics currently require DOIs, etc.

People may argue that these metrics can be gamed. (Of course John pulled people off the streets of SF to tweet his article.) Seriously I think the accesses are reasonably accurate and haven’t been gamed AFAIK. The altmetrics don’t visit enough blogs, I think but that will come.

The point is that rather than waiting 3 years to find out if anyone has read our articles we get a current picture. It doesn’t surprise me that there are hundreds of accesses for each tweet or blog – we’ve had 300,000 downloads of Chem4Word and hardly a squeak. And most cheminformaticians don’t tweet or blog or show themselves in the glare of Open social networks.

Thanks to the altmetricians for their efforts.

Posted in Uncategorized | Leave a comment

Semantic Molecular Future: Article accesses during first 30 days

Posted on November 15, 2011 by pm286

I believe that we seriously need a set of new metrics for scholarly publication. It should be multidimensional and one of these should be accesses. [Yes, I know it’s possible to game the system. Everything can be gamed. The only thing that can’t is scientific reality, but that’s increasingly irrelevant to academia]. So here is my contribution to access-metrics.

Our special issue of “Visions of a Semantic (molecular) Future” in J. Cheminformatics (BiomedCentral) has now been out for a month. Since BMC publish the article accesses for each article (a) confidentially to the author and (b) publicly for the highest accessed articles each month I can start to do some simple analysis. (I’m only missing one author, Cameron Neylon).

Here are the stats, after exactly one month (October 14^th -> Nov 14^th). The actual stats are for a window of 30 days.

2639 Accesses Openness as infrastructure
John Wilbanks Journal of Cheminformatics 2011, 3:36 (14 October 2011)

1185 Accesses Open Bibliography for Science, Technology, and Medicine Richard Jones, Mark MacGillivray, Peter Murray-Rust, Jim Pitman, Peter Sefton, Ben O’Steen, William Waites Journal of Cheminformatics 2011, 3:47 (14 October 2011)

1018 Accesses Semantic science and its communication – a personal view Peter Murray-Rust Journal of Cheminformatics 2011, 3:48 (14 October 2011)

936 Accesses Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on Noel M O’Boyle, Rajarshi Guha, Egon L Willighagen, Samuel E Adams, Jonathan Alvarsson, Jean-Claude Bradley, Igor V Filippov, Robert M Hanson, Marcus D Hanwell, Geoffrey R Hutchison, Craig A James, Nina Jeliazkova, Andrew SID Lang, Karol M Langner, David C Lonie, Daniel M Lowe, Jérôme Pansanel, Dmitry Pavlov, Ola Spjuth, Christoph Steinbeck, Adam L Tenderholt, Kevin J Theisen, Peter Murray-Rust Journal of Cheminformatics 2011, 3:37 (14 October 2011)

822 Accesses Ami – The chemist’s amanuensis Brian J Brooks, Adam L Thorn, Matthew Smith, Peter Matthews, Shaoming Chen, Ben O’Steen, Sam E Adams, Joe A Townsend, Peter Murray-Rust Journal of Cheminformatics 2011, 3:45 (14 October 2011)

681 Accesses The past, present and future of Scientific discourse Henry S Rzepa Journal of Cheminformatics 2011, 3:46 (14 October 2011)

531 Accesses OSCAR4: a flexible architecture for chemical text-mining David M Jessop, Sam E Adams, Egon L Willighagen, Lezan Hawizy, Peter Murray-Rust Journal of Cheminformatics 2011, 3:41 (14 October 2011)

429 Accesses CML: Evolution and design Peter Murray-Rust, Henry S Rzepa Journal of Cheminformatics 2011, 3:44 (14 October 2011)

420 Accesses Mining chemical information from open patents David M Jessop, Sam E Adams, Peter Murray-Rust Journal of Cheminformatics 2011, 3:40 (14 October 2011)

313 Accesses The semantics of Chemical Markup Language (CML): dictionaries and conventions Peter Murray-Rust, Joe A Townsend, Sam E Adams, Weerapong Phadungsukanan, Jens Thomas Journal of Cheminformatics 2011, 3:43 (14 October 2011)

280 Accesses The Quixote project: Collaborative and Open Quantum Chemistry data management in the Internet age Sam Adams, Pablo de Castro, Pablo Echenique, Jorge Estrada, Marcus D Hanwell, Peter Murray-Rust, Paul Sherwood, Jens Thomas, Joe Townsend Journal of Cheminformatics 2011, 3:38 (14 October 2011)

265 Accesses Adventures in public data Dan W Zaharevitz Journal of Cheminformatics 2011, 3:34 (14 October 2011)

^^^ cut-off of top 25 papers this month ^^^

264 Accesses CMLLite: a design philosophy for CML Joe A Townsend, Peter Murray-Rust Journal of Cheminformatics 2011, 3:39 (14 October 2011)

263 Accesses The semantic architecture of the World-Wide Molecular Matrix (WWMM)
Peter Murray-Rust, Sam E Adams, Jim Downing, Joe A Townsend, Yong Zhang Journal of Cheminformatics 2011, 3:42 (14 October 2011)

??? Accesses Three stories about the conduct of science: Past, future, and present Cameron Neylon Journal of Cheminformatics 2011, 3:35 (14 October 2011)

=== and a previously-published article ===

235 Accesses
ChemicalTagger: A tool for semantic text-mining in chemistry
Lezan Hawizy^*, David M Jessop^*, Nico Adams and Peter Murray-Rust Journal of Cheminformatics 2011, 3:17 doi:10.1186/1758-2946-3-17 (This has been out for several months and is noted as “Highly Accessed”)

Total 10063 for 15 papers in 1 initial month == ca 20 accesses per day per paper

So what conclusions?

Clearly the figures aren’t random, or created solely by bots (I joked about the “wilbots”, but I am sure these are real people reading John’s article. There were about 50 tweets mentioning this article so people wanted other people to know about it.
The top articles are not significantly about chemistry. So it doesn’t matter where material is published. This is a really important message. If you have something to say, then people will find it.
A LOT of people read Open Access material. Note that after 14 days the articles were also available on Pubmed so we probably double the figures.
I can’t believe that these papers would have been nearly so widely read in the – effective Closed competitor – J. Chem. Inf. And Modeling. We have also had high readership there – OPSIN paper was “highly accessed” but I have no figures. But I suspect that only a small fraction of the current readership would have access. So closed access publication constrains innovation and constrains multidisciplinarity.

UPDATE:

Egon writes in a comment:

Peter, why not give Total Impact a try… click the ‘Manually edit this collection’ link in the left menu under ‘Collect IDs from:’ and add the DOIs for those papers, and hit the red ‘Get Metrics’ button…

PMR: I don’t get anything back that makes sense… Nor do I know how to add PMIDs instead of DOIs

Posted in Uncategorized | 10 Comments

Anyone can run their own Semantic Repository (Chempound, Quixote)

Posted on November 15, 2011 by pm286

#quixote

#chempound

The eResearch workshop on Semantic Physical Science was a technical success and has convinced me that now anyone can deploy our Chempound semantic repository. If you don’t know what a repository does, here is an explanation (you need to know that :

“It’s a Useful Pot,” said Pooh. “Here it is. And it’s got ‘A Very Happy Birthday with love from Pooh’ written on it. That’s what all that writing is. And it’s for putting things in. There!”

When Eeyore saw the pot, he became quite excited.

“Why!” he said. “I believe my Balloon will just go into that Pot!”

“Oh, no, Eeyore,” said Pooh. “Balloons are much too big to go into Pots. What you do with a balloon is, you hold the balloon ”

“Not mine,” said Eeyore proudly. “Look, Piglet!” And as Piglet looked sorrowfully round, Eeyore picked the balloon up with his teeth, and placed it carefully in the pot; picked it out and put it on the ground; and then picked it up again and put it carefully back.

“So it does!” said Pooh. “It goes in!”

“So it does!” said Piglet. “And it comes out!”

“Doesn’t it?” said Eeyore. “It goes in and out like anything.”

“I’m very glad,” said Pooh happily, “that I thought of giving you a Useful Pot to put things in.”

“I’m very glad,” said Piglet happily, “that thought of giving you something to put in a Useful Pot.”

But Eeyore wasn’t listening. He was taking the balloon out, and putting it back again, as happy as could be….

[reproduced without permission; however, my aunt was Marjorie Milne and I am sure she would forgive me]

If you can follow this, you will also be able to follow my account of how a Chempound server works, and how YOU can set one up and run it.

Our Chempound repository is somewhere to put chemistry (and other science) in and take them out.

Now a “repository” sounds frightening, but it’s only a piece of software. But you have to understand the concept of “server”.

The client-server (http://en.wikipedia.org/wiki/Client%E2%80%93server_model ) is for me one of the great advances in the last 50 years. Partly because it decouples and modularises functionality and represents clean design. Partly because it allows complex operations to be concentrated on one place and so more easily maintainable, especially when there is software than cannot be deployed (complexity, licence, etc.).

[Wikipedia] – the server is in the middle – the clients are distant in space and can be disconnected without disturbing the server or the other clients.

But mainly because the HTTP servers of the 1990’s brought power and democracy to individuals.

Huh?

Yes. I used to think that information systems could only be set up by a priesthood. You had a mainframe and dumb terminals. You couldn’t do anything without a mainframe. The early generations of client-server were proprietary and opaque. A different protocol for each system.

But HTTPD and NCSA changed that (http://en.wikipedia.org/wiki/NCSA_HTTPd ). I have heard it said that the great breakthrough for the takeoff of the web was not the HTML browser but the HTTPD server. Not Mosaic (fantastic though that was) but the NCSA server.

In 1994 I discovered that I could run a server!

It was a revelation. I could publish whatever I wanted to whomever I wanted! I was free. I was doing it through Birkbeck College Crystallography – all they had to do was give me a directory where I could put all my stuff and then run the server software.

Ordinary people could set up their own radio stations on the new World Wide Web.

Now we have become accustomed to this. We can tweet with zero effort. Get a WordPress blog and tell the world what we think. The client-server model means that the client doesn’t have to listen if it doesn’t want to! Publishing on the web doesn’t mean that people have to take any notice. Which is what democracy is.

So Chempound now brings democracy to physical science.

Anyone can set up a server but you have to have a place where you can run one. If you want others to play that means having a web-hosting service and is able to run a Java-based server. That may be a question of talking to your university/company sysadmin, or alternatively you can pay a few dollars to get your own domain.

But even if you don’t have this you can publish to yourself! You can discover the power of semantic resources on your own laptop. Everyone can publish to http://localhost:8080 and practice.

By now you will have realised that you need two bits of software:

The client. That’s easy; your browser is all you need. That’s because Chempound uses REST (a convention for using HTTP).
The server. That’s Chempound.

(Actually you also need another piece of software to load the data because it needs to be converted into semantic form).

A repository should support CRUD (http://en.wikipedia.org/wiki/Create,_read,_update_and_delete ). At present we don’t normally support Update but rather retransform and reinsert the whole entry. This is because Chempound/Quixote or Chempound/Crystaleye are “final” snapshots of a piece of work. And I shan’t cover Delete today.

OK – what do you have to do? I’m only going to paint outlines here as it’s all been documented by Jorge Estrada. Don’t worry about the details. This is NOT a full set of instructions. The point is to show how easy it is:

You must have a machine for the server which runs Java and you must tell it where JAVA_HOME is. Normally no problem. If you don’t have Java you will have to install it. Again normally not a problem
You need some generic Java server – either Jetty or Tomcat. They may be bundled in our distrib to make it easy for you.
You’ll need some workspace for the server to put stuff.

If you know Maven (and I’d recommend it) you can follow procedure 1:

Procedure 2.1. Steps to install a Chempound server

[You will need Mercurial to download the Chempound code. This is very easy – suggest you use Tortoise Mercurial]

Clone the Quixote Chempound sources from https://bitbucket.org/chempound/quixote-repository
Create a directory for Chempound to store its files during runtime:
Launch the Chempound server. You will need to provide the path to the workspace directory (chempound.workspace) and the root URL where Chempound will be running (chempound.uri).

A typical execution will run Chempound at localhost and port 8080:

That takes a few minutes at most.

OR you can run it directly under Jetty. We supply a huge file
				quixote-webapp-version-jar-with-dependencies.jar

With everything you need. You will still have to install Jetty.

Procedure 2.2. Installing Jetty

If you do not have Jetty installed, download a current Jetty distribution from http://jetty.codehaus.org/jetty/. This will be a file similar to jetty-distribution-7.4.5.v20110725.tar.gz
Unpack the downloaded file. You will find a directory with several files (including start.jar) and directories. This directory will be referred here on as /path/to/jetty

After this step, you will find the files quixote-webapp-version.war and quixote-webapp-version-jar-with-dependencies.jar in the target directory.

Clone the Quixote Chempound sources from https://bitbucket.org/chempound/quixote-repository
Create the WAR package and the JAR with the dependencies with the Maven package phase
You will have to copy both the WAR and JAR files to the webapps directory of your Jetty server, changing the name of the WAR file to just quixote.war
Create a directory for Chempound to store its files during runtime:
Configure Jetty to run Chempound as the root application by deleting Jetty default files ml under the contexts directory of your Jetty installation:

And then, create a quixote.xml configuration file in that same directory.

In this file, you would replace URL with the URL of your Chempound service^[2] (for example, http://localhost:8080/). /PATH/TO/CHEMPOUND/WORKSPACE should point to the directory you want to use as workspace for Chempound. Finally, when specifying the JAR file with the Chempound dependencies, you would substitute VERSION with the appropriate value.
Finally, launch the Jetty server from the /path/to/jetty directory:
You can stop the server at any time by typing CTRL+C.

Again a few minutes. Now you have a working Chempound repository, running on http://localhost:8080

But there isn’t anything in it!

OK, Eeyore, let’s put a balloon in it. (Or more accurately, ingest a legacy compchem logfile):

Procedure 3.1. Steps to build the Quixote utils software

Clone the Quixote utils repository from https://bitbucket.org/sea36/quixote-utils repository.
Build the JAR packages needed, using the Maven profile uberjar and the target package.

OR we will supply quixote-utils-0.1-SNAPSHOT-jar-with-dependencies.jar directly ands you can omit this

Depositing NWChem log files in a Chempound server

Using the JAR packages created previously, you can deposit your NWChem log files by running the following command:

java -cp target/quixote-utils-0.1-SNAPSHOT-jar-with-dependencies.jar net.quixote.utils.DepositNWChem {chempoundSwordEndpoint} myfile.log

Where myfile.log is the file you wish to ingest.

Now DepositNWChem carries out the following magic:

Converts myfile.log to CML (myfile.cml)
Validates the CML
Converts the CML to RDF (myfile,rdf)

We now have three files and we direct SWORD2 to:

Upload all of them
Index them
Add them to the RDF triplestore
Create a web page

That’s a great deal of magic for one command! Thank Sam, Jorge, and PMRGroup for all the code. At present it has to be done from the commandline (which is the best) but it would be easy to create a simple GUI so you could select files and upload them). Volunteers from GUI-writing addicts?

The server allows you to browse and search the repository. It “exposes a SPARQL endpoint”. That means you can search the RDF. So I have described the CR of CRUD.

It’s alpha. We are proud of it, but it has bugs. If you like hacking alpha software please let us know. If you are interested in public semantic chemical software and content let us know. Because with CSIRO and PNNL and YOU we are going to revolutionise semantic physical science – starting with computational chemistry, solid state, and spectra.

Posted in Uncategorized | Leave a comment

Semantic Physical Science at eResearch and the great Projector foul-up

Posted on November 11, 2011 by pm286

#eres2011

Nico Adams , Alex wade and I ran a 1-day workshop yesterday at eResearch. Very well attended (30 participants) who were mainly well prepared – we’d asked for Java installed if possible. The programme was ambitious, roughly:

Nico – intro to semantics
PMR – markup languages
Nico – details of semantics , RDF, ontologies, etc.
Alex – semantic documents and Word add-ins
PMR – textmining to create semantics
PMR+Nico Create and populate your own (chemistry) repository (Chempound/Quixote)

Special thanks to Sam Adams and Jorge Estrada who have created and documented Chempound and Quixote. This allows us to install the system on delegates’ machines, and ingest legacy files (e.g. NWChem), convert to CML, create RDF, ingest and expose a SPARQL endpoint. Jorge had produced excellent instructions but I failed to remember that CLASSPATH separators are different on windows and struggled for a day or so (remember there is only a short conversation window). Got it working, the night before, and then created the documentation in the early morning. (This ended in a slightly messy state and it needs tidying). If I feel motivated today (I am watching VIC/WA at MCG) I will hack and post.

So just about ready to go – Nico kicks off – first slide has half the text and lines missing. It’s a projector foul-up. When Nico shows an RDF graph all the edges (lines) are missing. Slightly destroys the meaning. But it’s because Nico has a Mac.

I should be fine – I have Windows. I offer Nico my machine. It doesn’t work at all. The screen remains blank (actually blue). We switch displays, resolution, etc. Nothing. An audiovisual guru is found. He hacks away during the break, changing every setting on my machine. There is a brief flash as both screens show the slide then nothing. I have to abandon this.

This makes it difficult for me to present. I have to use Nico’s Mac and I don’t know how to click on a Mac. Some Vulcan death-grip of fingers. And this isn’t powerpoint – it’s me running software. After all that what’s the workshop is about. We cover rather less than planned.

When Alex presents on another Windows it’s fine.

For the first two slides. Then the top half of the screen is covered in creeping darkness. Alex is reduced to showing us MathML in dark text on a dark background. We believe him.

But the good thing is that it generates excellent discussion about semantics. We have a good range from newcomers to experts. So a lot of useful discussion about what to mark up, when, do triple stores work, etc.

We finish on a technical high. Many of the participants were able to run stuff. We got one or two Chempound repositories working.

It’s probably a good time to write a blog post describing the software and resources.

After the tea break.

Posted in Uncategorized | Leave a comment

Open Research Reports: Links, Video and Prezi

Posted on November 7, 2011 by pm286

@jenny_molloy has created a top class summary of #ORR2011 at http://science.okfn.org/2011/10/29/okfn-at-oss2011-open-research-reports/ . It will point you to the discussions, the presentations and the latest URLs for the workshop, etc. We have a wiki at http://wiki.okfn.org/Wg/Science/swat4ls_hackathon

Jenny and I have created a video (http://dl.dropbox.com/u/6280676/ORRVideo.m4v )describing the *why* of ORR2011 – why is it necessary? It’s about 5.5 minutes and everyone to whom I’ve shown it has understood the point.

It had to be made on a tight timescale (ca 3 days). I looked for existing footage that we could use, including:

Me talking at Open Science Summit (#oss2011)
Jenny talking at Open Science Summit (#oss2011)
Some of our #oss2011 slides
Graham Steel at Cambridge earlier this year representing patients
Prof. Mary Abukutsa-Onyango discussing the importance of Open Access for research from Kenya and other African countries (with Leslie Chan, a co-creator of ORR.
Interlude slides from Jenny.

I was in Washington State, Jenny in Oxford. So we iterated slowly and at strange times of the day. I would say “let’s have 33 secs of this and 22 of that and can you voice over this and add a caption here” and Jenny was learning how to use the equipment in Oxford (BTW thanks to Oxford computing services, OUCS). And it would turn out that she couldn’t voice over because the ambient noise was too loud and there would be glitches and so on. And I would see the next draft and show it to people and get comments about … So Jenny would re-edit at strange times of the day (and not weekends because they were closed, etc.)

I think Jenny has done a tremendous job. (Apart from her starring role, which was anyway pretty ad lib as we only finished the slides just before the #oss2011 presentation).

So it’s large (50 Mbytes) but it’s a video. It needs a CC-BY notice. Treat as CC-BY, we’ll try to add this la. BTW all the components are CC-BY so we were able to use Mary’s footage without prior permission as we were in a rush. Maybe put it on Youtube?

So this will entice people to come. But we’d also like to tell people what *semantic* means. So maybe there will be a second edition.

Meanwhile Mark MacGillivray has converted “animal garden” to Prezi. Fantastic job Mark. Because it was in Powerpoint Mark hasn’t been able to translate the speech bubbles. I should have dumped it as PDF (what am I saying?). Anyway it’s great and here it is http://prezi.com/curtjkrhlagu/animal-garden/. Open it as full screen and it will change every few seconds, I think.

Enjoy (if you are a proponent of openness). Else prepare to be converted

Posted in Uncategorized | 2 Comments

What is the basis of the NaCTeM-Elsevier agreement? FOI should give the answer

Textmining: My years negotiating with Elsevier

Textmining: NaCTeM and Elsevier team up; I am worried

The scandal of publisher-forbidden textmining: The vision denied

Open Research Reports Hackathon: What is Semantics? URIs and URLs

Multiple Metrics for “Visions of a Semantic Molecular Feature”

report for Visions of a Semantic Molecular Future

article

10.1186/1758-2946-3-36

10.1186/1758-2946-3-35

10.1186/1758-2946-3-34

Dataset [there is a bug, these should be labelled “article”]

10.1186/1758-2946-3-39

10.1186/1758-2946-3-38

10.1186/1758-2946-3-37

10.1186/1758-2946-3-48

10.1186/1758-2946-3-46

10.1186/1758-2946-3-47

10.1186/1758-2946-3-44

10.1186/1758-2946-3-45

10.1186/1758-2946-3-42

10.1186/1758-2946-3-43

10.1186/1758-2946-3-40

10.1186/1758-2946-3-41

Semantic Molecular Future: Article accesses during first 30 days

Anyone can run their own Semantic Repository (Chempound, Quixote)

Depositing NWChem log files in a Chempound server

Semantic Physical Science at eResearch and the great Projector foul-up

Open Research Reports: Links, Video and Prezi

Recent Posts

Recent Comments

Archives

Categories

Meta