What is Data and what should be Open?

I got a long and thoughtful comment from Steve Bachrach on Open Data which contains important points.

Great post on an important topic. Im just going to throw some random thoughts together here.

One of the problems in this area (copyrights, databases, patents) is that the governing legal rules differ from country to country. I applaud the notion of generating a chemistry community – maybe even a science community – set of best practices. This can, I hope, cross boundaries.

PMR: Yes. That’s why the OpenKnowledgeFoundation and ScienceCommons effprts are critical. They understand the legal issues. Talis/Jordan_Hatcher donated legal services to address this

I worry about the need to claim that data is open, as in a sense diminishing what is rightfully there from the start. My understanding is that data cannot be copyrighted here in the US. While the statement the melting point of benzene is 5 °C can be copyrighted, the underlying fact is in the public domain and one can freely use that information for any purpose. The same is true about spectra – the IR image of the absorption of benzene is copyrightable (How one displays data can be a creative effort – see below) but the fact that benzene does not absorb at 1900 cm-1 is data and can be freely re-used.

PMR: This is a logical position and one I used to take. However many scientists do not know or appreciate this. Moreover many publishers will stick copyright notices on this type of material.. Hence it is important for the author to assert that all data is Open.

As part of best practices I would like to see journals insist that all data used in preparing the article be submitted as part f the article. The data should then be made available for re-use with no limitations and inhibitions (I.e. no cost). Currently, that would be as supporting materials but I hope that in the future this data might be more intimately connected with the article itself. In an ideal world, data should be packaged in a way to facilitate re-use, but that is something I am willing to let the world grow into slowly.

PMR: this is the position I have consistently taken for several years. I am also working with my colleagues and Microsoft and JISC to develop the tools to make it effortless in the chemistry community.

I think you need to be very careful about the notion to Assert that Data covers images and tables and other ways of representing data. Representation of data can very much be a creative process and should be copyrightable. For example, a landscape photographer is simply taking an image of data – the way the countryside appears on a particular date and time. Let that photographer be Ansel Adams, and he obtains a reprentation of data that is coloured by the artists perceptions and biases and influences and creates a product that I call art – and that image should be protected. A chemist may do the same thing with an array of data to create a plot that clearly and distinctly manipulates the data to make a point. In such a case, I would protect the image (the plot) with copyright and insist that the underlying data (such as the excel file) be published simultaneously within the supporting materials and that data is open in the public domain. Things get a bit tricky here too – does the photo of a gel run by a biochemists count as data or as a creative image? I think the former, but Im open to discussion.

PMR: This is a critical point and I’ll stress it in my talk tomorrow.

The primary issue is whether the author wishes to copyright the image and whether Community Norms support this copyrighting. The primary problem is that the author often hands over copyright of all images to the publisher who has put no creative effort into them. I would argue the following:

  • There are many areas of science where images and tables are the natural way of communicating the science. The alternative of words or unorganised numbers is often counterproductive. In these cases the Community Norms should urge the author to use appropriate PDDL or CC0 methods to protect the images, while asserting that they belong to the community.

  • There are cases where the author puts creative effort into the image (and to a lesser extent) the table. In most cases that is so that the reader understands the underlying data better. I suspect that few authors would wish to prevent others reproducing their work as long as attribution was prominent indeed they would encourage it. The problem is that copyright in publishers’ hands leads to reduced communication.

To do this the author should automatically stamp all images and tables with their authorship and their OpenData intention. This will become easier once tools support it. I have argued that all computational chemistry programs should emit an open data message by default with a runtime/comandline switch to remove this if required. Similarly I intend that out Chem4Word tool automatically stamps OpenData on every molecular structure unless disabled. This does NOT mean the molecule cannot be patented it only means that when released the image or connection table can be freely used.

The same should be done for gels, pictures of cells, movies of mice, etc. There is no moral reason for publishers to own this. It’s the fundamental information infrastructure of science.

Bottom line is I fully support an effort to create best practices and think tht th Panton Principles are a great starting point!

PMR: thanks

Posted in Uncategorized | Leave a comment

Web Democracy through mySociety

To convince you of the power of Web Democracy here’s what I have been able to do in the last three hours. I lamented the state pf UK politics and that I was unable to get a reply from my MEP on Net Neutrality. I wrote to the BBC and got a reply:

Thank you for sending your email to the Broadcasting House programme on BBC Radio Four. This is an automated reply to confirm that your message has reached us and will be drawn to the attention of the programme’s presenter and producers.

If you are asking for specific information, we’ll do our best to supply it. Please bear with us, because we receive a great many emails and our priority must be to maintain our on-air service. We may not be able to meet complicated requests.

If you are making a complaint or comment which requires a personal response, please forward your email to bhfeedback@bbc.co.uk

This is exactly what I needed clear and courteous. Now if I had got that from my MEP I would have been happy. As I was, however, I got a reply from a mySociety volunteer (anonymised as Anon)

Thanks for the nice feedback for the site [mySociety].

Neither MPs nor MEPs are covered by the Freedom of Information Act,
and are under no obligation to reply to any letters from their
constituents either, although most do.

For more information on what you can do if you haven’t received a
reply please see:
<<http://www.writetothem.com/about-qa#noreply>>

You can also write to your MEP via this site.

Again, exactly what I had hoped for. In retrospect I should have poked around the mySociety site and found it for myself. But this is Web Democracy where everyone contributes and Anon has done exactly that.

WriteToThem does exactly what it says on the tin. Let’s take you through it (I’m posting this with screenshots as I can now do this in ICE!).

graphics1

graphics2So I now get a list of my MEPs.

I hadn’t thought of writing to all of them. I’ll do that.

Read this before writing your message

Please be polite, concise and to the point; if you abuse your MEPs you devalue the service for all users.

Use your own words (new window); we block copy and pasted identical messages.

Its a waste of time (new window) writing to MEPs other than your own; your message will be ignored.

MEPs can help you on proposed European directives (laws), and questions on the European Parliament, Commission, or Union. However, once passed, EU laws become the responsibility of the UK to implement, so you may wish to go back and contact your MP about them in that case. Similarly if your letter is about a local or national issue, please go back and contact a regional or national representative, as your MEP will be unable to help you in that case.

Note that MEPs cannot help raise an issue with the European Court of Human Rights. The Convention is incorporated into UK law, so any challenge must start in the UK legal system.

Exactly what I want. Now for the list of MEPs

graphics3

This is presented very clearly. So I’ll write

I’m writing on the issue of Net Neutrality on Europe which I consider to be a fundamental democratic right and also a likely competitive advantage in wealth generation. This is particularly important for the Cambridge area with a large number of knowledge economy companies which, if we are to achieve the vision of the Lisbon protocol, will be essential.

I wrote to Andrew Duff before the vote on Net Neutrality earlier this month but received no reply. I would like the following:

  • An acknowledgement of this mail, even if automatic. There is nothing more destructive of the democratic process than the impression that one’ elected representatives do not care

  • A summary of the current position on Net Neutrality in Europe after the vote. My colleagues tell me that freedom has not been assured and that big business can still expect to control access to the Net.

Peter Murray-Rust

For the record my previous letter is on my blog at http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=1742 (Dear MEP, Please save Our Internet);

I fill in my name and address and submit the letter. WTT asks me to wait while they process my letter…

and I get:

All done Well send your message now

If, for some reason, we couldnt send your message to your representative, we will email you to tell you. Otherwise, you can sit back and relax your message is on its way.

Get email from your MP in the future

and have a chance to discuss what they say in a public forum

Your email:

Great. Now let’s see whether I get any reply.

And many thanks to mySociety. It’s changing our democratic process by the day. Look at it, and see if there are services you can responsibly use.

Posted in Uncategorized | 2 Comments

Can you help OPSIN with disambiguating chemical names?

[This post is requesting input from the chemical community]

In creating our IUPACName2Structure converter OPSIN we need to cater for a diversity of usage. This arises from

  • ambiguity and multiple approaches in the IUPAC rules

  • authors of documents who ignore or adapt the IUPAC rules

  • programs which ignore or adapt IUPAC rules

It’s difficult for Daniel and me to assert that a given approach is better than another so I’m turning to the court of public informed chemical opinion. Ideally we would like to get to a system where an OPSIN user community rather like Wikipedia develops a communal view based on best reading of the literature and current practice. This post is really to set the scene. If there are any contributors to the IUPAC rules then they should be given special weight.

In principle we should adopt IUPAC practice wherever we can determine with certainty what it is.

In practice we have run 3 of the commoner commercial programs to see how they interpret (not generate) chemical names. There are differences of opinion (not just errors of implementation).

So here are two fundamental questions:

  • Do spaces in chemical names matter? If so what are the rules?

  • Do hyphens in chemical names matter? If so what are the rules?

Here is our standard example of ambiguous practice. The name chloroethylbenzene is, I think, a valid IUPAC name. But it is ambiguous and can represent 5 structures (one of which can have further stereoisomers). Pubchem correctly lists all 5:

2:

CID: 69330

Related Structures

graphics2

p-Chloroethylbenzene;

1-Chloro-4-ethylbenzene;

Benzene, 1-chloro-4-ethyl- …
IUPAC: 1-chloro-4-ethylbenzene
MW: 140.610060 g/mol | MF: C8H9Cl

3:

CID: 6995

Related Structures


graphics3

O-CHLOROETHYLBENZENE; 2-Ethylchlorobenzene; Benzene, 1-chloro-2-ethyl- …
IUPAC: 1-chloro-2-ethylbenzene
MW: 140.610060 g/mol | MF: C8H9Cl

4:

CID: 231496

Related Structures

graphics4

Ethylchlorobenzene; Phenethyl chloride;

(2-Chloroethyl)benzene …
IUPAC: 2-chloroethylbenzene
MW: 140.610060 g/mol | MF: C8H9Cl

  • So we’d welcome comments, answering the following and similar questions:

  • If OPSIN is given chloroethylbenzene should it refuse to parse it because of ambiguity. (We intend that OPSIN should, as far as possible, show where the ambiguity occurs)

  • If it does parse it, should it guess one of the five structures? (We are working on returning generic structures but that’s not for now).

  • Should the guess be random or should it be informed by some principles (including popularity of chloroethylbenzene as a synonym).?

  • Can punctuation help to remove the ambiguity? If so which of the following are un- or less ambiguous: chloro ethylbenzene, chloroethyl benzene, chloro ethyl benzene. Remember that the spaces may not be put in by the author, but by a wordprocessor or technical editor. Similar considerations apply to: chloro-ethylbenzene, chloroethyl-benzene, chloro-ethyl-benzene. Remember that wordprocessors and editors may hyphenate names

  • What about incomplete locants? Which of the following are unambiguous? 1-chloroethylbenzene, 2-chloroethylbenzene, 3-chloroethylbenzene, 4-chloroethylbenzene.

  • Would the context influence you? Is a study of 2-chloroethylbenzene, 3-chloroethylbenzene and 4-chloroethylbenzene less ambiguous?

Note that in some cases ambiguity can be resolved by enumerating the chemically sensible structures. Thus decachloroethylbenzene is unambiguous assuming normal valence rules but any number other than 10 chlorine substituents is ambiguous. And for, say, tetrachloroethylbenzene there will be quite a lot of isomers! We should not load this fortuitous resolution onto OPSIN and I would throw decachloroethylbenzene out as ambiguous (
no locants).

Your comments will be very useful and also any suggestions as to the governance of this in future.

Posted in Uncategorized | 4 Comments

Can we create Web Democracy for the UK?

I would not normally write about politics on this blog but Non-Brits may not have caught the raw anger of the UK electorate about the betrayal of trust by their elected representatives (members of Parliament). I believe that web democracy is now essential for modern government. By web democracy I mean the processes that so many of us have developed in our own work. I am not suggesting that conventional government is replaced by Web processes but that web processes should be used to supplement the process of government and be baked into that process. That is why Net Neutrality matters so much.

This morning BBC ran its regular Broadcasting House program where they asked 3-4 of the great and the good to comment. A unelected bishop, Tony Benn and the unelected Lord Sainsbury. Lord Sainsbury was the Minister for Science who did much to resist the introduction of Open Access into the UK (or certainly did nothing to promote it). The program asked us to mail our comments I have no idea what they do with them but I wrote rapidly (a sort of megatweet):

In your current program (2009-05-17:0900) you ask for input to create a new democracy for the UK

Government can be made transparent through the Internet.

An outstanding example of web democracy is given by mySociety (http://www.mysociety.org/) and similar organisations which provide sites for public transparency. Examples are theyWorkForYou (what MPS have said and voted for), WhatDoTheyKnow (FOI requests). The Cabinet Office has supported some of this.

Transparent public information is now key to democracy

The first point is that I am more likely to get my voice heard through the media than through the formal process of my MP. (I recently wrote twice to my MEP about Net neutrality and also left a message on their answerphone and got no reply. This is not democracy). The Guardian ran a story on Thursday about Heather Brooke a journalist who fought for 5 years for freedom of information about MP’s expenses and was fought aggressively and relentlessly by the MP’s and the Speaker (the person who is meant to run the House of Commons but publicly fails to do so). The MP’s have deliberately tried to cover up their expenses, and it’s credit to the Daily Telegraph that we now have them.

The argument that we can vote out Mps after 5 years is not tenable in the Internet age. I’m not asking for mob justice, but I am asking for Open Government. That’s what mySociety has done so spectacularly. You can now make formal requests for information from any public body. I’m also impressed by Downing Street posting petitions on their site (as for the campaign against the EPSRC policy I blogged about).

So let’s see TheySpendYourMoney.org, listing Mps expenses. It’s trivial to do technically. Or ListenToUs.org where they are forced to reply to our views.

While on this subject, we have local and European elections next month. I always try to vote. I’d like to emphasize the importance of Web Democracy and might be persuaded to carry a placard with a simple message. Any suggestions?

Posted in Uncategorized | 2 Comments

Should the Foology Society sell its journals to commercial publishers

I received a request from a well-known learned society, which I will anonymise as the Foology Society (and use female gender to anonymise people). They have agreed for me to blog this and actively invite comment either on this blog or by mailing me to pass on. A well-known scientist and long-standing member and officer of the Society (Prof. Foo) rang me and asked if I could give her informal advice about whether the Society should sell its flagship journal to a commercial publisher. The motivation was not primarily to raise revenue, but fear about the commercial prospects of society journals. She sent me the following which epitomizes the concerns of many of the society officers:

“Libraries will target most of their cost cutting attentions on the smaller academic/not-for-profit sector subscriptions in order to protect the large commercial contracts such as the “Big Deal” and similar consortia. The days of small independent publishing are over, out-licensing is the only way to protect publishing income.”

She regards this as a catastrophe for the society and the journal and asked if I could provide contrary views.

I immediately replied that on no account should the society sell its journal it was the crown jewels and far too many societies had sold these. I’ll first of all give my own views which relate only to STM journals – and then suggest how she can get other views and also get help.

The journal system of today arose from learned societies. In 1970 many countries had their own domain journal France, Germany, Scandinavia, Switzerland, UK, USA, Canada all had national societies which ran flagship journals. There was a close liaison between the society journal and the membership and most members would receive one of more journals. Some scientists belonged to more than one society so as to get the journal. Libraries would subscribe to most national journals in a domain.

I’ve been a founder member (Treasurer) of a learned society (Molecular Graphics, now Molecular Graphics and Modelling) which in 1982 also ran a journal which was from the start run by a commercial publisher (Butterworth). The society raised its income from membership and meetings (equipment manufacturers would pay to exhibit and we always made a profit.) To be fair there were no paid officers but that is how many societies started. We got no income and incurred no costs from the journal, but members got it free.

As I blogged recently a major asset in C21 will be trust. I still trust learned societies to behave honorably (and when they do not it is deeply upsetting). I do not now trust commercial publishers to act honorably in all circumstances. The lobbying in Congress, Parliament, Europe by commercial publishers is often directly against the interests of scientists, most notably through the draconian imposition of copyright. The PRISM affair highlighted the depths to which some publishers will go to protect their income rather than the integrity of the domain. For Elsevier to finance PRISM to discredit Open Access science as junk while publishing fake journals means that no society can rely on their integrity.

Prof Foo shows how poisonous FUD is now a perceived part of libraries thinking. It may or may not be accurate but it is the perception. She is right that libraries are not likely to protect societies. Indeed I regard individual libraries now as the wrong place to purchase journal subscriptions. It is easy for a professional salesperson from a large publisher to win against the average library who is not trained in sales and negotiating. The divide and conquer strategy has been too successful already.

I would take individual libraries out of the purchasing system immediately and make this a national or consortium process. I believe Brazil buys journals nationally and so every university has access to every subscribed journal. This cuts management costs and almost certainly also cuts journal subscriptions. In the UK possible organs could be JISC or the British Library.

It is possible that consortia such as Highwire can provide a critical mass but I don’t know enough about their purchasing or selling influence (if any). I believe they might provide the sort of bundle that Prof Foo needs.

How is the Foology Soc to continue to remain solvent. That’s not easy, but if the journal is part of it then it, not a commercial publisher, should be in control. My own view is that societies must become points of rapid innovation, perhaps by teaming up with Universities and maybe through organs such as JISC.

I gave a list of people who might be able to help, but I hope very much that YOU will also give help, by commenting here or by email.

Because if we do not get ideas, Foology will die. Its soul will pass to the megapublishers.

Posted in Uncategorized | 18 Comments

The Panton Principles: A breakthrough on data licensing for public science?

I’ve been invited to a COST meeting in Porto (PT) on Monday. COST is an intergovernmental framework for European cooperation and I’m a member of a working group on automating computational chemistry (D37). But this is different – HLA-NET MC-WG is A European network of the HLA diversity for histocompatibility, clinical transplantation, epidemiology and population genetics. They are interested in how they share data (and materials) through an Open process.I didn’t need to know anything about sharing materials, privacy, etc. [BTW I may use data as a singular noun deliberately].

So last Tuesday we had a visit from Cameron Neylon (who incidentally gave a brilliant talk on Open Science , blogged by Nico Adams) and our group went to the Panton Arms (pub) for lunch. I took the opportunity to get Rufus Pollock of the Open Knowledge Foundation to discuss and clarify what our common stance is on Open Data. Cameron has blogged very comprehensively and usefully on the meeting. Rufus has also summarized views from Open Data Commons.

The critical thing to realise is that Open Scientific Data is not Open Software nor Open Content. It may sound arrogant but it can be difficult for a non-scientist to realise that is is different from maps, from Shakespeare, from photography, from government publications, from cricket scores. Scientists by default collect data, or calculate it, to justify their conclusions to prove they have done the work, to allow others to repeat the work.

It should be free, as in air.

They expect others to use it, without their permission. This could be to provde the original ideas right, or to prove them wrong. It could be to mine the data for ideas the original scientists missed. No scientist likes being proved wrong, or having someone else find ideas that they have missed. But it’s a central part of science. A scientist who says you can’t use my published data has no credibility today.

That’s not to say some scientists don’t try to hold their data back and mine the maximum from it before publishing. But it is becoming increasingly required by funders, by universities (in theses) and by some publishers that the data justifying a publication should be published in some way at the time of article publication. And by default there should be no restrictions on copying, re-use , republishing for whatever purpose and by whomever. I may not like it if my data is used to make weapons, or that a commercial organisation republishes it for money. But that is the implied contract I make by being a scientist. If I don’t like weapons derived from science there are other ways I can make my views known other than by adding restrictions and at times I have.

To summarize. Data itself must be completely free. The question is how to ensure that it is.

The Open Science and Open Knowledge community has been discussing this for about 2 years. We seem to be agreed that legal tools are counterproductive, and that moderation is best applied by the community. This is represented by Community Norms agreed practices that cause severe disapproval and possibly action when broken.

Our current crisis in Britain illustrates this. Huge numbers of Members of Parliament have been fiddling their expenses. They’ve been spending taxpayers’ money on cleaning their castle moats, buying second homes, antique rugs and so on. Huge amounts. This is, apparently, within the parliamentary guide lines.

But it is against the court of public opinion. It violates our Community Norms. The defence that it is within the rules illustrates the futility of the rules.

And it is incredibly difficult to draft good rules. So we’ve decided not to try to use the standard tools of copyright or licences.

For us Data are born Open. The question is how to state that. The simplest way is just to add the OKF’s Open Data button to the data. That’s a statement of intent. It says you can do whatever you like with this data without asking my permission. In many cases I think that is adequate.

However the community has also investigated the legal aspect and to provide a formal means of stating this in legal terms. This isn’t easy but the two approaches Public Domain Dedication and Licence (PDDL) and Creative Commons CC0 are roughly equivalent. I hope it’s useful to say that PPDL comes out of an Open Knowledge philosphy and deals with collections and other non-scientific content, whereas CC0 springs more directly fro
m science. And it IS complex I am meant to be an expert and I still find the details difficult. Here’s the CC0 FAQ:

How is CC0 different from the Public Domain Dedication and License (PDDL) published by Open Data Commons?

The PDDL is intended only for use with databases and the data they contain. CC0 may be used for any type of content protected by copyright, such as journal articles, educational materials, books, music, and art, as well as databases and data. And just like our licenses, CC0 has the added benefit of being expressed in three ways: through a human-readable deed (a plain-language summary of CC0), the legal code, and digital code. The digital code is a machine-readable translation of CC0 that helps search engines and other applications identify CC0 works by their terms of use.

This was the background when we tried to achieve a common view in the Panton. I’ll let Cameron take it from here:

The appropriate way to license published scientific data is an argument that has now been rolling on for some time. Broadly speaking the argument has devolved into two camps. Firstly those who have a belief in the value of share-alike or copyleft provisions of GPL and similar licenses. Many of these people come from an Open Source Software or Open Content background. The primary concern of this group is spreading the message and use of Open Content and to prevent freeloaders from being able to use Open material and not contribute back to the open community. A presumption in this view is that a license is a good, or at least acceptable, way of achieving both these goals. Also included here are those who think that it is important to allow people the freedom to address their concerns through copyleft approaches. I think it is fair to characterize Rufus as falling into this latter group.

On the other side are those, including myself, who are concerned more centrally with enabling re-use and re-purposing of data as far as is possible. Most of us are scientists of one sort or another and not programmers per se. We dont tend to be concerned about freeloading (or in some cases welcome it as effective re-use). Another common characteristic is that we have been prevented from being able to make our own content as free as we would like due to copyleft provisions. I prefer to make all my content CC-BY (or cc0 where possible). I am frequently limited in my ability to do this by the wish to incorporate CC-BY-SA or GFDL material. We are deeply worried by the potential for licensing to make it harder to re-use and re-mix disparate sets of data and content into new digital objects. There is a sense amongst this group that data is different to other types of content, particulary in its diversity of types and re-uses. More generally there is the concern that anything that smells of lawyers, like something called a license, will have scientists running screaming in the opposite direction as they try to avoid any contact with their local administration and legal teams.

PMR: I am completely aligned with Cameron. The added precision of legality is seriously outweighed by its difficulty and downstream problems. Cameron again:

What I think was productive about the discussion on Tuesday is that we focused on what we could agree on with the aim of seeing whether it was possible to find a common position statement on the limited area of best practice for the publication of data that arises from public science. I believe such a statement is important because there is a window of opportunity to influence funder positions. Many funders are adopting data sharing policies but most refer to following best practice and that best practice is thin on the ground in most areas. With funders wielding the ultimate potential stick there is great potential to bootstrap good practice by providing clear guidance and tools to make it easy for researchers to deliver on their obligations. Funders in turn will likely adopt this best practice as policy if it is widely accepted by their research communities.

So we agreed on the following (I think anyone should feel free to correct me of course!):

1. A simple statement is required along the forms of  best practice in data publishing is to apply protocol X. Not a broad selection of licenses with different effects, not a complex statement about what the options are, but best practice is X.

2. The purpose of publishing public scientific data and collections of data, whether in the form of a paper, a patent, data publication, or deposition to a database, is to enable re-use and re-purposing of that data. Non-commercial terms prevent this in an unpredictable and unhelpful way. Share-alike and copyleft provisions have the potential to do the same under some circumstances.

3. The scientific research community is governed by strong community norms, particularly with respect to attribution. If we could successfully expand these to include share-alike approaches as a community expectation that would obviate many concerns that people attempt to address via licensing.

4. Explicit statements of the status of data are required and we need effective technical and legal infrastructure to make this easy for researchers.

So in aggregate I think we agreed a statement similar to the following:

Where a decision has been taken to publish data deriving from public science research, best practice to enable the re-use and re-purposing of that data, is to place it explicitly in the public domain via {one of a small set of protocols e.g. cc0 or PDDL}.

PMR: agreed. The biggest danger is NOT making the assertion that the data is Open. There may be second-order problems from CC0 or PPDL but they are nothing compared to the uncertainty of NOT making this simple assertion. Do not try to be clever and use SA, NC or other
restricted licenses. Simply state the data are Open. Cameron finishes:

The advantage of this statement is that it focuses purely on what should be done once a decision to publish has been made, leaving the issue of what should be published to a separate policy statement. This also sidesteps issues of which data should not be made public. It focuses on data generated by public science, narrowing the field to the space in which there is a moral obligation to make such data available to the public that fund it. By describing this as best practice it also allows deviations that may, for whatever reason, be justified by specific people in specific circumstances. Ultimately the community, referees, and funders will be the judge of those justifications. The BBSRC data sharing policy states for instance:

BBSRC expects research data generated as a result of BBSRC support to be made availableno later than the release through publicationin-line with established best practice  in the field [CN – my emphasis]

The key point for me that came out of the discussion is perhaps that we cant and wont agree on a general solution for data but that we can articulate best practice in specific domains. I think we have agreed that for the specific domain of published data from public science there is a way forward. If this is the case then it is a very useful step forward.

PMR: completely agreed. Now there are some important actions:

  • get funders, universities and well-intentioned publishers to agreed on this approach, with appropriate modifications. It should be sufficient to see the Open Data button to know that the data are free for re-use.

  • Assert that Data covers images and tables and other ways of representing data. It is archaic and bizarre that data presented as an image are copyrightable. We must change this it’s far more important than the second order problems of PPDL/CC0

Cameron has called, this A breakthrough on data licensing for public science. If others agree let’s call it the Panton Principles of Open Data.

Posted in Uncategorized | 15 Comments

ICE is working

Hopefully got ICE reworking. The error message from ICE flashed up for ca 20ms. So I used screen capture to try to trap it. Took me 20 minutes and about 30 tries but at least I had the cricket to watch (it’s stopped raining).

It seems the problems were because Windows has two directories for programs:

C:\Program Files\

and

C:\Program Files (x86)\

Argggh.

ICE and OpenOffice install in the second. By copying the directories into the first I can now post again.

So I hope the technical qualities of the blogs will improve.

Posted in Uncategorized | Leave a comment

OPSIN: why it can become the de facto name2structure

In a previous post I reviewed our chemical language processing tools – OSCAR and OPSIN. This post updates progress on OPSIN, the IUPACName2Structure converter.

Why do we need a name2structure converter? It’s because chemists use language to communicate the identities of obejcts. It’s possible to talk simple chemistry over the phone whereas it wouldn’t ben easy to describe star maps, isotherms, engineering drawings, etc. And, because of this, chemists often abbreviate names – it’s easier to say “mesitylene” than “1,3,5-trimethyl benzene” or “DDT” instead of “paradichlorodiphenyltrichloromethane” (experts will cringe at the horror of this name which is seriously non-systematic and which could not be worked out by man or machine. There is, however, a lovely limerick based on it).

The rules for naming compounds are set out by the Int. Union or Pure and Applied Chemistry. Even if you are not a chemist, have a look at:  IUPAC Nomenclature Home Page which represents years of devoted work by chemists, much of the organization done by Gerry Moss. There are many reasons why the field is complicated:

  • almost all compounds can be named in many ways. Thus CH3-O-CH3 could be called methyl ether, dimethyl ether, 2-oxa-propane and so on. IUPC has recomendations for which of these should be used but they are often ignored, and sometimes are honoured in the breach. Most practising chemists, unless they routinely patent a lot of compounds neither know these recommendations nor care.
  • Errors are common. Letters can be elided, brackets missed etc. and plain mistakes made. How many readers could say accurately what the structure (if any) is of capric chloride, caproic chloride, caproyl chloride, caprilyl chloride, and capriloyl chloride. Don’t be a goat, it matters :

Buy Caprylic Acid Tablets. Stay fit and healthy, naturally. HollandandBarrett.com/CaprylicAcid

AND

Capric Acid Bulk tankers, drums and other sizes call 877-KIC-Bulk for pricing

So nomenclature is a black art. It’s semi-finite in that there are currently a finite number of compounds known (some 10s of millions) and a finite set of rules that can be used to generate an infinite set of  names. In a similar way there are a finite set of English words that can be used to generate an infinite set of articles. So, in principle, we could encode a finite set of rules, updated every year when IUPAC generate more rules that would completely interpret chemical name space.

In practice however the labour of doing this has been too great for anyone. Even the marker leaders in name2structure would not correctly interpret all the examples in the IUPAC rulebook. There’s a very long tail – many rules which apply to only a few compounds – or none – in the 30 million. Not cost-effective  at this stage. [There would be a cost-effective way if IUPAC rules were semantically encoded, but that’s many years away if at all.].

Ideally there should be one name2structure converter, sanctioned by IUPAC. Just like there is one InChI, sanctioned by IUPAC. In bioscience this would have happened. But in chemistry we have a  mess of competitive products, of very variable quality. They cost money (some are free to academics), have many errors, have no agreed standard of quality, have no believable metrics, have no way of input from the community.

A classic picture of anticommons.

So why are we developing OPSIN? In research terms it’s a “solved problem”. We are frequently told academia shouldn’t do things that the commercial sector does better.

In fact we are doing things better and we are doing language research. The motivations are:

  • generic use of language. Chemistry often uses phrases like “substituted pyridines”. There is no formal way of representing this concept and we are developing languages that provide a grammar. This is hard, it’s research and it’s valuable for the community, such as interpreting patents.
  • disambiguation. This is a key problem in NLP and certainloy worthy of research. What does “chloroethylbenzene”? It’s ambiguous and could be any of 5 structures (ClCCc1ccccc1,CC(Cl)c1ccccc1, Clc1ccccc1CC, Clc1cc(CC)ccc1, Clc1ccc(CC)cc1) or which one has further stereoisomers. Which did the author mean? Can this be deduced from context?. OPSIN will indicate whether a structure is ambiguous and in time may even attempt to reason what what meant.

These are the research reasons. We’ve now been joined by Daniel Lowe, a first-year PhD student supported by Boehringer Ingelheim to do research into machine interpretation of patents containing chemistry. Daniel’s made an excellent start, primarily by extending OPSIN. When he took this over from PeterC it was not a competitive tool.

Now it is.

How do we measure its success? There are no agreed corpora or metrics for chemistry NLP so we have to be careful. The essentials are to be Open and systematic and to invite community buy-in.

In essence Daniel has taken a representative set of 10000 “formally correct” IUPAC names and analysed them with OPSIN and 2 other commercial programs. (You will appreciate that it is not easy to get funding to buy programs simply to test them so there are others we cannot use). At present we find for one corpus  progA ~ OPSIN ~ progB and in two others progA > OPSIN > progB (yes, you will be kept guessing).

Treat all metrics with great suspicion, but Opsin’s recall (i.e names it translates correctly) is around 80% and it has the lowest error rate (incorrectly translated names) of all programs (ca 1%). [You should ask “on what corpus?” – and shortly we’ll tell you and Open it.]
We believe than the main reason why OPSIN < progA is vocabulary. Adding vocabulary is tedious as there is a very long tail. It’s good to do while watching cricket (as I am doing) but it’s still slow.

So this is the time when we can invite crowdsourcing. Until recently that wasn’t an option, but now OPSIN has a good infrastructure and it’s possible to add vocabulary without having to modify code. Much of OPSIN’s vocabulary is in external files which are fairly easy to modify and which won’t break the system.

OPSIN has, of course, always been Open Source and so – in principle – anyone could modify it. But in practice many OS projects have an incubation period where the infrastructure is being built and it’s very difficult to have an uncontrolled community process. Now we can offer a controled community process where large numbers of people can make small but useful contributions.

There are two methods of approach, and we’ll start with the first:

  • become a developer on Sourceforge and modify the template files to add vocabulary. Some examples of vocabulary we are missing are cabohydrates, nucleic acid components and amino-acids.
  • We should develop an interface that allows users of OPSIN to add vocabulary interactively. Thus is it fails to parse 1,5-dihydroxymanxane, OPSIN tell the user it didn’t know what maxane was and ask for a structure+locants.

So if you are interested in helping with OPSIN please let us know. Half a dozen vocabulary contributors could make rapid progress.

And when this is done we’ll have a tool that interprets IUPAC names and which, as it is Open, can become a de facto standard.

Posted in "virtual communities", Uncategorized | Leave a comment

funding models for software, OSCAR meets OMII

In a previous post I introduced our chemical natural language tools OSCAR and OPSIN. They are widely used, but in academia there is a general problem – there isn’t a simple way to finance the continued development and maintenance of software . Some disciplines (bioscience, big science) recognize the value of funding software but chemistry doesn’t. I can count the following other approaches (there may be combinations)

  • Institutional funding. That’s the model that ICE: The Integrated Content Environment uses. The major reason is that the University has a major need for the tool and it’s cost-effective to do this as it allows important new features to be added.
  • Consortium funding. Often a natural progression from the latter. Thus all the major repository software (DSPACE, ePrints, Fedora) and content/courseware (Moodle, Sakai) have a large formal member base of instutions with subventions. These consortia may also be able to raise grants.
  • Marginal costs. Some individuals or groups are sufficiently committed that they devote a significant amount of their marginal time to creating. An excellent example of this is George Sheldrick’s SHELX where he single-handedly developed the major community tool for crystallographic analysis. I remember the first distributions – in ca 1974 – when it was sent as a compressed deck of FORTRAN cards (think about that).  For afficionados there was a single variable A(32768) in which different locations had defined meanings only in George’s head. Add EQUIVALENCE, blank COMMON and any alteration to the code except by George led to immediate disaster. A good strategy to avoid forks. My own JUMBO largely falls into this category (but with some OS contribs).
  • Commercial release. Many groups have developed methods for generating a commercial income stream. Many of the computational chemistry codes (e.g. Gaussian) go down this route – an academic group either licenses the software to a commercial company, or set up a company themselves, or recover costs from users. The model varies. In some cases charges are only made to non-academics, and in some cases there is an active academic devloper community who contribute to the main branch, such as for CASTEP
  • Open Source and Crowdsourcing. This is very common in ICT areas (e.g. Linux) but does not come naturally to chemistry. We have created the BlueObelisk as a loose umbrella organisation for Open Data, Open Standards and Open Source in chemistry. I believe it’s now having an important impact on chemical informatics – it encourages innovation and public control of quality. Most of the components are created on marginal costs. It’s why we have taken the view that – at the start – all our software is Open. I’ll deal with the pros and cons later but note that not all OS projects are suited for crowdsourcing on day one – a reliable infrastructure needs to be created.
  • 800-pound gorilla. When a large player comes into an industry sector they can change the business models. We are delighted to be working with Microsoft Research – gorillas can be friendly – who see the whole chemical informatics arena as being based on outdated technology and stovepipe practices. We’ve been working together on Chem4Word which will transform the role of the semantic document in chemistry. After a successful showing at BioIT we are discussing with Lee Dirks, Alex Wade and Tony Hey the future of C4W
  • public targeted productisation. In this there is specific public funding to take an academic piece of software to a properly engineered system. A special organisation, OMII, has been set up in the UK to do this…

So what and why and who and where are OMII? :

OMII-UK is an open-source organisation that empowers the UK research community by providing software for use in all disciplines of research. Our mission is to cultivate and sustain community software important to research. All of OMII-UK’s software is free, open source and fully supported.

OMII was set up to exploit and support the fruits of the UK eScience program. It concentrated on middleware, especially griddy stuff, and this is of little use to chemistry which needs Open chemistryware first. However last year I bumped into Dave DeRoure and Carole Goble and they told me of an initiative – ENGAGE – sponsored by JISC – whose role is to help eResearchers directly:

The widespread adoption of e-Research technologies will revolutionise the way that research is conducted. The ENGAGE project plans to accelerate this revolution by meeting with researchers and developing software to fulfil their needs. If you would like to benefit from the project, please contact ENGAGE (info@omii.ac.uk) or visit their website (www.engage.ac.uk).

ENGAGE combines the expertise of OMII-UK and the NGS ? the UK?s foremost providers of e-Research software and e-Infrastructure. The first phase, which began in September, is currently identifying and interviewing researchers that could benefit from e-Research but are relatively new to the field. “The response from researchers has been very positive” says Chris Brown, project leader of the interview phase, “we are learning a lot about their perceptions of e-Research and the problems they have faced”. Eleven groups, with research interests that include Oceanography, Biology and Chemistry, have already been interviewed.
The results of the interviews will be reviewed during ENGAGE’s second phase. This phase will identify and publicise the ‘big issues’ that are hindering e-Research adoption, and the ‘big wins’ that could help it. Solutions to some of the big issues will be developed and made freely available so that the entire research community will benefit. The solutions may involve the development of new software, which will make use of OMII-UK’s expertise, or may simply require the provision of more information and training. Any software that is developed will be deployed and evaluated by the community on the NGS. “It’s very early in the interview phase, but we?re already learning that researchers want to be better informed of new developments and are keen for more training and support.” says Chris Brown.
ENGAGE is a JISC-funded project that will collaborate with two other JISC projects ? e-IUS and e-Uptake ? to further e-Research community engagement within the UK. “To improve the uptake of e-Research, we need to make sure that researchers understand what e-Research is and how it can benefit them” says Neil Chue Hong, OMII-UK’s director, “We need to hear from as many researchers and as many fields of research as possible, and to do this, we need researchers to contact ENGAGE.”

Dave and Carole indicated that OSCAR could be a candidate for an ENGAGE project and so we’ve been working with OMII. We had our first f2f meeting on Thursday where Neil, and two colleagues, Steve and Steve came up from Southampton (that’s where OMII is centered although they have projects and colleagues elsewhere). We had a very useful session where OMII have taken the ownership of the process of refactoring OSCAR and also evangelising it. They’ve gone into OSCAR’s architecture in depth and commented favourably on it. They are picking PeterC’s brains so that they are able to navigate through OSCAR. The sorts of things that they will address are:

  • Singletons and startup resources
  • configuration (different options at statup, vocabularies, etc.)
  • documentation, examples and tutorials
  • regression testing
  • modularisation (e.g. OPSIN and pre- and post-processing)

And then there is the evangelism. Part of OMII-ENGAGE’s remit is to evangelise, through brochures and meetings. So we are tentatively planning an Open OSCAR-ENGAGE meeting in Cambridge in June. Anyone interested at this early stage should mail me and I’ll pass it onto the OMII folks.
… and now OPSIN…

Posted in "virtual communities", nmr, open notebook science, Uncategorized, XML | 2 Comments

OPSIN and OSCAR – Chemical language processing

This blog is about new developments in our chemical language processors OSCAR and OPSIN and about how OMII (eScience) and we are taking them forward.  WE also have a JISC project with NacTEM – CheTA and I’ll write more later about that.

Many of you will know that we have been interested for several years in the Natural Language Processing (NLP) of chemistry texts. “Text-mining” – the extraction of information from texts – is now commonplace (and will remain so until we move away from PDF as the only means of communication). Our interest has been wider – with Ann Copestake and Simone Teufel in the Computer Laboratory we’ve been trying to get machines to understand that language of chemical discourse – “why was this paper written?”, “what is the authors relation to others?”, etc.

But to do this we needed language processing tools which were chemistry-specific, and since 2002 we’ve developed the OSCAR and OPSIN tools (see http://sourceforge.net/projects/oscar3-chem) . OSCAR was the first, developed initially by Joe Townsend and Chris Waudby through summer studentships from the Royal Society of Chemistry. The first version of OSCAR was developed to check the validity of data in chemical syntheses and has been mounted on the RSC’s website for 5-6 years.

I know from hearsay that this is widely used though I don’t have any download figures.This software is variously referred to as OSCAR and internally as OSCAR-DATA or OSCAR1. It is a measure of its quality that it has been mounted for > 5 years and has run with no reported problems and required no maintenance. I continue to emphasize the value of making undergraduates full members of the research and development process and why in our group we continue to highlight their importance.

You will need some terms now:

  • chemical natural language processing – applying the full power of NLP to chemically oriented text.  This includes approaches such as tree banking where we try to interpret all the possible meanings of a sentence or phrase: “time flies like an arrow” (Marx) or “pretty little girls school”. There are relatively few systems which do this, at least in public.
  • chemical entity recognition. A subset of chemical NLP where the parsers identity words and phrases representing chemical concepts. To do this properly it’s necessary to recognize the precise phrase. Thus “benzene sulfonic acid” represents a single phrase and to interpret is as “benzene” and “sulfonic acid” is wrong. We also recognize phrases to do with reactions, enzymes, apparatus, etc.  This is an area where we have put in a lot of work.
  • Chemical name recognition is an important subset of chemical entity recognition. Names can be recognised by at least (a) direct lookup – required for trivial or trade names (“cholesterol”, “panadol”) (b) machine-learning techniques on letter or n-gram frequencies and (c) interpretation (below).
  • Chemical name interpretation, e.g. of (IUPAC) names (e.g. 1-chloro-2-methyl-benzene). The Int. Union of Pure and App. Chemistry (IUPAC) oversees the rules for naming chemicals which runs to hundreds of pages. It looks algorithmic to code or decode chemical names. It is NOT. Some computer scientists have taken this as a toy language system and been defeated, because it is actually a natural language with rules, exceptions, irregular formations and a great deal of non-semantic vocabulary. It includes combinations (semi-systematic) such as  7-methyl-guanosine where if you don’t know what guanosine is you can make little progress (but not none, you know there is a methyl group).
  • Information extraction. The (often large-scale) extraction of information from documents. This is never 100% “correct”, partly through lack of vocabulary, partly through variations in language including “errors”, and partly because of ambiguity. We use the terms recall (how many of the known chemical phrases were actually found) and precision (how many of the retrieved phrases were correctly identified as chemical). Note that this requires agreement as to which phrases are chemical and this must be done by humans. This annotated corpus requires much tedious work, and to be useful must be redistributable in the community. Without it any reported metrics on the performance of tools are essentially worthless. There is commercial value in extracting chemical information and so, unfortunately, most metrics in this area are published as marketing figures. Note that the performance of a tool is not absolute but depends critically on the selection of documents on which it is run.

During this process Joe and Chris enhanced OSCAR by adding chemical name recognition using n-grams and bayesian methods. This gave a tool which was able to recgnize and interpret large amounts of the wrold’s published chemical syntheses. It’s at that stage that we run into the non-technical problems such as publisher firewalls, contracts, copyright and all the defences mounted against the free digital era (but that’s a different post).

The next phase was a collaborative grant between Ann Copestake and Simone teufel of the Cambridge Computer Laboratory and myself, funded by EPSRC (SciBorg). I reemphasize that Sciborg is about many aspects of language processing besides information extraction. We were delighted to include publishers as partners, RSC, Int. Union of Crystallography and Nature Publishing Group. All these have contributed corpora, although these are not wholly Open.

In NLP an important aspect is interpreting sentence structure through Part-of-speech-tagging. Thus “dihydroxymanxane reacts with acetyl chloride” has the structure NounPhrase Verb Preposition NounPhrase. There’s a splendid tool, Wordnet, that will interpret natural language components – here is what it does for “acetyl chloride” (identifying it as a Noun). But it fails on “dihydroxymanxane” – not suprising as my colleague Willie Parker coined the name manxane in 1972 and the dihydroxy derivative is generated semi-systematically. There are an infinite number of chemical names and we need tools to identify and interpret them.

OSCAR was therefore developed futher by Peter Corbett to recognise chemical names in text and our indications are that its methods are not surpassed by any other tool. Remember that results are absolutely depedent on an annotated corpus and on the actual corpora analysed. It’s easy for any tool to get good results on the corpus it’s been trainied on and lousy ones for different material. But, on a typical corpus from RSC publications OSCAR3 scores over 80% combined precision and recall. (Before you brag that your tool can do better, the study also showed that expert chemists only agreed 90%, so that is the upper limit. If chemists cannot agree on something, then machines cannot either).

OSCAR3 is now widely used. There have been over 2600 downloads from SourceForge (yes, of course OSCAR3 is Open Source). We get little feedback because chemistry is a secretive science but this at least means that there are relatively few bugs. Of course there may also be people who find they can’t install OSCAR3 but don’t contact us. The European Patent Office has used OSCAR3 on over 70,000 patents.

So OSCAR can justify some effort to make it even more usable and that’s why we have approached OMII. See below…

When we first started OSCAR we realised that we needed a name2structure parser if we were going to understand the chemistry. It’s valuable to know that dihydroxymanxane is a chemical, but even better if we know it is 1,5-dihydroxybicyclo[3.3.3]undecane because chemists can interpret that. So I started  by writing a separate tool to interpret chemical names (there weren’t and there aren’t now any other Open Source programs to do this). Joe Townsend took over and researched the literature for parsing methods, and handed this over to PeterC at the start of Sciborg. Peter made useful enhancements to this and included it as a subcomponent OPSIN. Peter deliberately did enough work to interpret common chemical names and included it in the OSCAR processing chain.

I want to be very clear. OPSIN has never been promoted as a tool to compete with commercial name2structure tools (there are 3-4) . It was an adjunct in the Sciborg program. If PeterC or I had spent more time increasing its power it would have been at the expense of what the grant was for. It met its given purpose well – to highlight the value of automatic translation and markup of names, and led – in part – to the RSC’s development of Project Prospect where chemical concepts in publications are semantically marked. From time to time we see anecdotal reports that OPSIN is not up to the standard of commercial tools and that is used as an argument for poor quality in Open Source projects and – sometimes – the relative inability of academics to do things properly. That’s unfair, but we have to bite our lips.

That’s now massively changing and I believe that in a few months time OSCAR and OPSIN will be seen as a community standard in chemical language processing and chemical entity interpretation. Being Open Source that will lead to increased community effort which has the power to leapfrog some of the commercial offerings. More in the next blog post.

Posted in "virtual communities", Uncategorized | 4 Comments