Net Neutrality: can you help me and my MP?

I have been campaigning for Web Neutrality and the web as a democratic process (web Democracy). As a result I got a comment on my blog from my MP David Howarth (LibDem) – which felt I had treated him somewhat unfairly. He has a valid point and I will try to find common ground (perhaps by clarifying timescales and recording of correspondence). I’m also asking readers of this blog to help, either by commenting on the protocol of communicating with Mps or by pointing to useful information.

I wrote to my MEP (Andrew Duff, also LibDem) about NetNeutrality in Europe. I didn’t get any acknowledgement or reply which upset me. So I used the mySociety organ WriteToThem to send a letter in public (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=1955) This brought a useful reply (which I published http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=1984). In that Andrew Duff wrote:

 I am aware that the UK along with France is
pushing strongly to change the wording agreed by the Parliament.  This
action will delay the progress of the report and put at risk the many
important aspects in this package which will increase competition and boost
jobs.  I therefore urge you to raise your concerns with the UK Government
and ask them to support the article which ensures that an internet cut-off
can only be imposed after approval from a competent legal authority.

I did so by using WriteToThem to write to David Howarth essentially asking him to take the issue to the government.

I got an acknowledgement that promised that David Howarth , but heard nothing more until 2 days ago when I got a followup from WriteToThem (WTT) indicating that it was 2 weeks since I had written and asking what feedback I had had. I assumed that it was reasonable to get a substantive reply within two weeks and so posted a somewhat critical comment (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=2051). David Howarth commented:

Dear Dr Murray-Rust,
I think you are being slightly unfair. This is not an issue I have any expertise in. My gut instinct would be to agree with you, but there is more to policy than gut instinct. Would you prefer a thought-through reply or a statement of unthinking prejudice?
Regards,
David Howarth

I appreciate this reply and admit that I have been slightly unfair. I had based my action on similarity with the Freedom of Information Act which requires public institutions to provide substantiative replies within 20 working days (and for which support is provided by mySociety’s sister site WhatDoTheyKnow?). I had assumed that WTT’s followup suggested that a substantive reply could be expected in 14 elapsed days (this is my error and not WTT’s).

So I accept that David Howarth is working on the matter and I also know from colleagues that he has addressed other problems promptly. I also completely agree that if time gives a better response then I am prepared to wait. I am unaware of the UK’s proposed policy (as opposed to the EU’s) and so I will try to help David Howarth by asking the readership:

PLEASE CAN YOU PROVIDE LINKS EXPLAINING THE CONCERNS TO WHICH ANDREW DUFF IS REFERRING?

I will forward this message to David Howarth through WTT.

Posted in Uncategorized | 2 Comments

The Doctor Who Model of Open Source

How do we sustain Open Source in a distributed world? We are facing this challenge with several of our chemical software creations/packages. People move, institutions change. Open Source does not, of itself, grow and flourish it needs nurturing. Many packages require a lot of work before they are in a state to be usefully enhanced by the community – throw it over the wall and it will flourish does not work.

Many OS projects have clear governance and (at least implicitly) funded management. Examples are Apache, Eclipse, etc. Many others have the BDFL – Benevolent Dictator For Life with characters such as RBS, Linus, Guido Python, Larry Perl, etc. These command worldwide respect and they have income models which are similar to literary giants. These models don’t (yet?) work for chemistry.

Instead the Blue Obelisk community seems to have evolved a Doctor Who model. You’ll recall that every few years something fatal happens to the Doctor and you think he is going to die and there will never be another series. Then he regenerates. The new Doctor has a different personality, a different philosophy (though always on the side of good). It is never clear how long any Doctor will remain unregenerated or who will come after him. And this is a common theme in the Blue Obelisk.

Many projects have a start-up time. This is not surprising and is usually a Good Thing as the project does not know where it is going and almost always needs a strong framework on which to build. It’s not normally useful to have a random self-selected community try to continually refactor complex platforms in an oligarchic meritocracy that can some later. This start-up is normally done by a single BD or possibly a small group working together who know what they want to do and need to create something that works before the first release. This is the long dark night of the sole developer working towards something that they believe is valuable, for which initially they will get no recognition, often no funding and non infrequently criticism from the wider commuinity. Not this is a promising beginning but it doesn’t work, it’s not got enough features, there’s no support. I’ve been there on multiple occasions. It’s lonely.

However at some stage the software obtains critical mass, at least enough to attract fellow-minded Open Source developers. They often come from surprising sources. The s/w is not normally good enough to release to the community in general as it lacks features, testing, documentation, etc. But it know it is on the right path.

It’s at that stage that the Doctor Who succession may begin. Often the software has lain fallow for some years (as I think happened with Jmol). The official site provides:

Jmol was originally intended to be a fully functional replacement for XMol which was a molecular viewing program developed at the Minnesota Supercomputer Center….

XMol’s demise left a need for a similar tool. Dan Gezelter, the originator of Jmol, chose to avoid the same problems by making Jmol open source. …

Later, Bradley A. Smith took over the project and did a lot of work in streamlining the project as well as the software. ..

In the end of 2002, Egon Willighagen became the new project leader and a start was made with integrating Jmol with The Chemical Development Kit, …

Miguel joined the project at the end of 2002, with the explicit goal of helping build Jmol into a viable repacement for the Chime plugin (www.mdlchime.com)

Shortly after 10.2 was released, Bob Hanson started leading the work on Jmol source code

This is an excellent exemplar a piece of code written for a specific purpose (Xmol was fairly basic and ran on Xwindows), which then lay fallow before it passed through 5 Doctors and there is no indication that Bob is not with us for a long series!! But at each stage the project had enough visibililty and enough e-charisma to attract high-quality developers who could take over when required and add their own personality.

Note that these Doctors have the same force within the community as the Doctor has on TV and a BDFL with their developers. The Doctor finds their own way of regulating and encouraging development. Miguel used to make very clear lists of questions which were to be answered in a clear timescale. Bob has a similar but different way of gathering and prioritising.

[The following is not historically researched and I welcome contributions to set the record straight.]

Babel:

Babel (Pat Walters) => OELIB (Matt Stahl + ?) (language fork Java) => JOELib (Joerg Wegner)

OELIB (language fork C++)=> OpenBabel (Geoff Hutchison)

CDK:

CDK (Christoph Steinbeck, Egon Willighagen, Dan Gezelter) => Egon

OSCAR:

PeterMR => Joe Townsend + Chris Waudby => Peter Corbett

OPSIN

PeterMR => Joe Townsend => Peter Corbet => Daniel Lowe

But however we got there and wherever we are going we have a single Doctor for each of these.

[There are variants. Currently JUMBO has a BDFL (PeterMR), mainly because it has been continually refactored and it is asking a lot for anyone else to be involved in this maelstrom. With the success of Chem4Word (and dotNUMBO) that may change. ]

what happens if a Doctor dies and does not regenerate? Well, just like the TV we know it will work out. The software above is sufficiently widely used that we are sure that someone would step in. Miguel came from nowhere (Gallifrey?) – he wasn’t a chemist but a computer scientist and he filled the role perfectly and handed over to Bob in an exemplary manner. Doctor-based Blue Obelisk works.

I’d be interested to know of other OS projects where this model has been generally successful.

Posted in Uncategorized | 23 Comments

Open Source increases the quality of science

There’s a vibrant discussion on my assertion that Chemical Open Source will win, see http://www.abhishek-tiwari.com/2009/06/chemical-open-source-for-free-how-far.html

Abhishek Tiwari argues:

Academia is already enjoying everything for free, for example most of chemoinformatics toolkits (JChem, OEChem), workflow solutions (PipeLine Pilot) and many others commercial software are freely available for academic uses. However at same time finding a commercial customer for open source software service in chemistry or biology is an arduous task. Pharmaceutical companies are maintaining huge BIO-IT departments, and some of them have created a back door to exploit the cheap and free software through their academic collaborations. Peter argue that anything offered to academia should be free and industry should be charged for that, does this include the academic labs with industry connections? I don’t understand why academia or anyone should be offered such a liberty especially when they are working for their industrial collaborations. Further, in my opinion Open Source and Free should not be considered as low or zero initial cost. Unlike academia hosted community project that involves cheap labor of PhD and Postdoctoral students, open source transformation for commercial vendors is never easy and they need to find a competitive way to survive. In either case it is totally justified if software producer happily charge you for their time and resources.

and there’s been many comments from Open Source and Blue Obelisk supporters such as Egon Willighagen , Deepak Singh, Rajarshi Guha. I’ll add mine here:

If the only argument were how to support the business for creating standard software in chemistry Abhishek has a reasonable case. For example a tool to manage the departmental payroll needs to be competent and supported and it’s perfectly reasonable that someone should be paid for it. (Interestingly even such tools are increasingly being built on Open Source but I’m not using that argument). No, the reasons are specific to science:

  • closed source can produce Bad Science. I remember as an undergraduate student, running crystallographic calculations, funding a bug in the program. It generated the wrong numbers . If I had used the results I would have created bad science. So I had to read the source code (effectively machine code) and explain the bug to the author. This was then corrected and other users could not make the same mistake. In contrast if you cannot read the source there can be no guarantee that the science is not corrupted. It’s worse when you are threatened by legal action for reporting bugs

  • closed source stifles innovation. Science builds on other science. When A reports a finding in the literature, B can build on it without permission. However if A creates a computer program and closes it, B cannot integrate it into their software. Suppose A calculates a molecular property and B wishes to develop machine-learning software to understand its significance, B is forbidden to do so. So B either has to wait till A creates a machine-learning algorithm B’, or to duplicate the property calculation A’. This leads to a plethora of clones. Every software house creates equivalent components of unknown quality. Unknown because they are closed, and because there is no way of any independent assessment of their quality. Result: an anticommmons where no one develops anything new.

  • Inflated claims are made about quality. The primary motive of companies is to make money. Nothing wrong about that but when your products look the same as your competitors there is a natural tendency to inflate yours over theirs. A’s fingerprints are better than B’s . This is a meaningless scientific statement as it is untestable. Assessments are by conversations in bars, not proper metrics.

  • Funded science is stifled. If I want to test a new hypothesis in chemoinformatics (if the subject still has a standing as a science) I have to use commercial tools. I cannot do repeatable science (for the reasons above) and I cannot build on them. So the only possibility is to write your own. That is very difficult. You don’t get publications for duplicating existing software, so you have to do it on a shoestring and because you believe in the cause. You have to find bits of funding here, volunteers there. The saving grace is that the Open Source codes collaborate, not compete. For example I wrote my own molecular viewer because there wasn’t a good one in Java. I spent a lot of time with a volunteer who used Java3D (groan). Then I decided that rather than write my own I would use Jmol. I would give up the glory of writing a viewer for the chance to develop Chemical Markup Language. If I had been in a company I would have put more coders on the viewer to try to beat the competition. But, in all of this, you don’t get funding.

But now we are beating the closed source components. When that happens we can return to doing science properly. We can develop the next generation of real chemoinformatics. I want to build intelligent chemistry in silico. The Chemist’s Amanuensis as we called Sciborg. I am nearly ready to start.

Because I know that I can leave much of the work to collaborators. And play my part in bring ing the scientific method back to chemoinformatics.

Posted in Uncategorized | 2 Comments

Cambridge/OMII Workshop on OSCAR and OPSIN

We are delighted to be working with OMII-UK on the refactoring of OSCAR and OPSIN our tools for chemical entity extraction and chemical name2structure conversion. There is now a lot of interest and uptake in these tools and we are now running a 1-day workshop. Here are the details anyone is welcome to come first-come first served.

As you may know we are working with OMII-UK under the JISC ENGAGE program to refactor, repackage and enhance the OpenSource OSCAR and OPSIN software. OSCAR is now ca 7 years old and is widely used in chemistry and bioscience for the identification of chemical entities in text. Informal studies have shown it probably has the highest precision and recall of any commonly used tool.  OPSIN’s name2Structure has been informally tested against corpora of names, and again it is not far behind the leading commercial tool and has a smaller error rate.

We get feedback from many groups and have started to get offers of help in kind to enhance these tools. We are therefore inviting a number of collaborators and early adopters to a relatively informal workshop to explore what you would like to see done to OSCAR/OPSIN.

This is provisionally scheduled for 9th July 2009 and will be held in the Unilever Centre, Cambridge. There is no charge for attendance at the workshop and lunch will be provided. The programme will include presentations of the experience of 2-3 early adopters; a wish-list
session; a tour through the architecture; and a roadmap of the projected enhancements through the ENGAGE project.  We would welcome input from you into the future of OSCAR/OPSIN, and how the evolution of these applications would best meet your future needs.

Anyone interested in OSCAR/OPSIN, both non- and for-profit, is welcome to attend and should contact s.brewer@ecs.soton.ac.uk at OMII-UK.

For more information about OMII-UK, please see http://www.omii.ac.uk/, and for ENGAGE, please go to http://www.engage.ac.uk/.  You can find out more information about OSCAR from http://oscar3-chem.sourceforge.net/, and http://wwmm.ch.cam.ac.uk/.

Posted in Uncategorized | 3 Comments

Universities as publishers?

UCL has become the first large UK HE institution to mandate Open Access. Here’s part of Nature’s account (reproduced with thanks but without explicit permission under fair use):

University College London (UCL) has become the latest institution to adopt an open-access publishing policy, adding to a rapid increase in such mandates over the past year.

Open-access analysts say the move foreshadows a series of announcements from many other UK universities that have been considering similar policies.

Under UCL’s system, all research published by university staff will be placed online in an institutional free-to-access repository but only when publishers’ copyright rules allow.

UCL, one of the UK’s leading research-intensive universities, announced on 3 June that it had established a publications board to implement the policy. The policy will take full effect with the beginning of the 20092010 academic year, says Paul Ayris, director of UCL library services.

An important comment is added later:

UCL’s move is unlikely to improve public access to scientific research papers, as national bodies that support research already demand that. Thirty-six of them including the US National Institutes of Health, all seven UK research councils, and the European Research Council require work they have financed to be made publicly available (usually through deposition in open-access repositories such as PubMed Central, six months after publication).

But Alma Swan, a consultant for Key Perspectives, which analyses scholarly communications, says the recent flurry of institutional activity has come because university officials are realizing the importance of increasing their institution’s visibility on the internet, and of creating a complete record that can be analysed and compared against other institutions’ outputs or easily entered in national funding competitions. The UK and Australia, which both allocate funding depending on the quality of published research, lead the world in open-access repository policies, Swan notes.

[and, in passing, my congratulations to Paul and Alma for their dedicated efforts in this arena.]

The key point is that Openness on the web is a critical factor in institutions advertising their output. Conventionally a scientist’s impact is measured by the integral of their papers multiplied by the impact factor of the journal in which the papers appeared. This depends on how often the papers in the journal as a whole were cited. This is clearly an extraordinarily blunt metric some papers are never cited and some are cited 10,000 times (normally in a ritual of copy and paste). The average figure for high-impact journals is probably around 5 (i.e. 5 citations per paper). That figure only appears some years after the journal starts.

In practice many scientists now discover their reading through Google and click on titles that look interesting. That’s a download and most journals have highly accessed papers. If you can trust the publisher to report accurate figures the key word is trust then it’s a more accurate metric. Still not very good, as many readers will flick between papers for example I clicked through 3 papers in 10 minutes this morning just looking for a particular bit of data. The ones that didn’t have what I wanted will count as accesses motivation cannot be recorded easily by weblogs. (I am sure someone will tell me that yes the clicks are all monitored and they know what’s going on in my brain).

So discoverability through Google is the primary way that scientists access information and find research papers. Let’s think what those scientists actually want. Sometimes they know exactly the melting point of that particular compound. How to express that protein. But often they want to read a collection of papers relevant to a particular topic.

And they will find that half the papers they want to read cannot be read because their University does not buy that journal (correction does not pay an annual rental for the journal). And, if they are in the charitable sector patient disease groups they may not be able to read a significant amount of material at all.

Now if a particular University was the world leader in that area it would make sense to collect all their public work in one place where it was publicly visible. It would be the natural place for people to look first. If another University felt this was unfair they would have to try harder to advertise their work. And so on and every step of cometition would increase the visibility of the science. To anyone in the world who was interested. Anyone who had access to the Internet, agreed, but without any toll-barriers.

In essence the Universities would become Open Access Disseminators. That’s where current Closed-access models fail they restrict dissemination. There is still absolute need for peer-review and it’s unlikely an University can easily peer-review its own work. (It would be like Mps punishing themselves). So the current scholarly publishers could and should concentrate on peer-review.

But doesn’t this sound like an opportunity for a renaissance of University Presses? The costs of publishing have dropped there are many small journals which are run at near zero cost, although larger journals with high rejection rates require paid staff. If we were starting today, in an e-only world, would Universities re-invent for-profit publishing?

Possibly. But it would look very different from what we currently have. And UCL, with many others can show the way.

Posted in Uncategorized | 1 Comment

WebDemocracy – WriteToThem works, but does my MP?

I am sitting here (rather hot) in my voting suit when I got a very timely mail from WriteToThem the mySociety group which has majot contributions to ensuring that Mps have correspond responsibly with their constituents. Remember that they get money to pay staff to help with correspondence they don’t have to manage all the stuff themselves. So..

I wrote to my MP (David Howarth) two weeks ago at the suggestion of my MEP and used WriteToThem to do so. I got an acknowledgement from his office:

Dear Mr Murray-Rust,

Thank you for your email about internet neutrality and for including your letter from Andrew Duff. I will ensure that these are brought to Mr Howarth’s attention as soon as possible.

Kind regards,
Susannah Kerr
Casework Assistant to David Howarth MP

Peter Murray-Rust wrote:

Wednesday 20 May 2009

Dear David Howarth,

The Importance of Internet Neutrality

I have been concerned about the absolute need to keep the Internet
neutral and have argued that this is especially important for Cambridge
with its great emphasis on the new knowledge economies.

I wrote to my MEP, Andrew Duff who reassured me about the primary vote,
but also urged me to pressure the UK government (see para 5 in letter
below (“France has been…).
I look forward to a reply.

Peter Murray-Rust
Reader in Molecular Informatics
University of Cambridge

I haven’t heard anything more. Today I got a mail from WTT asking for my feedback on Mr Howarth’s replies:

Two weeks ago we sent your letter to David Howarth MP, your MP. (For
reference, there’s a copy of your letter at the bottom of this email)

– If you HAVE had a reply (not just an acknowledgement), please click
on the link below:

   http://www.writetothem.com/Y/n4ihhklnwr/y7knmfa235iyq7xvisc

– If you HAVE NOT had any reply at all, OR you have only had an
acknowledgement, please click on the link below:

   http://www.writetothem.com/N/n4ihhklnwr/y7knmfa235iyq7xvisc

If you feel that neither link is suitable, then please do not answer
the questionnaire.

Your feedback will allow us to publish performance tables of the
responsiveness of all the politicians in the UK. The majority of MPs
respond promptly and diligently to the needs and views of their
constituents. They deserve credit and respect for their
conscientiousness.

Likewise, we’re keen to expose the minority of MPs who don’t seem to
give a damn.

The letter you sent to your MP will be deleted from our database within
the next two weeks. Our full privacy policy can be read here:
http://www.writetothem.com/about-qa#personal

Please do not reply to this mail. If you have a question or comment
about the site, please send an email to team@writetothem.com

— the WriteToThem.com team

I clicked on the second link. Ian acknowledgement is certainly better than nothing at all but in the Internet age a reply should take less than 1,000,000 seconds (i.e. < 1 microHertz). I shall probably write to him and say that although we were not voting for UK Mps today I shall take his non-reply into account when voting and that there are a significant number of Cambridge readers of this blog who are intelligent and can make their own conclusions.

I think WTT is a wonderful system the correct mixture of constructive approach and potential criticism of unacceptable failure to reply.

Posted in Uncategorized | 3 Comments

Support the Open Knowledge Foundation

The OKF Turns 5 – And We Need Your Support

June 2nd, 2009

This month the Open Knowledge Foundation is five years old.

Over those last five years weve done much to promote open access to information from sonnets to stats, genes to geodata not only in the form of specific projects like Open Shakespeare and Public Domain Works but also in the creation of tools such as KnowledgeForge and the Comprehensive Knowledge Archive Network, standards such as the Open Knowledge Definition, and events such as OKCon, designed to benefit the wider open knowledge community. (More about what weve been up just over the last year can be found in our latest annual report).

While we have achieved a lot, we believe we can do much, much more. We are therefore reaching out to our community and asking you to help us take our vision further.

Our aim: at least a 100 supporters committed to making regular, ongoing donations of £5 (EUR 6, $7.50) or more a month.

These funds will be essential in expanding and sustaining our work by allowing us to invest in infrastructure and employ modest central support. To pledge yourself as one of those supporters all you need to do is take 30 seconds to sign up to our 100 supporters pledge at:

http://www.pledgebank.org/support-okfn/

And if you want to act on the pledge right now (or make any other kind of donation), please visit: http://www.okfn.org/support/

We are and will remain a not-for-profit organization, built on the work of passionate volunteers but these additional fund are essential in maintaining and extending our effort. Become a supporter and help us take our work forward!

Posted in Uncategorized | Leave a comment

Democracy is alive – it&apos;s politics which is sick

I’m still working out who or what or how to vote for tomorrow. I’ve had three replies so far please add yours. Two were from completely unknown people interested how they found it.

Terence Eden wrote:

I think youve stumbled on the conclusion – democracy doesnt need saving; politics does.

People are more engaged than ever in democracy (direct, digital or otherwise) but the political parties have consistently ignored us.

This expresses it well and what I have been trying to describe as webDemocracy. Citizens alert themselves they do not expect politicians to feed them issues. They are capable of mobilising very rapidly and marshalling arguments and resources.

It has, of course, just as many dangers as the current system if not more. Single-item parties are easy to create and in Europe can have a voice. Pro- and anti-life. Capital punishment. Repatriation. Etc. We have to believe in an increasing understanding of an increasingly connected publication.

If, for example, you are interested (as I hope readers of this blog are) in changing copyright and patents then there is the Pirate Party. (Thanks to (Open Access News). Many countries have branches of this. See also Pirate Bay which is a a BitTorrent site supporting mass illegal downloading. These represent the political and the direct action (civil disobedience) approaches, but they are based on webDemocracy rather than conventional politics.

[I was gently admonished this morning by an acquaintance for using this blog for politics. I do not do this as a matter of course. But where the issues directly affect academia (e.g. copyright, patents and other Opens) then I feel that it is my duty to present the issues.]

Posted in Uncategorized | Leave a comment

Chemical Open Source will win

Chemical software will be Open Source

This statement expresses both a simple truth (Simple Future, see WP) and an aspiration (Coloured Future Software shall be free). The latter is what I have been advocating on this blog the moral, pragmatic, utilitarian value of Open Source. The former simply states that it will happen. IOW a betting person could lay a wager.

This post is simply to convince you that the simple future is inevitable. I’ve made this claim before and been taken to task by the Closed Source chemical software manufacturers. Of course Open Source can’t be as good as us, of course volunteers can’t coordinate, of course you don’t have the developers. So why am I so confident.

There is a great deal of chemical software it ranges from Quantum-mechanics, through Molecular mechanics, to docking, property calculations, QSAR, analytical support (instruments, data), and chemical informatics. I’m addressing just the latter in detail and I agree that CompChem may be the slowest to change. But it will.

What are the forces?

  • The expectation in the community that software will be free (gratis, as in beer) and diminished budgets

  • The requirement of science that methodology should be Open and repeatable. The necessity of justifying one’s computed conclusions. Trust us, you have paid us a lot of money no longer works

  • The Open movement in general

  • The growing realisation (though not the reality yet) that software development should be an activity worthy of publication metrics

  • The increasing complexity of deployed systems, meaning that SME manufacturers simply cannot maintain such a diversity of unique products.

  • And the evangelism of the major information manufacturers (IBM, Google and now Microsoft) that their products will increasingly be Open Source and that they will benefit from this division.

And there is a particular aspect to Chemoinformatics – the software that supports the management of chemical compounds, reactions and their measured and computed properties:

There have been no new developments in the last decade

What I mean by this is that there have been no new algorithms or information management strategy to have come out of commercial chemoinformatics manufacturers. Chemical search, heuristic properties and fingerprints, molecule docking are solved problems. And advance comes from packaging, integration and parameter_tweaking/machine_learning. Only the last adds to science and since the commercial manufacturers are secretive then we can’t measure this (and I believe this to be mainly pseudoscience in its practice you can make extravagant plans without independent assessment). So the advances from the manufacturers have been engineering ease of use, deployability, interoperation with third-party software but not functionality.

So the Open Source community the Blue Obelisk is catching up. I believe that OSCAR is already the best chemical language processing tool, that OPSIN will soon be as good as any commercial name2structure parser and that OSRA will do the same for chemical images.

KNIME and Taverna are becoming de facto workflows and will continue to develop. And there are many other OS tools such as R and Weka. That are being integrated

And, when the Open Source components catch up with their commercial rivals the community will switch. Not just academia but pharma and chemical industry.

Because the growing community round each tool will mean that the tools are better. The science is better. No commercial company can accurately claim metrics for their software as there is no current way of measuring this.

So what role is there for the commercial sector?

An different but enormously exciting one. Where the companies provide the integration of Open Source components. Academics are not paid to integrate companies are. Where the open deployment of components is a service worth paying for. Where the tools start to produce better science and information that can be managed Openly better than before. How many of us have contributed to ClosedChem property calculators? Probably only those whose system was purchased and then closed. How many of us contribute to ClosedName2Structure as opposed to OPSIN. Who would publish a bug from ClosedTheoChem whose lawyers will send you a letter the next day (probably revoking your licence)? That’s a true case. And that’s science??

Last time I published claims of this sort I was challenged and responded I hope fairly. I obviously cannot review the science in closed source programs as I have to pay for them and might be sued if I benchmarked them. So it’s up to the commercial sector to justify their existence. If they make a well-argued case I might even change my analysis.

Posted in Uncategorized | 5 Comments

Crowdsourcing will make OPSIN the best name-to-structure program

I received the following wonderful mail yesterday to which I have replied. (I don’t reveal identities of emailers without permission but I sometimes quote. ) Daniel Lowe has seen my reply and I have included his amendments.

  • Prof Foo reminded me of your call-out for OPSIN vocabulary contributors.

    My colleague Xyzzy Bar and I have been toying with the possibility of using OPSIN and OSRA as part of our KNIME workflows.

    Not being programmers, the only way we can contribute to the development of these tools would be assisting in ‘less skilled’ efforts.

    Could you let us know what you had in mind for vocabulary contributors? If you were thinking of something along the lines of keeping a list of phrases that OPSIN misinterpreted as we came across them, then submitting that list along with corrections, we would be happy to participate.

This is fantastic. Generally the volunteers in Open Source projects hack code but we are increasingly finding systems where many other contributions are equally or more valuable. So I replied

 Excellent.
Our vision – and we were exploring  this today – is that Open Source is catching up the commercial components and will soon overtake it. We see OSCAR as the leading recognizer of chemical entities at present – metrics are hard to come by as it’s difficult to do controlled tests on closed source but informal reports confirm that.

We also have reports of at least two software companies which are interested in using OSCAR in their workflows and integration. And if I didn’t tell you already we now have a project with OMII (ex eScience and JISC – Steve Brewer (copied)) where it is being refactored. Your interest will give considerable encouragement. We meet on Thursday.

We also see OPSIN and OSRA overtaking current commercial tools. (We have some open source code which we could make available to open-source structure reconstruction from images such as OSRA). So we’ll concentrate on OPSIN for now.

Daniel Lowe (copied) has been doing great work on OPSIN which parses IUPAC names into structure. Note that it is very difficult to get certified IUPAC names in the public domain for metrics. So we rely on IUPAC names generated by programs. In practice this is probably OK as an increasing amount of IUPAC names are probably **generated** by programs. We know of 4 programs doing this FooChem, BarChem, Y2Chem and ZorkChem. Daniel has names generated by all 4. To get a good sample he takes 10000 molecules from Pubchem and their generated names and then analyses them for recall (how many structures are interpreted) and precision (how many of those are correct). I think the following is a reasonable summary (note that the commercial systems have both name generators (e.g. producing FooNames) and interpreters (producing FooStructures):
* names from the 4 programs are significantly different.
* OPSIN has been developed against FooNames.
* OPSIN has better precision than any other program
* Against FooNames names OPSIN scores over 80% recall and 98+% precision. FooStructure is slightly better and BarStructure slightly worse.
* OPSIN is somewhat worse than FooStructure against other program-generated names but significantly better than BarChem

…OPSIN can be enhanced by:
* adding new syntaxes – e.g. dispiro compounds (needs a programmer)
* adding new rules to existing syntaxes (e.g. different functional classes). This currently requires programming
* adding new vocabulary (e.g. semi-trivial names with numbering). This can be done completely outside the codebase. Thus “7-methyl-guanosine” requires the entry of guanosine and its numbering scheme into a vocabulary file – any dedicated chemist can do this with minimal training.
* testing. This is very valuable. The analysis of why OPSIN fails on recall and which are the best places to add more effort
* generation of corpora. This is also extremely valuable but quite tricky as we have to avoid copyright.

documentation and tutorials. Always golden

We have a student joining us for 3 weeks in the summer and have to have OPSIN is a state where she can make contributions. So we are very highly motivated to get OPSIN into shape where volunteers can crowdsource it.

We don’t need you to be able to hack code but we do need you to be able to install and run OPSIN and to gather metrics. I’ll talk to Daniel tomorrow and we’ll talk about how we can put together suggestions. BTW we are keen on KNIME and we have close links to that group

So some questions (some order of preference if possible). I am assuming that you have experience in chemistry (hope that’s not too rude)
* could you install and run OPSIN (a Java program) – you can ask colleagues for help.
* do you have a corpus of names that you are currently interested in. If so, please describe its provenance. To be useful it will need corresponding structures
* Would you be prepared to analyse OPSIN failures and systematize the reasons why they failed?
* could you enter new rules – these are archetypal structures as SMILES strings and name features.
* could you enter new structures (e.g. guanosine) and their numbering?

I warn you that this is boring work and the payback is in many small increments. I tend to do this sort of thing while watching the cricket – enter a structure while the bowler walks back  – watch the ball – enter several structures while drinks are taken – watch the next ball – enter tens of structures when bad light starts…

And we’d be very interested in what you are doing anyway and where this fits in.


P.

Posted in Uncategorized | Leave a comment