petermr's blog

A Scientist and the Web

 

Archive for the ‘“virtual communities”’ Category

More on Chem4Word and OpenOffice

Thursday, April 29th, 2010

I have left my microphone so this is being typed.

I had expected – and am glad – that there would be debate on the release of Chem4Word under an Open Source licence. The latest contribution (http://techrights.org/2010/04/28/really-qualifying-as-foss/, (Dr. Roy Schestowitz)) which I quote in full (till the ruler)

Who can port Chem4Word to OpenOffice.org?

Summary: Chem4Word is an example of Free software which is trapped deep inside Microsoft’s proprietary cage and needs rescuing

From an academic and scientific point of view, Chem4Word’s developer does the right thing by becoming a Free software proponent and choosing the Apache licence for the project (not GPL, which would have been better). The only problem is that Chem4Word helps sell Microsoft Office, which means that any user of Chem4Word (even as Free software) will be pressured to buy a standards-hostile and closed-source office suite. Those who are close to this project are aware of the issue.

This is yet another example where Microsoft is using (as in exploiting) Free software to sell its proprietary software.

Supporting Microsoft software is bad for a variety of reasons, not just because it’s proprietary and standards-hostile. Here for example is a new explanation from Omar, who exemplifies what Microsoft is doing to developing countries where cost matters a lot.

But then the grief doesn’t end here, because the problem will seem even worse if you ponder the fact that most people, around the world, who use computers can barely afford to pay their monthly bills, and that all these people are using pirated software because:

* A) That’s the only software they’ve ever known.

And:

* B) They cannot afford to pay for the annual licensing fee of a genuine copy.

These people have been mass-hypnotized, they’ve been indoctrinated into believing that whatever MS gives them is right, and that MS software is the only software on Earth that actually works. Now, take under consideration that MS is a for-profit organization after all (Actually, MS is a for-nothing-but-profit organization, but ya know), and that sooner or later, MS will start collecting money in all ways possible.

Let us hope that Chem4Word gets extended (or forked) to support Free software further down the stack. It can support all major platforms if it gets ported to office suites such as OpenOffice.org.

“I would love to see all open source innovation happen on top of Windows.”

–Steve Ballmer, Microsoft CEO

I am not out of sympathy with much of this. I have made some of my position clear (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=2233 ) and now add some more thoughts. For those who don’t know me and my group some background.

· I am a passionate and public supporter of Openness – I am on the advisory board of the Open Knowledge Foundation (http://www.okfn.org ), a prime mover in the Panton Principles for Open Data (http://www.pantonprinciples.org) and a founder of the Blue Obelisk Open software/data/standards (http://www.blueobelisk.org ) group in chemistry. I have been outspoken in this area on many occasions and have criticised certain non-Open Access publishers and opponents or obstructers of the free redistribution of scholarly data.

· My group and employer receives support from Microsoft for Chem4Word (I personally do not). I have made it clear to Microsoft that I shall speak my mind during the project and do not feel shackled. I am doing so now.

· I have been critical of Microsoft in the past (e.g. at the time of the Halloween document). I have entered this sponsorship with my eyes open.

· We spent a great deal of time and care drawing up the contract with Microsoft and this is reflected in the Open Source offering that we have now – jointly – delivered.

I am not against commercial companies. I used to work for Glaxo (now GSK) and our institute is sponsored by Unilever. I have lived through the era where IBM dominated the software/hardware market, to be replaced by Microsoft. I have seen many empires rise and fall and I am optimistic that monopolies in this area have the seeds of their own decline. Monopolies are generally bad and I worry about Google as much as Microsoft. I believe that the rise of competition checks on and exposure of Microsoft actions mean that there is less (apparent) monopoly. If Microsoft really were a monopoly I would probably be more concerned – it may still largely have the desktop but it doesn’t have a monopoly on the browser or the Net content.

Most software is closed – ICT is an exception – a shining and great exception, but unusual. Open software requires some form of incentive – a mixture of time and money in the first instance and largely money for sustainability. I wish it were otherwise and if this project can generate models in chemistry for sustainable F/OSS software I will be delighted. Bioinformatics is an exception (I may write elsewhere on this) but there is a great deal of public Open funding. In chemistry the normality is that software is closed, usually sub-standard (with regard to modern software engineering techniques), diminished by needless competitive duplication. There has been virtually no innovation over the last decade (integration and widget frosting, but no new science). We wish to change this – to create an infrastructure where the community can actually do new things rather than waiting for last-century companies to make minor modifications. We are getting there.

An important part of Chem4Word was to design a new approach to chemical information – one appropriate to this century using open standards (XML, RDF, REST, etc.) That’s happened and it’s all in the Open – code, data, specifications, etc. That’s available to the community whether or not people use Chem4Word within a Word environment. And to give Microsoft at least some credit they were early adopters and promoters of XML and Word uses XML rather than a proprietary language.

Porting to Open Office. I would be happy for this to go ahead. Ishould be regarded as an extension or port rather than a fork as forks are a last resort – this is not relevant here where the authors are supportive. It would help to reinforce the (Open) Chemical Markup Language (XML) we have developed and to develop the ideas of quality and conformance that are so badly lacking in commercial chemical software. But it needs support and it needs chemists. Open Source chemists are very rare – we struggle to overcome ideas such as “if it’s free it’s inferior”. Whereas in ICT lots of people are supported by their companies (implicitly or explicitly) to contribute to F/OSS, in chemistry no-one is. The F/OSS is largely ignored – though there are signs of this changing. The pharma companies are particularly culpable – we know of several who use F/OSS but give no acknowledge or encouragement.

If there is to be a port to OO it has to be done by chemists and thus will be effectively within the Blue Obelisk community as we know of relatively few other F/OSS chemists. As I’ve said if someone can make this happen we’d be delighted to help. But the barriers are relatively high – it carries no research reward (most F/OSS chemists are in academia or public research) and so is marginal time and to potentially detriment of career. And I cannot imagine it’s technically straightforward. There has been much in the Word work that has been very intricate and could not have happened without expert knowledge. So my main concerns are that it requires formal support and some very unusual individuals.

OPSIN: why it can become the de facto name2structure

Saturday, May 16th, 2009

In a previous post I reviewed our chemical language processing tools – OSCAR and OPSIN. This post updates progress on OPSIN, the IUPACName2Structure converter.

Why do we need a name2structure converter? It’s because chemists use language to communicate the identities of obejcts. It’s possible to talk simple chemistry over the phone whereas it wouldn’t ben easy to describe star maps, isotherms, engineering drawings, etc. And, because of this, chemists often abbreviate names – it’s easier to say “mesitylene” than “1,3,5-trimethyl benzene” or “DDT” instead of “paradichlorodiphenyltrichloromethane” (experts will cringe at the horror of this name which is seriously non-systematic and which could not be worked out by man or machine. There is, however, a lovely limerick based on it).

The rules for naming compounds are set out by the Int. Union or Pure and Applied Chemistry. Even if you are not a chemist, have a look at:  IUPAC Nomenclature Home Page which represents years of devoted work by chemists, much of the organization done by Gerry Moss. There are many reasons why the field is complicated:

  • almost all compounds can be named in many ways. Thus CH3-O-CH3 could be called methyl ether, dimethyl ether, 2-oxa-propane and so on. IUPC has recomendations for which of these should be used but they are often ignored, and sometimes are honoured in the breach. Most practising chemists, unless they routinely patent a lot of compounds neither know these recommendations nor care.
  • Errors are common. Letters can be elided, brackets missed etc. and plain mistakes made. How many readers could say accurately what the structure (if any) is of capric chloride, caproic chloride, caproyl chloride, caprilyl chloride, and capriloyl chloride. Don’t be a goat, it matters :

Buy Caprylic Acid Tablets. Stay fit and healthy, naturally. HollandandBarrett.com/CaprylicAcid

AND

Capric Acid Bulk tankers, drums and other sizes call 877-KIC-Bulk for pricing

So nomenclature is a black art. It’s semi-finite in that there are currently a finite number of compounds known (some 10s of millions) and a finite set of rules that can be used to generate an infinite set of  names. In a similar way there are a finite set of English words that can be used to generate an infinite set of articles. So, in principle, we could encode a finite set of rules, updated every year when IUPAC generate more rules that would completely interpret chemical name space.

In practice however the labour of doing this has been too great for anyone. Even the marker leaders in name2structure would not correctly interpret all the examples in the IUPAC rulebook. There’s a very long tail – many rules which apply to only a few compounds – or none – in the 30 million. Not cost-effective  at this stage. [There would be a cost-effective way if IUPAC rules were semantically encoded, but that's many years away if at all.].

Ideally there should be one name2structure converter, sanctioned by IUPAC. Just like there is one InChI, sanctioned by IUPAC. In bioscience this would have happened. But in chemistry we have a  mess of competitive products, of very variable quality. They cost money (some are free to academics), have many errors, have no agreed standard of quality, have no believable metrics, have no way of input from the community.

A classic picture of anticommons.

So why are we developing OPSIN? In research terms it’s a “solved problem”. We are frequently told academia shouldn’t do things that the commercial sector does better.

In fact we are doing things better and we are doing language research. The motivations are:

  • generic use of language. Chemistry often uses phrases like “substituted pyridines”. There is no formal way of representing this concept and we are developing languages that provide a grammar. This is hard, it’s research and it’s valuable for the community, such as interpreting patents.
  • disambiguation. This is a key problem in NLP and certainloy worthy of research. What does “chloroethylbenzene”? It’s ambiguous and could be any of 5 structures (ClCCc1ccccc1,CC(Cl)c1ccccc1, Clc1ccccc1CC, Clc1cc(CC)ccc1, Clc1ccc(CC)cc1) or which one has further stereoisomers. Which did the author mean? Can this be deduced from context?. OPSIN will indicate whether a structure is ambiguous and in time may even attempt to reason what what meant.

These are the research reasons. We’ve now been joined by Daniel Lowe, a first-year PhD student supported by Boehringer Ingelheim to do research into machine interpretation of patents containing chemistry. Daniel’s made an excellent start, primarily by extending OPSIN. When he took this over from PeterC it was not a competitive tool.

Now it is.

How do we measure its success? There are no agreed corpora or metrics for chemistry NLP so we have to be careful. The essentials are to be Open and systematic and to invite community buy-in.

In essence Daniel has taken a representative set of 10000 “formally correct” IUPAC names and analysed them with OPSIN and 2 other commercial programs. (You will appreciate that it is not easy to get funding to buy programs simply to test them so there are others we cannot use). At present we find for one corpus  progA ~ OPSIN ~ progB and in two others progA > OPSIN > progB (yes, you will be kept guessing).

Treat all metrics with great suspicion, but Opsin’s recall (i.e names it translates correctly) is around 80% and it has the lowest error rate (incorrectly translated names) of all programs (ca 1%). [You should ask "on what corpus?" - and shortly we'll tell you and Open it.]

We believe than the main reason why OPSIN < progA is vocabulary. Adding vocabulary is tedious as there is a very long tail. It’s good to do while watching cricket (as I am doing) but it’s still slow.

So this is the time when we can invite crowdsourcing. Until recently that wasn’t an option, but now OPSIN has a good infrastructure and it’s possible to add vocabulary without having to modify code. Much of OPSIN’s vocabulary is in external files which are fairly easy to modify and which won’t break the system.

OPSIN has, of course, always been Open Source and so – in principle – anyone could modify it. But in practice many OS projects have an incubation period where the infrastructure is being built and it’s very difficult to have an uncontrolled community process. Now we can offer a controled community process where large numbers of people can make small but useful contributions.

There are two methods of approach, and we’ll start with the first:

  • become a developer on Sourceforge and modify the template files to add vocabulary. Some examples of vocabulary we are missing are cabohydrates, nucleic acid components and amino-acids.
  • We should develop an interface that allows users of OPSIN to add vocabulary interactively. Thus is it fails to parse 1,5-dihydroxymanxane, OPSIN tell the user it didn’t know what maxane was and ask for a structure+locants.

So if you are interested in helping with OPSIN please let us know. Half a dozen vocabulary contributors could make rapid progress.

And when this is done we’ll have a tool that interprets IUPAC names and which, as it is Open, can become a de facto standard.

funding models for software, OSCAR meets OMII

Saturday, May 16th, 2009

In a previous post I introduced our chemical natural language tools OSCAR and OPSIN. They are widely used, but in academia there is a general problem – there isn’t a simple way to finance the continued development and maintenance of software . Some disciplines (bioscience, big science) recognize the value of funding software but chemistry doesn’t. I can count the following other approaches (there may be combinations)

  • Institutional funding. That’s the model that ICE: The Integrated Content Environment uses. The major reason is that the University has a major need for the tool and it’s cost-effective to do this as it allows important new features to be added.
  • Consortium funding. Often a natural progression from the latter. Thus all the major repository software (DSPACE, ePrints, Fedora) and content/courseware (Moodle, Sakai) have a large formal member base of instutions with subventions. These consortia may also be able to raise grants.
  • Marginal costs. Some individuals or groups are sufficiently committed that they devote a significant amount of their marginal time to creating. An excellent example of this is George Sheldrick’s SHELX where he single-handedly developed the major community tool for crystallographic analysis. I remember the first distributions – in ca 1974 – when it was sent as a compressed deck of FORTRAN cards (think about that).  For afficionados there was a single variable A(32768) in which different locations had defined meanings only in George’s head. Add EQUIVALENCE, blank COMMON and any alteration to the code except by George led to immediate disaster. A good strategy to avoid forks. My own JUMBO largely falls into this category (but with some OS contribs).
  • Commercial release. Many groups have developed methods for generating a commercial income stream. Many of the computational chemistry codes (e.g. Gaussian) go down this route – an academic group either licenses the software to a commercial company, or set up a company themselves, or recover costs from users. The model varies. In some cases charges are only made to non-academics, and in some cases there is an active academic devloper community who contribute to the main branch, such as for CASTEP
  • Open Source and Crowdsourcing. This is very common in ICT areas (e.g. Linux) but does not come naturally to chemistry. We have created the BlueObelisk as a loose umbrella organisation for Open Data, Open Standards and Open Source in chemistry. I believe it’s now having an important impact on chemical informatics – it encourages innovation and public control of quality. Most of the components are created on marginal costs. It’s why we have taken the view that – at the start – all our software is Open. I’ll deal with the pros and cons later but note that not all OS projects are suited for crowdsourcing on day one – a reliable infrastructure needs to be created.
  • 800-pound gorilla. When a large player comes into an industry sector they can change the business models. We are delighted to be working with Microsoft Research – gorillas can be friendly – who see the whole chemical informatics arena as being based on outdated technology and stovepipe practices. We’ve been working together on Chem4Word which will transform the role of the semantic document in chemistry. After a successful showing at BioIT we are discussing with Lee Dirks, Alex Wade and Tony Hey the future of C4W
  • public targeted productisation. In this there is specific public funding to take an academic piece of software to a properly engineered system. A special organisation, OMII, has been set up in the UK to do this…

So what and why and who and where are OMII? :

OMII-UK is an open-source organisation that empowers the UK research community by providing software for use in all disciplines of research. Our mission is to cultivate and sustain community software important to research. All of OMII-UK’s software is free, open source and fully supported.

OMII was set up to exploit and support the fruits of the UK eScience program. It concentrated on middleware, especially griddy stuff, and this is of little use to chemistry which needs Open chemistryware first. However last year I bumped into Dave DeRoure and Carole Goble and they told me of an initiative – ENGAGE – sponsored by JISC – whose role is to help eResearchers directly:

The widespread adoption of e-Research technologies will revolutionise the way that research is conducted. The ENGAGE project plans to accelerate this revolution by meeting with researchers and developing software to fulfil their needs. If you would like to benefit from the project, please contact ENGAGE (info@omii.ac.uk) or visit their website (www.engage.ac.uk).

ENGAGE combines the expertise of OMII-UK and the NGS ? the UK?s foremost providers of e-Research software and e-Infrastructure. The first phase, which began in September, is currently identifying and interviewing researchers that could benefit from e-Research but are relatively new to the field. “The response from researchers has been very positive” says Chris Brown, project leader of the interview phase, “we are learning a lot about their perceptions of e-Research and the problems they have faced”. Eleven groups, with research interests that include Oceanography, Biology and Chemistry, have already been interviewed.

The results of the interviews will be reviewed during ENGAGE’s second phase. This phase will identify and publicise the ‘big issues’ that are hindering e-Research adoption, and the ‘big wins’ that could help it. Solutions to some of the big issues will be developed and made freely available so that the entire research community will benefit. The solutions may involve the development of new software, which will make use of OMII-UK’s expertise, or may simply require the provision of more information and training. Any software that is developed will be deployed and evaluated by the community on the NGS. “It’s very early in the interview phase, but we?re already learning that researchers want to be better informed of new developments and are keen for more training and support.” says Chris Brown.

ENGAGE is a JISC-funded project that will collaborate with two other JISC projects ? e-IUS and e-Uptake ? to further e-Research community engagement within the UK. “To improve the uptake of e-Research, we need to make sure that researchers understand what e-Research is and how it can benefit them” says Neil Chue Hong, OMII-UK’s director, “We need to hear from as many researchers and as many fields of research as possible, and to do this, we need researchers to contact ENGAGE.”

Dave and Carole indicated that OSCAR could be a candidate for an ENGAGE project and so we’ve been working with OMII. We had our first f2f meeting on Thursday where Neil, and two colleagues, Steve and Steve came up from Southampton (that’s where OMII is centered although they have projects and colleagues elsewhere). We had a very useful session where OMII have taken the ownership of the process of refactoring OSCAR and also evangelising it. They’ve gone into OSCAR’s architecture in depth and commented favourably on it. They are picking PeterC’s brains so that they are able to navigate through OSCAR. The sorts of things that they will address are:

  • Singletons and startup resources
  • configuration (different options at statup, vocabularies, etc.)
  • documentation, examples and tutorials
  • regression testing
  • modularisation (e.g. OPSIN and pre- and post-processing)

And then there is the evangelism. Part of OMII-ENGAGE’s remit is to evangelise, through brochures and meetings. So we are tentatively planning an Open OSCAR-ENGAGE meeting in Cambridge in June. Anyone interested at this early stage should mail me and I’ll pass it onto the OMII folks.

… and now OPSIN…

OPSIN and OSCAR – Chemical language processing

Saturday, May 16th, 2009

This blog is about new developments in our chemical language processors OSCAR and OPSIN and about how OMII (eScience) and we are taking them forward.  WE also have a JISC project with NacTEM – CheTA and I’ll write more later about that.

Many of you will know that we have been interested for several years in the Natural Language Processing (NLP) of chemistry texts. “Text-mining” – the extraction of information from texts – is now commonplace (and will remain so until we move away from PDF as the only means of communication). Our interest has been wider – with Ann Copestake and Simone Teufel in the Computer Laboratory we’ve been trying to get machines to understand that language of chemical discourse – “why was this paper written?”, “what is the authors relation to others?”, etc.

But to do this we needed language processing tools which were chemistry-specific, and since 2002 we’ve developed the OSCAR and OPSIN tools (see http://sourceforge.net/projects/oscar3-chem) . OSCAR was the first, developed initially by Joe Townsend and Chris Waudby through summer studentships from the Royal Society of Chemistry. The first version of OSCAR was developed to check the validity of data in chemical syntheses and has been mounted on the RSC’s website for 5-6 years.

I know from hearsay that this is widely used though I don’t have any download figures.This software is variously referred to as OSCAR and internally as OSCAR-DATA or OSCAR1. It is a measure of its quality that it has been mounted for > 5 years and has run with no reported problems and required no maintenance. I continue to emphasize the value of making undergraduates full members of the research and development process and why in our group we continue to highlight their importance.

You will need some terms now:

  • chemical natural language processing – applying the full power of NLP to chemically oriented text.  This includes approaches such as tree banking where we try to interpret all the possible meanings of a sentence or phrase: “time flies like an arrow” (Marx) or “pretty little girls school”. There are relatively few systems which do this, at least in public.
  • chemical entity recognition. A subset of chemical NLP where the parsers identity words and phrases representing chemical concepts. To do this properly it’s necessary to recognize the precise phrase. Thus “benzene sulfonic acid” represents a single phrase and to interpret is as “benzene” and “sulfonic acid” is wrong. We also recognize phrases to do with reactions, enzymes, apparatus, etc.  This is an area where we have put in a lot of work.
  • Chemical name recognition is an important subset of chemical entity recognition. Names can be recognised by at least (a) direct lookup – required for trivial or trade names (“cholesterol”, “panadol”) (b) machine-learning techniques on letter or n-gram frequencies and (c) interpretation (below).
  • Chemical name interpretation, e.g. of (IUPAC) names (e.g. 1-chloro-2-methyl-benzene). The Int. Union of Pure and App. Chemistry (IUPAC) oversees the rules for naming chemicals which runs to hundreds of pages. It looks algorithmic to code or decode chemical names. It is NOT. Some computer scientists have taken this as a toy language system and been defeated, because it is actually a natural language with rules, exceptions, irregular formations and a great deal of non-semantic vocabulary. It includes combinations (semi-systematic) such as  7-methyl-guanosine where if you don’t know what guanosine is you can make little progress (but not none, you know there is a methyl group).
  • Information extraction. The (often large-scale) extraction of information from documents. This is never 100% “correct”, partly through lack of vocabulary, partly through variations in language including “errors”, and partly because of ambiguity. We use the terms recall (how many of the known chemical phrases were actually found) and precision (how many of the retrieved phrases were correctly identified as chemical). Note that this requires agreement as to which phrases are chemical and this must be done by humans. This annotated corpus requires much tedious work, and to be useful must be redistributable in the community. Without it any reported metrics on the performance of tools are essentially worthless. There is commercial value in extracting chemical information and so, unfortunately, most metrics in this area are published as marketing figures. Note that the performance of a tool is not absolute but depends critically on the selection of documents on which it is run.

During this process Joe and Chris enhanced OSCAR by adding chemical name recognition using n-grams and bayesian methods. This gave a tool which was able to recgnize and interpret large amounts of the wrold’s published chemical syntheses. It’s at that stage that we run into the non-technical problems such as publisher firewalls, contracts, copyright and all the defences mounted against the free digital era (but that’s a different post).

The next phase was a collaborative grant between Ann Copestake and Simone teufel of the Cambridge Computer Laboratory and myself, funded by EPSRC (SciBorg). I reemphasize that Sciborg is about many aspects of language processing besides information extraction. We were delighted to include publishers as partners, RSC, Int. Union of Crystallography and Nature Publishing Group. All these have contributed corpora, although these are not wholly Open.

In NLP an important aspect is interpreting sentence structure through Part-of-speech-tagging. Thus “dihydroxymanxane reacts with acetyl chloride” has the structure NounPhrase Verb Preposition NounPhrase. There’s a splendid tool, Wordnet, that will interpret natural language components – here is what it does for “acetyl chloride” (identifying it as a Noun). But it fails on “dihydroxymanxane” – not suprising as my colleague Willie Parker coined the name manxane in 1972 and the dihydroxy derivative is generated semi-systematically. There are an infinite number of chemical names and we need tools to identify and interpret them.

OSCAR was therefore developed futher by Peter Corbett to recognise chemical names in text and our indications are that its methods are not surpassed by any other tool. Remember that results are absolutely depedent on an annotated corpus and on the actual corpora analysed. It’s easy for any tool to get good results on the corpus it’s been trainied on and lousy ones for different material. But, on a typical corpus from RSC publications OSCAR3 scores over 80% combined precision and recall. (Before you brag that your tool can do better, the study also showed that expert chemists only agreed 90%, so that is the upper limit. If chemists cannot agree on something, then machines cannot either).

OSCAR3 is now widely used. There have been over 2600 downloads from SourceForge (yes, of course OSCAR3 is Open Source). We get little feedback because chemistry is a secretive science but this at least means that there are relatively few bugs. Of course there may also be people who find they can’t install OSCAR3 but don’t contact us. The European Patent Office has used OSCAR3 on over 70,000 patents.

So OSCAR can justify some effort to make it even more usable and that’s why we have approached OMII. See below…

When we first started OSCAR we realised that we needed a name2structure parser if we were going to understand the chemistry. It’s valuable to know that dihydroxymanxane is a chemical, but even better if we know it is 1,5-dihydroxybicyclo[3.3.3]undecane because chemists can interpret that. So I started  by writing a separate tool to interpret chemical names (there weren’t and there aren’t now any other Open Source programs to do this). Joe Townsend took over and researched the literature for parsing methods, and handed this over to PeterC at the start of Sciborg. Peter made useful enhancements to this and included it as a subcomponent OPSIN. Peter deliberately did enough work to interpret common chemical names and included it in the OSCAR processing chain.

I want to be very clear. OPSIN has never been promoted as a tool to compete with commercial name2structure tools (there are 3-4) . It was an adjunct in the Sciborg program. If PeterC or I had spent more time increasing its power it would have been at the expense of what the grant was for. It met its given purpose well – to highlight the value of automatic translation and markup of names, and led – in part – to the RSC’s development of Project Prospect where chemical concepts in publications are semantically marked. From time to time we see anecdotal reports that OPSIN is not up to the standard of commercial tools and that is used as an argument for poor quality in Open Source projects and – sometimes – the relative inability of academics to do things properly. That’s unfair, but we have to bite our lips.

That’s now massively changing and I believe that in a few months time OSCAR and OPSIN will be seen as a community standard in chemical language processing and chemical entity interpretation. Being Open Source that will lead to increased community effort which has the power to leapfrog some of the commercial offerings. More in the next blog post.

Trust in scientific publishing

Saturday, May 9th, 2009

[Please excuse formatting - reinstalling ICE soon]

Two stories have coincided – both relate to the role of trust in scientific publishing.

The first is when I was rung by Emma Marris, reporting for Nature, last week and asked what I thought of the financial problems in the American Chemical Society. I said that I wasn’t really the right person to ask as I had no previous or specialist area, but that it was essential that Scientific Societies were a key part of the scientific community. She’s included a quote in this week’s Nature:

http://www.nature.com/news/2009/090506/full/459017a.html

American Chemical Society makes cutbacks to fight financial losses.

Emma Marris

The American Chemical Society (ACS), the world’s biggest scientific society, is feeling the effects of the global economic downturn.

On 28 April, six months after tightening its belt a first notch, the society laid off 56 people, 3% of its employees…. [the rest is Pay-to-read]

I can’t reproduce the article (copyright) but here’s my bit …

Even vocal critics of the society’s opposition to open-access publishing aren’t delighting in its financial woes. Peter Murray Rust of the University of Cambridge, UK, whose blog covers open-access chemical information, says that he wishes the society well. “I have not been a supporter of many of [its] policies,” he says, “but I would say that we absolutely need national scientific societies.”

As Emma says I have been critical of some aspects of the ACS’s public policy, most notably its proactive role in PRISM – a coalition of (a few) leading publishers to discredit Open Access. From Peter Suber’s blog (2007):

[3]   July 2006 – As Nature later reports, Several publishing executives with ACS, Wiley and Elsevier meet with PR operative, Eric Dezenhall, to discuss a plan to defeat open access.  Dezenhall advises the executives to equate Open Access with a reduction in peer review quality.

This and similar actions have led people to question the scientific integrity of the participants .

In the C21 one of the critical commodities is trust. A typical (and misguided) mantra is: “You can’t trust anything in Wikipedia”. So who can ,by their nature be trusted in the scientific arenas? I’ll try the following list and am happy for comments:

  • learned societies (and international scientific unions)

  • universities, national laboratories and government agencies

  • libraries

  • funding bodies including (most) charities

  • (some) regulatory bodies if business is conducted publicly

Scientific societies have a critical role and that’s why I wish to see a healthy and growing involvement of scientific societies in establishing trust. Trust cannot be mandated, it has to be earned. It is hardly won and easily lost. In C21 Openness and democratisation are major tools in speeding up the growth of trust.

I’ve excluded the commercial publishers. There are worthy ones but there are also ones driven at least partly by the search for revenue at the cost of trust. The following story (http://www.earlham.edu/~peters/fos/2009/05/elsevier-and-merck-published-fake.htmll) broke recently about Elsevier’s publication – for money paid by Merck – of a fake journal. The “journal” was made to look like a typical medical peer-reviewed journal

Merck paid an undisclosed sum to Elsevier to produce several volumes of a publication that had the look of a peer-reviewed medical journal, but contained only reprinted or summarized articles–most of which presented data favorable to Merck products–that appeared to act solely as marketing tools with no disclosure of company sponsorship. …

The Australasian Journal of Bone and Joint Medicine, which was published by Exerpta Medica, a division of scientific publishing juggernaut Elsevier, is not indexed in the MEDLINE database, and has no website (not even a defunct one). …

This might well have gone unnoticed in a pre-digital age and it’s clear that the blogosphere is a major tool in detecting unacceptable publication. So – as many have noted – here is a commercial company which has campaigned to rubbish Open Access as “junk science” behaving in a manner which totally destroys any trust in their ethics and practice. I have no option but to say that I now cannot absolutely trust the ethical integrity of every piece of information in Elsevier journals.

The need for Open, trusted, scientific data and discourse is now clear. The scientific societies are well placed to help us make the change from closed paper to open trusted semantic digital. They clearly need a business model that transforms the new qualities into a revenue stream. This will not be easy but it has to be tried – there is no alternative. Some of the modern tools will help – the ability to mashup, aggregate, etc. will lead to new forms of high-quality information that will have monetary value. Certified validated information will lead to productivity gains and may be a valuable commodity.

So this should be a time for scientific societies to look positively to the future rather than fearfully at the receding past.

British Library document on copyright

Saturday, May 9th, 2009

From Ben White of the BL (who sought views from me and others to go into the document). There is a lot positive in this and I really hope the Government takes the recommendations seriously in revising the law. [BTW the format of the document itself is strange and rather difficult to read on screen - it looks more like a poster].

Please find attached the British Library’s latest paper on Copyright and Research. http://www.bl.uk/ip/pdf/copyrightresearchreport.pdf

We had an event (see podcast if you have the time at www.bl.uk/ip) this Tuesday to discuss copyright and research – those on the panel included Lynne Brindley, CEO of the British Library, as well as IP and Higher Education Minister David Lammy, Torin Douglas BBC etc. Lots of great people in the audience too of course!

Please spread the word regarding the paper!

Sincerely yours

Ben White

Here is an excerpt:

In a supreme irony, the ease of access enabled by the digital age
actually leads to greater access restrictions:

1. Researchers increasingly find a black hole when researching
21st Century material– ironically the material of previous
centuries has become easier to access than the websites, word
documents and blogs of today because clearing rights to give
access to modern day material can be lengthy and expensive.

 Currently Google blocks post 1868 material on their Google
Books site from users in the European Union because of the
longer duration of copyright in the EU. This means that
European researchers wanting to read material up to 1923
have to travel to the United States to view material that is
freely available there on the web but not in Europe. Much
of this material was of course produced by Europeans…

 Some historical publishers have had to abandon post war
social history projects as the rights issues are too complex.

2. Researchers of the future find a black hole when researching
late 20th Century history as much of our digital history has
decayed and become digitally corrupted.

 Parts of the British Library’s archive of celebrated photographer,
Fay Godwin, may no longer be accessible to researchers when
Microsoft and Adobe no longer support Windows XP/ Vista
and Photoshop (CS3) servers, as the servers are essential for
viewing some of her digital photographic collection. Restrictions
in copyright law mean that the British Library can do little
practically to prevent this.

3. Computer based research techniques become restricted by
copyright and contract law. Computer technology has already
significantly changed the way in which scientific research is
conducted. Scientists increasingly do not read books or journals,
but by writing computer programmes search, analyse and
extract data from written sources in a technique known as ‘data
mining’ or ‘text mining’. Science is propelled forward by access
and collaborative reuse of scientific information. It is important
that computer based research techniques are allowed for by
future copyright law, in the way that in the analogue world we
have protected research activity through ‘fair dealing’.

 Medical researchers write their own computer programmes
to search across thousands of digitised articles in their libraries
to extract important medical data, such as the relationship
between a certain enzyme and the spread of cancer. Despite
this, the researcher is not able to share the results of their
findings with other scientists as this will contravene the terms
of their licence with the database provider, and the relationship
between the provider and the university.

It is heartening to see such a positive view being promoted at a national level. Perhaps this is something that individual libraries can help to support and propagate. Hopefully it can give encouragement to those who wish to challenge the unacceptable status quo.

BioIT – Chem4Word

Monday, April 27th, 2009

I’m in Boston for Bio-IT World Conference & Expo 2009 for two main reasons, an invited talk “the Chemical Semantic Web” (Computational Chemistry track) and also our first public demonstration of the Chem4Word software (research.microsoft.com/en-us/projects/chem4word/ ) . For those who are at the meeting, the first’s on Wednesday morning, the second on Tuesday lunchtime.

The C4W demo has been worked on very hard for the last month. There was a dress rehearsal in Redmond at the Microsoft External Research meeting which was ready about 5 minutes before the presentation. We took the decision to freeze that functionality and to show it in Boston after the bugs had been ironed out. The discipline of having a fixed deadline (an international meeting) is an excellent way of concentrating minds within a project. Rudy Potenzone is demo-ing the software but I’ve got the demo on my machine as well.

What does Chem4Word do? It’s more important to say what it is.

At one level it’s an add-in that chemists can use to author documents. An the other end it’s a toolkit which can be used to develop the next generation of bench-top chemical software. I owe Rudy some introductory material, so I might as well use this blog to do it.

Chem4Word is an Open Platform for collaborative chemical software development in a .NET environment.

C4W will be transferred to CodePlex (the MS Open software site) and will be available for anyone to help develop, much as in the spirit of the Blue Obelisk. Learning from other Open Source chemistry projects we have though closely about sustainability of management.

Chem4Word is an Add-In to Word2007 that creates a semantic authoring tool for chemistry.

Word2007 is a platform that supports semantic authoring. Its use of smartTags allows words and phrases to be linked to a range of document components, including a Gallery, a Navigator.

Chem4Word uses (chemical) Ontologies.

With the new Microsoft Research Ontology Add-In external ontologies (we use Nico Adams’ ChemAxiom) document components can be managed by a formal ontology. At one level this is a chemical spell-checker, at another a thesaurus, at another a converter between scientific units and at yet another a transformation tool between scientific concepts.

Chem4Word emphasizes semantics by using CML as its exposed data model

Current chemical toolkits require a fixed data model for objects. C4W communicates with CML (and other XML) as its data model. This gives a declarative programming model where there are no side effects. Effectively this is a new programming language for chemistry, both formal and flexible

Chem4Word is modular

The graphics and UI are decoupled from the chemical engine. This means that commands can be issued to the engine from sources other than the UI. The document is also modular – it’s possible to examine the chemistry, the links, the tags all as XML and to build document processors independent of Word.

Chem4Word supports validation

All CML has to conform to a schema (CML-Lite) and can be validated at every stage. The import pipeline takes 4-5 stages with validation and normalization. It is impossible to import or author an invalid file. This is intended as an important contribution to bringing needed quality into chemistry.

Chem4Word integrates Text and chemistry and styles

The Word document introduces ChemistryZones : which are chunks of the document representing chemistry. These are all backed by a CML object which itself can have many components, currently:

  • single molecule

  • compound molecule (salts, hydrates, complexes)

  • formula

  • name

Each of these can be displayed in a chemistry zone, making it possible to change the representation of an object, while preserving the semantics. The Navigator allows the user to select a given zone or to navigate from it.

Current functionality

The current project had to balance functionality, semantics and aesthetics and has put most emphasis on semantics. The primary functionality is currently:

  • manage gallery, navigator and other Word concepts

  • create chemistry zones

  • import CML molecules

  • validate them

  • render them, with different styles in different zones

  • tweak them (move atoms to prettify the molecule)

  • change atoms

We have deliberately not (yet) introduced chemical editing tools as we wish to get the UI framework correct and validate the semantics. With the large number of molecules now available (e.g. in Pubchem) we can convert these to valid CML outside C4W and import them. This means that unless chemists are working with new molecules C4W will already support many of their authoring needs.

The future

The current project runs for another few months at the end of which we’ll have a release version. (We shall make the current version available to a few pre-alpha collaborators). A major emphasis is to create a distribution which is well designed for development and even if that means limiting the initial functionality. We’ll work hard on developing use cases where C4W is useful, especially in the creation of compound documents.

We’ll tell you then where this is going after that.

This blog authored with ICE + Open Office; thanks to PeterSefton and USQ

(Note: Just when I thought I had the ICE plugin working, it now fails to post. I think this may be due to firewalls or something else, but I can’t grab the error message as it disappears. So I have to cut and paste. I think that’s why the fonts go wonky)

Three days to save the European Internet

Sunday, April 26th, 2009

Two days ago I had no idea the European Internet was under severe threat, and I’m a European. Part of the problem is that Europe is incredibly complicated and the governance is baroque and bizarre. It uses terms like (Acquis communautaire) admittedly I suffer from Anglophone blindness, but in any language the complexity of terminology and governance is horrendous.

The normal thing most Brits do is ignore it. I have a cosy feeling that continentals are more educated but that’s probably false. So we have a governance process that’s out of control. They pay themselves huge allowances, are regularly corrupt, but as a war baby I reckon that’s a small price to pay for not carpet-bombing civilians. Yes, the UK tabloids regularly bash the Common Agriculture Policy, etc. but…

I was shocked out of my complacency when the issue of Software Patents in Europe arose. I went to UCL (London) to hear Richard Stallman talk on this and was embarrassed to find an American who knew how European government worked. He know where the power lay, the Council of Ministers (who are unelected), etc. and he gave us clear instructions as to how to best mobilise.

Now we are at it again. Although I’m an educated citizen of Europe I don’t know how to promote my views best. But one of the great powers of the Web is that it promotes e-democracy. Not only can anyone say what they want but groups can use crowdsourcing to assemble arguments and advocacy. So I know that I can read up rapidly on the issues and know what the best use of my very limited efforts is. (Here I think it’s mainly raising the issues on this blog and writing as an individual to my MEP).

I’ve found Twitter very useful here. 2-3 followers have in the rather cryptic style of Twitter pointed out that there are two issues.

  • Net neutrality

  • 3-strikes

Both are evil but the wisdom seems to be that net (non)neutrality is even more evil. What’s NN? Here’s a helpful site (http://www.savetheinternet.com/=faq). Essentially Net Neutrality is about the infrastructure of the net as provided by the companies such as telcons, which by default do not have our interests at heart.

From the site:

What is Network Neutrality?

Network Neutrality — or “Net Neutrality” for short — is the guiding principle that preserves the free and open Internet.

Put simply, Net Neutrality means no discrimination. Net Neutrality prevents Internet providers from blocking, speeding up or slowing down Web content based on its source, ownership or destination.

Net Neutrality is the reason why the Internet has driven economic innovation, democratic participation, and free speech online. It protects the consumer’s right to use any equipment, content, application or service on a non-discriminatory basis without interference from the network provider. With Net Neutrality, the network’s only job is to move data — not choose which data to privilege with higher quality service.

Who wants to get rid of Net Neutrality?

The nation’s largest telephone and cable companies — including AT&T, Verizon, Comcast and Time Warner — want to be Internet gatekeepers, deciding which Web sites go fast or slow and which won’t load at all.

They want to tax content providers to guarantee speedy delivery of their data. They want to discriminate in favor of their own search engines, Internet phone services, and streaming video — while slowing down or blocking their competitors.

These companies have a new vision for the Internet. Instead of an even playing field, they want to reserve express lanes for their own content and services — or those from big corporations that can afford the steep tolls — and leave the rest of us on a winding dirt road.

The big phone and cable companies are spending hundreds of millions of dollars lobbying Congress and the Federal Communications Commission to gut Net Neutrality, putting the future of the Internet at risk.

Isn’t the threat to Net Neutrality just hypothetical?

No. By far the most significant evidence regarding the network owners’ plans to discriminate is their stated intent to do so.

The CEOs of all the largest telecom companies have made clear their intent to build a tiered Internet with faster service for the select few companies willing or able to pay the exorbitant tolls. Network Neutrality advocates are not imagining a doomsday scenario. We are taking the telecom execs at their word.

And you should read more.

Here’s an analogy. I shall start my journey to BioIT on two trains, East Coast Capital Connect (used to be British Rail) and Transport for London (the tube). Each makes up its own rules as the what services operate, what the fare structure is. For example if I want to travel from Cambridge to London they decide that I cannot have a cheap fare at certain times even though I have a concession. So as a class of citizen I am discriminated against in favour of corporate passengers (customers). That’s Train non-neutrality.

If I travel at the wrong time I incur a penalty. Let’s call that a strike. And let’s assume that a company decides that a recidivist breaker of this rule gets banned from travelling. That’s a per person decision, and somewhat analogous to the three strikes rule. There may be good reasons for wanting to ban individuals repeated disorderly behaviour for example. I don’t know, but I expect there are people banned from rail travel.

So in writing to my MEP I referred him to a summary of the issues better than trying to explain them myself when I don’t know what’s being voted on when and by whom.

I hope he knows.

Open Chemistry Data at NIST

Friday, April 24th, 2009

I had a wonderful mail this morning from Steve Heller …

Peter


I am helping the NIST folks get additional GC/MS EI (electron impact only) mass spectral for their WebBook and mass spec database.
http://webbook.nist.gov/chemistry/
and
http://www.nist.gov/srd/nist1a.htm

The question I have for you is would you be willing to post something on your blog suggesting it would be useful for people to donate their EI MS to the NIST folks. The WebBook is Open Data which is where the spectra would go first/initially. In addition, the spectra would also go into the NIST mass spec database to add to the existing database they provide.

NIST is in the process of setting up an arrangement with the Open Access Chemistry Central folks to do this and I wanted to see if you also would be willing to cooperate/collaborate as well.

Cheers

Steve

PMR: Many of us have known the NIST webbook for many years. It was the first, and for some time the only, openly accessible chemistry resource on the web (outside bio-stuff like PDB). NIST are a US government agency whose role is – in large part – to produce standards (data, specs) for resources in science and engineering. Part of this role is to support US commerce through these activities.

The webbook has many thousands of entries for compounds. Even if you aren’t a chemist, have a look as it’s an ideal exemplar of how data should be organised. The impressive thing is that it has complete references for all data and also concentrates on error estimation. In many ways it is the gold standard of chemical data. (I agree that things like Landolt-Bernstein are very important but in the modern web-world monographs costing thousands of dollars are increasingly dated). And it was Steve and colleagues (especially Steve Stein) who got the InChI process started – because they had so much experience in managing data publicly it made sense to promote the InChI identifier for compounds.

(In passing, NIST has also made an important contribution to our understanding of the universe by measuring the fundamental constants to incredible accuracy).

So is NIST in CKAN – the Open Knowledge Foundation’s growing list of packages of Open Data? YES (from http://www.ckan.net/package/read/nist)

Metadata:

Notes:

About

The NIST Data Gateway provides easy access to NIST scientific and technical data. These data cover a broad range of substances and properties from many different scientific disciplines.

Openness

Much of the material appears to be in the public domain as it is produced by the US Federal Government, but it varies from dataset to dataset.

Note that there is some fuzziness about what is meant by openness here – the NIST pages carry “all rights reserved” and “the right to charge in future”. But Steve’s motivation is clear here and it’s part of the role of OKFN/CKAN to help determine what the rights are.

I’m also interested in the reference to Open Access Chemistry Central. This raises the very important question of where Open Data should be located. The bioscience community has shown that a mixture of (inter)governmental organizations can work extremely well but this is less clear in chemistry at present. We are in exploration phase with a number of initiatives trying out models such as Pubchem (gov), Chemspider (independent/commercial), Crystaleye (academic), NIST (gov), Wikipedia Chemistry (independent), NMRShiftDB(academia), Chemistry Central (commercial/publisher) etc. I am sure there will be a need for multiple outlets – the variation in the sites above is too great for any single organization.

What is important is that this is Linked Open Data because then it does not matter who exposes it. LOD has a number of requirements including

  • Open Data (not just accessible)

  • Semantic infrastructure (e.g. XML/RDF)

  • Identifier systems

  • Appropriate metadata and/or Ontologies

I’ll be talking about this at BioIT next week in Boston (where I shall meet up with Steve). I’ll be bloggins more over the next two days.

In Cambridge we have just been funded by JISC to enhance our repository of chemistry data, which will include Mass Spec. I don’t know how much is EI, but our mission is to make the data Open and where this happens then we will certainly send it off to Steve. There’s a certain amount of technology needed but between us I think we could get an excellent public prototype.

More – much more – soon.

This blogpost was prepared with ICE+OpenOffice.

ICE-cold in Toowoomba

Monday, April 20th, 2009

I am here for all too short a time working with Peter Sefton and colleages on a number of collaborations on authoring and publishing tools. Peter runs the  Australian Digital Futures Institute at the University of Southern Queensland in Toowoomba – a lovely place in the mountains west of Brisbane.

We have a joint project funded by JISC – ICE-Theorem – and I’ll blog later when we have had the demo. This is a great arrangement because we have been able to contract much of the work to Peter’s group. Having now met the current group (and it’s grown since I was last here) I can say that it has a critical mass of committed developers which is very hard to put together in most academic institutions, especially those which depend on “research” output rather than technology. We’ve built up a strong mutual understanding over the last 3 or so years.

We have our differences of approach, but wherever popssible we are looking for these to compement each other. Good academic web tools will depend on a mixture of diversity and synergy. That means trying out new ideas but not getting locked into one’s own approach because you want glory or money (the chances are relatively small).

What often happens in the academic content/publishing world is that technology “empires” spring up – managing repositories, courseware, etc. They often mutate into political organisations with large consortia, where the pace is governed not by technology but by the need to satisfy everybody’s interests. At the other end of the spectrum are the geeks – in the best sense – who want to build systems in  days.

They often do. And Toowoomba is one of the places where it happens.

Peter has been showing me the Fascinator – it’s a lightweight desktop repository based on Fedora (but that’s excchangeable). We have an apparently similar approach in Jim Downing’s Lensfield. However we are looking to see how these two complement each other – Peter is document-centered, we are data-centered and there is enough difference that it make sense to go forwrad on both fronts.

But I have to rush …