petermr's blog

Blue Obelisk Award – Christoph Steinbeck of CDK

Posted on September 14, 2006 by pm286

Last night we met at the Thirsty Bear pub in San Francisco. This was the second anniversary of the first BO meeting (in San Diego). There were nine of us, and the membership and programs are growing. People are taking us seriously. It is still extremely hard to get support for Open Source in domains, especially chemistry (though some of us can thank funding bodies for our existence). The market is slewed to the pharma industry who have little effective interest in encouraging Open Source, even though they know that the current products are broken and do not interoperate (see earlier blogs). It is an enormous labour of love to create tools which appear to duplicate existing commercial offerings and be ignored. That is what Christoph has born for several years as guru of CDK (the Chemistry Devlopment Kit) – the mantle having now passed to Egon Willighagen.
CDK contains all the basic chemoinformatics routines for reading and writing molecules, analysing their chemistry by topology and fragments, substructure searching, property calculation, etc. Unlike most commercial offerings the algorithms are open and so available for validation. The BO takes this seriously and is accumulating reference data on which the programs rely and against which they can be tested.
I walked 5 miles round SF looking for blue obelisks in rock shops – few of those anyway and they were completely the wrong sort. So Christoph’s award is neither blue nor an obelisk. It’s purple and pointy. But no doubt it can be refactoered, perhaps to a superclass of colourable pointy object, and then subclassed.
Raja took the picture. When I get it I’ll try to remember to post it here

Posted in "virtual communities", chemistry, open issues | Leave a comment

Blue Obelisk Award – Bob Hanson of Jmol

Posted on September 14, 2006 by ojd20

The Blue Obelisk Open Source group has now achieved a critical mass of high quality software, especially in chemoinformatics, chemical text analysis, editing and infrastructure such as markup languages (CML). We are begginning to be taken seriously and more collaborators are joining. The success is built on years of work by a few individuals. Those of you who think Open Source is now “obvious” may not realise that in domains – such as chemistry – it is normally regarded as suboptimal, carried out in “undergraduate projects” (a slur, anyway, as undergraduates have created some of our best materials).
This reputation is blown away by the splendid molecular visualiser Jmol. Indeed Nature Publishing Group has chose “First Glance in Jmol” (FGiJ) as the tool with which to display proteins in their articles. Quality and Open Source are thereby recognised.
Jmol has been built over several years by many contributors, but always with one guru. They pass on the succession with a “Dr. Who-like” regeneration (indeed this is common to many OS projects) Dan Gezelter, Egon Wilighagen, Miguel Howard and now Bob Hanson are among those who have taken this role. All deserve enormous credit.
The Blue Obelisk award is made whenever I have a Blue Obelisk and meet a BO member self-evidently worthy of receiving one. (That is an arbitrary process, rather like software development – the other gurus are not lesser beings, they just didn’t happen to be around).
In the anchor
The award was made after Bob had given a spectacularly visual talk in the Chemistry Department at Cambridge. This is not the Department – it is the Panton Arms Pub.
Bob receives his Blue Obelisk Bob, Blue Obelisk and a Coke
Jim Downing took the pictures – I think – anyway he isn’t in them.

More BO awards are announced in later posts.

Posted in chemistry, open issues | 2 Comments

Open Data – the time has come

Posted on September 12, 2006 by pm286

The term “Open Data” is now becoming commonly used and we (Blue Obelisk) are trying to define it (our mantra being ODOSOS. Open Data, Open Source, Open Standards). It was not commonly used two years ago although the concept is general enough to have been important. In the last 12-15 months there has been a lot of use, particularly in the techie web logs and meetings. The idea is potentially very much broader and looks set to become very important.
The earliest references I can find are:
Jim Kent on the human genome. An Open Data Consortium was founded in ca. 2003 seemingly concerned with geospatial data. Simon St. Laurent gave a presentation without date but it looks a few years back. It has a strong XML flavour.
I became concerned about Open data in ca. 2003-2004 and Henry and I published a Manifesto for Open Chemistry in 2004. I followed these up in 2005 with several mails.
(example) presentations to JISC, OAI, STM Publishers, etc. where I used the term “Open Data”.
Late in 2005 SPARC set up an Open Data list with me as moderator.
Science Commons started in
Dec 2004
In 2005 the term started to emerge, possibly independently, in the XML/tech area as in:
XTech 2005.
It is now a
hot topic among the Tims Bray and O’Reilly
There seem to be several related threads:

scientific data deemed to belong to the commons (e.g. the human genome)
infrastructural data essential for scientific endeavour (e.g. GIS)
data published in scientific articles which are factual and therefore not copyrightable
data as opposed to software and therefore not covered by OS licenses and potentially capable of being misappropriated. (this is a very general idea)

I think the current usages are sufficiently close that we should try to bring them together. Comments here would be useful. Maybe a Wikipedia article would help?

Posted in open issues | 8 Comments

Do you read journals, or "use a database"?

Posted on September 11, 2006 by pm286

We had a reception for the Chemical Information Division of the American Chemical Society last night and I spent a considerable time talking with several staff in the publications side of the Society. (I am not attributing personal views in this post, just general impressions). It was illuminating. I’ll try to be objective (it was a party and the drinks were provided by CAS). I met with the staff who were on the other side of our removal of service of ACS journals that I reported last week. Here are some thoughts…
There has been a major shift in how (some) Scientific Publishers see the purpose and practice of scholarly communication. Listening to the words used, “database” has replaced “journal” and “users” has replaced “readers”. I suspect the latter word conflates “purchasing officers” with “readers” into an unhappy anonymous entity. Moreover there is a tension between the publisher and the users – significant content is illegally downloaded and an important role of the publisher is acting as “policeman” making sure that content is not stolen. Thus our problem last week was a student using Firefox and not understanding (realistically how could he?) that “open all tabs in browser” would put the University of Cambridge in breach of contract. Technically this may be true; however in police-community interactions overzealousness is not always a good strategy. We parted with the observation the “Firefox is a problem”.
Now, I have never advocated breaking or abolishing copyright, but it is clear that this is creating a tension in the publisher/reader community. I’ve been involved in setting or being on the board of scientific journals and I see their major purpose as enhancing scholarly communication. I’m worried that we are losing sight of this, where journals in non-profit organisations are seen as a way of subsiding other activities of the society. If the publishers see “users” as a group who have a major motive to steal content, I suspect things will get worse.
At some stage we seem to have flipped from a community where publishers interpreted the wishes of the community and served them – for a reasonable fee – to a world where publishers make the rules and police their non-compliance. Did anyone in the reader community:

actually ask for journals to be transformed to databases?
actually ask for content to be limited in time to the duration of a subscription (we used to have physical journals we could take home and even hand down to our descendants or give to needy institutions)

It worries me that this has happened almost silently. I remember in ca. 1970 (when I was too inexperienced to notice) that authors were asked to transfer copyright to publishers. These requests came from trusted societies – national societies and international unions (At that stage there were essentially no commercial publishers – Pergamon was a few years later). I didn’t think twice about it – but it was one of the biggest mistakes of my scientific life. Are we sleepwalking into something just as serious?
Objectively I have some sympathy with publishers whose content is illegally downloaded – I do believe in copyright. But pragmatically is the way forward to be increasingly draconian with readers (sorry, users)?

Posted in open issues | 7 Comments

ACS Presentation III

Posted on September 11, 2006 by pm286

The presentation – which was in the Cyberinfrastructure session – went well except for one tiny problem – no Internet though I asked for it the day before. Conference Centre in San Francisco – you might think it had default wireless – but no, certainly not in our area.
Some slight movement in some areas but on the whole I sense a major split between (a) pharma + software companies + commercial data bases (CAS, CCDC, etc.) and (b) the next generation of technology and social computing. I was pleased to see that – in a show of hands – ca 35-40% said they had read this blog – even though it’s only 10 days old.
Off to Google on Wed, hopefully with Steve Heller – co-inventor of InChI. Google + InChI has real potential as a disruptive technology…
Do you think Google will have connections to the Internet? Perhaps I should check.
Many thanks to all the Cambridge hackers and the Blue Obelisk group for their help in the demo. It all went well and there is lots of interest in all the technology. Several people are starying to mention Blue Obelisk – it’s a happy choice of name.
P.

Posted in "virtual communities", chemistry | Leave a comment

Hamburgers and Cows; The Cognitive Style of PDF

Posted on September 10, 2006 by pm286

PDF is one of the greatest disasters in scientific publishing – why?
I normally give my slides in XHTML rather than Powerpoint and prefix them with the quote which I made up:
“Power corrupts; Powerpoint corrupts absolutely”
I then searched the web and found thefts Edward Tufte had already thought of it in
The Cognitive Style of PowerPoint.
Tufte contends that PP had an important role in the Space Shuttle disaster(s). Tufte’s premise is that PP requires authors to omit critical data and dumb-down thought. I had never thought of PP as actually perverting they way we think, but it is absolutely right
Mine attack on PP is complementary – technical rather than political. PP corrupts any semantics in the document completely. Just try to read the saved HTML from a PP (in say Google) and you will be lucky to get anything. PP is probably the most effective destroyer of semantic information yet devised. Tufte urges that authors use Word instead. I will interpret this to mean “any tool that displays conventional compound documents at the required level and without loss”. I therefore choose XHTML (because Word is a pretty good semantic destroyer as well).So why not just use PDF? It’s universal, it’s beautiful to look at? It’s used for scientific publishing…
NO! PDF is the biggest destroyer of scientific information currently in use.
PDF concentrates on only one thing: reproducing the process of adding printers’ ink to paper. The PDF that scientists use for publications was not promoted by them, but by the scientific publishers. How many scientists wrote to the publishers saying “we would like double column text in PDF”.
The “e publishing revolution” has had the major and very sad effects of:
* transferring the printing bill from the publisher to the reader (almost all scientists seem to print out the papers and annotate them with markers
* transferring political power to the publishers. It allows the publishers to claim (as the ACS does) that

What is important to realize is that a subscription to an STM journal is no longer […] a subscription; in fact, it is an access fee to a database maintained by the publisher.
[…] one important consequence of electronic publishing is to shift primary responsibility for maintaining the archive of STM literature from libraries to publishers. I know that publishers like the American Chemical Society are committed to maintaining the archive of material they publish. Maintaining an archive, however, costs money.
From “Socialized Science” (ACS[*] commentary on NIH)
RUDY M. BAUM, Editor-in-Chief, C&E News,
September 20 2004 Volume 82, Number 38 p. 7

How many scientists asked the publishers to convert journals into databases. How many asked the publishers to become the guardians of the archive? And have them switch off access at a moment’s notice (as they did to Cambridge last week)
There are some minor benefits from ePublishing, Crossref, more rapid access, but it’s a Faustian bargain and we are suffering. PDF has been the devil’s agent in this. It has insidiously transferred control to publishers with the unintended but equally horrific downside of semantic destruction.
Apart from the politics, why is PDF so bad? A question on XML-DEV about how to convert PDF to XML brought the lovely comment from Mike Kay (author of the (OpenSource) Saxon XSLT tool):

>
> Could you please tell me, How we can convert the PDF data
> into Xml file using java? I found a library PDFBox.
>
Converting PDF to XML is a bit like converting hamburgers into cows. You may
be best off printing it and then scanning the result through a decent OCR
package.
Michael Kay
http://www.saxonica.com/
http://lists.xml.org/archives/xml-dev/200607/msg00509.html

So I use XHTML and preserve my semantics. It’s a labour – but it has to be the way forward. I’ll write more on this later and why the browser manufacturers have destroyed semantics as well.
(Judith M-R tells me there were too many typos in last post, so I shall edit offline, spellcheck and paste. I am still losing edits in WordPress and then finding later they have been saved after I have rewritten them.)

Posted in open issues, XML | 15 Comments

A new Recruit to Open Source

Posted on September 10, 2006 by pm286

We’ve got a new cybercollaborator! This is how things happen in the world of the Blue Obelisk… The exciting thing is that anyone – with hard, careful work and the respect of their peers can become a highly valuable – and hopefully visible – member. You don’t have to be at a big institution.
I got the following mail yesterday from Beth Ritter-Guth, which she is very happy for me to post:

Dear Dr. Murray-Rust:
I am a doctoral student working on an article about Open Source Science in chemistry as it relates to rhetoric and technical communication. My theory is that tech communicators will have a new role as emerging technologies make scientific data (specifically chemical) more accessible to the general public.
Your work in automation is very interesting, as I believe this is the future of how chemical data will be shared within the chemistry community. I am also interested in how Open Source Science creates collaborative space for chemists to work for the good of the global community (Synaptic Leap and Chemists without Borders). My hope is to share these concepts with technical communicators in hopes that they (as translators) can educate the general public (or other chemists) about OSS in terms of collaboration and automation.
… can you direct me toward resources that will accurately represent OSS in chemistry? I want to make sure that I am including all of the important materials in my research. I have read articles by you and Dr. Rzepa, and I work in collaboration with Jean-Claude Bradley at Drexel. I follow a host of chem blogs, as well.
Since I am not a chemist by trade, I want to make sure that I extract the best articles in the field. I am also working in a completely open format using a wiki (http://bethritterguth.wikispaces.com ), as I truly adopt and support Open Access Scholarship.

I replied to Beth saying that there had been very little scholarship done on OS in chemistry as most people in chemistry think it’s a bunch of idealists and “student hackers”. “If it’s free it can’t be any good” and they ritually pay kilobucks/year for mediocre software.
I mailed the following:

The best place is probably the Blue Obelisk mailing list. Ask for what you want and you should get a considered set of replies. I know that Geoff Hutchison had a back-to-back article with Matt Stahl (Open Eye) taking different views. I am not sure that there are many articles other than Geoff’s. Henry and I are most vocal on Open Data rather than Open Source although of course we promote it
The main other set of OS code is comp chem which includes Abinit, GULP, MOPAC-8. They are active but not religious about OS so they won’t have written anything
Would you mind if I mentioned your request on my blog?

Posted in "virtual communities", chemistry, open issues | 6 Comments

The Blue Obelisk

Posted on September 8, 2006 by pm286

I’ve promised to write about the Blue Obelisk and I’ve only got a short time before cycling home but at least I need to point to this before the ACS meeting.
Chemoinformatics and much chemical computation is seriously broken. The formats are 30 years old, the producers compete against each other, there are no validated data resources, programs and no communal agreed knowledge. Each producer sees themselves at the centre of the universe and caters only for their own requirements, leading to a forest of “stovepipes” in the antipattern jargon. There is no sign of positive reaction to the developments on the web. Neighbouring disciplines such as bioinformatics sigh meaningfully and then go ahead and create the Open chemical resources they need. More of this later…
Chemical software used to be free. It wasn’t interoperable, but that is because machines weren’t. Even if you used a singel language (FORTRAN) there was a lot of work to transport it. A fine organisation called QCPE (Quantum Chemical Program Exchange) would sell you a distribution for the cost of the mag tape. That’s all changed. First the computational chemistry codes (quantum mechanical), then the chemoinformatics and moelcular graphics ones were bought up by warring software companies in the 1908s. I was on the custome side, in pharma, and I’ll write more later. But everything became closed. One company threatened to sue customers if they revealed its file format…
This mess persists. But about 10 years ago a number of small initiatives took place to create Open alternatives – a real labour of love because theye were generally not innovating, but playing catchup. They weren’t taken seriously. For the most part they still aren’t. But it’s changing. There is now a critical mass of developers in mainstream chemoinformatics – not enormous, but sufficient to create a usable, useful system. That is growing rapdily. I guess there is over 1 million lines of Java code, and the same in C++. Yes, we have to duplicate codes for platform reasons, but it’s a good things to have a few alternatives.
We discover each other by cyber-methods – mailing lists, IRCs, etc. The best known of the IRCs is freenode cdk. So people become cyberfriends. Before the ACS meeting in san Diego 2 years ago we decided to meet in Horton Plaza – by the Blue Obelisk. Amusingly there are two so we nearly didn’t manage it. But we did, and the name stuck. I wrote a short summary of our communal aims and aspirations and it’s taken off from there. We’re meeting again in San Francisco next week.
The Blue Obelisk now has its own mailing list and many members including me have blogs. You can find it all at:
http://www.blueobelisk.org
There is also a planetblueobelisk which aggregates the feeds.
I’ll write more about individual components and people as I feel the opportunity.

Posted in "virtual communities", chemistry, open issues | Leave a comment

ACS Presentation – Part II

Posted on September 8, 2006 by pm286

The first part of my presentation dealt with the technical issues surrounding semantic chemistry. This page contains predictions – they are general enough that you don’t have to be a chemist to appreciate them. I’ll probably try to cover some in the talk but if not, at least Wendy can write them up.
They are not agressive polemics, but statements of what I see as the current and the inevitable.
Chemical informatics and information is broken. It’s expensive, lossy, out of data and restrictive. There is virtually no innovation and no obvious understanding of how the web is changing. I don’t think the future Web (“Web 2.0” or whatever current acronym can co-exist with the closed, inward-looking chemoinformatics community which supports the closed world of pharmaceutical research.
Unless current providers of information and software, and purchasers of these services (pharma) change rapidly there will be a split. The new informatics will be characterised by:

biosciences and some sciences adjacent to chemistry (perhaps geosciences)
funders who agressively promote Open Access and require their grantees to make their output universally available
data providers who wish to build mashups – especially multidisciplinary, combined services, and autonomous processes.
the young-at-heart generation who espouse Wikipedia, folksonomies, and social computing. Expect to see a lot of semi-formal semi-voluntary reviewing of information resources such as PubChem and Wikipedia
a growing Open Source community based on the Blue Obelisk mantra of Open Source, Open Data and Open Standards
publishers with the foresight to see the new opportunities and the value of new products and services

Five years ago I made several predictions about the Semantic Chemical Web. Many have come true in technology (but not always in human uptake). Here are some of the next five:

Wikipedia Chemistry will be more accessed than the Merck Handbook or general chemical textbooks
Students will bring PDAs into lectures (if they even bother to go) and point out when the lecturer makes mistakes
machines will be able to answer some first year chemistry exam questions
machines will roam the Open chemical semantic web mashing data against bio- and geo-sciences.
PubChem will be more accessed that Chemical Abstracts. Universities will cancel their subscriptions to the latter, which will be increasingly oriented to serve the pharma industry
chemical linguistic robots will read Open Chemical papers on behalf of the community and extract data, give guidance on what papers are worth reading, build personal chemical memexes, etc.
mashups of Open crystallographic data will become universal and except for historical data searches replace the crystallographic datbases.

There are some unprectible aspects:

will the pharma industry continune in its closed approach to information? If it is to be information-driven it has to develop and open supply chain for multidisciplinary information and services
will the major publishers react positively?
will Google enter chemistry? I’ve been invited to Mountain View next week – very exciting. I expect to get a very different type of audience from the ACS – probably no chemists but many excited young web hackers. Google and the new technology could dramatically change chemical informatics.

Posted in chemistry, open issues, XML | Leave a comment

ACS presentation Part I

Posted on September 8, 2006 by pm286

Edward Tufte said in his recent book that one shouldn’t use Powerpoint to present information, but Word. Although I am not a fan of Word (see later posts) I agree with the message. So this is the first part of my talk to the American Chemical Society. Don’t worry – there’s not a lot of chemistry.
The title is someting like eChemistry (which would be nice if it actually existsed – we are trying to create it). The abstract is irrelevant as it was written 3 months ago and the world has changed so much the abstract is either out of date or so general it doesn’t matter.
First thanks. When you are likely to run into time problems, thank people at the start. Here are some (please let me know If I have missed anyone – it’s easy to do). Almost everyone on this list has hacked something
Cyberheroes (Mainly Blue Obelisk)
Bob Hanson (Jmol)
Christoph Steinbeck(Cologne)
Egon Willighagen(Cologne)
Tobias Helmut(Cologne)
Stefan Kuhn(Cologne)
Ola Sputh(Uppsala)
Eklund, Martin (Uppsala)
Miguel Howard (Jmol)
Joerg Wegner (Tuebingen/ALTOVA)
Rich Apodaca (Stanford)
Rajarshi Guha
Geoffrey Hutchison (Cornell)
Indiana:
Gary Wiggins
David Wild
Geoff Fox
Marlon Pierce
Symbiote: Henry Rzepa
Cambridge:
Ann Copestake and colleagues
Peter Corbett
Nick Day
Jim Downing
Justin Davies
Richard Moore
Joe Townsend
Alan Tonge
Andrew Walkingshaw
Andrew Walker
Toby White
Sponsors:
DTI, Accelrys, IBM
Unilever
Royal Soc Chemistry, Int Union of Crystallography, Nature Publishing Group
JISC
EPSRC
BBSRC
The current workflow in chemical informatics is broken. A typical scenario is:

nonxmlchain.png

Here we see legacy programs, human activities and legacy data. At each stage a human has to cut and paste stuff, edit it, etc. This causes loss of time, loss of quality and loss of temper. Wouldn’t it be easier if everything was in a consistent interoperable format like this?

Here all the data is in XML with semantic markup. human input goes seamlessly into programs, databases, etc. The outputs pass between programs display, etc. with no semantic loss and no friction. XML ontologies add meaning to all information components. The basic components now exist in enough cases that we can build mashed up systems.
I’m going to demonstrate some mashups. Some demos use the Internet, some have been locally crafted. Obviously we can’t demonstrate the 6 month project where we ran 1 million jobs with all interfaces in XML. Here are some of the components:
programs: MOPAC, GAMESS-US, DL_POLY, GULP, SIESTA, CASTEP, METADISE, etc.
editors: Jchempaint, etc.
renderers: Jchempaint, Jmol, JSpecView
Rich Client: Bioclipse
Services: CMLRSS, InChI. OpenBabel
Repository: SPECTRa/DSPACE (CMLCryst, CMLSpect, CMLComp)
Toolkits: CDK, JUMBO, JOELib, CIF2CML, OSCAR3, OSCARDATA
Demonstrations:
Semantic Markup (MACiE)
Simple Mashup: Placeopedia
Chemical Mashup: GoogleInchi. InChI API + Google search API
Semantic data and linking (clickable graphs and tables in CML). Jmol display
Journal-eating robots: OSCAR-DATA (chemical data)
OSCAR3 (chemical text and names) – mashup with PubChem
Reposition of data (SPECTRa) in institutional repositories
CMLRSS: molecular feeds (on Acta Crystallographica)
Rich client: Bioclipse
I shall try to get through all of these in 21.5 minutes – if the connections are slower I may have to omit some. At the end it should be clear that there is enough technology from the Open Source community to take chemistry into the 21st Century.
The next post will cover descriptions and predictions…
——————————————-
Some of these have static URLs and can be viewed relatively easily and robustly
http://www.placeopedia.com
http://wwmm.ch.cam.ac.uk/cryst/summary/acta/e/2006/07-00/ (static repository of Acta Crystallographica CIFs)
http://www-mitchell.ch.cam.ac.uk/macie/ (MACiE) (100 Entries | M0001 | animate reaction – needs IE)
http://wwmm-svc.ch.cam.ac.uk/wwmm/html/googleinchiserver.html GoogleInChI