Open Knowledge Foundation

After my post on the “tragedy of the lurkers” I thought I would raise it on the Open Knowledge Foundation Mailing list. The OKFN is the creation of Rufus Pollock – one of the advantages of living in Cambridge is that it is easy to meet people from many disciplines. Rufus also set up the CB2 cafe meetings I posted about…

The Open Knowledge Foundation exists to address these challenges by promoting the openness of knowledge in all its forms, in the belief that freer access to information will have far-reaching social and commercial benefits. In particular, we

  • Promote the idea of open knowledge, for example by running a series of forums.
  • Instigate and support projects related to the creation and distribution of open knowledge.
  • Campaign against restrictions, both legal and non-legal, on open knowledge.

So I posted a request to see whether there was any economic theory about the problem of lurking around the anticommons. Rufus replied:


iii) (under-provision): under-provision of a good (be it rival or
nonrival). This will usually be due to the free-rider problem, which
arises from imperfect excludability (and imperfect information). Here
the issue is that because it will be possible to benefit from the good
without contributing to its production some people will try to 'free-ride'.
Item (iii) is the classic justification for creating monopoly rights in
knowledge (Intellectual Property). By giving out the right to exclude
(the monopoly) the aim is to force those who would otherwise free-ride
to contribute to the production of the good. Of course in doing so one
incurs two costs:
1. 'under-use' of the good (some people will be excluded who shouldn't
be -- the good after is nonrival so everyone should it is optimal for
everyone to get access to it)
2. 'under-production' as a result of item (ii) -- reuse is reduced below
the optimum
The issue of 'lurkers' that you discuss falls firmly into category
(iii). 'Lurkers' after all are those who use the knowledge you've
created without contributing back. Whether this constitutes a tragedy is
difficult to say. Of course it would be better if more of these people
contributed to the projects they used -- and the lack of contribution
may well be resulting in 'under-production'. However in trying to do
anything about this one is caught in the classic dilemma: in trying to
exclude people (or force them to contribute more) one will reduce usage
and perhaps prevent the full reuse of one's work (and you'll almost
certainly reduce the contributions of those who currently *aren't* lurking).
Nevertheless there is nothing to prevent you exhorting those 'lurkers'
to do more (and particularly where we are talking about 'Big Pharma'
perhaps getting out their cheque books ...)

All of which makes good sense. So no magic remedy from economic theory, but a good indication we are on the right track… There are subsequent posts…
This afternoon at Science Commons we had an impressive and in-depth analysis from Paul David which re-emphasised the problem of the anticommons. If any economist wants an example of the worst possible anticommons scenario to study then chemical publications, information and software is a very good place to start. Nonetheless we are starting to see cracks…
P.

Posted in data, open issues | 1 Comment

ASATW: Beth's Blog Notes and virtual communities

Beth Ritter-Guth mailed me yesterday…
As you will see, I have started generating notes about your blog on
>my research wiki. Please feel free to make comments on the wiki or
>through email if you wish to clarify my statements or add to them:
http://bethritterguth.wikispaces.com/PMRblognotes

I asked if I could publicize these and Beth agreed. (She will also correct the spelling mistakes…)
The primary purpose of this is Beth’s research (or scholarship – an underused word). I would hope that readers can give Beth feedback which may be helpful. We all must agree that we must be honest and clear in what we write and not be afraid to critique others and to receive criticism in public. That is part of the Open process.
Of course one of the problem of Internet communities are that they are self-selecting. This means that they can floursih quickly as everyone shares fairly common views. It would be good to have some counter-balance from other communities but this isn’t easy when they are not involved in e-activities at all!
Anyone with experience of Internet/virtual communities knows that differences of viewpoint, strategy, etc. can become magnified quite rapidly. As an example the BO has had a heated debate over another Open project Gallium vs. Kalzium (both Open offerings of the periodic table under UNIX and other systems). I haven’t followed all of it but it seems as if the projects will be happy to find common and distinct ground and both are enriched.
So Beth, feel free to critique this blog accurately – your assessors expect no less. I am feeling my way and often have relatively little idea where this is going. Unlike single topic blogs, especially those that report current issues, this is more free-ranging. I had thought it occasional, but now it seems to have taken hold. The CML blog has stagnated and I need to do something there. I hope there are more comments on ASATW because an analysis of just my ideas expose the limitations of my vision, knowledge and typing speed.
For interest I ran something slightly similar for XML-DEV many years ago which I called XML-DEV jewels. Unfortunately the link in there is broken as when I left Nottingham I lost all my digital objects. So there may be some copies elsewhere – I’ll write on this later.

Posted in "virtual communities", Uncategorized | 1 Comment

Open Source and the Tragedy of the Lurkers

In my last post ( Science Commons and Pasteur’s Quadrant) I pointed readers to the collection of vision papers for next week’s meeting on the Science Commons. I ended it with an implied challenge to the pharmceutical industry thats they were, in effect, parasitic on the Open Source movement. I also used the phrase “Commons” frequently.
Wikipedia describes a common as

In England and Wales, a common (or common land) is a piece of land over which other people—often neighbouring landowners—could exercise one of a number of traditional rights, such as allowing their cattle to graze upon it. The older texts use the word “common” to denote any such right, but more modern usage is to refer to particular rights of common, and to reserve the word “common” for the land over which the rights are exercised. By extension, the term “commons” has come to be applied to other resources which a community has rights or access to.

Again from WP:

The tragedy of the commons is a phrase used to refer to a class of phenomena that involve a conflict for resources between individual interests and the common good. The term derives originally from a parable published by William Forster Lloyd in his 1833 book on population. It was then popularized and extended by Garrett Hardin in his 1968 Science essay “The Tragedy of the Commons”. See also the enclosure of the commons, and its attendant social problems, which may have inspired the content of the parable. The opposite situation to a tragedy of the commons is sometimes referred to as a tragedy of the anticommons.

which itself is described as:

The tragedy of the anticommons is a situation where rational individuals (acting separately) collectively waste a given resource by under-utilizing it. […] This happens when too many individuals have rights of exclusion (such as property rights) in a scarce resource. […]
The term “tragedy of the anticommons” was originally coined by Harvard Law professor Frank Michelman and popularized in 1998 by Michael Heller, a law professor at Columbia Law School. [They] pointed to biomedical research as one of the key areas where competing patent rights actually prevent useful and affordable products from reaching the marketplace.

On the Internet and in the digital age the initial cost of manufacture is high, but the cost of copying is effectively zero so it can be argued that there is no tragedy of the commons and that re-use has a beneficial effect (e.g. The comedy of the commons, speech by Lawrence Lessig (Lawrence Lessig) – audio). So why am I upset that pharma companies may use Open Source software and data without contributing to it?
I am not an economist and it could be useful if someone put this in terms of economic theory as for the Commons. At present I am calling it the Tragedy of the Lurkers. WP defines a lurker as:

In Internet culture, a lurker is a person who reads discussions on a message board, newsgroup, chatroom or other interactive system, but rarely participates.

I have been involved in building and help run virtual communities for many years and have a rough figure that 90% of any Internet group are lurkers (BTW this is not a pejorative term, and has nothing to do with lurkers on commons in real life). A listmanager or MOO-wizard will be able to estimate the percentage of lurkers but other members cannot do so. A project at sourceforge such as CDK has statitics on downloads and commits but does not give any idea of how many people use it.
LAST WEEK CDK WAS 26TH IN THE WHOLE OF SOURCEFORGE (1million projects).
That is competing with vastly successful IT projects in very widepsread use. So here we have a very widely used system and complete silence from the user community. That is a tragedy of some sort.
When I speak to people in the pharma industry I frequently hear “we have run a million compounds through InChI”, “we use Openbabel for our file conversions”, “we use CML for …”. Yet the authors and developers of these systems know NOTHING of these activities. There is a hinterland of unknown size that is completely silent.
Given that the cost of replication is zero (and borne by Sourceforge and the net) why should this matter. It doesn’t detract from our development activity. If the lurkers stopped we wouldn’t notice anything (except some lower stats in SF). Bet yes, there is a tragedy.
Open Source developers have a very lonely road. It can be years before anything takes off. So the most important thing is the community. We invest our resources in the expectation of the community developing. There is a natural hope that the users of the goods will, in some way, contribute. There is, of course, no legal requirement but I think there is a moral one. Contributions need not be code or financial (though these are appreciated) but can be bug reporting, use cases, documentation, and simple moral support. If by a lack of such contribution users make the future development of the
good less easy than it would have been there is a tragedy.
I’ve worked in pharma – it can be very secretive. But I suspect there are many people in pharma who not only use Open Source but have developed material to contribute. Perhaps it is fear we have to overcome…


see also:
Online Consumer Communities: Escaping the Tragedy of the Digital Commons

Posted in "virtual communities", chemistry, open issues | 5 Comments

Science Commons and Pasteur's Quadrant

I’m in Washington (in my favourite guest house in the US, Woodley Park Guest House (near the Zoo). It’s small and we all have breakfast together which gives a great atmosphere – so much better than the amorphous chain hotels. The only problem is it’s often full…)
I’m here for a meeting Creating a Vision for Making Scientific Data Accessible Across Disciplines run by Science Commons

a project of the non-profit Creative Commons, is the sponsor and organizer of the Commons of Science Conference. Our goal is to promote innovation in science by lowering the legal and technical costs of the sharing and reuse of scientific work. We remove unnecessary obstacles to scientific collaboration by creating voluntary legal regimes for research and development.

which I expect to be very exciting and useful.

“The Conference is an invitation-only gathering of scientists, policy makers, and commons advocates who are actively interested in designing ways to make access to scientific data more widely available and more transparent across all scientific disciplines. Anyone is welcome to read the Background information, Vision Papers, or browse the list of Conference participants.”

I have been reading the papers on the plane and I am impressed with the clarity and common views that are set out there. For example

Robert Cook-Deegan and Tom Dedeurwaerdere. 2006. The Science Commons in Life Science: Research: Structure, Function, and Value of Access to Genetic Diversity.

sets out clearly the battle for the public ownership of genomes, while

Paul A. David. 2004. Towards a Cyberinfrastructure for Enhanced Scientific Collaboration: Providing Its ‘Soft’ Foundations May Be the Hardest Part
argues that although we have made considerable technical progress on cyberinfrstructure / eScience we risk losing the value unless we can solve the social problems of collaboration such as IPR, liability and contractual working.

Several author’s mention Soke’s book “Pasteur’s Quadrant” (here is the blurb). I hadn’t come across this and it’s worth understanding it to start with. In simple terms Stokes argues that there are two axes to research, “pure” and “applied”. Bohr epitomises “pure”, Edison “applied” and Pasteur is both. (The “neither” quadrant ought to be uninhabitated…) Again several papers contrast the vision of Vannevar Bush’s view of the value of pure research and the incresing current move into Pasteur’s quadrant.

Read the papers yourself – it’s essential if you are interested in the balance between privatised information and public commons. They note that it was Merck, not the universities who kept the genome in the public commons – the universities would have supported the “tragedy of the anti-commons” – the piecemeal pivate ownership of myriads of small parcels of intellectual property.
Unfortunately for me most of this debate is centered on biosciences and geosciences. I don’t find many chemists who are concerned about their commons – witness the near-global chemical silence over PubChem.
However I think things are starting to change in the pharmaceutical industry. Some of the people in them realise that ownership and secrecy of information is not necessarily working. For example in Egon Willighagen’s blog he reviews “Can open-source R&D reinvigorate drug research” (doi:10.1038/nrd2131 closed access so you have to find it yourself) by Bernard Munos. This reviews the Open Source and Public-private efforts sponsored by the pharmaceutical industry. It ends:

Other tools such as eMolecules, Jmol or the Chemistry Development Kit are adding powerful chemical search and visualization capabilities to the open-source scientist’s toolbox.

Thank you for the publicity! But perhaps you don’t realise that Jmol and CDK are written by unpaid volunteers, with other day activities (such as teaching and PhDs). At present pharma contributes essentially nothing to Open Source chemistry. I can personally list 2000 USD from Merck that we used to add code to Open Babel, but I’m not aware of any other pharma funding for Open Source software or data. Or any political support for the chemical commons (where were you all during PubChem? – I didn’t see a single voice from pharma – but you are all using it now…). Given Merck’s excellent track record on the genome, this looks like an oversight – perhaps this blog will help to bring it to people’s attention 🙂

Posted in chemistry, data, open issues | 3 Comments

Datuments and the ACS Style Guide

I was delighted to receive a special book yesterday:
“The ACS Style Guide”
Effective Communication of Scientific Information
 Anne Coghill and Lorrin Garson.
OUP ISBN-13:978-0-8412-3999-9
It’s an attractive produced hardback volume and I’m torn as to whether I should keep it as pristine as possible or cover it with annotations. I think I’ll do the latter!
The editors did me and Henry Rzepa the honour of contributing a chapter on Markup Languages, whih we have called:
“Markup Languages and the Datument”.
In the Foreword Madeleine Jacobs, Executive Director/CEO of ACS writes:
“I fell in love with chemistry when I was 13. I fell in love with writing at the age of four…” and
“The goal of the [guide] is to help authors and editors achieve […] ease and grace in all of their communications”
So the editors asked Henry and me to look ahead and write about style in an environment that is still building itself. Obviously we shall be out of date in some respects very soon, but we have tried to anticipate the closer linkage of machines and humans in science – epitomised by Tim Berners-Lee’s Semantic Web. The scientific publication of the future will soon be very different from what we do now – the younger generation may soon not use pen and paper and expect instant multichannel information. Science has to react.
So as a first step Henry and I have coined the term “datument” [1] – a portmanteau of “document” and “data”. This is a single compound (or hyper-) document representing the complete experimental and scientific environment of the researcher or scholar. The first steps are to integrate multiple markup languages (e.g. MathML, XHTML, SVG, + CML, AnIML and ThermoML in chemistry). Each language has an intelligent browser or other user agent which can understand the appropriate part of the document. And this is not just creating something that is visual – an equation might say “integrate me” – a molecule might say “I can give you my molecular weight and you can calculate my logP”. When we have rich clients such as Bioclipse (more later) we shall be able to let our machines read the boring bits of the paper while concentrating on the more complex results. Already our group can read a datument and send it off to calculate additional properties of the molecules. This takes a few minutes so the human can read the text while the machine enhances the data.
The previous style guide was published in 1997 and our contribution will look very strange in 2015! I hope that some of the ideas still make sense in that brave future. I gently predict that the Style Guide then will look very different from the book today. But I shall still need to be able to “write on it”!
I’ve been invited to the ACS on Thursday next week and hope to be able to meet some of the other authors. I’ll be taking the guide as my reading on the plane.
[1] This works in IE. It used to work in Firefox. The upgrades have broken it. Since the datument is on the publishers’ site there isn’t much we can do (though perhaps we should take a copy and mend it ourself). It is so frustrating to have to fight the browsers every few months…
 

Posted in chemistry, general | Leave a comment

Wikipedia: Getting started

Sometime last year I made my first edit to Wikipedia. I was extremely nervous despite many years on the web and having built and run virtualo communities. What if I said something stupid? Or broke one of the rules? Since the whole history is recorded I can’t wipe out my mistakes!
I have forgotten exactly what I edited but it was probably changing a little bit of syntax and hoping no one would notice! Nothing went wrong, and I grew in confidence. I found a few things I know about and perhaps added a link. And after a time got to adding new sentences.
These were probably anonymous (i.e. I was only identified by my IP – and that is dynamic). There is no shame in being anonymous and many prolific contributors stay that way. But some register with WP and take up a username – mine is petermr as I have used that for 15 years on the public web. So I am probably easily identifiable 🙂
As soon as you register you get a “Talk” page where messages can be left. The first thing is a Greeting:

Welcome! Hello Petermr, and welcome to Wikipedia! Thank you for your contributions. I hope you like the place and decide to stay. Here are a few good links for newcomers:

I hope you enjoy editing here and being a Wikipedian! Please sign your name on talk pages using four tildes (~~~~); this will automatically produce your name and the date. If you have any questions, check out Wikipedia:Where to ask a question or ask me on my talk page. Again, welcome! – UtherSRG (talk) 13:02, 5 December 2005 (UTC)

I already knew how to edit Wikis but even if you get it wrong someone will tidy it up. The main thing that worried me was whether I would be able to fulfil the high standards. So I read the five pillars and they are worth reproducing in full:

Wikipedia is an encyclopedia incorporating elements of general encyclopedias, specialized encyclopedias, and almanacs.Wikipedia is not an indiscriminate collection of information. It is not a trivia collection, a soapbox, a vanity publisher, an experiment in anarchy or democracy, or a web directory. Nor is Wikipedia a collection of source documents, a dictionary, or a newspaper, for these kinds of content should be contributed to the sister projects, Wikisource, Wiktionary, and Wikinews, respectively.Wikipedia is not the place to insert your own opinions, experiences, or arguments — all editors must follow our no original research policy and strive for accuracy.
Wikipedia has a neutral point of view, which means we strive for articles that advocate no single point of view. Sometimes this requires representing multiple points of view; presenting each point of view accurately; providing context for any given point of view, so that readers understand whose view the point represents; and presenting no one point of view as “the truth” or “the best view”. It means citing verifiable, authoritative sources whenever possible, especially on controversial topics. When a conflict arises as to which version is the most neutral, declare a cool-down period and tag the article as disputed; hammer out details on the talk page and follow dispute resolution.
Wikipedia is free content that anyone may edit. All text is available under the GNU Free Documentation License (GFDL) and may be distributed or linked accordingly. Recognize that articles can be changed by anyone and no individual controls any specific article; therefore, any writing you contribute can be mercilessly edited and redistributed at will by the community. Do not submit copyright infringements or works licensed in a way incompatible with the GFDL.
Wikipedia has a code of conduct: Respect your fellow Wikipedians even when you may not agree with them. Be civil. Avoid making personal attacks or sweeping generalizations. Stay cool when the editing gets hot; avoid lame edit wars by following the three-revert rule; remember that there are 1,408,046 articles on the English Wikipedia to work on and discuss. Act in good faith by never disrupting Wikipedia to illustrate a point, and assume good faith on the part of others. Be open, welcoming, and inclusive.
Wikipedia does not have firm rules besides the five general principles elucidated here. Be bold in editing, moving, and modifying articles, because the joy of editing is that although it should be aimed for, perfection isn’t required. And don’t worry about messing up. All prior versions of articles are kept, so there is no way that you can accidentally damage Wikipedia or irretrievably destroy content. But remember — whatever you write here will be preserved for posterity.

So how do these rules relate to creating an “Open Data” entry?
First we should ask whether it is necessary. Frequently we see duplicate entries in WP that zealous editors spot and suggest should be merged or otherwise tidied. For example, OD might be seen as part of Open Access. I don’t believe it is and will defend this view with reasoned arguments and historical references.
Secondly we must strive for Neutral Point of View. That means that I and others must not use it to promote OD although we can reasonably list some of our writings if they are substantive to the entry. The entry is not “mine” but “ours”. It would be completely appropriate to collect evidence that there was opposition to Open Data. But the page is NOT a debate between two sides, howerve carefully reasoned, although it could record such debates if they were deemed to be sufficiently important.
Soon we’ll create an entry and follow its progress…
P.

Posted in "virtual communities", open issues | 3 Comments

Let's write a Wikipedia article

I have always been entralled by the idea of a worldwide knowledgebase and a decade ago Lesley West, Alan Mills and I developed a technology to create a worldwide terminology. The Virtual Hyperglossary (TM) [probably the earliest use of this term] proposed terminological entries with unique identifiers in cascading dictionbaries which – in principle – could resolve any term. It was ahead of its time and although we had several groups who were attracted the technology did not exist.
Wikipedia (WP) has hit the right place at the right time. The Web is always able to tolerate many failures and WP was not the first attempt at a virtual encyclopedia. But it has the right combination of funding, zeitgeist, and technology. Something like this was bound to happen around now – it has turned out to be WP.
Many academic colleagues poo-poo WP as uncurated and capable of corruption. They are shortsighted – in <=2 years time I suspect WP will be standard reading in all undergraduate science and technology courses. This year – 2006 – has seen a critical mass of contributors in all subjects (with chemistry, as always, lagging behind the rest). The maths and physics is superb. The chemistry is good (given the current total disdain of almost all of the community). I salute the efforts of the relatively few who have laboured to create many excellent pages. I have predicted that in <= 5 years WP chemistry will be consulted more frequently than standard references works such as the Merck Index (The recent edition is a massive red paper volume).
So how does WP work? Simply, anyone in the world can contribute and anyone can change what previous authors have contributed. And contributions can be anonymous. So isn’t this just mindless wibble? Unsurprisingly (to me, at least), no. As an example take the first thing that came to my mind: the Gibbs-Duhem relation (like other chemists I struggled with this as an undergraduate).
[Eyeball the article] to get a feel for the scope and quality.
I immediately get a feeling of competence and relevance to what I need. I will read this article with confidence if I need to know about this area of thermodynamics. How can I do this when I know nothing about the people who have written it? Couldn’t it be the delusion of a perpetual motionist? Or some failed undergraduate?
No. The reason is in the history. It was started in 2003, and has over 50 edits. My own experience is that scientific entries are heavily edited until there is acceptable consensus (there is a different approach to contentious issues – e.g. politics). You can see that the frequency of edits is slowing – a good sign that the entry has stabilised. You will see that there are several editors of which one, PAR, has made a large number of edits. PAR’s home page again manifests a high-quality contributor (I have no idea who s/he is). But note, also, many other edits with specialist or niche contributions.
So I myself have started a few pages (e.g. Molecular Graphics , to which having been an Officer of the Molecular Graphics Society I feel I can make a moderately authoritative contribution) . There should be no pride in having done this as the work is not “mine” but the community’s. I’ve probably spent a day or two on this as I care about the discipline and its history (which is so easily lost). I have been the substantive contributor but various people have made contributions to formatting and style which are very useful – this consistency of presentation in WP is one of its great strengths.
In general physical science is often uncontroversial and so it is fairly easy to have a neutral point of view (NPOV). When we come to “Open Data” we shall have to be careful to avoid factionalism and advocacy and to research our sources.

Posted in "virtual communities", open issues | Leave a comment

"Open"

As I mentioned earlier I am about to start a Wikipedia entry on “Open Data”. Lorcan Dempsey noted that this was quite a common way of approaching a communal subject.
So I shall take readers through the process of creating a WP entry and hope to convince the unconvinced that this is a high-quality scholarly activity with appropriate checks and balances.
While I was doing this Beth Ritter-Guth has been creating an analysis of our shared vision for the Blue Obelisk. She has taken the discourse on the Blue Obelisk mailing list and her own discussions with Jean-Claude Bradley and summarised these. It is extremely valuable to have such as summary as often when new ideas and activities are started the participants are so engrossed int he detail they don’t have time to look at the wider picture. You can find links to the discussion in Jean-Claude’s recent summary.
My current position – and it has changed as a result of the discussion – is that the term “Open” both unites us and causes potential confusion. “Open” has connotations of trust, collaboration, innovation, etc. but because someone espouses “Open X” that doesn’t mean they espouse “Open Y”.
I realised this at a chemical informatics meeting last year. I gave my usual rant about Open Chemistry and the semantic web and then a software saleman talked about their product. He described it as having an “Open API”. [API = application programming interface; the instructions on how to configure the software]. I asked if it was published on the Web and he said no, it was a trade secret. So here “Open” = a manual that paying customers can read (as opposed to a product where customers have no idea how to configure it.
Our discussions on Blue Obelisk mailing list revolved around the term “Open Source”. We use this in Blue Obelisk to mean “Open Source software” as defined by the Open Source Initiative. [The BO mantra is ODOSOS (Open data, Open Source, Open Standards). ]Naively I assumed that this was the only use of the term “Open Source”. However Jean Claude uses the term “Open Source Science” and Beth had assumed that this means that the philosophy behind Open Source software and Open Source science were identical. In fact I (and I suspect most other BO members) have not heard of Open Source Science (example). So I looked this up and found it has been used about 2 years ago to mean an approach to science which relies of collaboration and openness at an early stage in the process. Here is Jean Claude on patents.
It seems reasonable to extend “Open Source” philosophy to other initiatives that share some of the general principles of Open Source computing. However we cannot assume that the actual practice is compatible. Having looked at Wikipedia I find that “Open Source” is so widespread it needs a disambiguation page which lists an amazing number of “Open Source Foo”:

Specific products

Licensing

Society and culture

Procedures

Organisations

Open-source software related:

Miscellaneous

  • Open Source, a radio show using open content information gathering methods hosted by Christopher Lydon
  • Open source intelligence, an intelligence gathering discipline based on information collected from open sources, i.e. information available to the general public.

This means that any use of “Open” is likely to be fuzzy and confusing. The “Open Access” movement is broad and supports several major points of view which, though overlapping, have significant differences either in pragmatics or philosophy. Moreover “Open Foo” does not imply “Open Bar”. Thus “Open Access” publications will not by themselves ensure “Open Data”.
More on this later…

Posted in "virtual communities", open issues | 2 Comments

Hamburger House of Horrors (1)

This is an occasional series indebted to Hammer House of Horrors. You don’t need to be a chemist to understand the message.
It’s sparked off by a comment from Totally Synthetic in this blog:

A good deal of the reasoning behind transcription of spectral data in publication is to impart meaning to the spectra. The 1H NMR spectrum of rasfonin, for instance, would be indeciferable to me, but the data written in the publication, transribed by the author and annoted for every peak would make (more) sense. It’s great to get an idea what the spectra look like, but more often than not, the actual spectra can be found in the supplementory data as a scan of the original. The combination of these two data sources gives the synthetic chemist everything they need.

Before I get onto the horror, Let me make it very clear that Tot. Syn’s blog is excellent and I’m hoping that he can meet us at the Pub on Monday lunch. His blog is a model of the future of chemoniformatics and we’d like to bounce some ideas off him.
(I’m also not specifically criticising the authors of the paper – at least not more than all other organic chemists because this supporting information (SI) is typical. I am of course suggesting gently that the process of publishing organic chemical experiments is seriously and universally broken).
The supporting information is a hamburger PDF and this example excellently makes my point. (Please readers, read it – or as much as you can manage – as I need help. Especially from anyone who is involved in graphical communication). It’s a separate document from the original paper and even though on the ACS site remarkably seems to be openly viewable. Maybe the ACS will close it sometime or maybe this exercise shows that Openness enhances downloads.
The SI draws the spectra on their sides! This is a clear indication that they aren’t meant to be read on the screen, but printed out. But the SI is 106 pages long. That’s not unusual – we have seen over 200 pages. I am sure that many organic chemists who want to read it will print it out rather than trying to read it on the screen. The spectra run from pp 36-107 with no navigational aids – if you want to link a compound to its spectrum you have to scroll through the spectra till you find its formula. Some compounds are depicted as chemical formulae on the spectra and some, but not all, contain index numbers (bold in the text).
Let’s assume that you are at a terminal and your lab has used up its paper bill. You scroll down to the infrared spectrum of a compound:
rasfonin0.png
It doesn’t look very promising, so I turn my head 90 degrees to look at it. Not very comfortable. So there is a tool on Adobe reader that rotates the page to give:
rasfonin1.png
This is awful. It looks like the spectra I used to collect 30 years ago when the pen plotter was running out (before that we plotted the spectra by hand it’s good for the soul). The resolution is probably 0.1 or better in the x-direction. I have no idea why it is so awful.
Now we
want to look back to the text where the author has made the annotations (there are no annotations on the spectra so we have to skip back 70 pages) to find:
rasfonin2.png
Our helpful Adobe reader has turnd all the pages round, so we have to turn this one back again. And, I suspect, the only real way to navigate this is to print it out.
The authors obviously spent a lot of time preparing this SI. The publisher probably calls it a “creative work” – you can claim copyright on creative works. I’d call it a destructive work. It doesn’t actually have a copyright notice, although the ACS has a meta-copyright where they assert copyright over all SI (except one from Henry Rzepa and me).
Now – please help me with the PDF. I have blogged earlier about OSCAR – the data extraction tool that can extract massive information from chemical papers in HTML or even Word. But it doesn’t work with PDF. Is there any way of extracting all the characters from this document? If I try to cut and paste I can only get one page at a time? Yes, I could probably hack something like PDFBox. But otherwise PDF is an appalling efficiently way of locking up and therefore destroying information.
The message is simple:
STOP USING PDF FOR SCIENTIFIC INFORMATION
DO NOT USE PDF FOR DIGITAL CURATION

Posted in chemistry, data, open issues | 7 Comments

GoogleInChI

Two months ago I was invited by Timo Hannay of Nature to a Nature/O’Reilly FooCamp at GooglePlex. Unfortunately I was already booked and Peter Corbett was able to step in. But there was a generic invitation from Leslie Hawthorn (who has just been running the Google Summer of Code) so last week in California I took a day off the ACS meeting to go to Google and offered to give a talk about the potential of Google in Chemistry using InChI.
There are millions of known chemical compounds and they are all distinct. It’s very value to give each a unique identifier and until recently this had to be done by an authority (Chemical Abstracts Service, Beilstein, etc.). This is problematic as the numbers are copyright and you have to pay to lookup the formal link between number and compound. Recently the International Union for Pure and Applied Chemistry has developed an identifier InChI that can be automatically generated from the chemical structure with a free OpenSource program. This means that anyone can generate an InChI and the result for a given molecule will always be the same. So if we want to search for a molecule, all we have to do is generate it’s InChI and see if Google has indexed.
Yong Zhang in our group set up a server and we were able to show that they could be discovered in Google. Nick Day last year showed that they worked incredibly well. The University of Southampton crystallographer Simon Coles had put 100 compounds on the web and used Nick’s approach to add InChIs. When Nick searched for them using Google he found all 100 and no junk. This must be one of the most accurate searches ever done!
So I was able to present these ideas to Leslie and colleagues and she offered to record this on video (ca 55 min) – and they do not retain copyright. Unfortunately when I came to the GoogleInChI demo I disovered that our service had died. A pity, but all the other demos worked.
It was really nice to meet Leslie and colleagues and start planning joint activities. There is a very different attitude to that in many otehr companies. They are keen on Open Source and also looking to provide new services in GoogleBase – perhaps more of that later.
Of course Google is a commercial organisation and not a charity but there is a lot of shared vision – we have different things to contribute to the vision. For example at the eScience meeting I’ve just been at there were many demos including GoogleMaps. It has made a considerable impression.
Who knows – Googlechem?

Posted in chemistry, general | 1 Comment