petermr's blog

A Scientist and the Web


Archive for the ‘general’ Category

Does linking to technorati tags generate spam?

Sunday, September 23rd, 2007

In a recent post (blogs, folksonomies and tagging – get going!) I encouraged the Open Access community to start using blogs and tagging. I specifically pointed to Technorati to illustrate the value and showed that some conferences had huge amounts of traffic and others almost none. I gave several examples and gave links to the technorati summary of the posts under given tags. This was based on a particular URL structure.

On revisiting these sites I find that the lists at Technorati have been drastically altered.  The berlin5 one has 11 porno spam links. The method a fairly recent one – take the content of a genuine post and do some very crude lexical munging of the words and phrases (I get zillions of these each day sumitted to the blog comments). Somehow they actually linked to sex sites in Cambridge, so maybe they interpret domain names. So it seems the spammers have found my post yesterday and somewhere generated spam content that is either injected into Technorati or has already been linked. AFAICS the genuine links are still there.

Then I looked at www2007 worrying that I would see the same. But whereas there were 300+ links yesterday to www2007, now there are only 6, all half a year old. Was technorati spammed and tried to clean it?

If by linking to Technorati I have unwittingly generated spam I apologize, but this can be done in other ways.

I don’t take Technorati counts very seriously – about as seriously as I take ISI citation counts – but it’s a useful way of finding people. But maybe we have to be careful about the exact way we use it. I welcome enlightenment.

blogs, folksonomies and tagging – get going!

Saturday, September 22nd, 2007

At the recent “Berlin 5″ meeting on Open Access I noted sadly that I was the only person blogging the meeting. Normally there are many bloggers at the meetings I go to so I (and everyone else) can choose what they blog. At berlin5 I felt it was important to show the way so I hacked some notes together for many of the talks – generally typing scattered phrases during the talks (and with even more typos than normal). As a result I spent more time than I would have likely simply noting some of the presentations. In any case it’s not a very good approach since you don’t know what the speaker is going to say and often run out of puff during dense slides. You know them (“Title XVIII chapter 123 of the EC, says …”). Mind you, it isn’t easy to blog my presentations synchronously either…

I made ca. 15-20 posts about Berlin-5. 2-3 before I went, 2 (so far) afterwards and about 12 during the meeting. Many of the latter are simple shorthand notes of speakers with little or no comment. So for example I have copied as many of Alma Swan’s words as possible to give those not there an idea. I can’t type well or fast so it’s limited. And there are no links.

I’m writing this in the hope that librarians, funders and policy makers will be more adventurous and start their own blogs. An increasing number of slides at berlin5 mentioned blogs, wikis, folksonomies, etc. The best way to understand these is to DO them, not read other people’s.

There are of course some top-class blogs from staff working for publishers – Nature and PLoS lead the way. They actually tell us how people in the organization think, work, interact. (Contrast the more formalised magazine-like blogs on some publishers which are often written by third parties, sometimes recruited from the blogosphere). And there are some excellent librarian blogs. But I am sure there is a niche for “DGXIII inmate”, “bewildered at RCUK/STFC”, etc. In Open Access we need more than just Peter Suber, Stevan Harnad commenting. They have clear formats and agendas which need complementing. There is a huge need for investigative blogging to reveal the spread and the problems with OA.
The digital library needs metadata and in C21 much of this should be done elsewhere than the library. Two main methods are text-mining and tagging (folksonomies). Here I’ll look at the latter.

If you have just set up a blog, no one will know about it. It can be quite dispiriting. There are many legitimate ways to advertise, including tagging. There are sites such as Technorati which visit all blogs (ca. 100, 000, 000 exist) and index and link to them.
One thing that Technorati looks for is the tags in a blog.
If you write a blog you can add tags which give an idea of the content. Tags are common in many systems such as and Connotea where communities expect other members to use tags to find similar contributions. There is NO controlled vocabulary – you could use anything (though it’s best to stick to ANSI alphanumerics). If you don’t understand social computing, this is a good place to start. It doesn’t matter what you do – it won’t break anything. And there is no “right” or “wrong” way to do things – it is whether it works. So for this meeting I chose “berlin5″. It’s natural and I assumed that of the 100-200 delegates that others would choose something similar. Let’s assume they choose “berlin open access” (you can have multiple tags, of course).
In a formal metadata system this is a nightmare, but in the blogosphere it’s trivial. If twenty people read both blogs one of them will probably post a comment ” Petermr is using berlin5 – why don’t you add that as well” (or the other way round). So the two of start to converge. No one tells us to – it’s just obviously a good thing to do.

So here is the list of posts about berlin5 (there are 18). There are 3 which are nothing to do with OA but they are easily ignored as they are old.




As I say it’s a pity that there isn’t anyone else (although you we needn’t have finished)
Let’s look at a more distant meeting – electronic theses and dissertations at Uppsala. If you follow:

You’ll find 17 posts, mainly by me but not all:

ETD Policies, Strategies and Initiatives in…

Das, Anup Kumar and Sen, B. K. and Dutta, Chaitali (2007) ETD Policies, Strategies and Initiatives in India: a Critical Appraisal. In Proceedings 10th International Symposium on Electronic Theses and Dissertations (ETD2007), Uppsala, Sweden.

So now I have made two important contacts – the authors of the article and also EPrints for LIS. That’s just because we both used etd2007 in out posts.

But now let’s look at the really hectic end of the scale, www2007:

303 posts! and although the meeting was 5 months ago, posts mentioning it are still coming in, such as Yet another semantic tagging application in Jakoblog — Das Weblog von Jakob Voß

More, because I have added this link to my blog, Jakoblog will get notified. Technorati keeps count of how often every blog mentions others. E-LIS has 251 other blogs which link to it; I have about 120 (“the authority”), Jakoblog has 37. If I put Jakoblog on my blogroll it would increase to 38. (A popular aggregator/multiple_author blog like ScienceBlogs has nearly 10, 000, Bora’s Blog around the clock has 700, Dorothea Salo’s Caveat Lector · has Authority: 199; My colleague Andrew Walkingshaw‘s Brighten the Corners, has 28. Of course these numbers are about as useful as citation statistics!

The serious message is that if you want to go out and get noticed in the blogosphere you have to get noticed! Tagging is a good way of finding out who is thinking along the same lines as you. Then link to them. They’ll often link back. Aggregators will include all of you, and so on.

So, OA colleagues – and hopefully OD colleagues as well – get out there! Yes, you will reach some people via conventional scholarly publications. But your publications will be noticed much more if they are blogged. Das, Sen and Gutta should get some more readers because I have blogged it. They’ll get me anyway, and that’s because E-LIS blogged it. And so it grows…

Stochastic hyperslide at MKM2007

Friday, June 29th, 2007

I have just given my presentation at Mathematical Knowledge Management 2007 for which I wrote an abstract about 2-3 months ago : Mathematics and scientific markup. I knew that in the intervevning time I would find something new to get excited about – and this has happened – I have added the excitement of the lc-semanticweb. Of course the technology and community have developed since then.

As many of you know I rail against Powerpoint as a prime destroyer of semantic content. Powerpoint also constrains the presenter to a linear mode – yes you can skip a few slides and maybe even hide them, but it’s not easy to flip about. And it’s a poor launch platform for interactive demos.

I’ve done my slides in XHTML+SVG, believing this is the right way to remain true to my campaign for XML. (I’ll do Powerpoint when it’s necessary for business purposes – e.g. to integrate with colleagues, but that’s about it). This worked for a bit but soon hit problems of scale. I started addressing that with XSLT to add menus to the presentation. In fact I started with the wrong technology (for some bizarre reason I chose it to be Windows specific) and have now simply changed to XHTML.

I have over 12000 XHTML slides. (before you get the wrong idea,  many of these are scraped – so 3000+ from one example of OSCAR3). But nonetheless there are very many. I want to be able to reassemble them for each talk, and I want the technology to be as simple as possible – ideally none. (The efforts I have used in the past have all been broken by browser “upgrades” – a synonym for disasters).

Some ideas are:

  • use a database and craft metadata for each slide
  • use something like Spotlight or local Google

but these don’t assemble the talk. So at present I have about 100 directories (maybe with trivial subdirectories) and 5-20 slides per directory. I make the talk by selecting directories which may have some general bearing in the talk – perhaps 20-30. Admittedly it takes memory to work out what is likely to be in each folder but I have to work hard at a talk and the time is well spent. I then asterisk those directories which I HAVE to present (i.e. if I get to 5 mins before the end and haven’t mentioned them, i break off and visit them). I prepare demos (such as BIOCLIPSE, OSCAR1, GoogleInChI, Blue Obelisk GreaseMonkey,) and visits to the WWW (when the organizers have provided it – e.g. the ACS hardly ever does even when I ask in advance – it makes little sense to have sessions about the Web when you can’t get there).

So I prepared this for today’s talk to the MKM. A very nice audience to present to as they understand all about semantic content, namespaces, XML, dictionaries – so none of that has to be explained. I said my hyperslide would be stochastic – I didn’t know what slides I would present and in what order. The demos might break.

They did. BIOCLIPSE hung on Jmol rotation (although I got to demo Jmol later). However I am sure the audience appreciated the value – we’d seen Eclipse being used for theorem proving, etc. GreaseMonkey worked yesterday, but failed today. Now I have reinstalled it and it works great. GoogleInchi failed (is the Google API finally broken?) But OSCAR1 and OSCAR3 worked – and the links out to Pubchem and the chemical blogosphere. And the polymer builder, although I didn’t have time to explain exactly how it was a symbol manipulator. And I certainly covered less than half of what I might have said. But at least the hyperslide approach means I never overrun – as you can stop when you need to.

There are downsides. It’s difficult to keep a record (that’s why videos are useful). And  Powerpoint does have the merit of acting as a document container. I’ve tried both S3 ans Slidy but neither help you to assemble talks.

The only complete way to make slides available is to put them under SVN on WWMM. I can copy the directories to a pen drive. But none of this is a record of which slides were visited in which order and what was said.

I’d be interested in whether anyone else is mad enough to create new ways of managing their slides? And whether they have any ideas. At present I’m almost motivated to try Javascript, but the last time I did that – 5 years ago – everything broke within a year.

Ownership/Copyright of Comments and Email

Tuesday, December 19th, 2006
  1. David Goodman raises a question in response to this post:
    December 18th, 2006 at 11:33 pm e excerpted and adapted from my Dec 18 posting on the SPARC-OA list:As an example, relevant to this very posting, I quote from this blog:
    “You own the copyright in your comments: but you also agree to license your comments under the the Creative Commons Attribution-NonCommercial license.”
    When you act as a publisher, you do not license commercial use.

    David Goodman,

When we set this series of blogs up we thought it was important to consider the ownership of comments. I’ve been in positions before when I have been unable to distribute a collaborative work because we didn’t address this issue. So we took a stab and picked a license for the comments. I’m quite happy to be convinced that “commercial use” would be better, but we omitted it because we thought it might put off some readers (we thought that even requiring a license might cause problems).

So, readers (and commenters) what is your view? Would you be happy with a license that allows commercial exploitation? Are any of you put off by the current license? (If you are, and don’t want to use the comments, please mail me.).

David – I do not understand “you” in the last sentence. Does it mean:

“When PMR acts as a publisher, PMR does not license commercial use” (i.e. a statement of fact) or

“When anyone acts as a publisher, that person does not license commercial use” (i.e. a metasyntactic variable applying to all publishers) (and perhaps “does not” implies “should not” or “cannot”).
or something else…

David’s other comment refers to the SPARC Open Data list which has had a discussion about copyright on email. David writes (Mailing List Message #85):

If you mean quoting, that is protected fair use. When it is for the
purpose of criticism, and especially scholarly criticism,
as it usually is on these lists, this can be very encompassing.

If you mean reposting without permission on a different list, I
think youy will find that most of the posters report
that they have permission--the usual statement is
"reposted with permission from the X list";
I know that i have always asked. I think that Heather
has always asked.

Once or twice the original poster did not want a larger or different
audience, and in those cases I did not repost.

And, as Heather says, posting or quoting from a private letter is
totally unethical without explicit permission, print or email. I have
not seen anyone I know do it.

So this quote (almost all David's mail except for the address) is quoted here under fair use.

The penultimate point is important. I have been embarrassed in the past because someone took one of my posts to a fairly small list (A) and reposted it in a much larger list (B) with the wrapper “from Peter Murray-Rust”. This post caused harm, because it appaered that I had deliberately posted it in list B where its content and attribution was seen as agressive while on list A it was normal discourse in the community. Unfortuantely the person (an open source evangelist) made a habit of this and I took to including something like “please do not repost without  explicit statement that this  was not done by the original author”. I don’t know the answer to this.

A typical scenario might be:

I post here something that says “I think it would be a good idea if all publishers added InChIs to their chemistry” . This is fine on this blog. If, however, the post was copied to the Computational Chemistry List (where the person copied my material) and attributed to me, then those members might see me as spamming, uncritical, intrusive, etc. If he had written (I found this quote on PeterMR’s blog …) I couldn’t complain.
So metadata and transclusion matters. But we don’t want any more hard stuff before breakfast…

Sometimes when someone writes to me I extract part and comment:

“A correspondent asked me:

“When should I use cml:property and when should I use cml:Parameter? Here is some code to highlight the problem… ”

[code included in quote]

and I'm replying in public because I think many people will be interested."

I hope  that this does not break confidentiality and fair use. Otherwise I will have to write a lot more emails.

Impact Factors! Hirsch, Erdős and Pauling

Friday, December 8th, 2006

Having spent 2 hours tidying CML Schema over a flaky CVS connection to sourceforge, I need some relaxation. So, after my disillusionment with the accuracy of citation metrics, I was spooking around Wikipedia and came across the h-index ( suggested in 2005 by Jorge E. Hirsch of the University of California, San Diego). This is rather similar to Zipf’s law – so essential in understanding informatics. The h-index is defined as:

A scientist has index h if h of his/her Np papers have at least h citations each, and the other (Np – h) papers have at most h citations each.

WP continues:

In other words, a scholar with an index of h has published h papers with at least h citations each. Thus, the H-index is the result of the balance between the number of publications and the average citations per publication. The index is designed to improve upon simpler measures such as the total number of citations or publications, to distinguish truly influential scientists from those who simply publish many papers. The index is also not affected by single papers that have many citations. The index works properly only for comparing scientists working in the same field; citation conventions differ widely among different fields.

So if a scientist has (say) 10 papers with citations:

200, 15, 12, 8, 5, 4, 2, 1, 0, 0

they have an h-index of 5 (5 papers have >=5 citations). The 200 citations are no more powerful than 20 would be for the first paper. If we have to have citation analysis this might be a good approach (since we have little idea how the actual numbers are obtained or who is using what) and the parametric approach allows for this. (I have yet to find how a “citation” is defined). (BTW if it matters I score ca 14 on Google Scholar – Feynmann is quoted at 23, Hawking at 68 – don’t take it too seriously – Galois scores 2).
Anyway, now for some light vanity. In the links to h-index was the Erdős number. This is named after the legendary Hungarian mathematician who was prolific both in the number of his papers and his collaborators. The number is defined as:

In order to be assigned an Erdős number, an author must co-write a mathematical paper with an author with a finite Erdős number. Paul Erdős has an Erdős number of zero. If the lowest Erdős number of a coauthor is X, then the author’s Erdős number is X + 1.


Erdős wrote around 1500 mathematical articles in his lifetime, mostly co-written. He had 504 direct collaborators; these are the people with Erdős number 1. The people who have collaborated with them (but not with Erdős himself) have an Erdős number of 2 (6,984 people), those who have collaborated with people who have an Erdős number of 2 (but not with Erdős or anyone with an Erdős number of 1) have an Erdős number of 3, and so forth. A person with no such coauthorship chain connecting to Erdős has an undefined (or infinite) Erdős number.

So might I have a finite Erdős number? We’ve all heard how small the world is - (Six degrees of separation and Small-world network).
But very unlikely. I have to have written a mathematical paper with someone with a finite Erdős number. So I was browsing through the Erdős 4 numbers and suddenly saw Linus Pauling (Now of course I have to take WP on trust that this is a genuine entry – I can imagine the debate over Erdős numbers can be quite detailed). So could I make a chain which links to Linus Pauling (I’ve had the honour to meet him)?

Well, I searched Google scholar for L Pauling (there is also Peter Pauling, his son). And I reckoned there might be a crystallographic chain that connected me. The best I can do is:

  • Pauling + Vernon Schomaker
  • Schomaker + Jack Dunitz
  • Jack Dunitz + PeterMR

That would give me an Erdős number of 7. But, unfortunately, although my paper with Jack was mathematical (cokernels of crystallographic point groups) the Schomaker-Dunitz papers were on electron diffraction (cyclobutane, etc.) and the Pauling-Schomaker papers included the splendid title:
The Use of Punched Cards in Molecular Structure Determinations I. Crystal Structure Calculations

So, reluctantly, unless I can find another chain of mathematical papers I don’t have a finite Erdős number.

But I do have a finite Pauling number – currently 3. (I doubt I can get it lower). And since Pauling is generally acknowledged as the greatest chemist of the twentieth century, why don’t we start a Pauling number?

(Oh – FWIW  Erdős  has an h-number of 54 on Google and Pauling 39. But don’t take these too seriously).

The most cited chemistry articles??

Thursday, December 7th, 2006

In my last post ( Assessed by Robots and citation Quiz) I argued that our careers are now in the hands of the publishing industry – they provide the numerical metrics and based on this the funders decide whether we keep our jobs. So I thought I’d look at how to improve my citations. I typed something like “most cited papers chemistry” into a well-known search engine and got something like this result. (Now we’ve just been out for our Christmas lunch and now I have got back the results aren’t the same as beforehand – so take everything with a pinch of salt… Anyway in the first cases I went to CAS Spotlight which announces:

CAS, the world’s leader in providing chemical information is now highlighting the most cited documents. The “Chemistry” category identifies the most highly cited chemistry documents appearing in the 1999-2005 published literature and appearing in journals covered by CAS.

CAS provides this information as a free service to the scientific community.

I went to the Journal articles (2005) button and got:

The following records identify the top ten, most cited journal articles appearing in documents published in 2005.

Sign up to receive notice of future updates.

Title Author/Affiliation Source
1. Density-functional thermochemistry. III. The role of exact exchange [details] Becke, Axel D.
Dep. Chem., Queen’s Univ., Kingston, ON, K7L 3N6, Can.
J. Chem. Phys.
2. Development of the Colle-Salvetti correlation-energy formula into a functional of the electron density [details] Lee, Chengteh; Yang, Weitao; et al.
Dep. Chem., Univ. North Carolina, Chapel Hill, NC, 27514, USA
Phys. Rev. B: Condens. Matter
3. Density-functional exchange-energy approximation with correct asymptotic behavior [details] Becke, A. D.
Dep. Chem., Queen’s Univ., Kingston, ON, K7L 3N6, Can.
Phys. Rev. A: Gen. Phys.
4. Generalized gradient approximation made simple [details] Perdew, John P.; Burke, Kieron; et al.
Dep. Phys. Quantum Theory Group, Tulane Univ., New Orleans, LA, 70118, USA
Phys. Rev. Lett.
5. The Protein Data Bank [details] Berman, Helen M.; Westbrook, John; et al.
Research Collaboratory for Structural Bioinformatics (RCSB), Research Collaboratory for Structural Bioinformatics (RCSB), Rutgers University, Piscataway, NJ, 08854-8087, USA
Nucleic Acids Res.
6. Gaussian basis sets for use in correlated molecular calculations. I. The atoms boron through neon and hydrogen [details] Dunning, Thom H., Jr.
Chem. Div., Argonne Natl. Lab., Argonne, IL, 60439, USA
J. Chem. Phys.
7. Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set [details] Kresse, G.; Furthmueller, J.
Inst. Theor. Phys., Technische Univ. Wien, Vienna, A-1040, Australia
Phys. Rev. B: Condens. Matter
8. Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells [details] Elbashir, Sayda M.; Harborth, Jens; et al.
Dep. of Cellular Biochem., Max-Planck-Inst. for Biophys. Chem., Gottingen, D-37077, Germany
Nature (London, U. K.)
9. Ordered mesoporous molecular sieves synthesized by a liquid-crystal template mechanism [details] Kresge, C. T.; Leonowicz, M. E.; et al.
Paulsboro Res. Lab., Mobil Res. and Dev. Corp., Paulsboro, NJ, 08066, USA
Nature (London)
10. General atomic and molecular electronic structure system [details] Schmidt, Michael W.; Baldridge, Kim K.; et al.
Dep. Chem., Iowa State Univ., Ames, IA, 50011-0311, USA
J. Comput. Chem.

Most Cited Journal Articles – ChemistryCAS Science Spotlight

(Note – I am sure this is part of a page that is copyright ACS so I am claiming fair use without asking permission. And I shall be complimentary – so please don’t cut me off). Now… have a look and decide what is common to all of these. Read the abstracts if it helps (I didn’t read the articles as only the abstracts are Openly accessible). That’s what I asked you in the last post.

Yes – they are all about techniques. So my world domination strategy was based on creating things that people want to use, not providing scientific results. (You can ,of course, argue that a database or a basis set or a functional is a scientific result, but the citers are using it as a tool).

I reran the search after lunch. I thought the results would be the same but maybe Google, or the lunch or my fingers were different. At top of the bunch now comes Elsevier:

Access key papers as recognised by CAS Science Spotlight
CAS Science Spotlight is a free web service that identifies the most requested research publications as reflected by requests for full text via their online services. Additionally, the most cited chemistry-related research publications as reflected by the more than 100 million citations found in the journals, patents, conference proceedings and other sources covered by CAS are identified.Elsevier is proud to be the publishers of # 1 and # 2 most requested chemistry papers in 2005*, as recognised by CAS Science Spotlight.
You are invited to access these and other highly-valued articles by clicking on the paper title.
# 1

The following Elsevier article was the #2 most requested ‘chemistry and related science’ article for 1Q05 and the #1 most requested for 2Q05 and 3Q05:
Title: External link A useful bicyclic topological decapeptide template for solution-phase combinatorial synthesis of tetrapodal libraries
Published: Tetrahedron Letters, pp7261-7263, vol.42:41, (2001)
If you do not have access to this article on ScienceDirect, click External link here#2
The following Elsevier article was the #1 most requested ‘chemistry and related science’ article for 4Q04 and 1Q05 and #2 most requested for 2Q05 and 3Q05:
Title: External link Convenient synthesis of human calcitonin and its methionine sulfoxide derivative
Published: Bioorganic & Medicinal Chemistry Letters, pp2237-2240, vol.12:16 (2002)
If you do not have access to this article on ScienceDirect, click External link here

* Tet. Lett. article, #2, #1, #1 most requested for first three quarters of 2005
BMCL article, #1, #2, #2 most requested for first three quarters of 2005
Final quarter and cumulative year data not yet released by CAS.
The following Elsevier article was the #1 most requested chemistry article for 2002 and 2003:
Title: External link Glucan synthesis. Part VI. Total synthesis of cyclomaltohexaose
Published: Carbohydrate Research, pp277-296, vol.164 (1987)
If you do not have access to this article on ScienceDirect, click External link here
The following Elsevier article was the #1 most cited ‘chemistry and related science’ article for 2004 and 2003 and the #2 most cited for 1999, 2000, 2001 and 2002:
Title:External link A rapid and sensitive method for the quantitation of microgram quantities of protein utilizing the principle of protein-dye binding
Published: Analytical Biochemistry, pp248-254, vol.72:1-2 (1976)
To access this article, click External link here

(I didn’t ask their permission to quote this either).

Well I am mystified. There is no correlation between the types of paper given here and the ones earlier. They are not only not the same papers, but they aren’t even on similar topics.

I have probably made a simple mistake. (I think it’s the same CAS Spotlight and the same year. Elsevier uses slightly different words “most requested chemistry papers in 2005*” (my italics) and also “ MOST REQUESTED CHEMISTRY AND RELATED SCIENCE’ PAPER ON CAS “. So maybe there are two completely different lists. Or maybe there is a different selection criterion. Or a subset.

But imagine you are a busy provost/dean and have to decide whether to close the theoretical section of you chemistry department or the organic (of course you may be thinking of both…). The theoreticians will point to the CAS page, the synthetists to the Elsevier page. And I am sure there are others.

So the real skill in the next decade will not be doing science, but choosing and manipulating the metrics. I suppose it is an advance from HEFCE’s last idea which was to measure research income.

5 Years of Open Babel

Sunday, November 26th, 2006

I’ve mentioned Geoff Hutchison and Open Babel here before in the context of the Blue Obelisk awards. Open Babel is an Open Source “universal adapter” (see below). So it’s nice to report his announcement of 5 Years of Open Babel from the mailing list. To quote:

I’d like to take the opportunity to outline a bit of what’s happening with Open Babel right now and what 2007 might bring. Last year, we released version 2.0, representing a full stable release. Since then, we’ve released two updates to fix bugs, thanks in part to many user reports and contributions. There are contributed binary copies for Windows, Mac OS X, and a range of Linux distributions.

So what is Open Babel. Basically it’s one of those universal adapters such as you get in airports.


(Thanks to Wikipedia – it’s so liberating to be able to paste pictures without worrying about copyright). This adapter can take 1 input (US) and transform to 2 different outputs (UK and European). Note that this transforms the mechanical format, not the voltage. (it’s a bit similar to transforming the syntax, but not the semantics). There are smarter adapters – some can manage 4 inputs and 4 outputs. But they feel as if they may fall to bits any time.

Open Babel is a lot more powerful than that! It can manage 70 formats. And also carry out some semantic conversion. It’s not a Swiss army knife – CDK and JOELib are more like that. It does a single job – syntactic and semantic conversion – and it does it through Open voluntary labour.

It is critical to highlight how important Open Babel and other Blue Obelisk activities are to the pharma industry. I’ve highlighted this before. I expect that Open Babel is used by every pharma company in the world. I estimate that in direct costs, staff etc. the pharma industry spends several billion (that’s a US billion = 10^9) USD per year on chemical informatics of some sort. (For example we can guess the amount spent on CAS, Beilstein, and chemoinformatics software). None of this goes on Open Babel.

That’s not quite true. Last year we had support for a summer student from Merck (Nick England) who added some exciting routines to Open Babel. Note directly costed – let’s say ca. 5000 USD. So thank you Merck. And also thanks to MDL for a summer student to write a CML Reader in Java.

But that’s about it. In IT there is a huge industry investment (direct and implied) in things like Apache, Eclipse,  etc. But in chemistry nothing. It’s pure free-riding by the pharma industry.

Now there is no moral argument here. We write these systems for a variety of motivations and they are fulfilling (though non-hackers have NO IDEA how much effort actually goes in.) NO IDEA. I have spent the weekend trying to refactor my molecule builder, and the gear wheels are spread out across the floor. Nothing is working. I promised my colleagues it would be ready for tomorrow. We’ll see. This post is a welcome relief.

So, dear pharma industry, if you read this – think about what you owe Geoff and the rest of us. It doesn’t just happen. It isn’t easy. The refactoring is desperate. We know you are shy – people in pharma don’t like to come out into the daylight so if you mail me I’ll keep it confidential. Or you can post an anonymous comment to the blog. I will have no idea who your are. At the very least add a post that says something like “Thank you Geoff from an anonymous person in pharma who has found Open Babel useful”. That sort of message is highly motivational.

(Chemical) Images in blogs

Wednesday, November 15th, 2006

I am following up a post where I suggested we could provide a service for drawing molecules in blogs. One problem is how to incorporate them into the post.

(I’m still working on this post, so don’t believe it all)

When I create images in my posts I have to:

  • create the image somehow (draw it, cut and paste, etc.)
  • save it on my filesystem
  • “Upload” it to WordPress using a rather clunky uploader.

So when I painted TotallySynthetic’s web stats from his blog I simply cut and pasted them into a pixel editor, trimmed them, saved to disk, uploaded, etc. Can this be made easier.

One possibility is that I can link tp other images. So here is one of Peter Corbett’s latest posts. The first image is referenced by:

<img xsrc=""/>

If I want to link to this I can paste this URL into my post which gives you this:

This is a HOTLINK. It is easy for me, but there are problems. Every time I load the image (by opening the browser) it accesses Peter’s blog. If I get 1000 human hits a day, Peter will get 1000 hits per day (assuming the images are downloaded). Also if Peter’s server is offline, you won’t see the image. But it’s simple.

However I can copy this image into WordPress. It’s no more difficult (and no easier) than uploading an image from disk. You simply load the URL into the file browser text field and it will copy the image into WordPress’s local image store. Now it looks like this:

If we create semantically useful titles for the images it should be possible to do some fun things.

There are some issues here. As Peter pointer out, Creative Commons gives rights over the content but not necessarily right to link to the server on which they are mounted. You can easily see denial of service problems here. I’d welcome any ideas of whether this is going in a useful direction.


Mystery Molecule and Jack Dunitz on Fluorine

Monday, November 13th, 2006

Jack Dunitz (one of the greatest chemical crystallographers) visited our lab today. I had told people beforehand that I would ask him what the mystery molecule was and prophesied that he would get it immediately. He did.

This gives me a chance to record the enormous personal debt I owe to Jack with whom I spent a year in Zurich. He is deeply loved by the many people who have passed through his lab. He now works almost exclusively with theoretical tools rather than equipment and today told us about Fluorine – or more precisely organofluorine compounds. Substituting hydrogen by fluorine in hydrocarbons (aliphatic or aromatic) makes almost no difference to physical properties (except density), but despite their similar properties the fluorocarbons and hydrocarbons don’t mix. In fact perfluoro butane in butane has one of the highest activity coefficients (10). But even the molecular volume is almost unaltered. In some directions the fluorine is actually smaller than the hydrogen.

So there are still many simple observations in chemistry that we don’t understand. With Gautam Desiraju one of Jack’s most engaging was that over the whole of reported chemical space there are more compounds with an even number of carbon atoms than odd (Nature closed access reference) – see report in New Scientist where he is quoted as:

“It’s much more intriguing if you don’t offer an explanation,”

I’ll leave you with the puzzle he greeted me with when I started in Zurich. “If a golf ball is hit with an infinitely massive golf club moving with velocity V, what will be the velocity of the ball after the collision”. The answer is simple and logical.

Blogs as scholarly record? Should we reposit them?

Monday, November 6th, 2006

Blogs are increasingly becoming the grey literature of our time, and at least some may need preservation. I use this blog for many semi-reputable activities – an open notebook of thoughts – a means of presenting talks and snapshots of activities of value to me. This blog, and others in the Blue Obelisk, are being used by Beth Ritter-Guth as a resource for het rhetorical work. This, at least, demands an element of preservation.

So simple questions:

  • should they be reposited in the institutional repository?
  • if so, how? (zipped at regular intervals? presumably not after every comment?)

This may appear trivial, but it isn’t. Having had logins at several institutions (Glaxo, Daresbury, BioMOO, Nottingham, Birkbeck, Cambridge) in the last 10 years I have lost significant amounts of my digital scholarship with each move. I have to resort to fragments floating around in random webcaches – it’s remarkable how long they survive…

(I now have the spellchecker… :-)