petermr's blog

Mystery Molecule – more clues…

Posted on November 8, 2006 by pm286

As you will have seen from the comments, Peter Corbett knows what it is – and quite rightly isn’t saying. But no other response. What does an author do when all the readers are lurkers? Here’s the first strategy – lure them on. So here’s a clue to keep you interested.
This is the unit cell of the crystal with the axes drawn correctly and precisely. I shan’t give them exactly as that would allow you to look them up (more on that later).

However if you are adventurous you can measure them and deconstruct the perspective and probably get quite good ratios.
The picture is, of course, thanks to Jmol – a Blue Obelisk member.

Posted in chemistry | 4 Comments

Pubchem and Thomson – two cheers

Posted on November 8, 2006 by pm286

Noted in Peter Suber’s Blog:

Additional 2.2 Million Structures Now Searchable in Freely Available Database
Thomson Scientific, … provider of information solutions to the worldwide research and business communities, today announced the deposit of 2.2 million chemical structures from Thomson Pharma into PubChem, the freely accessible database that provides information on the biological activities of small molecules. PubChem was developed by the National Center for Biotechnology Information at the US National Institutes of Health (NIH) to help biomedical researchers identify chemical structures with the potential to treat diseases. The addition of Thomson Pharma’s extensive collection of biologically active and pharmacologically active structures, derived from worldwide patent and literature sources, significantly enhances the value of this research tool to the scientific and medical community.
Thomson Pharma (thomsonpharma.com) is the comprehensive pharmaceutical information solution that covers the entire drug discovery and development pipeline. PubChem searchers with a Thomson Pharma subscription can link directly from PubChem to more detailed information on their structure of interest including:

— activities reported for that compound

— drug reports citing that compound

— synthetic methods and critical reaction data

— patents, journals and news stories that feature the compound

— synonyms and trade names

— related salts and isomers

“My colleagues and I are very pleased to welcome the addition of Thomson Pharma chemical structures to PubChem,” said Steve Bryant, Director, PubChem, National Institutes of Health. “This will allow our users to cross-reference information in PubChem, such as biological testing results from the NIH Molecular Libraries program, with the further rich sources of information provided by Thomson Pharma.”

This is partially good news. It means that Pubchem is becoming the de facto standard for chemical indexing. We can rapidly and, increasingly reliably, find the compounds we want without hassle, delay, subscription, licenses and non-re-use (all the devils of the anticommons). More below.
But the devil of the details states:

PubChem searchers with a Thomson Pharma subscription can link directly from PubChem to more detailed information on their structure of interest including … (my italics)

This appears to mean that only the structure, not the data, are freely accessibly. Don’t get me wrong, the structures are very valuable in itself (although I don’t know whether these are 2.2 million NEW structures – if so I’ll be happily surprised). Some certainly will be.
In a twentieth century morality and legality the data are Thomson’s – won by the sweat of the brow. But in the twenty first century – which some of us inhabit – note were the data come from…

— activities reported for that compound (mainly in scientific journal articles – PMR)
— drug reports citing that compound (probably appearing in the public domain – PMR)
— synthetic methods and critical reaction data (originally published in scientific journals – PMR)
— patents, journals and news stories that feature the compound (patent information is generally made Open by the patent offices – PMR)

So to a large extent this is OUR data. Its re-use is technically (though not socially) straightforward. I have been approached by patent offices who would like to make their outputs re-usable with CML. Drug reports (e.g. from regulatory authorities such as WHO (whom I have worked with) should be Open). The primary literature could be made Open to robot indexes if the political will was there. No cheer for the closed data.
But one cheer for the forces of Openness driven by commercialism. I guess that Thomson have done this because they sense a market opportunity. When Pubchem becomes the leading chemical index – just as Google is the leading free-text index – then everyone will want to be linked therefrom. It’s not altruism – just good business. And others will follow. But I hope their data is open.
So the other cheer is for Pubchem as index. There is, of course, quite a lot of data in Pubchem and Medline, but increasingly Pubchem is becoming a linkbase. And that is just what we want. If we can persuade all journals to make their published compound data available we have an Open chemical data system. This semantic publishing is not difficult – it just needs a different business model. If it’s compelling enough for Thomson to link their data to Pubchem, what about chemistry articles? Well, no surprise, Nature already started this with Nature Chemical Biology. I am sure it’s a good move – I bet they get more clicks or whatever excites their business people.
So a simple prediction. In 5 years’ time (and that is ludicrously conservative) the majority of scientific journals will be linked to by Pubchem. There are good semantic tools (like Peter Corbett’s OSCAR3) that will help to take the drudgery out of conversion. So where’s the problem?
Did I mention before that chemists are conservative?

Posted in chemistry, open issues | 4 Comments

Mystery Molecule!

Posted on November 7, 2006 by pm286

This is a detective story. If you know the answer, please don’t reveal it (though I’d be pleased that you announce that you know it). (Anyone remember when Psycho came out? Hitchcock made the audience promise not to tell).
Many years ago I used to synthesize compounds. Not very well, and not very safely. That’s another story. My main interest was structural chemistry – specifically small molecule crystallography (proteins were rare then). I did this reaction (I am not telling you what went in, but it was in the area of coordination chemistry with organic ligands). One particular reaction didn’t give what I expected but instead a very few, rather pretty, rather small (< 0.5 mm) dark orange crystals appeared. Too few for a chemical analysis (you had to burn them then) – no mass spec – not enough for an NMR or even an IR. But crystalline. So I thought I would find out by XRay crystallography.
So I put it on the XRay camera and measured the cell dimensions. It had a molecular mass of about 250. I was excited. A rather unusual spacegroup (I shall withhold this from you, readers, like a good author). And then suddenly I realised what it was.
I expect that by now all experienced chemical crystallographers will know what the material was. It is probably the most inadvertently studied molecule in the world. I have seen several re-determinations of this crystal structure. In crystallographic repositories like the one we are building in the SPECTRa project. And now, when it is common to mount-and-shoot (since a crystal structure can be determined in an hour or two) I suspect it happens once a week somewhere in the world.
So by know you either know what I am talking about or, I suspect, are completely bemused. If you are in the latter category I might release some more clues later. But I’d like to be confirmed in my suspicion that crystallographers will know the answer.

Posted in chemistry | 8 Comments

"Open Data" on Wikipedia. Bloat and NOR?

Posted on November 7, 2006 by pm286

Latest report on the Open Data entry on Wikipedia – we are starting to get contributions. Remember always that WP is ours, not mine. And that it”s an encyclopedia, not a platform. This post is just to show how things develop – it’s not judgemental in any way.
I added a section on Relation to Other Open Activities. This was not specifically about Open Data but things that were similar but different or at least distinct. So Open Access is not Open Data although the BOAI implies part of it (that data in fulltext should be Open); however it says little or nothing about non-fulltext. Open Source may have similar ideals but is not really about data. And so on. After a robust real-life discussion with Rufus Pollock (founder of the Open Knowledge Foundation) I included a short link to the OKFN (making it clear that I believe that OK is not the same as OD).
Shortly afterwards Jean-Claude Bradley added an entry to Open Notebook Science in WP and edited the OD entry to point to it. Then a Wikipedian tagged the Open Notebook Science as a neologism. This is tough, but I think fair. I believe that J-C has an important, courageous, approach and I support him. However the actual term is only a month or two old and so is probably unsuitable for WP. It also comes close to NOR (no original research) which deprecates the development of new ideas on Wikipedia.
Because of NOR I have tried to cut down any personal ideas in the OD entry. Obviously since I have started it there is a lot of emphasis from me, but I have tried to address the objective aspects (history, definition, current usage, etc.). I have tried to keep the material related to the actual term “Open Data” or at least the co-occurrence of Data and Open in the same sentence. I have mentioned my own use of the term and given references (as I must), but not elaborated any details.
After J-C was tagged, he moved the definition of Open Notebook Science to OD. At the same time a significant amount of extra definitive material on Open Knowledge was added. At this stage the “Relation to Other Open Activities” was becoming larger than most other sections, and could invite more contributions, perhaps violating NOR.
So I thought it was a good idea to prune this section. I reiterate that I am not a special editor and that anyone can re-edit this. The Open Data page (like all pages) has a Talk page, so I left messages with my thoughts there.
I’ll keep you up to date with progress. I’m hoping that some of the Open Data mailing list will start helping with definitions.

Posted in "virtual communities", open issues | 1 Comment

Blogs as scholarly record? Should we reposit them?

Posted on November 6, 2006 by pm286

Blogs are increasingly becoming the grey literature of our time, and at least some may need preservation. I use this blog for many semi-reputable activities – an open notebook of thoughts – a means of presenting talks and snapshots of activities of value to me. This blog, and others in the Blue Obelisk, are being used by Beth Ritter-Guth as a resource for het rhetorical work. This, at least, demands an element of preservation.
So simple questions:

should they be reposited in the institutional repository?
if so, how? (zipped at regular intervals? presumably not after every comment?)

This may appear trivial, but it isn’t. Having had logins at several institutions (Glaxo, Daresbury, BioMOO, Nottingham, Birkbeck, Cambridge) in the last 10 years I have lost significant amounts of my digital scholarship with each move. I have to resort to fragments floating around in random webcaches – it’s remarkable how long they survive…
(I now have the spellchecker… 🙂

Posted in general | 3 Comments

Open Electronic Theses – should be simple…

Posted on November 4, 2006 by pm286

Theses are one of the most concentrated and valuable ways that science is published. Yet they could be so much more valuable. There a a few hurdles to overcome…

From Peter Suber’s blog:

Effective today, the University of Tasmania will mandate electronic submission of theses and dissertations. The new policy is simplicity itself: in addition to submitting two bound, printed copies (as before), candidates must submit one electronic copy.

Comment [from PeterS] . Kudos to Tasmania and congratulations to Arthur Sale, the mover behind the new policy. This little change can have big consequences because (as I argued in a July 2006 article), for theses and dissertations, achieving mandatory electronic submission is the hardest part of achieving OA:
In principle, universities could require electronic submission of the dissertation without requiring deposit in the institutional repository. They could also require deposit in the repository without requiring OA. But in practice, most universities don’t draw these distinctions. Most universities that encourage or require electronic submission also encourage or require OA. What’s remarkable is that for theses and dissertations, OA is not the hard step. The hard step is encouraging or requiring electronic submission. For dissertations that are born digital and submitted in digital form, OA is pretty much the default. I needn’t tell you that this is not at all the case with journal literature.

I agree with all the positive things Peter says, but I also need Open Data – the ability to re-use the data in the thesis without further permission. I believe that a large amount of chemistry (and other science, but my main activity is chemistry) is locked up in theses and never published. I’m sure this is mainly inertia, with some lack of courage and vision as well.
However at the Open Scholarship meeting in Glasgow I specifically asked for some exemplars of chemistry eTheses. I got a lot of response – and in many countries theses seem to be published routinely in electronic form under some form of (implicit) Open license. However in the UK theses seem to be restricted by additional rights and regulations imposed by the universities, and all seem to be different. So I have the impression – and it’s only an impression – that although there are electronic theses they are not necessarily OA.
Sadly, of course, while almost all theses are created electronically, most undergo the cow to hamburger destruction. I heard yesterday of a student measuring spectra with a ruler, when the original data were digital…
I think theses are a great opportunity to show the value of reposition. I know many cases where the author is among the first to request data from the repository – since they have lost their own digital records. A few cases of this sort starts to make sense even in a conservative community like chemistry.
So let’s start demanding that we all deposit theses Openly – if only for the benefit of the student and supervisor!

Posted in open issues | 1 Comment

WordPress – help!

Posted on November 2, 2006 by pm286

This blog is now about 2 months old and I’ve made 70 posts. I have done all my editing with WordPress, the software that publishes this blog. Some things work well, others are driving me wild (and actually stopping me blogging). Any suggestions would be welcome (except to stop blogging…)
In its native form WordPress does not catch Spam. I therefore scrolled through 30 spams/day are manually removed them. I was then pointed to free WordPress plugins, including Akismet. This provides a spam checker and – after registering – I installed the plugin. It has caught all the spams since then – thank you Akismet.
Wordpress does not have a spellchecker – you may have noticed! So I downloaded Firefox 2.0 which says it does. There is check box “Tools | Options | Check my spelling as I type”. Exactly what I want. I checked it. It doesn’t work. At least it doesn’t perform the operation ” Check my spelling as I type and inform me of the result”. HELP!
Wordpress has an HTML editor (button called HTML). it brings up a window with the HTML source. I can edit this and “Update” which transmit my HTML changes to the edit window. I think it does. It also submits a lot of chenges I didn’t request which completely screw up everything. It’s worse than that – the normally edit window displays formatting an indents. It is WYSICUTOWIP. What you see is completely unrelated to what is published. HELP!
I recently suggested that it was a good idea to put InChIs is “alt” attributes for images. It is. The only way you can do this is using the HTML editor. And guess what? It is then creative with the rest of your text. It tool me 20 mins to publish 4 chemical structures with alt tags. You might say – create your article elsewhere and simply drop it into the editor. I tried. WordPress usually garbles it. The problem today (where my italics screwed up all the posts in Planet Blue Obelisk) is because the editor had inserted an empty em tag (I won’t try to reproduce it with angle brackets in case I screw this post). For non-HTML experts, an empty em creates a zero-length piece of text with no characters and makes the non-existent characters italic. If this were a null-op that would be just about OK, but it isn’t. It is interpreted as “turn all subsequent characters italic, even those in other people’s blogs where this post is inserted”. Fun, but not widely appreciated. HELP
So how do I write CML and XML in my CML Blog? I can’t at present, so the CML Blog is stalled. I have installed a “code” plugin for WordPress but we are not sure whether it does anything. HELP
Please don’t think I’m ungrateful to WordPress – I’m not. And I’m not whingeing. It’s freely available and I thank the creators. I write Open software and I know how difficult it is and how you often get dispiriting messages. (I once got a (gratuitous) message from someone in a chemical software company “CML is the most mangled format I have ever come across”. Considering their own format is binary and abstruse in the extreme I return the compliment. They had failed to recognise the different between a proprietray format and a modern markup language architecture. But I’ll rant on chemical software companies another day).
So in essence blogs with text are OK. Quoting, indents, so – so. Images can be a pain. Code is a nightmare. The management apparatus is good.

Posted in general | 4 Comments

"Open Data" Wikipedia NPOV, three revert, etc.

Posted on November 2, 2006 by pm286

The “Open Data” article on WP has already had useful attention from Wikipedians. Some minor typo corrections (and many WPians devote much energy to this, including developing bots). My reference numbering was a mess since I didn’t know WP had a reference tool, and this was edited by Gurch to create a page with automatic references.
A nice thing about WP is that everyone who has registered has a “Talk” page, so I left a thankyou message on Gurch’s page. Gurch replied on my page

No problem. I was reading your blog post here (I have a habit of reading Wikipedia-related blog posts just to get an idea of Wikipdia’s reputation) and thought I’d check the article to see if anything else needed doing

I hope this gives an idea of the immediacy of the collaboration that exists on WP.
As I have said already, this is not MY page. It’s important to start a page with enough for others to add to, but not enough to preclude NPOV (neutral point of view). This is a very important point on WP where contributors must avoid bias (non-NPOV). This is quite difficult for Open Data as almost all the discussion is from advocates. A neutral page will report arguments for and against a controversial issue and try to be objective. Open Access has seen its share of minor hyperbole – here is an example of a revert:

One motivation for authors to make their articlea openly accessible is research impact factor. Since Lawrence’s methodologically weak ) cross-sectional study (with no adjustment for confounders first suggested the Open Access citation impact advantage…

was reverted to:

One motivation for authors to make their articles openly accessible is research impact factor. Since Lawrence’s landmark study first suggested the Open Access citation impact advantage …

The NPOV sometimes gets to the stage where someone challenges the neutrality. This happened in the last day or two in Open Source

where “Mikeblas” added a tag stating that the page was not neutral. This tag was soon removed, only to be re-replaced. This is an “edit war” and could go on for ever, destroying the page. WP has several effective mechanisms to solve this. Each entry has a Talk page, and here is the discussion on this point:

NPOV:
There’s nothing negative to say about open source? — Mikeblas 02:52, 17 October 2006 (UTC)

No, there’s just nothing that anyone’s written on the wiki page. Why do you ask? DMacks 03:46, 17 October 2006 (UTC)

I’m asking because Wikipedia needs to be NPOV, and this article certainly isn’t, since no critiques of open source are mentioned. I’ve marked the article NPOV for this reason. — Mikeblas 14:59, 28 October 2006 (UTC)

oh let’s troll open source people lol ^___^ — and two POV does not somehow magically make NPOV, unless you’re the mainstream media covering the U.S. elections. The article, as it stands, is unbiased towards either side. Go away, troll. Tag removed. —70.108.92.221 18:34, 29 October 2006 (UTC)

I’ve replaed the tag as the article remains POV. There’s no coverage of the problems in the open source community, nor any discussion of the negatives in the practice. Your ad hominem attack doesn’t convince me that the article is balanced. Here’s a couple of references to help us get started: [1] [2] — Mikeblas 22:26, 29 October 2006 (UTC)

Lack of negativity does not make an article POV. I’m removing the NPOV tag. Please feel free to add criticisms to think article though. —Pengo ^{talk · contribs} 12:11, 1 November 2006 (UTC)

Well at present the POV tag is off. If it goes back on, then an official WP process comes into effect the three revert rule

Do not revert any single page in whole or in part more than three times in 24 hours, except in the case of obvious, simple vandalism.
(Or else an administrator may block your account.)

Another rule of WP is No Original Research.

Articles may not contain any unpublished arguments, ideas, data, or theories; or any unpublished analysis or synthesis of published arguments, ideas, data, or theories that serves to advance a position.

I was slightly worried that “Open Data” might be seen as my promotion of a personal crusade, but research on the Internet has convinced me that it is now a widely used term, of considerable importance so I am relaxed about this.
WP is necessarily an expreiement in virtual democracy and from what I have seen over a year or two works pretty well.

Posted in "virtual communities", open issues | 3 Comments

Creating "Open Data" on Wikipedia

Posted on October 30, 2006 by pm286

In an earlier post I mentioned that I was going to start an article on “Open Data” on Wikipedia. This is a blow-by-blow account (a few technical details are omitted).

Do not be afraid. (I used to be afraid, but there is no need). Nobody has ever made me feel unwanted (there have been robust discussions, but none angry). So I go to WP and search for “Open Data”. (The capitalisation may be important – always start with a capital if you can). There is no such page, but it offers one a start

No page with that title exists.

You can create this page or request it.

So by clicking on the first link you get a fresh page headed “Open Data

2. You don’t have to write the whole page at once. Even the title and a few words is enough. This is a “stub” and others can add to it or modify it. Let’s see how far we get before nightfall.
3. Read some WP pages to get a sense of style. Good ones have an introductory paragraph and then the body, made of several labelled sections. Then there are usually links to other WP pages, and finally references. WP is very insistent on good references. Of course you don’t have to have them all at once.
4. You can always backtrack if you get something wrong. WP saves everything. No-one minds if there are 100 versions of something. Indeed if you are on a flaky connection you may wish to make a number of small changes and save after each.
5. You don’t have to let people know who you are. I do, but I don’t expect it from others. You can use an IP or (if registered) an alias, or a real name. I use “petermr” which is not difficult to crack (use Google). If you are registered then add a signature (~~~~ is the magic) which links to your own WP Talk page.

Since all the edits are recorded, I can simply link to them!

Page creation. Note that WP has formatted the page nicely (I simply have to remember that there is no need for a title, that each section has ==foo== and to add {{stub}} at the bottom. The links go in [[..]], e.g. [[Open Access]]. I’ve saved it after a paragraph as it’s dinner time. Any visitor will see that it’s a new page (stub) – they may or may not feel they want to add something. It’s not my page, it’s our page. However it’s probably good manners to wait a day or so before editing a new page.
Here is the history of the edits. Notice that a Wikipedian has already spotted the article and removed the “stub” before me!
Here is the latest version. It’s not “mine” – it’s ours. Anyone can edit it – I am sure at least there are typos and many other links and references can be added.
the real final version tonight.

Posted in open issues | 1 Comment

Enjoyable version control with the Tortoise

Posted on October 30, 2006 by pm286

Update, merge, test, add, update, commit… that’s what I do when working in a communal software project. Update, merge, test, add, update, commit…
So I’ve written some additional tests for JUMBO. Now I have to save them and share them with my collaborators (anyone who wants to collaborate). When I look at my directory on Windows it looks like:

As you can see some of the files are green and some are orange. The orange ones are those I have changed since “last time”. Last time of what? last time I used the Tortoise,:

This cheerful fellow (developed initially by Francis – who I first met on the Cambridge geek puntconvention – maybe more later) sits on top of CVS. Now CVS is a wonderful tool – about 15 years old, I thnk – which manages the versions of your documents. You (or someone else) sets up a CVS server and every so often you chek your current files against the server. It tells you when they are out of synce and whether you need to upload a file TO the server or download someone else’s update FROM the server. (Remember we work as collaborators!).
CVS is a great concept but pretty hairy for normal mortals, with some fairly esoteric runes to remember. It’s sufficiently forbidding almost to scare you away. Tortoise makes it trivially easy by adding itself to your normal file browser. All you do is right-click and follow simple commands.
The first is Update. Has anyone changed the files on the server since I last used CVS? If so it will download them to my directory and merge them. (Don’t worry about being overwritten, CVS remembers all versions). Update also tells me which files I have edited.

(The pink ones are new). Since there were one or two serverside changes I now have to re-test the system (the changes might break my code). In this case the test passes. Sometimes it requires a lot of work to merge the changes – that’s normally a good thing because someone else is working with you. Now let’s re-update in case the server has changed. No? Did I create any new files? I check with “Add contents” – yes, I did. (This is where I most frequently goof up – unless I send these files to the server my collaborators won’t see them. And their systems will then fail). So I add the files, and then “Commit”.
Now everyone in the project can update from my latest changes…
Some of you will have thought “what happens if two people make changes to the same file at the same time?” CVS will try to merge the changes. If they are in different parts of the file it’s probably OK. If not there will be a conflict. This is a bit of a pain, and normally involves reverting to the older version and then agreeing between the two people what changes each wanted to make and why. In many projects – such as CDK – the active developers keep a chat room open so they can send messages like “I need to change X – does anyone mind?” Much better than technical mechanisms.
Here’s the final commit:

CVS is now being superseded by Subversion (SVN) which is even easier (and also has a Tortoise overlay). We use SVN locally, but I haven’t got a round to changing it at SourceForge for JUMBO…
CVS or SVN is also very good for dealing with other documents (if they are in ASCII). Take a look. You’ll find you are working in a community, not alone…
(Oh – and why am I using Windows? Don’t ask)