Who's going to FOO 2007?

I was about to post about an idea – “Who is going to W3C2007, Xtech2007 …”, when Jean-Claude Bradley beat me to it.

Going to Science Foo Camp

I just got an invitation to attend Science Foo Camp in August 07, a unique meeting organized by Nature, O’Reilly and Google. Based on what I heard from last year’s attendees this will be an amazing opportunity to bounce ideas around.
I’d like to hear more from others who are going or who attended last year.

As before, we will be inviting around 200 people who are doing particularly interesting work in a wide range of scientific disciplines, as well as in areas of technology and culture that influence, and are influenced by, science. The aim is to encourage cross-fertilization of ideas, creating a unique opportunity to explore topics that transcend traditional boundaries. Of course, senior colleagues from Nature, O’Reilly, and Google will also be present.

Well I’m also going, J-C!
But I am going to a lot of meetings and I thought there might be a way of pre-meeting on the Web. Maybe this is already answered – if so please let me know – but if not…
Could we use del.icio.us – by tagging the registration page (or some other) of the conference with a simple set of tags. Like “arriving saturday”, “no yet made arrangements for sunday dinner” “would like to meet anyone interested in scientific XML”, and so on. Also I might say “am hoping to spend a few days in Bay area before FOO camp” – and see if anyone would like to meet up pre-meeting.
So here are 3 of my upcoming conferences:

Posted in Uncategorized | 8 Comments

experiment and theory – the liberation of data and source

Antony Williams on the ChemSpider blog has paid tribute to NMRShiftDB. I have copied this in full and comment later on how theory and experiment test each other:

Open Source Data, Testing Quality and Returning Value – Interactions with NMRSHIFTDB and the Blue Obelisk Community

Posted by: Antony Williams in Quality and Content

I give a thumbs up to the quality of the NMRSHIFTDB. We’ve validated it. Why would I care? I’m an NMR jock at heart. I also work for a commercial software company innovating NMR prediction software and compiling NMR databases as the basis of our work. Does this mean that the commercial software vendors and Open Source/Access communities can coexist and have mutual admiration. I believe so!

After 18 months work I finally signed off on one of those infamous copyright transfers for Elsevier, now the publishers of Progress in NMR. After over 18 months of work (that after hours style…much like blogging) a 360 page review article is finally submitted – “Computer-Assisted Structure Verification and Elucidation Tools In NMR-Based Structure Elucidation”. Proofs will arrive before end of month. It’s the culmination of over ten years of our own work as well as that of many contributors in the domain of CASE (Computer Assisted Structure Elucidation) systems. The complexity of structures that can be solved by computer algorithms is impressive…see examples here. Recently the StrucEluc CASE system solved the structure of an antibiotic of Mw>1150. Three NMR spectroscopists couldn’t solve it…a symbiotic relationship with software is VERY enabling!

One very active player in CASE is Christoph Steinbeck, a member of Blue Obelisk, one of the more active blogging groups on the net today.

Christoph’s group host NMRSHIFTDB. Recently ChemSpider linked to NMRSHIFTDB. In parallel I took interest in the recent critique of the quality of data published by Wolfgang Robien especially since during my “day job” I am directly involved with NMR prediction, structure verification using NMR and of course, CASE systems.

What was interesting about Robien’s post was the fact that it focused on the application of Neural Networks to prediction. With the availability of a public dataset we were able to repeat the analysis using our own Neural Networks as well as our classical approaches. The results will be reported elsewhere. What I want to confirm is PMR’s post regarding the quality of data. Peter commented, relative to our own efforts at ChemSpider. “There is little point in collecting 10 million structures if you cannot rely on any of them. It actually detracts from the hard work of people like Stefan, Christoph and others on NMRShiftDB as the general user of the database will judge all entries by the lowest common denominator.” After analyzing the data, over 200,000 individual chemical shifts I can say DON’T judge by the lowest common denominator. There is some junk on there..as seen by Wolgang Robien. But our estimates after our analysis is likely less than 250 data points in error. These are truly excellent statistics if you consider that this is an open access system where people are depositing data, that these data are free to download and utilize even for the development of derivative algorithms and that such systems can work. The addition or improvement of rigorous checking algorithms to the NMRSHIFTDB is the next natural step and flagging data to the submitter will have them check and validate the quality of their input. This will catch many errors during the submission process.

So, my compliments to Christoph and the team. The quality is excellent and there are “large errors” but minimum in number. I’ve already sent him a report to help cleanse the database though didn’t compare it with that of Robien…likely we saw the same things since they were very obvious. These errors should not detract from the effort ..with >200,000 data points it is obvious that there would be some. For ChemSpider we have the same problem…with >10 million structures there are errors….lots of them. But it’s very useful all the same!
======
So thank you Antony for this analysis. This represents best practice in the prediction of chemical properties:
  • find a set of molecules for which nmr spectra have been measured and which are available in machine parsable digital form (i.e. molecules in CML or legacy formats, spectra in CMLSpect or JCAMP-DX, not hamburger PDF)
  • compute the NMR spectrum for each molecule. ACDLabs have one of the best programs for doing this – which I think is based on lookup, heuristics and machine learning techniques. It’s also possible to use fundamental simulations such as ab initio quantum mechanics.
  • compare the observed and predicted values.
  • attempt to rationalise any discrepancies in terms of either or both experiment or theory. If this is possible then it may be possible to refine either of these methods.

In favourable cases we find that experiment and theory agree (within known limits). Each side of the equation this validates the other. This gives us great confidence in creating general protocols. Thus, from the account above, ACDLabs can predict NMR spectra within published limits and get it right 99.9% of the time.
Joe Townsend has done the same for crystallography – we are in the process of wrting this up but basically we get similar results. We create a protocol to identify poor experimental data and filter them out, and then compare the rest with predictions from quantum mechanics. We also get > 99.9% agreement. As a result we are confident that the data collected in Nick Day’s CrystalEye will be highly valuable.
So the message is simple. If you wish to predict properties, you must follow the steps above AND

  • publish your methodology and protocols
  • publish your test data set before and after filtering
  • publish the agreement
  • use this to give confidence limits to your predictive method.

Only then can you reasonably announce to the world that you have a useful method.
For example I am listening to Bobby Glen giving a talk about how we predict solubility of organic compounds in our Centre. The data in the literature can be awful – some measurements differ by a factor of 1000. Yet many groups have developed predictive methods based on them and these methods are widely deployed. So Bobby’s group is going back to the bench to measure solubility by new and better methods – but I’ll let them tell the story.
Unfortunately this is all too rare in chemoinformatics. Many publications report predictive methods but often none of the data, protocols, algorithms, software or analysis methods are normally reported or publicly available. The process is not repeatable outside the organisation that created the methodology. Unless the software is open source it is fundamentally impossible to verify algorithms – reports in traditional publications are far too brief to allow full tested implications. So, although for the NMR study the data were open, the program was closed source which means that only the authors can investigate any discrepancies.
Oh, and have I mentioned in other blogs that > 99.9% of NMR spectra are not available in Open machine-parsable form? Because publishers copyright them and try to sell them back. Because chemists do not see the value of preserving their own data. Because manufacturers have binary formats.
Well, the chemical blogosphere is going to change these attitudes. We are going to liberate data. We’ll start with crystallography and then move on to spectra. We’re not going to reveal all the methods in case people try to block us. But we’re confident it will work

Posted in data | Leave a comment

Chemical Blogosphere

For those who denigrate the blogosphere I reply that the chemical blogosphere is an excellent example of a coherent, productive, communal social organism. The members find their ecological niche, and where necessary feed off each others’ electronic secretions. ChemBark (Paul) is an insighful reviewer of it, and here he compares the various members to (US) television programs or other media. (Egon and I are compared to CNBC – not something I consciously watch in the US though I recognise their logo).
Conventional Media Harvests the Blogosphere…Again

Posted by Paul on April 30th, 2007

In a bittersweet turn of events, the chemical blogosphere is losing Carmen and her blog, She Blinded Me with Science. The upside is that she landed a job at C&EN, so we’ll get to enjoy her stuff in bigger doses. I’m guessing the pay is better, too.For those keeping score, this marks the third time that a chemistry magazine has lured a popular blogger into its ranks. Carmen follows in the footsteps of Derek Lowe and Dylan Stiles, who now pen monthly columns for RSC’s Chemistry World.It’s definitely fitting, because of all the chemical blogs out there, Carmen’s is the one most like C&EN (note: this is a compliment, not an insult). Her posts are meaty, often read like mini-feature articles, are always on topic (science/chemistry), and have a serious/professional tone. Finally, she avoids making negative posts (no bloggarific mud-slinging) and doesn’t use swear words. In short, it’s no surprise that C&EN snapped her up.
In contrast, if I were to get a job at C&EN and submit anything like the stuff written here, Rudy Baum would defenestrate my computer and me along with it. I’ve actually heard ChemBark called The O’Reilly Factor of chemistry blogs. Can you believe that? While I was initially revolted by that thought—I often find myself in disagreement with Billy—a deeper analysis has led me to the conclusion that the comparison is valid. After all, we both frequently address hot-button issues, we don’t shy away from sharing our opinions, we occasionally come off as smug jerks, and a significant percentage of our audience is composed of people who hate our guts. That said, I think he is more of a loudmouth and that I make more sense than he does.
Anyway, all of this ridiculousness got me thinking about what television shows are the most analogous to other chemistry blogs out there. Here’s what I decided:
The Chem BlogThe Daily Show with Jon Stewart — Good information is presented with an editorial spin in a humorous manner.
Carbon-Based CuriositiesLive with Regis and Kelly — A healthy balance of research news and assorted fluff, although Excimer swears more than Regis.
Totally SyntheticNFL Primetime – Paul D. breaks down total syntheses like Ron Jaworski breaks down game film. ChemDraw = the Telestrator.
In the PipelineMoney magazine — OK, I’m cheating here; I can’t think of a good TV analogue of Derek’s blog, which is a mix of technical news, industry news, lab research, and human interest stories. In this regard, I think it’s like an issue of Money magazine for drug discovery instead of financial matters.
Chemical MusingsCountdown — Like Milo’s blog, this British game show is filled with puzzles and includes more banter than your typical American game show. At least, it did when the great Richard Whiteley was host.
Sceptical ChymistThe Sports Reporters — This show put a camera in front of a panel of four reporters (here: editors) and allowed them to talk about whatever newsworthy issues they pleased. It was definitely more professional and thoughtful than most sports shows.
Lamentations on Chemistry — Andy Rooney’s Segments on 60 Minutes — Thoughts from a grizzled chemistry veteran.
Post Doc Propter DocDilbert — Her blog is about her life in lab, and it often sounds like a sit-com.
Org Prep DailyThe Essence of Emeril — BAMMM!! Recipes with flair.
Chemical ForumsWashington Journal (C-SPAN’s morning show) — The focus of both is on audience participation.
The Chemical Informatics Crowd (e.g., Peter Murray-Rust and Egon) — CNBC Network — Much like CNBC covers the financial world inside and out, the chemical informatics community has taken to blogging en masse, making that subject the deepest explored issue in the chemical blogosphere.
The Half-Decent Pharmaceutical Chemistry Blog, Curious Wavefunction, Whistling in the Wind — The National Public Radio lineup — There are a lot of blogs that resemble the programming of NPR, in that they are quality shows that never get the attention they deserve.
It’s difficult to pigeonhole blogs into TV equivalents, so some of the above comparisons ring more true than others. To any of you who feel offended, just remember that none of these is worse than being called another Bill O’Reilly.
Posted in blueobelisk, semanticWeb | 4 Comments

Latest Blue Obelisk Greasemonkey

Noel O’Boyle posts:

Add quotes from PostGenomic and Chemical Blogspace to journal

Greasemonkey is a Firefox extension that allows you to rewrite the HTML of a webpage on-the-fly. Pedro Beltrão was the first to think of adding a link to journal Table of Contents pages whenever a particular paper had been reviewed on PostGenomic.com. I extended Pedro’s script to include a clickable pop up of the actual blog post as described by Egon.
I have just released a new version, described on the Blue Obelisk wiki and available from User scripts. This incorporates comments from both Postgenomic and Chemical Blogspace, although you can use the menu to choose just one or the other.
Feedback is welcome. In particular, what journals would people like to see added? Currently, only the following websites are included, although others may work if you add them (please let me know if they do):
  • http://pubs*.acs.org/*
  • http://www.rsc.org/*
  • http://www*.interscience.wiley.com/*
  • http://www.nature.com/*
  • http://*.oxfordjournals.org/* (Added 01/May/07)
  • http://*.plosjournals.org/*
  • http://www.pnas.org/*

Here’s the obligatory screenshot showing a recent issue of Nature containing quotes from both Chemical Blogspace and Postgenomic:

===================
This is very exciting. It makes your browser a semantic lens for a whole host of journals. The chemical blogosphere is becoming the primary place where the chemical literature is reviewed as soon as it is published (or sometimes before!)

Posted in Uncategorized | 3 Comments

OAI-ORE

I am delighted to be able to write about OAI-ORE – on whose advisory board I am. I also had the chance recently to have a long discussion with two of the people driving it, Carl Lagoze and Herbert Van de Sompel who writes:
We thought you might be interested in the presentation about OAI-ORE given at the recent OAI5 Workshop which took place at CERN, Geneva, Switzerland.
The presentation gives an insight regarding the problem domain in which ORE operates, and in the evolving thinking regarding potential solutions.
The presentation was recorded on video and is available for both streaming and download at:
We would also like to mention that Michael Nelson did similar presentations on behalf of the ORE effort at the recent CNI Task Force meeting and at the DLF Forum:
Greetings
Carl & Herbert
Herbert Van de Sompel
Digital Library Research & Prototyping
Los Alamos National Laboratory, Research Library
tel. +1 505 667 1267
ORE will be an extremely important development. Essentially OAI-PMH described the metadata that accompanied objects in a repository and works fine as long as the object and the metadata are eseentially the same – such as depositing PDFs. But for compound objects (for example an HTML file with several images) OAI-PMH is not able to cope – it cannot describe the relationships of the components. Which is why, for example, the only feasible way of repositing HTML files is to zip them.
ORE can describe compound objects, and will also be able to describe compound objects that intersect. Although ORE is formally not dependent on the objects being in repositories the  motivation – OAI – means that is where the  drive comes from.
My advice so far has been – keep it simple (e.g. I have to be able to understand it) – get early implementations out – and make sure the whole community knows what is going on. After all, every repository will have to be able to implement it, and that includes filing systems  and directories on servers.
Then no-one will have any excuses for not putting their stuff in repositories.
Posted in Uncategorized | Leave a comment

Repositories or Lists of Open Molecules

I am looking for lists (or repositories) of small molecules with connection tables (or machine-parsable molecular structures) which are Open. By Open I mean that anyone can, in principle download, copy or clone part or all of the site, re-use the information and redistribute without reference to the original site. At present I am aware of:

  • Pubchem (10 million+ , superset of many Open datasets including NCI. I use this term to subsume everything at nih.gov)
  • ChEBI (> 25 000 terms collected at EBI, not all with connection tables)
  • MSD (ligands in Protein structures, collected at EBI > 5000)
  • WWMM (250, 000 calculated structures from NCI database). Reposited in DSpace,
  • Crystallographic Open Database crystal structuires collected from the literature or donated. Soon to be complemented with CrystalEye. This should give nearly 100,000 crystal structures.
  • The BlueObelisk Data Repository (BODR). A collection of critical information collected by BO volunteers primarily as reference data for (Open) software. (includes non-molecular stuff like elemental properties). BODR is widely distributed on Gnome and other Open Source distros.

I’ve almost certainly missed some so please let us know. There should be a substantial amount of (molecules x attributes) in the collection and ideally it should provide complementary information to the above list. The connection tables (InChI, CML, SMILES, MOL) (or 3D coordinates) of the molecules should be easily accessible (i.e. not requiring resolving identifiers).
Note that there are many useful sites which offer free queries on databases (e.g. NIST Webbook) but these do not fit the criterion of Openness – could they be downloaded in toto, cloned and redistributed (with attribution, of course).
If we don’t already have it, perhaps there should be a page on the Blue Obelisk Wiki.

Posted in Uncategorized | 6 Comments

The SWORD is mightier than the pen

I am just relaxing in a hotel in Redmond, WA, US after two week’s very hard work instead of going downtown Seattle and shopping (which I hate unless it is for Obelisks). So the posts are going in all directions today. Incidentally a post seems to take about at least an hour which is why I don’t post every day. I am seriously considering audio blogs (podcasts) and would welcome advice.
Part of my environment now is “digital libraries” – at least I get to interact with librarians, informations scientists, etc. There is a real buzz of change in the air. One feature of that is the Institutional Repository (IR) – a digital place where people in an institution (usually a University) put digital things.   That’s as close as I can get to the communality of vision. Why are people doing this – why are they NOT doing this? What do they want to put in? I’ll probably blog about this later.
Caveat Lector is an intriguing blog from a librarian (Dorothea) an ex-publisher who describes herself as

Sure, I’m a geek. But I’m not a gadget geek or a Lib2 geek or even a web geek (more than incidentally). I’m not even really a markup geek any more; I sling XML now and then, but as a side requirement of my real work rather than the focus of my professional attention. What I am is a problem-solving geek. I have a problem with a technology (say, hm, I dunno, DSpace?), I beat the living daylights out of it with the nearest handy rock until either it does what I want or I decide that the problem needs a better tool than a rock and give up (complaining bitterly afterwards, of course).

She sounds like the sort of person that we desperately need – an XML-slinger who works in libraries and information science. We collaborate with Jeremy Frey (Southampton – CombeChem eCrystals, etc.) and yesterday that we agreed at the top of our shopping list we needed informations scientists embedded in chemistry (and other scientific) departments. That’s the role of the modern “library”.
Anyway there has been a lot of debate about repositories and how to get stuff into them. As a scientist I know that this requires two conditions

  • There has to be an overwhelming motivation for the scientist (like losing their job if they don’t do it) and…
  • … it has to be trivially easy

At present neither of these are true. Recently JISC has announced a project SWORD to help people (authors?, librarians?) reposit material. That is a great advance. Jim Downing who advises me on much of this thinks it’s important. And I am supportive of what JISC is doing with repositories (readers should allow for the fact that they are supporting us!).
So let Caveat Lector tell it her way…

Well, hot damn

I ask for middleware, and lo, there is middleware!

(No, I don’t think there’s a direct cause-and-effect relationship there. Even on my worst days I’m not that arrogant! Just shows that I’m not the only person with that particular train of thought.)
Bring on the SWORD, y’all. I’ll wield that baby, you betcha.

Dies Martis, 24 Aprili 2007

Repository middleware

I did a lot of IR marketing this week, despite my perfect awareness that IR marketing doesn’t work. For a tactic that doesn’t work, I did manage to come away with some contacts, and it appears that the IR made its way into some heads, and that’s all good.

But if marketing doesn’t work, what does?
Here’s the problem I’ve got: there’s a ton of material that’s IR-ready floating around, but I can’t get at it. My nose is mashed up to the window of other people’s hard drives, web servers, workflow silos, and collaboration tools. I want the stuff that comes out of those arenas. I just have no way to grab it.
Here’s the problem everybody else has got: they need the curation, preservation, and “put this important content somewhere safe (but otherwise out of my hair)” tools that an IR theoretically provides, but they don’t need the hassle of extra deposit steps. They need an “Archive It!” button. They just have no way to build one… if they even know about the IR to begin with.
I need middleware, and I need it badly.
I don’t think DSpace or EPrints developers should be directly considering building the kinds of tools that Peter Murray-Rust is talking about. We’re the wrong people for the job (we can’t even do versioning!), and the job is being done elsewhere by others anyway, because faculty want and need these tools, and IT is finally listening. (I have direct evidence of that from my own job, but I need to keep fairly quiet about it because work is ongoing. You’ll just have to trust me.)
What DSpace and EPrints developers should be considering is how to hook IRs up to the firehose of research products those other tools are producing. By my one-horse back-of-the-napkin calculations, that means an ingest API (no, not a command-line batch import tool, an API!) that is configurable enough to authorize certain tools for unmediated deposit and then prepopulate metadata fields with what those tools “know” about their content and the people who use them.
It’s a tall order, but I dearly hope it’s not impossible, because I want to get my IR’s ingest pipe connected to that firehose.

One caveat (small C). If you build it they may not come.  That’s the challenge. If they do come, and it’s a good SWORD, they’ll be less likely to go away. That’s why we have to work on all areas of encouraging people to capture and reposit digital artifacts.

Posted in Uncategorized | 1 Comment

BioMOO and BlueObelisk Cemetery

Jean Clause Bradley posts in Nature Island Review

Joanna Scott just wrote a nice little review of what is going on at Nature Island (slurl) on Nature’s Nascent blog since her return from the American Chemical Society meeting in Chicago.
The Blue Obelisk Cemetery, where I give my students quiz races on Fridays was featured (only possible through help from Beth and Eloise – thanks again!). Another fun place is Mary Anne Clark’s biological cell that you can enter and float amongst the mitochondria.
Nature Island has really become a very interesting place to hang out, meet smart people and learn and share.

A slurl is a Second life url – an address in the emerging virtual world. The Blue Obelisk community has built a cemetery of Obelisks – see the picture in J-C’s link. Name sounds a bit gloomy, but the sun seems to be shining – and no dount the obelisks celebrate something positive.
I have been a great believer in virtual worlds.  Pioneers in this were British Telecom ? University of Essex (MUD)  and Xerox Parc (Pavel Curtis) with the MOO.  Many MOOs were built with LambdaMOO software (not an intuitive system!) including BioMOO for bioscientists.
BioMOO is no more – it flourished from ca 1993 to ca 1998. It was years ahaed of its time – as was Diversity University. But the digital record is inexorably decaying and this is a tragedy. I regard BioMOO as at least part of the current subjective unconscious of the collective scientific web and a first generation of what we are now seeing in scientific second life. Here are some bitshards (electronic potshards) unearthed from Google:
BioMOO
an overview captured in 1997 but all the links are 404. (The Virtual School of Natural Sciences was my creation in Diversity University).

2-4.2 The BioMoo

a summary by TECFA on the value of BioMOO for education. Again most links are 404.

BioMOO PPS97-98

A record of the meetings in BioMOO of the Principles Of Protein Structure course which Alan Mills and I kicked off in 1995. “ClareS” is Clare Sansom who took over.

DESCRIPTION] BioMOO announce VR web interface

a description of how to use BioMOO in 1995 (VR = virtual reality)

BioMOO meeting, PPS Base, 14th Mar ’96 17:00 GMT

a transcript of a tutorial on protein structure at which petermr was present.

Diversity University, 29.09.1996

with link to “tour of BioMOO by petermr”

Analyse d’un Mud: le BioMoo

A description of BioMOO including a map. BioMOO  (Gustavo Glusman and others) developed a rich system of images to paint a picture of BioMOO, Our course was held in VSNS-PPS classroom and some of the students decorated the virtual walls with interactive 3D objects in RasMOL.
So – secondLife – it’s all been done before. But before the world was ready for it. Because it’s technically hard and takes a lot of infrastructure. So perhaps all these ideas of the 1990+ have had to wait for the Cloud, the Web, SeocndLife etc. to impact on everyday life. So now it is almost costless to try them out.
I am delighted to see a garden in which flowers of sorts  that I cannot imagine will grow
But the tragedy is that we have lost much of the digital garden of the first generation. Are we in danger of losing the record of the BlueObelisk Cemetery? Can we record it for posterity? I think we probably can.

Posted in Uncategorized | 2 Comments

Another puzzle

I have blogged about the broken state of chemical information and the lack of semantics in current commercial chemical software and drawing tools. This is exemplified by the amount of incorrect structures received by publishers (and I have talked to several).
What is the most common chemical substance received by a chemical publisher (or at least the Royal Society of Chemistry, thanks to Colin Batchelor)? It’s not water, or a common solvent. It’s an error.
If you KNOW – i.e. I or Colin or someone else have told you – please don’t answer. But otherwise please guess.

Posted in Uncategorized | 2 Comments

Pubchem Pigeons and Parrots

I ‘ve pent some of the last two days talking with Steve Bryant who runs The PubChem Project. People often think (and I’ve been guilty of this) that there is a lot of junk in Pubchem (although the proportion is very low). That’s not really accurate – it would be better to say that there are a lot of name-to-structure links that most people would regard as inaccurate.
The primary problem is names (and naming is one of the critical challenges of the digital age). Lewis Carroll recognised the central roles of names – in Alice’s Adventures in Wonderland Alice’s neck had grown extremely long and …

… a large pigeon had flown into her face, and was beating her violently with its wings.

`Serpent!’ screamed the Pigeon.
`I’m not a serpent!’ said Alice indignantly. `Let me alone!’
[…]
`And just as I’d taken the highest tree in the wood,’ continued the Pigeon, raising its voice to a shriek, `and just as I was thinking I should be free of them at last, they must needs come wriggling down from the sky! Ugh, Serpent!’
`But I’m not a serpent, I tell you!’ said Alice. `I’m a–I’m a–‘
`Well! what are you?’ said the Pigeon. `I can see you’re trying to invent something!’
`I–I’m a little girl,’ said Alice, rather doubtfully, as she remembered the number of changes she had gone through that day.
`A likely story indeed!’ said the Pigeon in a tone of the deepest contempt. `I’ve seen a good many little girls in my time, but never one with such a neck as that! No, no! You’re a serpent; and there’s no use denying it. I suppose you’ll be telling me next that you never tasted an egg!’
`I have tasted eggs, certainly,’ said Alice, who was a very truthful child; `but little girls eat eggs quite as much as serpents do, you know.’
`I don’t believe it,’ said the Pigeon; `but if they do, why then they’re a kind of serpent, that’s all I can say.’

This exemplifies a fundamental problem of naming – the pigeon uses phenotypes and Alice uses genotypes, and Alice’s phenotype is inconsistent with her genotype.
I’ll try to create an analogy and then map it onto Pubchem. Mr Python sells musical animals such as parrots and mouse organs. He has a number of suppliers who send animals (and occasionally collections of animals) which are labelled. Mr Python is not an ornithologist, but he has bought a molecular biology kit and can sequence the DNA of the things he is sent. He uses the names the suppliers send and his own internal numbering system based on the DNA of the thing he is sent (let’s assume no intra-species variation in the DNA) running from C1…. He also has a cataloguing system for everything S1…
Supplier 1 sends a live specimen labelled “Norwegian blue parrot” – its DNA is labelled C1 and its catalog is S1.
He now gets:

  • A “white mouse” C2 S2
  • An “african grey parrot” C3 S3
  • A box of assorted animals labelled “animal organ”. Mr Python cannot extract DNA and the label is * S4
  • Another parrot labelled “norwegian blue” and with DNA consistent with C1. He labels this C1 S5

So far there is no problem. But now he gets:

  • a bird called “oslo beauty” with DNA C1. He labels this as C1 S6.

Now when anyone asks for a “norwegian blue” or an “oslo beauty” it will retrieve catalog entries with DNA of C1. Mr python starts to regard these names as synonyms. When asking for this, most customers asks for “norwegian blue” and this becomes the preferred name.
But now the confusion starts:

  • he gets a picture of a parrot. Since he is not an art gallery he does not accept this into the collection.
  • he gets a parrot labelled “norwegian blue” which does not look very perky. He puts this into his collection as S7. Try as he can he can’t get any DNA out of this. It is, in fact, a stuffed parrot. So he complains to the supplier – the rest is history – but the entry still stands in the record – “norwegian blue” – S7. He does not offer this to his customers.
  • he gets a parrot labelled “norwegian blue” whose DNA corresponds to C3. There is a name collision, but Mr Python is completely ignorant of parrot names and it goes in his collection as “norwegian blue” C3 S8.
  • He gets another bird labelled “parrot” with DNA C3. This is labelled as “parrot” C3 S9

Mr Python has kept an accurate record of what he is sent. He is deliberately impartial to what name belongs to what DNA. He does care about selling dead birds and does not offer them in his catalog. But for all the rest he simply shows what names have been associated with what DNA. If someone asks for “norwegian blue” they get offered both the C1 S1 and C3 S8. A request for “parrot” gets C3 C9
Pubchem has a similar approach. Here the role of the DNA is replaced by the chemical connection table or “structure”. Steve and his colleagues check the any sample offered to see if it has a connection table (CT). If this CT is already in Pubchem, they use its number. If not they create a new entry for the new connection table. If there are name(s) associated with the sample, these names are associated with the CT, no matter how apparently “incorrect” they are. The information provided by the supplier goes in exactly as it is sent.
This lead to name-structure links which appear absurd to most of us. But we should remember that there have been some very strange names in the past. You might expect that “salts of lemon” is citric acid or a citrate, but actually it refers to Potassium Hydrogen Oxalate (Pubchem entry). So Pubchem records this accurately.
The real problems is with chemical suppliers. The level of accuracy and checking is often extremely poor. Nick Day did a study on one compound (staurosporine) – which has only one canonical structure (determined by X-ray crystallography) but for which 19 variants in the literature exist (some are simply “wrong”, others are “fuzzy” – omitting stereochemistry, etc.).
Formally the only way to resolve a naming problem is to convene a committee of the great and the good in the domain, have them debate endlessly and finally come out with a recommendation. That’s what the International Union of Pure And Applied Chemistry does. It’s a highly respected body, but such processes are slow. It took many years to agree on the name of element 104 (even though none of it exists in nature) because of US-Soviet politics. Often the names are never used in practice. We all talk of “water” – H2O – but that is anglophone and the systematic international name is “oxane”. But how many of you use that!
So Pubchem records every name+structure association it has. For the name “Methane” it finds CH4 as the first structure. That’s because the preferred structure for “methane” is CH4. Pubchem generates “preferred” names and structures by a popularity algorithm (details of which are ?deliberately? not obvious).
It also finds 15 other structures, including things like isotopically labelled methane, and also some “wrong structures” which have “methane” in the name but are not methane, nor do they have “methane” as a name. So this is a feature of the search algorithm – name searching is a hard problem – and as Pubchem continues to develop software will reduce. (There is no “absolutely correct” when you are dealing with linguistics).
Let’s look at the names for CID 297. The preferred name is methane – it’s at the top of the list. But we also find “carbon”. Methane is not carbon, so why is “carbon” a name for methane?
It’s because chemists often omit the hydrogen atoms when drawing or writing chemical structures. So in some contexts “C” is interpreted as CH4 (remember carbon has a valency of 4) and O as H2O.
The fundamental problem is that chemoinformatics and chemical information is fundamentally broken and the main purveyors do not care. Chemical software tools will emit structures which are semantic rubbish, suppliers of  chemical compounds mislabel their bottles, databases of chemical information are closed and compete against each other in a commercial market. The semantics are necessarily inconsistent.
So Pubchem faithfully reflects the broken nature of chemical infomation. It cannot mend it – there are only ca. 20 people – and anyway the commercial chemical information world prefers to work with a broken system.
But could social computing change it? Like Wikipedia has? I talked with Steve  about this. He said said social computing had been tried on sequences and it hadn’t worked. Comments tended to be “I published that first but it has got X’s citation on it”.
I think chemistry is different. And I think we could do it almost effortlessly – rather like the Internet Movie Database.  Here every participant can vote for popularity or tomatos. A greasemonkey-like system could allow us to flag “unuseful names” or to vote for the preferred names and structures. And this doesn’t have to be done on PubChem – it could be a standoff site addressed through a greasemonkey.
Martin Walker tells us there are 23000 pages in Wikipedia on chemistry. If we link those to Pubchem – and Steve is happy for this to happen – then Wikipedia becomes a communal standoff annotation tool for Pubchem. Of course there are > 10million compounds in Pubchem, but many of them are probably uncontroversial – they probably come from a single supplier and are rarely accessed. Wikipedia addresses the most important ones.

Posted in Uncategorized | 7 Comments