I 've pent some of the last two days talking with Steve Bryant who runs The PubChem Project. People often think (and I've been guilty of this) that there is a lot of junk in Pubchem (although the proportion is very low). That's not really accurate - it would be better to say that there are a lot of name-to-structure links that most people would regard as inaccurate.
The primary problem is names (and naming is one of the critical challenges of the digital age). Lewis Carroll recognised the central roles of names - in Alice's Adventures in Wonderland Alice's neck had grown extremely long and ...
... a large pigeon had flown into her face, and was beating her violently with its wings.
`Serpent!' screamed the Pigeon.
`I'm not a serpent!' said Alice indignantly. `Let me alone!'
`And just as I'd taken the highest tree in the wood,' continued the Pigeon, raising its voice to a shriek, `and just as I was thinking I should be free of them at last, they must needs come wriggling down from the sky! Ugh, Serpent!'
`But I'm not a serpent, I tell you!' said Alice. `I'm a--I'm a--'
`Well! what are you?' said the Pigeon. `I can see you're trying to invent something!'
`I--I'm a little girl,' said Alice, rather doubtfully, as she remembered the number of changes she had gone through that day.
`A likely story indeed!' said the Pigeon in a tone of the deepest contempt. `I've seen a good many little girls in my time, but never one with such a neck as that! No, no! You're a serpent; and there's no use denying it. I suppose you'll be telling me next that you never tasted an egg!'
`I have tasted eggs, certainly,' said Alice, who was a very truthful child; `but little girls eat eggs quite as much as serpents do, you know.'
`I don't believe it,' said the Pigeon; `but if they do, why then they're a kind of serpent, that's all I can say.'
This exemplifies a fundamental problem of naming - the pigeon uses phenotypes and Alice uses genotypes, and Alice's phenotype is inconsistent with her genotype.
I'll try to create an analogy and then map it onto Pubchem. Mr Python sells musical animals such as parrots and mouse organs. He has a number of suppliers who send animals (and occasionally collections of animals) which are labelled. Mr Python is not an ornithologist, but he has bought a molecular biology kit and can sequence the DNA of the things he is sent. He uses the names the suppliers send and his own internal numbering system based on the DNA of the thing he is sent (let's assume no intra-species variation in the DNA) running from C1.... He also has a cataloguing system for everything S1...
Supplier 1 sends a live specimen labelled "Norwegian blue parrot" - its DNA is labelled C1 and its catalog is S1.
He now gets:
- A "white mouse" C2 S2
- An "african grey parrot" C3 S3
- A box of assorted animals labelled "animal organ". Mr Python cannot extract DNA and the label is * S4
- Another parrot labelled "norwegian blue" and with DNA consistent with C1. He labels this C1 S5
So far there is no problem. But now he gets:
- a bird called "oslo beauty" with DNA C1. He labels this as C1 S6.
Now when anyone asks for a "norwegian blue" or an "oslo beauty" it will retrieve catalog entries with DNA of C1. Mr python starts to regard these names as synonyms. When asking for this, most customers asks for "norwegian blue" and this becomes the preferred name.
But now the confusion starts:
- he gets a picture of a parrot. Since he is not an art gallery he does not accept this into the collection.
- he gets a parrot labelled "norwegian blue" which does not look very perky. He puts this into his collection as S7. Try as he can he can't get any DNA out of this. It is, in fact, a stuffed parrot. So he complains to the supplier - the rest is history - but the entry still stands in the record - "norwegian blue" - S7. He does not offer this to his customers.
- he gets a parrot labelled "norwegian blue" whose DNA corresponds to C3. There is a name collision, but Mr Python is completely ignorant of parrot names and it goes in his collection as "norwegian blue" C3 S8.
- He gets another bird labelled "parrot" with DNA C3. This is labelled as "parrot" C3 S9
Mr Python has kept an accurate record of what he is sent. He is deliberately impartial to what name belongs to what DNA. He does care about selling dead birds and does not offer them in his catalog. But for all the rest he simply shows what names have been associated with what DNA. If someone asks for "norwegian blue" they get offered both the C1 S1 and C3 S8. A request for "parrot" gets C3 C9
Pubchem has a similar approach. Here the role of the DNA is replaced by the chemical connection table or "structure". Steve and his colleagues check the any sample offered to see if it has a connection table (CT). If this CT is already in Pubchem, they use its number. If not they create a new entry for the new connection table. If there are name(s) associated with the sample, these names are associated with the CT, no matter how apparently "incorrect" they are. The information provided by the supplier goes in exactly as it is sent.
This lead to name-structure links which appear absurd to most of us. But we should remember that there have been some very strange names in the past. You might expect that "salts of lemon" is citric acid or a citrate, but actually it refers to Potassium Hydrogen Oxalate (Pubchem entry). So Pubchem records this accurately.
The real problems is with chemical suppliers. The level of accuracy and checking is often extremely poor. Nick Day did a study on one compound (staurosporine) - which has only one canonical structure (determined by X-ray crystallography) but for which 19 variants in the literature exist (some are simply "wrong", others are "fuzzy" - omitting stereochemistry, etc.).
Formally the only way to resolve a naming problem is to convene a committee of the great and the good in the domain, have them debate endlessly and finally come out with a recommendation. That's what the International Union of Pure And Applied Chemistry does. It's a highly respected body, but such processes are slow. It took many years to agree on the name of element 104 (even though none of it exists in nature) because of US-Soviet politics. Often the names are never used in practice. We all talk of "water" - H2O - but that is anglophone and the systematic international name is "oxane". But how many of you use that!
So Pubchem records every name+structure association it has. For the name "Methane" it finds CH4 as the first structure. That's because the preferred structure for "methane" is CH4. Pubchem generates "preferred" names and structures by a popularity algorithm (details of which are ?deliberately? not obvious).
It also finds 15 other structures, including things like isotopically labelled methane, and also some "wrong structures" which have "methane" in the name but are not methane, nor do they have "methane" as a name. So this is a feature of the search algorithm - name searching is a hard problem - and as Pubchem continues to develop software will reduce. (There is no "absolutely correct" when you are dealing with linguistics).
Let's look at the names for CID 297. The preferred name is methane - it's at the top of the list. But we also find "carbon". Methane is not carbon, so why is "carbon" a name for methane?
It's because chemists often omit the hydrogen atoms when drawing or writing chemical structures. So in some contexts "C" is interpreted as CH4 (remember carbon has a valency of 4) and O as H2O.
The fundamental problem is that chemoinformatics and chemical information is fundamentally broken and the main purveyors do not care. Chemical software tools will emit structures which are semantic rubbish, suppliers of chemical compounds mislabel their bottles, databases of chemical information are closed and compete against each other in a commercial market. The semantics are necessarily inconsistent.
So Pubchem faithfully reflects the broken nature of chemical infomation. It cannot mend it - there are only ca. 20 people - and anyway the commercial chemical information world prefers to work with a broken system.
But could social computing change it? Like Wikipedia has? I talked with Steve about this. He said said social computing had been tried on sequences and it hadn't worked. Comments tended to be "I published that first but it has got X's citation on it".
I think chemistry is different. And I think we could do it almost effortlessly - rather like the Internet Movie Database. Here every participant can vote for popularity or tomatos. A greasemonkey-like system could allow us to flag "unuseful names" or to vote for the preferred names and structures. And this doesn't have to be done on PubChem - it could be a standoff site addressed through a greasemonkey.
Martin Walker tells us there are 23000 pages in Wikipedia on chemistry. If we link those to Pubchem - and Steve is happy for this to happen - then Wikipedia becomes a communal standoff annotation tool for Pubchem. Of course there are > 10million compounds in Pubchem, but many of them are probably uncontroversial - they probably come from a single supplier and are rarely accessed. Wikipedia addresses the most important ones.