At the risk of boring readers who already understand the issue of names, metadata, recursive annotations and versions, let me do this discussion to death.
I reiterate. A name by itself is neither right of wrong. It is possible that the syntax might determine whether it is a name or not, but that’s all. “green feathery compound” was a name we gave to a lead compound in Glaxo. (It wasn’t a very good lead).
A connection table obeys some syntax, but otherwise is neither right nor wrong. COOOOOOC is a valid SMILES. So is CO(O)(O)OC. So is [H][U]Ge. They are unlikley to exist, but they are still valid by the syntax rules.
The name “water” is associated with the compound with formula H2O. So is “wasserstoff”. So is “oxidane”.
I have made three statements. Some of you will assert that I am “wrong”. Some will say the name for H2O is “wasser”. Others might say that “wasserstoff” should be associated with the compound H2.
Name: ChemSpiderMan |
Peter, Your original posting was on how long it took to copy a structure (well that’s my interpretation anyways).
If someone copies a structure and misses a fragment of the structure by default isn’t it wrong? If some draws Taxol and reverses a stereocenter by accident is that taxol anymore? I don’t think so…I think it’s a poor copy, and wrong.
PMR: So if someone takes my statement “wasserstoff is H2O” and mistypes it as “wasserstoff is H2”, then Chemspider asserts they are wrong because they have made a typo. But they have made a statement with which most (German) chemists might agree.
This is the type of quality issue that is essential to have tracked.
PMR: If you wish to spend your life recording typos in chemical documents, I hope it is fulfilling.
Maybe I misinterpreted your point about right and wrong structures?
PMR: I think you have. Read Alice through the Looking Glass. Carroll understood the issue very well.
PMR: There is no absolute way of deciding what the name of a compound is. There are authorities who make meta-statements. Thus IUPAC states that various chemical structures have various names. Chemical Abstracts states that various chemical structures have various names. Suppose they differ? Chemspiderman makes the final decision!?
No. The only “absolute” is if there are real-world consequences. If I state that I have sold compound X and it is safe, and someone else says that actually compound X is something else, then we have a court case. The lawyers will argue that Chemical Abstracts is more important than IUPAC. Or vice versa. And I go to jail because I got the wrong name. But I am neither right or wrong, I have simply made a statement which conflicts with one made by a real-world authority who can send me to jail.
The only modern way to do this is with constant annotation, including versioning. This is what sites like Sourceforge and Wikipedia provide. And the latter has a form of cybergovernance which can never be absolute, even as Plato’s is not absolute, but it’s good enough for me.
So Chemspider is fundamentally and almost irretrievably broken because it does not have metadata. It deals with absolutes, while the modern world deals with assertions. And the technology – RDF/OWL – has now arrived to support assertions.
I agree wholeheartedly with Peter. I think the emphasis on curation is misplaced. It isn’t that the issues are unimportant, it is that many of the curation questions are essentially scientific questions and saying you are going to solve that by curation leads to a situation where these scientific decisions are made in ways that don’t become part of the database record. I think at this point in time the emphasis should be on an architecture that allows multiple structure/name claims from multiple sources to be compared and then ways to annotate and track the discussions that arise as consensus is reached on the inconsistencies. This is exactly what scientists do and I don’t see why we should build databases that hide these issues from the people who use them. To be sure most inconsistencies will be the equivalent of typos, but even these problems are ill served by the current set up. I can tell you that at least 3 or 4 other groups have contributed structures to PubChem that were originally copied from the DTP database and hence are not really independent data sources. If there was a problem in a DTP generated structure do we look at PubChem and say that because a number of independent groups have the same structure that ther is a consensus developing on an alternative structure? Not if we knew that all the structures originally came from the same source. Setting up databases that assume the only information needed for a structure is the connection table, coordinates, and whether the representation is right or wrong throws away all the information of where the structure came from, how was it manipulated, what evidence was claimed in support of it, etc. It doesn’t even allow us to know what correction to make. We just had a example where someone found a name for a DTP compound that didn’t match the structure at all. It would be pretty easy to substitute the correct structure for that name, but in this case that would be exactly the wrong thing to do. The primary data from the supplier was the structure and it was a DTP mistake that associated an incorrect name with that structure. In my opinion the most pressing need is for an awareness of the importance of these data items and for an architecture that captures and makes use of them. I think PubChem has made a good start and I would very much like to encourage that, rather than use its limited resources to try to be the final (and hidden) arbiter of what is right and wrong.