The Obelisk SMILES

We are delighted that Craig James has suggested making the molecular format SMILES an Open activity. Egin Willighagen writes:

08:03 28/09/2007, Egon Willighagen,
Craig James wants to make SMILES an open standard, and this has been received with much enthusiasm. SMILES (Simplified molecular input line entry specification) is a de facto standard in chemoinformatics, but the specification is not overly clear, which Craig wants to address. The draft is CC-licensed and will be discussed on the new Blue Obelisk blueobelisk-smiles mailing list.Illustrative is my confusion about the sp2 hybridized atoms, which use lower case element symbols in SMILES. Very often this is seen as indicating aromaticity. I have written up the arguments supporting both views in the CDK wiki. I held the position that lower case elements indicated sp2 hybridization, and the CDK SMILES parser was converted accordingly some years ago. A recent discussion, however, stirred up the discussion once more (which led to the aforementioned wiki page).You can imagine my excitement when I looked up the meaning in the new draft. It states: The formal meaning of a lowercase “aromatic” element in a SMILES string is that the atom is in the sp2 electronic state. When generating a normalized SMILES, all sp2 atoms are written using a lowercase first character of the atomic symbol. When parsing a SMILES, a parser must note the sp2 designation of each atom on input, then when the parsing is complete, the SMILES software must verify that electrons can be assigned without violating the valence rules, consistent with the sp2 markings, the specified or implied hydrogens, external bonds, and charges on the atoms..

PMR: This is excellent. The problem with specifications is that it is VERY difficult to describe them so that independent groups can interpret them consistently. I spent some years helping with the XML effort and apparently simple ideas could cause huge debates. (e.g. namespaces…) It’s well known that some constructs in computer languages, such as
int i = 6;
int j = i++ * i++;
i = i++;
cause enormous confusion. What are the results? (Try to work it out, then try it out and then find the “right answer” (your compiler may surprise you) [*].
Back to chemistry. Almost all formats have been proprietary. That means that there is unlikely to be much useful interactive public help from the originators, and the only check is likely to be a binary executable. When I joined the pharma industry and started trying to get some standards, one software company threatened to sue anyone who published their molecular file formats. It’s slightly better now, but IMO the responsibility for the current appalling situation lies with the pharma industry which has had no effective interest is standardising anything and is now paying the price. (It can only survive by using information, and until it makes this standard and largely free it won’t).
That’s a major reason for developing CML (Chemical Markup Language). CML is open, and uses open standards (XML). It’s much larger than SMILES, and there are places where it is defined less well than we would like, but at least it’s open and that can happen.
SMILES is very widely used. Creating an open standard will take more effort than might appear. The “aromatic” or “lower case” concept is extremely difficult to define. I don’t understand the definition:
The formal meaning of a lowercase “aromatic” element in a SMILES string is that the atom is in the sp2 electronic state.
I don’t believe that SMILES has anything to do with electronic states and I think it should simply be a means for counting atoms, formal bonds and electrons. Is there a difference between Cn(C)C and CN(C)C ? The first represents a planar transition state of trimethylamine, the second a pyramidal ground state.
But the positive point is that I have the chance to make this view and other the chance to support it, modify it or challenge it. Just like Wikipedia, the Blue Obelisk uses the court of public opinion. And we have the exciting position that a “Web 2.0” community is now about to lead the chemoinformatics world.
Maybe the pharma industry will take us seriously. And, wonder of wonders, might actually come into the open, say so, and offer some support.
[*] actually both are undefined and may give different answers

Posted in blueobelisk, chemistry, open issues | 2 Comments

What's in a name? hexanoic acid still smells of goats

In a recent post I said – rather crudely – that there was no absolute way of understanding chemical names. I have been (rightly) taken to task for imprecision:

ChemSpiderMan Says:
September 25th, 2007 at 5:04 am e I’m not sure what you mean by the comment “Because there is no absolute way of assigning names to structures.” Systematic naming is exactly that….IUPAC Naming, CAS Naming. Well defined rules. Now, are they exhaustive across all forms of chemistry..surely not…inorganics, organometallics, polymers while challenging do have nomenclature standards too while some believe they don’t. Of course chemical structure classes change…there were no rules of fullerenes before they were synthesized. But, in general there IS an absolute way of assigning the names to structures. Maybe I misinterpreted your

PMR: This is true, in principle for certain classes of compounds (mainly organic). BTW many chemical (informatics) folk are arrogant enough to assume that there is nothing in the world except organic chemistry. There are many chemicals which aren’t organic. The Wikipedians have a lot of problems in deciding how to assign a name to something because they use names as both descriptions and addresses. Naming is hard. Very hard. It’s been said that there are only two hard problems in computer science and naming is one of them. Here are some and they can’t be represented by a formal name other than lookup.
calcite / aragoniteBakelite
invert sugar
and, of course there are trivial names, such as Diazonamide A. Why use that rather than the systematic name? Because when it was first discover they didn’t know what it was. It seems they still don’t. Or at least some people don’t. The name relates not to a connection table but to a sample with associated properties such as composition, melting point, NMR, etc. which serve to identify, but not always elucidate.
Trivial names are convenient. Therefore we need an Open (not just free) set of chemical names.
I’ve just remembered. We’ve got several: Pubchem, Wikipedia, ChEBI. Set up respectively by biologists, volunteers, biologists. For the service of chemists. They might even get interested in helping them grow.

Posted in chemistry, open issues | 5 Comments

Structures that InChI and SMILES can't represent

Even in organic chemistry there are lots of strucures that cannot be represented by InChIs and currently cannot be communicated without structure diagrams. I’ve gone randomly to Beilstein Journal of Organic Chemistry (as it’s Open Access) and found three consecutive abstracts. They contain ideas of variable locants, spatial arrangements, non-atomic species (balls), reactions, ion pairs, organometallic coordination. It would be an act of scientific barbarism to copyright anything below.

Novel base catalysed rearrangement of sultone oximes to 1,2-benzisoxazole-3-methane sulfonate derivatives
Veera Reddy Arava, Udaya Bhaskara Rao Siripalli, Vaishali Nadkarni, Rajendiran Chinnapillai
Beilstein Journal of Organic Chemistry 2007, 3:20 (8 June 2007)
[Full Text] [PDF] [Album] [PubMed] [Related articles]

m-Iodosylbenzoic acid – a convenient recyclable reagent for highly efficient aromatic iodinations
Andreas Kirschning, Mekhman S Yusubov, Roza Y Yusubova, Ki-Whan Chi, Joo Y Park
Beilstein Journal of Organic Chemistry 2007, 3:19 (4 June 2007)[Abstract] [Full Text] [PDF] [Album] [PubMed] [Related articles]

A convenient catalyst system for microwave accelerated cross-coupling of a range of aryl boronic acids with aryl chlorides
Matthew L Clarke, Marcia B France, Jose A Fuentes, Edward J Milton, Geoffrey J Roff
Beilstein Journal of Organic Chemistry 2007, 3:18 (30 May 2007)
[Abstract] [Full Text] [PDF] [Album] [PubMed] [Related articles]

FWIW: CML can manage much of the uncertainty above, but although it is a work of breathtaking beauty it also shouldn’t be copyrighted.

Posted in chemistry, open issues | 1 Comment

Grazie!

I made the sweeping assertion at Berlin5 that no-one other than me was blogging (I asked for a show of hands), and am delighted to be proved wrong:

Paolo Gardois Says:
September 24th, 2007 at 3:30 pm e
Firstly, compliments for your presentation, it was great!!
Secondly, I was at Berlin 5, blogging the meeting, so you should feel a little less sad… :-) Our blog is in Italian, but if you want to take a look: http://unitosbd.wordpress.com .
Please let us know what you think…

PMR: Paolo has done a great job and blogged every session.

Peter Murray Rust, scienziato di Cambridge e blogger prosegue con un paper sulla publicità dei dati di ricerca, partendo dall’esempio dei dati su inquinamento e riscaldamento climatico. Dopo l’imperdibile citazione di Tufte (”Power Corrupts. PowerPoint Corrupts Absolutely“), Rust prosegue delineando l’onnipresenza del copyright, sia nelle tabelle e grafici pubblicati dentro gli articoli scientifici di editori commerciali, sia nei database (es. ACS).

Ora, un aspetto dell’open access riguarda il libero accesso all’informazione, ma un altro riguarda la possibilità di riuso dei dati. Non sempre le 2 cose sono collegate, e questo costituisce un problema per gli scienziati, che generano nuova conoscenza letteralmente manipolando e riconfigurando dati pubblicamente disponibili. Una soluzione è rappresentata dall’attrivuire un’esplicita licenza relativa all’utilizzo dei dati (es. Science Commons).
Anche le tesi di dottorato dovrebbero essere rilasciate secondo modelli di licenza simili (vedi l’iniziativa di Harvard).
Su un altro versante, si incontrano difficoltà anche tecniche nell’estrarre i dati (formule, ecc.) dalle pubblicazioni per poterle riutilizzare. Non è solo un problema di copyright, dunque, ma anche di formati. Occorre dunque pubblicare i dati grezzi in formati standard in repository pubblici, e parallelamente sviluppare strumenti di text mining – estrazione automatica di dati da file di testo – ovviamente XML, non PDF che distrugge la scienza :-)
Un es. di questi strumenti, utile per l’annotazione semantica di articoli di chimica, è OSCAR3.
Ma comunque, ironia a parte, quanto detto qui sui formati chiusi riecheggia quello che ho scritto ieri sui documenti chiusi come forma ormai obsoleta di pubblicazione della conoscenza. Spesso i dati sono più interessanti delle conclusioni che se ne traggono, perché permettono discussione ed interpretazioni alternative: “chiuderli” esclusivamente dentro PDF e Powerpoint è sicuramente un errore, ed un’altra faccia dello stesso problema. Qui i concetti da cui si può partire per un ragionamento sono: openaccess (aspetto culturale, giuridico, economico, professionale) opensource (aspetto informatico, produzione) open standards (accesso, riuso, riconfigurazione).

PMR: Good accurate report, like the others. I am interested to see that Italian uses many English terms directly “copyright”, “repository”, “open access”, “text mining”. I don’t want to seem like an anglophone imperialist, but in the Internet age it can be useful to know that we are using the same terms for the same concept. Of course copyright will be country-dependent in precise meaning.

Posted in berlin5 | 1 Comment

Truth or beauty, continued

Continuing our discussion on whether a chemical strucure diagram is copyrightable.

  1. Steven Bachrach Says:
    September 24th, 2007 at 9:27 pm ePeter,
    I have to take exception to some of your claims. The chemical formula drawing is not the only way of communicating the compound. In fact there are really much better ways of doing this, though not necessarily the best for a human to readily read. The InChI or smiles or 3-D coordinates really capture more information in a more reusable and less likely to be error-ridden way (especially the 3-D coordinates). The chemical formula drawing is not even unique, as we have seen in your examples.
    Furthermore, Totally Synthetic had many arbitrary decisions to make in how to represent the structure. I have modified this structure in a few simple ways to make this point:
    Note that I have changed the orientation of the terminal isopropyl/OH groups and the way the amide connects to ring A. With regard to ring B, the wedges here are actually NOT how it has to be. Note that the carbons of ring B are not stereocenters. The structure is drawn to try to indicate that ring B sort of π-stacks above ring E. This may or may not be true. Furthermore, the oxygen of the ring could in fact be pointing backwards. In my representation, I decided not to indicate any of this 3-D relationship.
    Now I am not claiming that my structure is better than the original. My claim, however, is that Totally Synthetic made some creative decisions in making this presentation, and thus it should be protected.

The image “http://hackberry.chem.trinity.edu/SMB/modFig.gif” cannot be displayed, because it contains errors.
PMR: “With regard to ring B, the wedges here are actually NOT how it has to be. Note that the carbons of ring B are not stereocenters. ” I don’t know whether this is true or not. Looking at the 3D structure it seems to me that there are two isomers (not conformers) where TS’s wedges show one. It may, however, be that the ring is sufficiently flexible that they interconvert rapidly enough not to be isomers.
More generally, however, there are many reasons why structural diagrams are essential. The diagram above is numbered. The numbers are essential to understand much of the data (spectral assignments, reactivity, etc.) They cannot be held in SMILES, or InChI or, indeed, in anything in common use other than CML (which has support for many sorts of annotation). Neither InChIs nor SMILES are any use for most organometallic compounds, polymers, intermolecules compounds, supermolecules, nanotubes, polymer beads, etc. Many of these things don’t have useful 3D coordinates – we are working on polymers and have developed Polymer Markup Language – and we can generate 3D coordinates very nicely, but not many others can.
So I contend that for much of chemistry diagrams are the only method of primary communication. It’s actually part of the problem of involving machines. How do we get these things into a formal system without losing information.
That’s hard enough without the publishers’ lawyers suing us.

Posted in chemistry, open issues | 1 Comment

Semantic web : the scream!

I have just blogged Paul Miller’s Talis Community Licence and realised that – I think – I used to get a feed from his/Talis blog. So I put it in the Feedreader and found a whole lot of posts on the semantic web (or Semantic Web). Now I had been battling with SPARQL for a day or two trying to make a query with real numbers (e.g.
FILTER (?foo < "1.23"^^xsd:float)
I could NOT get it to work. Finally my colleague Diana Stewart tracked it down to the fact that in some places in RDF you are allowed to use prefixes and in some places you can't. It's almost completely arbitrary. It's not in the simple tutorials. RDF is a place where if you make an inspired guess you will be wrong. The syntax (wrong, the 3-4 syntaxes) are all over the shop. XML,N3, Turtle, ???  It makes me scream.
So I was pleased to see that Jeni Tennison (one of early evangelists of XSLT, coming up with some magic tricks – and goodness you need all the tricks you can get with XSLT) had the same reaction:

Posted in "virtual communities", semanticWeb | 1 Comment

Diazonamide : The Blue CrystalEye Greasemonkey lends a hand

There is some doubt about what the structure of diazonamide A is. Because there is no absolute way of assigning names to structures. We only agree what aspirin is because everyone has been assigning the same structure to it for 100+ years. Many people are careless with names and even more are careless with structure diagrams. Indeed there seems to be a minor industry in drawing some structures wrongly. A year of two back when Nick Day was pioneering the use of InChI he used “staurosporine” as an example. He found lots of structure diagrams and I think there were 19 (sic) different diagrams. Some were frankly “wrong”. Others missed out the stereochemistry, others had other problems. And some of these were from suppliers sites (i.e. “labels on bottles”).
So how can we be sure? It needs an authority – but which one? Staurosporine is a (potential?) drug, so… WHO drugs? British National Formulary? US National Pharmacopeia? Chemical Abstracts? Beilstein? All of these are pay-to-view. So I cannot look them up (remember I am at home, simulating an interested person, such as a patient). Ah! Pubchem… with 16 entries, and several variations of stereochemistry. Wikipedia has a nice picture … but this about diazonamide…
On TotSynth’s post there’s a link to the latest paper (DOI: 10.1021/ja0744448). And following this I find:
diazonamide3.png
PMR: The Blue Eye is NOT part of the abstract – it shows that the Blue Obelisk Greasemonkey has found a crystalEye entry which looks like this:
diazonamide4.png
and here you can see the actual stereochemistry of the diazonamide nucleus (it’s not exactly the title compound) so there is virtually no doubt. The diagram on the right is calculated from the 3D coordinates and the layout is through CDK – note the stereo wedges and hatches.
So now I know what some of the stereo is. And because PNAS have made the text Open I can read how it relates to TS’s structure. The CDK may not be 100% beautiful, but it should be true (Cue some reader finding it’s wrong and a bug in JUMBO, but that’s what Open science is about). And you can always pay Chemical Abstracts 6.20 USD to check whether you have got it “right”.
So install the Blue Obelisk Grease Monkey (blog post) in your Firefox browser and Open your Eyes to a whole new world of truth and beauty.

Posted in blueobelisk, chemistry, data | 4 Comments

Beauty is truth, truth beauty – and copyrightable?

In (Finding chemical structures – InChIs et al., an amusement) I explored the varied approaches to drawing structures and the problems of representing them. I commented that Totally Synthetic’s diagrams were not only the most unambiguous but also the most beautiful. I now regret having done that, as Steve Bachrach has argued that this makes them copyrightable.

  1. Steven Bachrach Says:
    September 24th, 2007 at 5:23 pm eThis discussion demonstrates why I believe that “structure drawing” falls within the domain of materials than can be copyrighted. What is the difference in all of these representations (let’s agree not to worry about what might be a different stereochemistry at one center)? In the eye of the viewer some of these are “ugly” and some are more aesthetically pleasing. I might argue that Totally Synthetic’s representation is not only clear, it might even be called “pleasing”.Since it is aesthetics (beauty, clarity, etc) that differentiates these drawings, the creator’s choices in how to display this molecule were critical creative acts. It seems to me that this defines work that should fall under copyright protection. The fact that all of these representations refer to the same underlying chemistry does not diminish in the least the creativity involved to create the last structure, and perhaps a total lack of creativity in producing the PubChem drawing.So it seems to me that one should be careful in the re-use of structure drawings. To me these “drawings” are not data, while the connection table, for example, is data.Steven

PMR; Steve is a strong evangelist for OA and we are on the same side but I don’t think arguing for copyrighting chemical structure diagrams is helpful. Let’s take the analogy of mathematical equations. I could argue that

e^x = \sum_{n=0}^\infty {x^n \over n!} = \lim_{n \to \infty}\left(\frac{1}{0!} + \frac{x}{1!} + \frac{x^2}{2!} + \cdots + \frac{x^n}{n!}\right).
(taken from Wikipedia on Euler) was beautiful, while
ex = Σ(&infty;n=0)xn/n! was ugly.

Let’s assume that Euler had access to TeX and had published his formula in a journal belonging to the strong-copyright school of thought and that a student has cut and pasted the formula to explain the summation. The publisher could then claim “the formula is beautifully typeset so you must take it down, or retype it”. Poor Euler is dead so hasn’t any say.
In the same way the chemical formula diagram is the ONLY means of communicating the structure. In TotSynth’s case the wedge  bonds in ring B are not to make it pretty but to emphasize what the compound actually is
InChI=1/C40H34Cl2N6O6/c1-15(2)27-37-46-29-32(54-37)40-20-9-5-8-19(18-7-6-10-22-25(18)26(33(41)43-22)31-34(42)48-38(29)53-31)28(20)47-39(40)52-24-12-11-17(13-21(24)40)14-23(35(50)45-27)44-36(51)30(49)16(3)4/h5-13,15-16,23,27,30,39,43,47,49H,14H2,1-4H3,(H,44,51)(H,45,50)/t23-,27-,30-,39?,40-/m0/s1
while the PNAS structure tries to do the same but in a much uglier and gritty fashion. There is a real likelihood of confusion as to what the structure actually is:

  1. DrZZ Says:
    September 24th, 2007 at 12:26 pm eInteresting stuff. Let me add some additional points. One of the structures in PubChem comes from us (DTP/NCI). If you look at the compound record you get the mess you included above. If you navigate to the substance record (click on the CID, on that page look for the Substance: 1 link, when that hit comes up click on the SID, and on that page change the drop down choice for Compound Displayed from PubChem to deposited) you see a much more sensible 2D drawing. In a quick look, I think the difference between the two structures in PubChem is that one of the stereocenters in the NCI deposited structure is unspecified in the other structure. The NCI structure was submitted in 1997 by one of the authors of the original isolation paper. As the structure correction was published in 2001, it is almost certain that the NCI structure contains the original error. I say almost because we have no audit trail in our internal database for structures (at least not easily visible to me). A NSC has a structure, period. There is some possibility that the structure was fixed, but that just overwrites the previous structure. It just reinforces my view that it is extremely important to treat the structure of a substance as one more data point, subject to varied and possibly conflicting values.

diazonamide2.png
It’s very easy to get it wrong.
Now if I cut and paste the diagrams and say “this one shows clearly that the ring is sticking up” I might help avoid the wrong compound being given to the wrong patient somewhere down the line. (This is not hypothetical – these are possible drugs). Steve, can you justify a publisher saying – “we’ll send the lawyers after you for posting copyright chemical structures”?
Because if so, the C21 will be enormously impoverished. So yes, it’s beautiful. but NO it mustn’t be copyrighted by publishers.

Posted in chemistry, open issues | 2 Comments

Talis licence for Open Data

I used to think Open Data was simple – “facts are not copyrightable” and everything follows. No I am wiser and realise that data are complex and need a lot of attention – fast. So it’s very valuable to see groups who are addressing the problem. Here is Paul Miller of Talis (who convened a WWW2007 session on Open Data)

18:11 24/09/2007, Paul Miller, Nodalities
In the world of creative works, notions espoused by Lawrence Lessig and others over a number of years are becoming increasingly well understood. A Creative Commons license, for example, is recognised as giving the holder of rights an ability to prospectively grant certain permissions rather than limit use of their work by expecting all comers to request these permissions, again and again. Those rights are not cast aside, removing all opportunities to protect your work, your name, or your potential revenue stream. Rather, you are provided with a means to explicitly declare that your work may be used and reused by others in certain ways without their needing to request permission. Any other use is not forbidden; those uses must simply be negotiated in the ‘normal’ way… a normal way that also applied to those uses covered by Creative Commons licenses before the advent of those licenses.
Creative Commons licenses are an extension of copyright law, as enshrined in the legal frameworks of various jurisdictions internationally. As such, it doesn’t really work terribly well for a lot of (scientific, business, whatever) data… but the absence of anything better has led people to try slapping Creative Commons licenses of various types on data that they wish to share. It will be interesting to see what happens, the first time one of those licenses needs to be upheld via a court!
At Talis, we have an interest in seeing large bodies of structured data available for use. Through the Talis Platform, we offer one means whereby such data may be stored, used, aggregated and mined, although we clearly recognise that similar data may very well also be required in similar contexts.
Recognising that contributors of such data need to be reassured as to the uses to which we – and others – may put their hard work, we spent some time a couple of years ago drafting something then called the Talis Community Licence. This draft licence is based upon protections enshrined in European Law, and has been used ‘in anger’ for a while to cover contributions of millions of records to one particular application on the Talis Platform.
There has been plenty of talk around ‘open data‘ here on Nodalities, and on our sister blog Panlibus. See, for example, this recent post from Rob Styles. There were also fascinating discussions at the WWW2007 conference earlier this year.
Despite interest in open (or ‘linked‘) data, licenses to provide protection (and, of course, to explicitly encourage reuse) are few and far between. Amongst zealous early adopters, there does seem to be a tendency to either (mis)use a Creative Commons license, to say nothing whatsoever, or to cast their data into the public domain. None of these strategies are fit for application to business-critical data.
Building upon our original work on the TCL, we recently provided funding to lawyers Jordan Hatcher and Charlotte Waelde. They were tasked with validating the principles behind the license, developing an effective expression of those principles that could be applied beyond the database-aware shores of Europe, and working with us to identify a suitable home in which this new licence could be hosted, nurtured, and carried forward for the benefit of stakeholders far outside Talis.
Today, Jordan posted the latest draft of this license (now going by the name ‘Open Data Commons‘), some rationale, and pointers to various ways in which he – and we – are seeking input and further validation.
As my colleague Rob (again!) has argued, curators of data need an option on the permissions continuum between free-for-all and locked down. The Open Data Commons, née Talis Community Licence, offers that option.
Take a look. Think about how you would use it. Consider what sort of administrative framework you would want behind such a license. Join the conversation.

PMR:  First of all many thanks to funding legal work on Open Data. Whatever else we have to remain within the legal framework or we court disaster at a later stage.
There will not be a single approach to this anymore than there is a single Open Source licence. Motivations vary and, even more importantly, data is more varied than software. I know of two other efforts, Science Commons  – (in Cambridge US) springing from CC, and the The Open Knowledge Foundation set up by the tireless Rufus Pollock (in Cambridge UK) who invited me to be on the board. We honour this by using the OKFN “Open Data” on
our own CrystalEye. I expect that people will choose different licences to emphasize different policies. (For example I currently use Artistic as my software licence as I don’t want the name JUMBO to be misused for derivative works which are not compliant. I might well use BSD elsewhere. and so on).
As Paul says, please converse.

Posted in open issues, www2007 | 1 Comment

CDK's Diazonamide and general thoughts on Openness

Noel O’Blog has suggested that I should use Rajarshi Guha’s CDK service to layout the Diazonamide structure (see my post Finding chemical structures – InChIs et al., an amusement)

  1. baoilleach Says:
    September 24th, 2007 at 7:59 am eFor the record, you can compare with CDK’s SMILES to 2D at:
    http://cheminfo.informatics.indiana.edu/~rguha/code/java/cdkws/cdkws.html#sdg

PMR: so here it is:
cdk.png
PMR: I think it’s correct. Interpretable. I’d put it on the same level as the Daylight one. One message is that it is difficult for software to layout structures with a 10-ring nucleus.
The point is that CDK is Open Source and can therefore be enhanced by the community. Daylight and the software that Pubchem (?Cactus?, ?Openeye?) use isn’t. CDK is joint leader, and we can improve it.
A complementary approach is to start making collections of human-drawn images. The intelligible Chemspider image was hand-drawn by the PNAS authors – I don’t know how it got to Chemspider. (Personally I think it’s pretty awful – I do not like stereo bonds which are rectangular rather than wedges. Why do people use them. And You only have to scale the image to corrupt this info). So we need an Open collection of chemical structures.
This is not technically difficult but is lathered with copyright madness. Can I reproduce a chemical structure from Nature without permission? I’ve asked but they haven’t got back to me. Can I reproduce a chemical structure diagram from Wiley? I’ve asked but… … they haven’t got back to me.
It has to be fully Open. Every structure diagram has to be copyright-free and accompanied by metadata that gives provenance and alternative descriptions (names, InChIs, etc.). Is there anywhere that has chemical images that I can download that fulfils all these permissions?
I’ve found one (sorry for the layout). Here’s taxol:

Paclitaxel
β-(benzoylamino)-α-hydroxy-,6,12b-bis
(acetyloxy)-12-(benzoyloxy)-2a,3,4,4a,
5,6,9,10,11,12,12a,12b-dodecahydro-4,11-
dihydroxy-4a,8,13,13-tetramethyl-5-oxo-
7,11-methano-1H-cyclodeca(3,4)benz(1,2-b)
oxet-9-ylester,(2aR-(2a-α,4-β,4a-β,6-β,9-α
(α-R*,β-S*),11-α,12-α,12a-α,2b-α))-
benzenepropanoic acid

And there’s lots of data with it that looks like this:
taxol.png
I’ll leave you to guess where this is. Clues: It’s Open, re-usable, very highly curated, and the first place that students look. That – or a derivative – is where the world’s chemistry should reside.

Posted in Uncategorized | 8 Comments