you can see there is a small but significant improvement. We believe that when the data errors are filtered out the improvement will be much clearer and more obviously valuable.
We’ll let Henry tell you what he’s done when it’s relevant.
WE ACKNOWLEDGE JOE TOWNSEND’S PLOTTING SOFTWARE WHICH WHEN DISPLAYED AS SVG (LATER THIS WEEK) ALLOWS EXCITING THINGS. Archive for the ‘chemistry’ Category
Open Notebook NMR – Henry’s improved protocol
Monday, October 22nd, 2007
you can see there is a small but significant improvement. We believe that when the data errors are filtered out the improvement will be much clearer and more obviously valuable.
We’ll let Henry tell you what he’s done when it’s relevant.
WE ACKNOWLEDGE JOE TOWNSEND’S PLOTTING SOFTWARE WHICH WHEN DISPLAYED AS SVG (LATER THIS WEEK) ALLOWS EXCITING THINGS. Open Notebook NMR: Anticipated errors
Monday, October 22nd, 2007- mis-reported solvent (the shifts are solvent dependent and the calculation tries to simulate this)
- variable calibration of NMR instrument (e.g. giving rise to origin shifts)
- impure compound. The sample may contain substance(s) which give rise to appreciable peaks not belonging to the title compound
- wrong compound assigned to spectrum (i.e. error in bookkeeping or drawing error)
- machine parameters (phasing, folding, field strength, etc.) varied incorrectly or reported incorrectly
- transcription errors in spectrum or peaks.
- misassignment of peaks to inappropriate atoms
- broad peaks with considerable variance leading to misreporting of mean (unlikely with 13C)
- errors in applying theory of NMR or its interpretation
- noise (including random noise and mains spikes).
- human editing of spectra including fraud
- mis-calculation of offset (e.g. from isotropic tensor to observed shift)
- garbling of the assignment of peaks to atoms (bug)
- corruption of connection tables (especially in adding hydrogen atoms)
- mismapping of atoms between input and output of calculation (we assume atoms come out in the order they go in – bug)
- incorrect generation of input (bug)
- program bugs in reading input and main calculation. For example we found a really nasty bug with GAMESS – if the line overflowed 80 characters the atom was reported but not include in the calculation.
- incorrect transformation of output to CMLSpect
- theoretical model has limitations (Henry will comment)
- Oversimplified chemical model. There are several common problems:
- only one conformer is calculated
- symmetry is not well treated
- tautomerism is ignored
- isomerism (e.g. ring-chain is ignored)
- other chemical effects (Antony mentions micelles, etc.)
- inconsistent results from different machine architectures
- errors in processing and displayng the results
- high precision, high accuracy
- high precision, low accuracy (hopefully allowing identification of systematic error)
- low precision, high accuracy (maybe due to noise, though this is unlikely here)
- low precision, low accuracy (these may allow us to identify problems with various sources such as authors, machines or protocols).
Adding semantic markup with InChI
Monday, October 15th, 2007- Richard [from RSC] Says: October 15th, 2007 at 10:17 am e[...]
- Sitting here as a publisher, we don’t have half the power you suggest – we have to satisfy authors (we want the best) as well as readers (to give them the best service), and making submission as easy as possible is an absolute requirement. Demanding InChIs from authors isn’t a realistic option yet – we can show the advantages of this information via the enhanced HTML and work towards it, but compulsion’s an attractive but ultimately futile option. The more you push, the worse data you’ll get.The publishers aren’t the problem – it’s because the possibilities of processing and reusing this information have only comparatively recently been apparent, and frankly because most people want to read the text and look at the pictures. As authors and readers are encouraged to look beyond print/PDF it’ll happen, but keeping the data within the publishing process is a community issue rather than a publisher one. We’d love it of course!
Lunch with Egon Willighagen
We had lunch yesterday with Egon Willighagen who in his spare time runs the Chemical Blog space, now situated at http://cb.openmolecules.net/ (running on postgenomic code).The chat over lunch was pretty good, it turns out that Egon’s favorite molecule might be Ascorbic acid. One of the topics that really animated Egon was how how to link molecules to academic papers. By this I mean for example if you do a search in google, or in some dedicated search engine, for a molecule, how does your search engine know which papers deal with this molecule. There are a couple of problems with solving this. One is that many different fields use different terminology for molecules, especially as the molecules become large, so a plain text search for the name will not get all of the papers that you might be interested in, also papers don’t have semantic markup of molecules.One solution to marking up molecules is to use an InChi (an IUPAC International Chemical Identifier). These have been championed by Peter Murray Rust and there is an extensive InChi FAQ available. The short story is that an InCHi is a character string which uniquely describes a chemical substance. From any chemical structure you can generate an InChi.Peter has a writeup on using inCHi in blogs, and if every chemical that appeared everywhere was somehow marked up with it’s InChi, or the article referring to it tagged with them then the findability problem would be solved by simple string searching.OK great, well what’s the problem? For a start there is an alternative system SMILES (which is a Simplified Molecular Input Line Entry System), a markdown for molecules if you like. There is a very good description of the syntax here and the KinasePro blog has a short comment on how many people use SMILES vs InChi. The bottom line is that more people use SMILES, but it seems easier to search Google with InChi. I’m not a chemist, but it seems from my naive stand point that the SMILES syntax seems closer to the text description of chemistry that we know from school, wheres the InChi system is more rigorous, it requires one further step of abstraction. It reminds me of the difference between LaTeX for math and MathML. MathML is a hell of a lot easier to write a parser for than LaTeX, as LaTeX can be quite expressive, however no one writes raw MathML. Scientists are lazy and that extra step of abstraction might be the reason why SMILES seems to be used more frequently at the moment.Egon suggested as a solution that journals should require papers dealing with chemicals to include InChis. He said that every tool for drawing chemicals (standard issue for anyone writing a paper on the subject) can now output the InChi with the click of a button. Sounds reasonable, seems easy, but there are problems with this approach. I have heard a few times people say, you are Nature, you can make authors do anything in order to get a paper published so why not get them to do x. Well, for a start, that’s an editorial decision, but even so, making more demands on scientists may not be the best decision when the process of publication is already pretty fraught and stressful. Even if we did this what would that gain? A small selection of the literature would be marked up, but the vast majority of journals in the area would need to follow suit in order to gain full coverage. Of course an argument that we should not do x because other people are not doing x is not what I am getting at here, but rather that this cannot be seen to be a final solution to the problem. Journals are naturally shy of any step that can delay the publication time of an article, and so I am also skeptical that we would see such obligatory requirements. Better, I think, to have this step as a voluntary one. Practically all journals allow supplementary information and I am sure all of them would accept InChi as supplementary information.Even then one is still left with the vast existing corpora of papers that are already published. Egon points out that no one uses the literature in this area from 50 years ago, as modern techniques have advanced so far that this literature is functionally of little use. The implication here is that 50 years in the future we will only need to go back as far as today’s papers. Even so there has to be a value in seeing the evolution of an idea for insertion into the literature right through to where it has led today, and Egon agreed with this.So what can we do now to help making connections between papers and molecules? Peter Corbett, who works with Peter Murray Rust, is working on automated methods of getting computers to read chemistry papers and output semantic markup of them. Tools like this can begin to fill in the semantic blanks, both for papers from the past and for the current literature. Egon has now created rdf pages for molecules on openmolecules.net. These pages use the InChi in their structure, and now each molecule had it’s own web page. Egon’s pages check Connotea, and pull from Connotea co-tags of InChi tags (Here is a short description of this). If we work on this a bit more we should be able to set up a system where if you tag a paper with an InChi, that paper could appear on Egon’s pages. We got quite excited about this idea yesterday and are certainly going to discuss this further. It’s a small start, but a start nonetheless.
OPSIN/OSCAR: you + us = we; please help
Sunday, October 14th, 2007- names with connection tables (especially if not in Pubchem)
- ontologies (e.g. similar to ChEBI)
- tutorials
- regular expressions for specific tasks (though Peter often has better approaches)
- insight into document structuring (e.g. the variability and commonality of organization in theses)
- corpora (especially annotated). These are very valuable but should not be approached casually
- acronyms
- numbering systems, bothy arbitrary and algorithmic
- journal- and thesis-specific regular expressions
- stereochemistry (even R- and S- cannot be parsed at present)
- bridged ring systems (e.g. bicyclo[3.2.1]octane). (I wrote an algorithm for this at one stage but it’s not in OPSIN)
- fused rings. (This can be fairly hairy, but partial lookup can help a lot)
- specialised vocabularies (e.g. saccharides, nucleic acids).
-
To develop a natural-language oriented markup language which enables the tight integration of partial information from a wide variety of language processing tools, while being compatible with GRID and Web protocols and having a sound logical basis consistent with Semantic Web standards.
-
To use this language as a basis for robust and extensible extraction of information from scientific texts.
-
To model scientific argumentation and citation purpose in order to support novel modes of information access.
-
To demonstrate the applicability of this infrastructure in a real-world eScience environment by developing technology for Information Extraction and ontology construction applied to Chemistry texts.
- use all the Open strategies (SVN, Eclipse, JUnit, Maven, etc.)
- use open source components where possible (PDFBox, Lucene, JUMBO, CDK, etc.)
- explore how to modularise them where necessary (e.g. CDK)
- separate components and communicate through XML or RDF (e.g. OSCAR3 -> OSCAR-DATA can be almost completely decoupled
- devise a framework based on standard approaches (e.g. Eclipse)
- actively design and highlight and explain extension points. This is essential if others are to add code and resources in parallel
My outrage against “Open Access Publisher” continues
Sunday, October 14th, 2007CSM: .We have already extracted 10s of thousands of chemical names and will be linking them up to ChemSpider structures to enable Open Access papers to be structure/substructure searchable. However, we’ve hit a bit of a hurdle…more details on this will follow shortly but we have been asked to remove thousands of articles indexed according to what we believe is a standard search engine policy from the ChemRefer index. During our conversation today with the publisher the conversion of chemical names to chemical structures to provide a structure searchable index of the articles was deemed to be “re-purposing” of the Open Access articles and is NOT allowable. Peter Corbett and Peter Murray Rust are engaged in similar activities so will likely run into the same challenges. If they manage to get around this issue with this and other publishers then they will be working in a “permissive” role where they will need to get permission from publishers to perform semantic markup. Their semantic markup is also “re-purposing”. The “permissive challenge” is far away from Peter’s stance in terms of Open Data for all.
By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.
- ChemSpiderMan Says: October 13th, 2007 at 8:31 pm ePeter, I will not announce the publisher at present because I made a commitment to not do so until we had a mutually agreeable blog posting for our users and accurately representing the conversation and agreements between us. I have an urge to co-exist in the world with publishers since they put a lot of value into the world. With the changes going on in Open Access figuring out how to co-exist is very necessary. I hope we can get the information out shortly. It is possible we have mis-stepped but more likely that there is a policy issue with spidering policy that needs addressing by the publisher.
FROG – not just Free, but Open (PRODRG take note)
Saturday, October 13th, 2007Recently, researchers at the French research institutes INSERM and CNRS developed an online service for converting SMILES string to 3D conformers: “FRee Online druG 3D conformation generator (Frog)”. A description of this service was published in T. Bohme Leite, D. Gomes, M.A. Miteva, J. Chomilier, B.O. Villoutreix and P. Tufféry. Nucleic Acids Research, 2007, 35, W568-W572:
Frog is an on-line service aimed at generating 3D conformations for drug-like compounds starting from their 1D or 2D descriptions. Given the atomic constitution of the molecules and connectivity information, Frog can identify the different unambiguous isomers corresponding to each compound, and generate single or multiple low-to-medium energy 3D conformations, using an assembly process that does not presently consider ring flexibility. Tests show that Frog is able to generate bioactive conformations close to those observed in crystallographic complexes.On behalf of the OpenBabel project, I am pleased to announce that Dr. Bruno Villoutreix (INSERM, University of Paris 5) and Dr. Pierre Tufféry (INSERM, University of Paris 7) have generously donated their code to OpenBabel. This code will be incorporated into OpenBabel under the GPL in the coming months, making fast and accurate SMILES-to-3D conformer generation available to the open source community for the first time. The absence of an open source 3D conformer generation algorithm has increasingly become a problem in recent years due to the popularity of SMILES strings for the description of molecular information. Fortunately, this problem has now been solved. Thanks again to all those involved in the development and release of this code. For further information on Frog, please contact the corresponding author of the Frog paper.
PMR: This shows all the worst aspects of “free” but not open services. You can try a few examples only (Wiley generously allowed me to inspect 3 spectra out of their collection of 500,000. Chemspider allows download of 100 molecules out of ca 10, 000, 000. The license is C20 – you have to fill in a form and then you get your personal copy of the software (presumably in binary – binary is a timebomb waiting to slef-destruct on the next OS upgrade). Can I convert 100 compounds? By default NO. Maybe PRODRG should contact Wellcome and agree a method for making it Open. PRODRG was, of course, supported before the Wellcome insistence on Open Access, so maybe they can do something retrospectively. Back to FROG. FROG Is Free. FROG is Open. If you don’t understand what this means, here is a simple translation. The FROG code has been made available to the human race, without requiring payment now or in the future. ANYONE can have access to the source code. ANYONE can compile it, and ANYONE can run it. You don’t have to email the authors. You don’t have to request permission. You don’t have to tell anyone you are doing it. If you modify it you have to make your modifications available. You have to make clear what the original authors wrote and what you wrote. You have to give credit to the original authors. (There’s a bit more, but that’s the essence). There is also a free service from FROG (“FRee Online druG 3D conformation generator (Frog)”.). Try it. Here’s penicillin (actually 6-aminopenicilanic acid):Q: I’m in a non-academic (i.e. commercial) environment. Can I use this server for free?A: You are free to try a few test compounds (up to 5) – then request a license, as explained here Q: I would like to use PRODRG locally, where can I download a copy?A: PRODRG executables are available under license, as explained here. Q: I would like to run a database (> 50 compounds) of small molecules on your server. Is that OK?A: Please E-mail me first.
C12C(N)C(=O)N1C(C(=O)O)C(C)(C)S2

Chemspider and “Open Chemistry Web”
Saturday, October 13th, 2007Welcome to Open Chemistry Web Posted by: will in ProjectAs the latest addition to ChemSpider’s services, ChemRefer is specialised in text-indexing and it is now focused (and soon to be integrated with the main ChemSpider search) on providing access to chemistry related information and building a structure centric community for chemists. I originally created the ChemRefer service to allow chemists to have a search engine to perform text-based searches of freely accessible chemistry articles. When I saw what ChemSpider was trying to achieve I joined their advisory group to assist their efforts. With time it was clear that a closer relationship would benefit both parties. Now, ChemRefer and ChemSpider are merged together and we have an opportunity to produce a FREE search engine which will allow users to input structural and textual queries into one search interface. Any ideas, comments on any sources you would like us to index or any features you would wish this service to have are most welcome. This blog is a parallel blog to the ChemSpider Blog and ChemSpider News so that we can discuss the ins and outs of text indexing of the chemistry literature. At a time when there is a great deal of openly available literature and data in this arena, it is time there was an openly available service with the cheminformatics and text indexing capabilities to search this effectively. We want to play a role in making that happen. We look forward to dialoguing with you. Please add Open Chemistry Web to your Blog Reader…PMR: There is nothing Open about this. Even the blog is not Open (it does not carry a CC licence). The services may be free, and they may be useful, but they are not Open. The text that they index may indeed be Open Access in its own right (and probably is because otherwise the publishers will sue them) but this is no especial credit to Chemrefer. We also index Open resources but we make our results Open. Chemrefer could disappear tomorrow. Only if the data, and the source code are made Openly available under licence can they be called Open.
Can chemical structures be right or wrong?
Tuesday, October 2nd, 2007- ChemSpider Blog » Blog Archive » Dictionary Lookups and Optical Structure Recognition Versus Structure Drawing. Which is Less Error Prone? Says: October 2nd, 2007 at 5:48 am e[…] Luqidcarbon has put up a recent blog posting about the speed by which he/she can draw structures in ChemDraw and asked for challengers. PRM has commented in Chemical SpeedDrawing. The challenge is outlined below… […]
- ChemSpiderMan Says: October 2nd, 2007 at 6:21 am ePeter, I think the structure of discodermolide is wrong…this is where a look-up in a reference dictionary is necessary…and I think we both support that effort. But it MUST be curated. it IS correct on Wikipedia but drawn incorrectly by liquidcarbon and everyone afterwards… It is why I favor the scan and convert software for this…there is the version from Marc Nicklaus’ lab but I must admit that my present bias is to use CLiDE (http://www.simbiosys.ca/clide/index.html) because it can be batched and because the results appear to be so far ahead of the Open Source code at present. We do not have time to work on the Open Source support at present as ChemSpider is very distracting and we are focused on potentially using the batch processing for extracting novel structures from Open Access articles. I put a detailed blog posting about this at: http://www.chemspider.com/blog/?p=180
The chemical blogosphere cares
Monday, October 1st, 2007PMR: A few comments. Yes, I didn’t include a tag – but as I have said before the blogosphere rapidly converges. I sympathize with Egon that I don’t particularly like pointing to bad articles. However whent eh robots start refereeing journals – as they will in out project – they don’t have sentiments and if they find bad data they will expose it without a qualm. Of course we will have to check they “hardly ever” make mistakes (no one is perfect). And, of course, if you publish in Open Access journals there is no place to hide.Peter is writing up a 1FTE grant proposal for someone to work on the question how automatic agents and, more interestingly, the blogosphere are changing, no improving, the dissemination of scientific literature. He wants our input. To make his work easy, I’ll tag this item pmrgrantproposal and would ask everyone to do the same (Peter unfortunately did not suggest a tag himself). Here are pointers to blog items I wrote, related to the four themes Peter identifies. The blogosphere oversees all major Open discussionThe blogosphere cares about data
- Open Text Mining Interface and Bioclipse
- USPTO considers open source software prior art
- New InChI software beta: license issues resolved and InChIKey
- SMILES to become an Open Standard
Important bad science cannot hide I do not feel much like pointing to bad scientific articles, but want to point to the enormous amount of literature being discussed in Chemical blogspace: 60 active chemical blogs discussed just over 1300 peer-reviewed papers from 213 scientific journals in less than 10 months. The top 5 journals have 133, 78, 68, 57 and 48 papers discussed in 22, 24, 10, 11 and 18 different blogs respectively. (Peter, if you need more in depth statistics, just let me know…) Two examples where I discuss not-bad-at-all scientific literature: Open Notebook Science I regularly blog about the chemoinformatics research I do in my blog. A few examples from the last half year: Update: after comments I have removed one link, which I need to confirm first.
- Uncertainty in NMR based 3D protein models
- re: ACS RSS feeds are messed up
- Molecules in Wikipedia without InChIs
I submit a Nature article to Nature Precedings
Monday, October 1st, 2007- pre-printing it in Nature Precedings. Anyone is allowed to do this – the submissions are vetted before being released into view (and I suspect are primarily to make sure they are in scope for scientific discourse, not that they reach a given standrad of excitement). I have done this (it’s not necessary for the actual Nature submission process) and it will doubtless appear at some stage. If so I will note it on this blog if you haven’t found it already.
- Asking that the final manuscript be available under Creative-Commons. I have suggested CC-BY (this term was unknown to the Nature permissions office although it’s part of Nature Precedings, which are licensed under CC). They are going to return to me about CC-BY and I have also suggested SPARC author-addendum. Let’s see,
- Using images without restrictive copyright. I have therefore chosen 2 from Wikipedia (which uses GPL), one from CrystalEye whose data is Open Data and where I am the author, one from Jean-Claude Bradley’s SecondLife snapshots, and one from our screenshot of OSCAR3. There is no need to seek permission for any of these. However the Nature copyright office still feels it has to write to Wikipedia for permission. What? They are FREE, OPEN as in free beer. No permission required. How do I explain what the words on the GPL mean?
- reducing the citation count to zero. I provide one link to the blogosphere which then links to the rest of the blogosphere and I provide 2 other links to other information. For the rest I use copious links to Wikipedia, which should increasingly replace ritualised citation of methods, algorithms, fundamental work, etc. Of course this isn’t applicable to most current scientific publications, but it’s worth considering whether the reader is disadvantaged. I doubt it.
- posting into the institutional repository. I know that Southampton have got this to “one click”, but my previous experience with DSpace suggests I shall take a few more. I’ll time it.


