petermr's blog

A Scientist and the Web

 

Archive for the ‘chemistry’ Category

Open Notebook NMR – Henry’s improved protocol

Monday, October 22nd, 2007

When we first started this (nearly a month ago). Henry suggested a protocol for calculating the chemical shifts. Nick tooled up for this and had to overcome several technical problems on job submission, etc. (A typical example – the order of arguments in a Condor machine filter seems to matter – anyway shuffling them fixed the problem. This took at least a week out of our elapsed lives. This is not really chemistry, but it’s part of eScience, just as analysing solvents from different suppliers is part of chemistry.). Here’s the first graph Nick got (you’ve seen it already):

nmr1.PNG

During this period Henry improved his protocol, but we continued to use the old one until the jobs finished. Then we re-ran them all with the new one (details will come later):
nmr2.PNG

you can see there is a small but significant improvement. We believe that when the data errors are filtered out the improvement will be much clearer and more obviously valuable.
We’ll let Henry tell you what he’s done when it’s relevant.

WE ACKNOWLEDGE JOE TOWNSEND’S PLOTTING SOFTWARE WHICH WHEN DISPLAYED AS SVG (LATER THIS WEEK) ALLOWS EXCITING THINGS.

Open Notebook NMR: Anticipated errors

Monday, October 22nd, 2007

Nick and I sat down this morning and thought about what possible errors might arise in the “data” or “experimental” axis and also on the “predicted” axis. Some of these may overlap with Antony’s suggestions but they are independent.

An “experimental” error is one that is independent of the prediction of the effect. There are some grey areas (in the compound), but we have come up with:

  • mis-reported solvent (the shifts are solvent dependent and the calculation tries to simulate this)
  • variable calibration of NMR instrument (e.g. giving rise to origin shifts)
  • impure compound. The sample may contain substance(s) which give rise to appreciable peaks not belonging to the title compound
  • wrong compound assigned to spectrum (i.e. error in bookkeeping or drawing error)
  • machine parameters (phasing, folding, field strength, etc.) varied incorrectly or reported incorrectly
  • transcription errors in spectrum or peaks.
  • misassignment of peaks to inappropriate atoms
  • broad peaks with considerable variance leading to misreporting of mean (unlikely with 13C)
  • errors in applying theory of NMR or its interpretation
  • noise (including random noise and mains spikes).
  • human editing of spectra including fraud

A “prediction” error is independent of the reported value for the shift. Some are theoretical, some are computer “bugs”. These include:

  • mis-calculation of offset (e.g. from isotropic tensor to observed shift)
  • garbling of the assignment of peaks to atoms (bug)
  • corruption of connection tables (especially in adding hydrogen atoms)
  • mismapping of atoms between input and output of calculation (we assume atoms come out in the order they go in – bug)
  • incorrect generation of input (bug)
  • program bugs in reading input and main calculation. For example we found a really nasty bug with GAMESS – if the line overflowed 80 characters the atom was reported but not include in the calculation.
  • incorrect transformation of output to CMLSpect
  • theoretical model has limitations (Henry will comment)
  • Oversimplified chemical model. There are several common problems:
  1. only one conformer is calculated
  2. symmetry is not well treated
  3. tautomerism is ignored
  4. isomerism (e.g. ring-chain is ignored)
  5. other chemical effects (Antony mentions micelles, etc.)

There are also potential bugs in the computational side:

  • inconsistent results from different machine architectures
  • errors in processing and displayng the results

So, we look forward to sharing this with Christoph tomorrow. Nick has prepared a range of display tools, including a filter for the errors within structures. Ideally the claculated value (y) shoudl relate to the observed one by:

y = x + eps

where eps is normally distributed. In practice we expect that we shall find

y = x + c + eps

where c varies between entries and reflects the errors in origins and solvents. We don’t know what the magnitude will be. We don’t see any need at present for

y = m*x + c + eps
where there is empirical scaling.

The intra-compound comparison will highlight entries with the following features:

  • high precision, high accuracy
  • high precision, low accuracy (hopefully allowing identification of systematic error)
  • low precision, high accuracy (maybe due to noise, though this is unlikely here)
  • low precision, low accuracy (these may allow us to identify problems with various sources such as authors, machines or protocols).

Adding semantic markup with InChI

Monday, October 15th, 2007

If we could require all authors to provide machine-readable chemical structures in their chemistry articles the quality of chemistry would increase dramatically and immediately. We could create Open databases immediately, that were machine-searchable (just like crystalEye). No-one doubts that, but who is prepared to make it work?

  1. Richard [from RSC] Says:
    October 15th, 2007 at 10:17 am e[...]
  2. Sitting here as a publisher, we don’t have half the power you suggest – we have to satisfy authors (we want the best) as well as readers (to give them the best service), and making submission as easy as possible is an absolute requirement. Demanding InChIs from authors isn’t a realistic option yet – we can show the advantages of this information via the enhanced HTML and work towards it, but compulsion’s an attractive but ultimately futile option. The more you push, the worse data you’ll get.The publishers aren’t the problem – it’s because the possibilities of processing and reusing this information have only comparatively recently been apparent, and frankly because most people want to read the text and look at the pictures. As authors and readers are encouraged to look beyond print/PDF it’ll happen, but keeping the data within the publishing process is a community issue rather than a publisher one. We’d love it of course!

PMR: Well, who supports it? From Nascent at nature.com

Lunch with Egon Willighagen

We had lunch yesterday with Egon Willighagen who in his spare time runs the Chemical Blog space, now situated at http://cb.openmolecules.net/ (running on postgenomic code).The chat over lunch was pretty good, it turns out that Egon’s favorite molecule might be Ascorbic acid. One of the topics that really animated Egon was how how to link molecules to academic papers. By this I mean for example if you do a search in google, or in some dedicated search engine, for a molecule, how does your search engine know which papers deal with this molecule. There are a couple of problems with solving this. One is that many different fields use different terminology for molecules, especially as the molecules become large, so a plain text search for the name will not get all of the papers that you might be interested in, also papers don’t have semantic markup of molecules.One solution to marking up molecules is to use an InChi (an IUPAC International Chemical Identifier). These have been championed by Peter Murray Rust and there is an extensive InChi FAQ available. The short story is that an InCHi is a character string which uniquely describes a chemical substance. From any chemical structure you can generate an InChi.
Peter has a writeup on using inCHi in blogs, and if every chemical that appeared everywhere was somehow marked up with it’s InChi, or the article referring to it tagged with them then the findability problem would be solved by simple string searching.OK great, well what’s the problem? For a start there is an alternative system SMILES (which is a Simplified Molecular Input Line Entry System), a markdown for molecules if you like. There is a very good description of the syntax here and the KinasePro blog has a short comment on how many people use SMILES vs InChi. The bottom line is that more people use SMILES, but it seems easier to search Google with InChi. I’m not a chemist, but it seems from my naive stand point that the SMILES syntax seems closer to the text description of chemistry that we know from school, wheres the InChi system is more rigorous, it requires one further step of abstraction. It reminds me of the difference between LaTeX for math and MathML. MathML is a hell of a lot easier to write a parser for than LaTeX, as LaTeX can be quite expressive, however no one writes raw MathML. Scientists are lazy and that extra step of abstraction might be the reason why SMILES seems to be used more frequently at the moment.Egon suggested as a solution that journals should require papers dealing with chemicals to include InChis. He said that every tool for drawing chemicals (standard issue for anyone writing a paper on the subject) can now output the InChi with the click of a button. Sounds reasonable, seems easy, but there are problems with this approach. I have heard a few times people say, you are Nature, you can make authors do anything in order to get a paper published so why not get them to do x. Well, for a start, that’s an editorial decision, but even so, making more demands on scientists may not be the best decision when the process of publication is already pretty fraught and stressful. Even if we did this what would that gain? A small selection of the literature would be marked up, but the vast majority of journals in the area would need to follow suit in order to gain full coverage. Of course an argument that we should not do x because other people are not doing x is not what I am getting at here, but rather that this cannot be seen to be a final solution to the problem. Journals are naturally shy of any step that can delay the publication time of an article, and so I am also skeptical that we would see such obligatory requirements. Better, I think, to have this step as a voluntary one. Practically all journals allow supplementary information and I am sure all of them would accept InChi as supplementary information.Even then one is still left with the vast existing corpora of papers that are already published. Egon points out that no one uses the literature in this area from 50 years ago, as modern techniques have advanced so far that this literature is functionally of little use. The implication here is that 50 years in the future we will only need to go back as far as today’s papers. Even so there has to be a value in seeing the evolution of an idea for insertion into the literature right through to where it has led today, and Egon agreed with this.So what can we do now to help making connections between papers and molecules? Peter Corbett, who works with Peter Murray Rust, is working on automated methods of getting computers to read chemistry papers and output semantic markup of them. Tools like this can begin to fill in the semantic blanks, both for papers from the past and for the current literature. Egon has now created rdf pages for molecules on openmolecules.net. These pages use the InChi in their structure, and now each molecule had it’s own web page. Egon’s pages check Connotea, and pull from Connotea co-tags of InChi tags (Here is a short description of this). If we work on this a bit more we should be able to set up a system where if you tag a paper with an InChi, that paper could appear on Egon’s pages. We got quite excited about this idea yesterday and are certainly going to discuss this further. It’s a small start, but a start nonetheless.
PMR: and Richard replies:
Getting InChIs out from the chemical drawing is easily done now, but I don’t think there will be a realistic way to get them into the authoring process until the tools offer a robust way to keep the InChIs in the right place (and validated). Certainly it’s not a burden we could currently expect of the majority of authors, which is why RSC Project Prospect relies on a combination of text mining and input by skilled technical editors. It’s quite hard to do in practice, but it’s worth it when you see the results which won us the ALPSP/Charlesworth Publishing Innovation award this year. The InChIKey should help to promote acceptance and use as Tony suggests, along with common treatment of these standards across publishers .

PMR: so there seems to be a can-do in biology that is missing in chemistry. So let’s float a revolutionary idea for capturing biology in articles. Let’s start with protein sequences. Now these are complex molecules with lots of atoms, so we’ll make them simpler. We’ll call one group of atoms “A”, another “C” and so on to “Y”. We’ll just use 20 letters.

This will be very very difficult for biochemists. They haven’t had nearly as long as chemists to learn informatics and their molecules are much larger. Insulin has hundreds of atoms – ten times larger than most common molecules and it’s one of the smallest. But even so too long to fit on the page:

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED

LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

so we’ve had to break it in the middle.

Unfortunately that’s much too difficult for a biochemist to include in a paper and anyway it’s meaningless. You can’t understand it.

Well it was worth a try. It could actually revolutionise biology and perhaps create something we could call bio-informatics (similar to chemo-informatics).

OPSIN/OSCAR: you + us = we; please help

Sunday, October 14th, 2007

I’m exploring how you and we may be able to work to improve OSCAR and OPSIN. Even if you aren’t interested in chemical names, you may find the general principles useful.

One of the drawbacks of full Open Source and Open Access is that you don’t always get feedback on whether what you are doing is appreciated. There is a small measure of downloads, accesses, and so on but you can only guess at the motivation and followup. When we go to meetings people (often in industry) say things like: “Oh we use OSCAR for identifying compounds and it’s very useful”, “We started using InChI and it’s saved us lots of money”, etc. But unless someone actually tells us what they are doing and what they want we don’t know and can only guess.

Sometimes we get letters of support, and some explore whether they can offer help. I may be able to expand later, but here are a number of positive things you can do:

(a). Simply tell us that you are using OPSIN and use some words that show us why. If possible we can publish this anonymously “we regularly use OSCAR and have integrated it into our Y process”. That gives motivation to us to continue in the dark hours of the night, to refactor, to document, etc.

(b). Contribute in-kind data. I’ll expand “how” later, but the sorts of things we would like are:

  • names with connection tables (especially if not in Pubchem)
  • ontologies (e.g. similar to ChEBI)
  • tutorials
  • regular expressions for specific tasks (though Peter often has better approaches)
  • insight into document structuring (e.g. the variability and commonality of organization in theses)
  • corpora (especially annotated). These are very valuable but should not be approached casually
  • acronyms
  • numbering systems, bothy arbitrary and algorithmic
  • journal- and thesis-specific regular expressions

(c) contribute in-kind code. You can, of course, do this anyway through Sourceforge, but it is best to coordinate. Major areas that OPSIN requires are:

  • stereochemistry (even R- and S- cannot be parsed at present)
  • bridged ring systems (e.g. bicyclo[3.2.1]octane). (I wrote an algorithm for this at one stage but it’s not in OPSIN)
  • fused rings. (This can be fairly hairy, but partial lookup can help a lot)
  • specialised vocabularies (e.g. saccharides, nucleic acids).

(d) support the community.

(e) Invite one of us to talk with your organization.

(f) financial support. This can range from a summer student (a few thousand USD) to larger structured projects, perhaps involving other partners and initiatives (JISC, FP7, NSF, etc.). One of the enormous attractions of Openness is that you immediately get all the benefits of everyone else involved. (If, in our work with RSC, Nature and IUCr we had insisted on all IP being held in silos we would never have had OSCAR – it’s that simple).

SciBorg and OPSIN
It’s valuable to understand where OSCAR fits into the large picture. in 2004 Ann Copestake and Simone Teufel (from the ComputerLaboratory) together with Andy Parker (Cambridge eScience) and myself bid for and got an EPSRC grant which we called “SciBorg”. It includes the very real and positive collaboration of Nature, Int. Union of Crystallography, and the Royal Society of Chemistry. The objectives are:

  1. To develop a natural-language oriented markup language which enables the tight integration of partial information from a wide variety of language processing tools, while being compatible with GRID and Web protocols and having a sound logical basis consistent with Semantic Web standards.

  2. To use this language as a basis for robust and extensible extraction of information from scientific texts.

  3. To model scientific argumentation and citation purpose in order to support novel modes of information access.

  4. To demonstrate the applicability of this infrastructure in a real-world eScience environment by developing technology for Information Extraction and ontology construction applied to Chemistry texts.

The project is larger and more visionary than simply extracting chemical names from text. I like to use the phrase “machine-understanding of scientific literature” – but may be pulled up on over-stressing “understanding”. But the idea is to use a variety of techniques to understand the deep structure of the language – at sentence level, paragraph, and document structure. To go beyond the linguistic form to infer motivation “why is this citation important?”, “is this paper challenging conventional views”.

To do this we need to understand chemical language, and that is where OSCAR3 comes in. It works out the role of chemical words and phrases so that more powerful tools can interpret the larger context. So “methane” is a noun (CM) while “methyl” is an adjective (CJ) and “methylated” is a reaction/verb. (You might mention “methylated spirits”, but exceptions and ambiguity are all part of the fun of parsing human-generated language). And OSCAR gives these decisions a probability (“P450 demethylates caffeine” is more likely to be a verb than “methylated spirit”). But in the grander scheme of things SciBorg does not usually need to know what the actual compounds are so work out the deep structure of the language. (Obviously as we advance there is a chance to add validity and inference – thus “C-14 demethylation of lanosterol” is meaningful whereas “C-15…” is not – but we can’t do it all yet).

The point is that Peter is only partially working on OSCAR. Along the way he has developed a lot of clever tricks and this account does not do him justice. But we now need to take OSCAR/OPSIN forward on a broader front and this post addresses some of this.
OSCAR and OPSIN have become complex. Peter Corbett inherited bleeding-edge code and data which at that time had bits from all sorts of authors (Joe Townsend, Fraser Norton, Sam Adams, Chris Waudby, Richard Marsh, James Bell, Vanessa de Sousa (who wrote “Nessie”), Justin Davies, and PeterMR. It covered document structure, import of legacy, regular expressions, data checking, name2structure, lexicons and other bits and pieces.

The original OSCAR (sometimes OSCAR-1 or even OSCAR-2) was primarily based on regular expressions and data checking. OSCAR-3 continues to identify this part but does not support the OSCAR-2 GUI. nor do the regexes fit all journals and theses – they were aimed at RSC articles. So Justin and Richard created OSCAR-DATA which consumes the data section output of OSCAR3 and then applies rules and presents it. We have a sustainability path for that. More later.

Meanwhile OSCAR3 is being used for a wider range of applications than simply journal articles, in particular patents, internal documents and theses. So we have been able to get funding from Unilever for David and Lezan, who joins us on Monday and from JISC for Alan and Diana. That gives us a critical mass in the more direct support for OSCAR3 functionality.

The first thing is refactoring. No-one likes refactoring, but it feels good when you’ve done it. Jim is designing the overall approach and he is able to do this in a way I can only marvel at. It uses design patterns, testing (of course) but combined with lightweight services (REST, Web 2.0) and JFDI pragmatism. We’ll expose more of this later but some of the themes are likely to be:

  • use all the Open strategies (SVN, Eclipse, JUnit, Maven, etc.)
  • use open source components where possible (PDFBox, Lucene, JUMBO, CDK, etc.)
  • explore how to modularise them where necessary (e.g. CDK)
  • separate components and communicate through XML or RDF (e.g. OSCAR3 -> OSCAR-DATA can be almost completely decoupled
  • devise a framework based on standard approaches (e.g. Eclipse)
  • actively design and highlight and explain extension points. This is essential if others are to add code and resources in parallel

So we have to do this for ourselves.

If the community can give us clear indications of what they can contribute then this will help us to identify the extensions.

My outrage against “Open Access Publisher” continues

Sunday, October 14th, 2007
[Peter Suber, I'd be grateful if you could comment on what it is legal to index without publishers' permission. And what it is reasonable to expect from someone who labels themselves an Open Access publisher.]
In my post Outrage: Repurposing Open Access material is allowed without explicit permission I blogged the account from Chemspiderman where an Open Access publisher had forbidden the indexing of their material. To remind you of the completely unacceptable position I repeat it, add CSM’s comments and then my own…

CSM: .We have already extracted 10s of thousands of chemical names and will be linking them up to ChemSpider structures to enable Open Access papers to be structure/substructure searchable. However, we’ve hit a bit of a hurdle…more details on this will follow shortly but we have been asked to remove thousands of articles indexed according to what we believe is a standard search engine policy from the ChemRefer index. During our conversation today with the publisher the conversion of chemical names to chemical structures to provide a structure searchable index of the articles was deemed to be “re-purposing” of the Open Access articles and is NOT allowable. Peter Corbett and Peter Murray Rust are engaged in similar activities so will likely run into the same challenges. If they manage to get around this issue with this and other publishers then they will be working in a “permissive” role where they will need to get permission from publishers to perform semantic markup. Their semantic markup is also “re-purposing”. The “permissive challenge” is far away from Peter’s stance in terms of Open Data for all.

PMR: This makes no sense at all. As I understand it Chemrefer indexes Open Access chemistry articles (I did a brief search and verified that there were no articles from ACS, RSC. Wiley, Elsevier and all those other publishers who help scientific communication by closing information). So I assume the target journals or papers are labelled Open Access in some way.
Open access is epitomised by the BBB declarations – here is the Budapest Open Access Initiative one:

By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.

PMR: I am a simple scientist and this is simple – or I used to think so. It was an algorithm that allowed an author to grant rights to everyone in the world to do specific things without having to require further permission.
What what in the world is happening above? The articles are either Open Access or they are not. If they are not, then they had better not be labelled Open Access. If they are, then they cannot and should not and for goodness’ sake should not WANT to restrict any repurposing.
WHO IS THE PUBLISHER. WE HAVE TO KNOW. I have a good idea, but it would be quite improper to say.
Because if the report above is true, it’s outrageous.
CSM replies:
  1. ChemSpiderMan Says:
    October 13th, 2007 at 8:31 pm ePeter, I will not announce the publisher at present because I made a commitment to not do so until we had a mutually agreeable blog posting for our users and accurately representing the conversation and agreements between us. I have an urge to co-exist in the world with publishers since they put a lot of value into the world. With the changes going on in Open Access figuring out how to co-exist is very necessary. I hope we can get the information out shortly. It is possible we have mis-stepped but more likely that there is a policy issue with spidering policy that needs addressing by the publisher.

PMR: Publishers do not make the rules. They think they do, but they do not. If they call themselves Open Access then they are expected to abide by the rules. Their licence must be OA compatible or they will be taken off the list of approaved journals. There is and will be increasing pressure from the community to make sure publishers conform. Calling yourself Open Access when you are not is fraudulent (I assume this publisher is taking money from people).

Any proper Open Access publisher who – mistakenly or implicitly – fails to remove permission barriers will remove them once the error is pointed out. It is LEGAL to index articles from ANY publisher. You do not have to ask the publishers’ permission to create an index. You do not have to ask their permission to re-use facts. Facts are not copyright. Elsevier made this clear on this blog. Molecules are not copyright. Molecule names are not copyright. The time of sunset is not copyright. It is LEGAL to extract these from any article. It is totally unacceptable that any publisher, let alone an “Open Access” one should disallow indexing. And to insist that you remove the index is preposterous.

What now worries me greatly is the you (Chemspiderman) are giving in to this extortion. It gives the worst possible message – allowing a publisher to make up non-existent rules to which you have to kow-tow. Don’t do it. Don’t negotiate. There is nothing to negotiate about.

There is only one aspect where you might be in error – the actual spider. If this site wishes to forbid spidering it should have a “robots.txt” file. That indicates which files and directories can be spidered, but this is ONLY to avoid inconvenience to the server. If, for example, I wish to click manually through the whole site while watching the rugby, robots.txt cannot stop me.

I am now worried that you will muddy the waters by suggesting to publishers that they have rights that they do not have. And that when I come to index the material – and re-purpose the facts as we have done with crystalEye – the publisher might object.

I do not know who the publisher is – I have a fair idea. Unlike Chemspiderman I have no hesititation in publishing correspondence and I see no reason why the publisher should have any cloak of anonymity. But I have a strategy for dealing with them

That I SHALL keep secret.

FROG – not just Free, but Open (PRODRG take note)

Saturday, October 13th, 2007

I am delighted to announce and praise FROG, a free service for creating 3D molecular structures from connection tables. This is yet another module in the list of Open chemistry offerings that will ultimately give the tools to the chemistry community that they need and – possibly – deserve. I comment later.
var imagebase=’file://C:/Program Files/FeedReader30/’;

15:48 12/10/2007, baoilleach, cheminformatics, openbabel, Noel O’Blog


Recently, researchers at the French research institutes INSERM and CNRS developed an online service for converting SMILES string to 3D conformers: “FRee Online druG 3D conformation generator (Frog)”. A description of this service was published in T. Bohme Leite, D. Gomes, M.A. Miteva, J. Chomilier, B.O. Villoutreix and P. Tufféry. Nucleic Acids Research, 2007, 35, W568-W572:

Frog is an on-line service aimed at generating 3D conformations for drug-like compounds starting from their 1D or 2D descriptions. Given the atomic constitution of the molecules and connectivity information, Frog can identify the different unambiguous isomers corresponding to each compound, and generate single or multiple low-to-medium energy 3D conformations, using an assembly process that does not presently consider ring flexibility. Tests show that Frog is able to generate bioactive conformations close to those observed in crystallographic complexes.

On behalf of the OpenBabel project, I am pleased to announce that Dr. Bruno Villoutreix (INSERM, University of Paris 5) and Dr. Pierre Tufféry (INSERM, University of Paris 7) have generously donated their code to OpenBabel. This code will be incorporated into OpenBabel under the GPL in the coming months, making fast and accurate SMILES-to-3D conformer generation available to the open source community for the first time.

The absence of an open source 3D conformer generation algorithm has increasingly become a problem in recent years due to the popularity of SMILES strings for the description of molecular information. Fortunately, this problem has now been solved. Thanks again to all those involved in the development and release of this code.

For further information on Frog, please contact the corresponding author of the Frog paper.

Image credit: gottcha78PMR: Before commenting on FROG, I’ll praise gottcha78 on flickr who has made her beautiful photograph freely re-usable through creative commons. This will increasingly marginalize the mindless publishers who claim copyright on images they haven’t taken. Let’s hope it won’t be long before science is liberated from this madness.

PMR: FROG is exciting not just because of its functionality, but also because it’s OPEN SOURCE. It will be distributed under Open Babel, which uses the GPL license.

Why does this matter? Surely there have been free services like this before. What about PRODRG? (This is a service which has been going for at least 5 years, I think).
I make it clear that I have no quarrel with the authors and maintainers of PRODRG. However they ARE supported by Wellcome, who have blazed a trail for Open Access. Perhaps this is an opportunity to do the same for Open Source and Open Services.

  • Q: I’m in a non-academic (i.e. commercial) environment. Can I use this server for free?A: You are free to try a few test compounds (up to 5) – then request a license, as explained here

  • Q: I would like to use PRODRG locally, where can I download a copy?A: PRODRG executables are available under license, as explained here.

  • Q: I would like to run a database (> 50 compounds) of small molecules on your server. Is that OK?A: Please E-mail me first.
  • PMR: This shows all the worst aspects of “free” but not open services. You can try a few examples only (Wiley generously allowed me to inspect 3 spectra out of their collection of 500,000. Chemspider allows download of 100 molecules out of ca 10, 000, 000. The license is C20 – you have to fill in a form and then you get your personal copy of the software (presumably in binary – binary is a timebomb waiting to slef-destruct on the next OS upgrade). Can I convert 100 compounds? By default NO.

    Maybe PRODRG should contact Wellcome and agree a method for making it Open. PRODRG was, of course, supported before the Wellcome insistence on Open Access, so maybe they can do something retrospectively.

    Back to FROG. FROG Is Free. FROG is Open. If you don’t understand what this means, here is a simple translation.

    The FROG code has been made available to the human race, without requiring payment now or in the future. ANYONE can have access to the source code. ANYONE can compile it, and ANYONE can run it. You don’t have to email the authors. You don’t have to request permission. You don’t have to tell anyone you are doing it. If you modify it you have to make your modifications available. You have to make clear what the original authors wrote and what you wrote. You have to give credit to the original authors. (There’s a bit more, but that’s the essence).

    There is also a free service from FROG (“FRee Online druG 3D conformation generator (Frog)”.). Try it. Here’s penicillin (actually 6-aminopenicilanic acid):

    C12C(N)C(=O)N1C(C(=O)O)C(C)(C)S2
    Go to the FROG site and paste it in. Non-chemists can also do this! Breaks the ice at parties! You should get something like:
    frog2.png

    Who owns this image? YOU do. (Unless you donate it to a commercial publisher).
    Are there restrictions on the FROG service? I don’t think so, but the service is completely separate from the code. Could they start charging in the future? Yes. Could they restrict access? Yes. Could they limit the usage? Yes. (I don’t think they will…)

    Could I take the FROG code and run an Open service. Yes! Could I then close it and start charging money? Yes! Isn’t that immoral. No.

    BECAUSE!!! We – the human race – now have the code. If something happens to the FROG service, we can duplicate it.

    That’s why we are developing Open services at UCC. It’s not always trivial to distribute Open services (it’s harder than Open code). But we are thinking hard how to do it. Making the code Open is the major first step.

    Chemspider and “Open Chemistry Web”

    Saturday, October 13th, 2007

    Recently the Chemspider company has announced an “Open Chemistry Web” which in my opinion misuses the word “Open”. Before I start I’ll review my relationships and attitude to Chemspider and Chemspiderman to try to clear the air.

    Chemspider.com and its associates are commercial organization which have aggregated a large number of chemical connection tables and have started by calculating their properties and extracting literature references which they make freely accessible but not Open. The freedom is for an unspecified timescale and you cannot download significant amounts of the data and you cannot re-use it without permission. Initially I was concerned about the complete lack of quality in these calculations and said so – I believe there has been some improvement in quality but I do not check and do not intend to do so. I do not follow Chemspider regularly but they appear to have added the ability for anyone to add annotations and curation. I have serious concerns about the lack of thought given to metadata and I do not expect Chemspider to be able to scale or to compete against modern approaches.
    Chemspider also encourages Uploading Spectra Onto ChemSpider. These spectra by default all belong to Chemspider. They are not Open. If you can convince the world at large to donate IPR to you for free, you deserve some form of congratulations for sheer bravado. Note that even if you upload data and metadata you are not allowed to download it (there is a limit of 100 structures). We have ca. 250,000 calculations on molecules and 130,000 crystal structures which Chemspider have suggested we upload to them. I’m not yet sure why we should do this.
    Chemrefer appears to allow searching of Open chemistry articles by keyword. Unexceptional, but why shouldn’t we simply use Pubchem? AFAIK it will index all these journals.

    The IPR model of Chemspider seems clear. No data, metadata and author contributions are Open. That allows them, at some stage in the future to close some or all of the site and to charge for data and services and – like eMolecules and their tie-up with Wiley (Wiley and eMolecules: unacceptable; an explanation would be welcome) – I predict this will happen within 5 years (unless Chemspider fails to survive in its current form). So all the authors who are contributing metadata are, in effect, donating IP to Chemspider. I have no moral objection to this – it just seems retrograde when we have Open collections of molecules such as PubChem and our own crystalEye. But a number of my friends in the Open Chemistry area are on the Chemspider advisory board, so I must be missing something. Perhaps they can show how donating IP to a commercial closed company advances the cause of Open Chemistry.

    And I applaud Chemspiderman’s efforts to clean up chemistry. Sometimes this gets muddled with the association with a commercial organisation based on possessing chemical IP so sometimes my messages have been less than generous and I apologized.
    I am not anti-capitalist – I do not attack companies per se. But I do attack people who use the word “Open” incorrectly and to promote themselves. I have done this when publishers come up with “Open Access” offerings which appear to be less than satisfactory ( see “open access products” at Nature obscures the debate, Why Open Access metrics are necessary) and for which the community has to pay. “Open” is now used by commercial organisations in the same way as “healthy” – please feel good about us and our activities as we use the word “Open”. We know it’s meaningless, but it makes us look good.

    Well, it isn’t meaningless. A number of people are trying carefully to describe what is meant by Open access, Open Data, Open source and Open Services. And when others use it to mean something less, I take exception. If nothing else it makes our job much harder.

    So:

    Welcome to Open Chemistry Web

    Posted by: will in ProjectAs the latest addition to ChemSpider’s services, ChemRefer is specialised in text-indexing and it is now focused (and soon to be integrated with the main ChemSpider search) on providing access to chemistry related information and building a structure centric community for chemists. I originally created the ChemRefer service to allow chemists to have a search engine to perform text-based searches of freely accessible chemistry articles. When I saw what ChemSpider was trying to achieve I joined their advisory group to assist their efforts. With time it was clear that a closer relationship would benefit both parties. Now, ChemRefer and ChemSpider are merged together and we have an opportunity to produce a FREE search engine which will allow users to input structural and textual queries into one search interface. Any ideas, comments on any sources you would like us to index or any features you would wish this service to have are most welcome.

    This blog is a parallel blog to the ChemSpider Blog and ChemSpider News so that we can discuss the ins and outs of text indexing of the chemistry literature. At a time when there is a great deal of openly available literature and data in this arena, it is time there was an openly available service with the cheminformatics and text indexing capabilities to search this effectively. We want to play a role in making that happen. We look forward to dialoguing with you. Please add Open Chemistry Web to your Blog Reader…

    PMR: There is nothing Open about this. Even the blog is not Open (it does not carry a CC licence). The services may be free, and they may be useful, but they are not Open. The text that they index may indeed be Open Access in its own right (and probably is because otherwise the publishers will sue them) but this is no especial credit to Chemrefer. We also index Open resources but we make our results Open.

    Chemrefer could disappear tomorrow. Only if the data, and the source code are made Openly available under licence can they be called Open.

    Can chemical structures be right or wrong?

    Tuesday, October 2nd, 2007

    Chemspiderman has commented…

    1. ChemSpider Blog » Blog Archive » Dictionary Lookups and Optical Structure Recognition Versus Structure Drawing. Which is Less Error Prone? Says:
      October 2nd, 2007 at 5:48 am e[…] Luqidcarbon has put up a recent blog posting about the speed by which he/she can draw structures in ChemDraw and asked for challengers. PRM has commented in Chemical SpeedDrawing. The challenge is outlined below… […]
    2. ChemSpiderMan Says:
      October 2nd, 2007 at 6:21 am ePeter, I think the structure of discodermolide is wrong…this is where a look-up in a reference dictionary is necessary…and I think we both support that effort. But it MUST be curated. it IS correct on Wikipedia but drawn incorrectly by liquidcarbon and everyone afterwards…

      It is why I favor the scan and convert software for this…there is the version from Marc Nicklaus’ lab but I must admit that my present bias is to use CLiDE (http://www.simbiosys.ca/clide/index.html) because it can be batched and because the results appear to be so far ahead of the Open Source code at present. We do not have time to work on the Open Source support at present as ChemSpider is very distracting and we are focused on potentially using the batch processing for extracting novel structures from Open Access articles.

      I put a detailed blog posting about this at: http://www.chemspider.com/blog/?p=180

    PMR: I have already posted on this blog that – in general – chemical structures are not right or wrong. They may be associated with other information and the chemical community as a whole decides that this association is useful or counterproductive. Please read the argument carefully.
    If, for example, I write CH5 is this structure wrong? It violates the valency rule, after all. No. It’s not wrong, it just can’t be found in a bottle in most labs. It can be found in mass specs and interstellar space. There is an arrogance in the chemical informatics community that assumes the only discipline that matters is synthetic organic chemistry. In general no chemical structure that obeys the algebra is wrong. (The algebra says things like “no fractional charges on molecules” (although ther can be on crystal cells,  “if A is bonded to B then B is bonded to A”).

    There are unacceptable uses in mainstream C19 organic chemistry, such as carbon with 5 “valencies”. Such structures may be deemed “wrong” by organic chemists. It was clear that when Chemspider was set up the support for inorganic compounds was almost non-existent – I pointed this out and I think the position is improved somewhat. But I don’t have time to check – I expect there are many compunds represented by discrete “connection tables” which in my view are far worse chemical sins. But I am turning my attention elsewhere.

    So  “Peter, I think the structure of discodermolide is wrong”. No. I think this means “liquidcarbon has drawn a structure to which s/he has associated  the name ‘discodermolide’ and Chemspiderman things this association is incompatible with current usage. ” OK. Discodermolide is a substance of relatively minor importance compared to penicillin G and THC.  It has 103 hits in Pubmed, compared with 30,000 for taxol.  Maybe it will become famous one day. Until then I don’t really care that liquidcarbon may have got it  “wrong”.

    What I do care about is that we develop a community process – not regulated by a closed commercial company or a closed learned society division – that allows us to converge towards a cluster of agreed names at any point in time. In some cases this is easy – I think we all agree what Pen-G is – in some cases this is a question of removing known errors – and Wikipedia is great for this. (BTW I made a correction to the strucure of Acetyl-CoA in Wikipedia, and the wikichemists agree the structure is noew “correct” – but this is a natural part of using WP and I do these things every other day).

    Pubchem has got it right. It simply records what name a human or organization has attached to a connection table, and gives the reference. That is all it needs to do. We then, as a community, need to evolve a Web 2.0 mechanism for annotation that allows us to find the “right” structure rapidly.

    That’s the sort of thing we shall soon start to be doing with the peer-reviewed literature – if our grant gets funded. Social computing to create consensus on data and names. All Open. All in public view. Versioned. With metadata. And until the chemical “databases” adopt C21 metadata they are largely useless in the C21. Pubchem understands this. And ChEBI, and some Blue Obelisk efforts. No-one else seems to have got the point.

    The chemical blogosphere cares

    Monday, October 1st, 2007

    Wow! I posted a request yesterday (sic) for supporting material for our proposal to JISC for a person to support the blogosphere as a major resource for increasing the quality of published chemistry. I have had valuable contributions from 4 people already and now Egon has created a wonderful summary – just the right length. We’ll either include it as it stands of point to it from the proposal – depends on space.  [Recall that Egon's Chemical Blogspace blog aggregates the whole of chemical blogspace.]

    17:01 01/10/2007, Egon Willighagen, chemical blogspace, pmrgrantproposal, chem-bla-ics
    Peter is writing up a 1FTE grant proposal for someone to work on the question how automatic agents and, more interestingly, the blogosphere are changing, no improving, the dissemination of scientific literature. He wants our input. To make his work easy, I’ll tag this item pmrgrantproposal and would ask everyone to do the same (Peter unfortunately did not suggest a tag himself). Here are pointers to blog items I wrote, related to the four themes Peter identifies.

    The blogosphere oversees all major Open discussion

    The blogosphere cares about data

    Important bad science cannot hide
    I do not feel much like pointing to bad scientific articles, but want to point to the enormous amount of literature being discussed in Chemical blogspace: 60 active chemical blogs discussed just over 1300 peer-reviewed papers from 213 scientific journals in less than 10 months. The top 5 journals have 133, 78, 68, 57 and 48 papers discussed in 22, 24, 10, 11 and 18 different blogs respectively. (Peter, if you need more in depth statistics, just let me know…)

    Two examples where I discuss not-bad-at-all scientific literature:

    Open Notebook Science
    I regularly blog about the chemoinformatics research I do in my blog. A few examples from the last half year:

    Update: after comments I have removed one link, which I need to confirm first.

    PMR: A few comments. Yes, I didn’t include a tag – but as I have said before the blogosphere rapidly converges. I sympathize with Egon that I don’t particularly like pointing to bad articles. However whent eh robots start refereeing journals – as they will in out project – they don’t have sentiments and if they find bad data they will expose it without a qualm. Of course we will have to check they “hardly ever” make mistakes (no one is perfect). And, of course, if you publish in Open Access journals there is no place to hide.

    I submit a Nature article to Nature Precedings

    Monday, October 1st, 2007

    I have been invited by the editors of Nature to submit a review/commentary article, currently on the theme of “Open Chemistry”. This is currently under the title “Horizons” though the actual format may change before publication. I wrote the article two weeks ago and it has been entered into the editorial process – I’m assuming that means it will appear in due course, the current date being ca. January 2008.

    I have taken the opportunity of testing 4-5 aspects of the publication process:

    • pre-printing it in Nature Precedings. Anyone is allowed to do this – the submissions are vetted before being released into view (and I suspect are primarily to make sure they are in scope for scientific discourse, not that they reach a given standrad of excitement). I have done this (it’s not necessary for the actual Nature submission process) and it will doubtless appear at some stage. If so I will note it on this blog if you haven’t found it already.
    • Asking that the final manuscript be available under Creative-Commons. I have suggested CC-BY (this term was unknown to the Nature permissions office although it’s part of Nature Precedings, which are licensed under CC). They are going to return to me about CC-BY and I have also suggested SPARC author-addendum. Let’s see,
    • Using images without restrictive copyright. I have therefore chosen 2 from Wikipedia (which uses GPL), one from CrystalEye whose data is Open Data and where I am the author, one from Jean-Claude Bradley’s SecondLife snapshots, and one from our screenshot of OSCAR3. There is no need to seek permission for any of these. However the Nature copyright office still feels it has to write to Wikipedia for permission. What? They are FREE, OPEN as in free beer. No permission required. How do I explain what the words on the GPL mean?
    • reducing the citation count to zero. I provide one link to the blogosphere which then links to the rest of the blogosphere and I provide 2 other links to other information. For the rest I use copious links to Wikipedia, which should increasingly replace ritualised citation of methods, algorithms, fundamental work, etc. Of course this isn’t applicable to most current scientific publications, but it’s worth considering whether the reader is disadvantaged. I doubt it.
    • posting into the institutional repository. I know that Southampton have got this to “one click”, but my previous experience with DSpace suggests I shall take a few more. I’ll time it.

    I’ll also try to see how many clicks and links I get from wherever. Nature have a voting system but I don’t know whether they release download info on Precedings.

    It will be quite fun to see how much the manuscript changes during the process. Even if I can’t re-use the final version perhaps I can mount a delta.

    NOTE: The manuscript is now public: http://precedings.nature.com/documents/1200/version/1