petermr's blog

The thing about Wikipedia is that it only works in practice. In theory, it can never work.

Posted on October 14, 2007 by pm286

A correspondent asked my opinion about Freebase:

This blog entry may be of interest, about Freebase a collaborative database project which may or may not be open. Are you familiar with it?

http://brianna.modernthings.org/article/20/freebase-wikipedia-and-the-right-to-fork

Also see http://www.nytimes.com/2007/03/09/technology/09data.html?ex=1331096400&en=a87d4f61e6052888&ei=5090&partner=rssuserland&emc=rss

I must admit to being rather suspicious about anything “Inc” being behind a project like this, but it certainly seems to be something to watch.

PMR: I’m familiar with some of the Freebase folks – having had drinks at WWW2007 – but I have not used it. However I believe in the concept and in a future post will detail how we are (thinking of) making crystalEye available under Freebase and other such offerings. Here, however, I catch up on a lovely phrase in the first URL…

Freebase, Wikipedia and the right to fork

Two nights ago I went to the first Freebase user meeting outside the US. (You can tell I’m setting myself up for a, “I was there when…”)
[image snipped]
It was organised by Kirrily Robert, who’s taken enough with her “new crack habit” to set up a specialised blog just for it.
So, what is Freebase? It claims to be a “database of everything”. There are several points of comparison with Wikipedia. Where Wikipedia is an “encyclopedia”, Freebase wants to be “everything”. It is far more structured than Wikipedia (which anyone who’s ever wrangled with an esoteric template might appreciate). Like Wikipedia, it’s a free content project: data derived from Wikipedia is GFDL (natch) and everything else is CC-BY. They have a very excellent and well-documented API — they’re not afraid to share. Bring on the mash-ups!
There are several more differences worth discussing. Currently, Freebase is alpha and invitation-only for write permission (ie an account). No worries, give it time.
[… details of Freebase snipped …]
So, interesting to see what will happen there. It’s Wiki[p|m]edia that convinced me (and taught me) about the absolutely vital right to fork. That is an incredible freedom which is vastly underappreciated by the journalists who are generally impressed with Wikipedia’s “freeness” (meaning no ads, or free access). And as a project leader, any kind of project, that is what keeps you on your toes. Maybe it is a good benchmark for deciding if you want to be a contributor to a particular project. If management gets too heavy, you can keep them in line by threatening to exercise your right to fork. Yeah!
Back to Freebase… another related, interesting aspect will be watching the development of their community and how it will be managed. Where Wikipedia was pretty grass-roots, it seems like Freebase is top-heavy, for the moment at least. Letting go, giving up control and trusting the unwashed masses is a very difficult psychological moment for anyone (who’s not a Wikimedian). Trying to get those same unwashed masses to behave themselves is a whole other kettle of fish. When I first contemplated this for Freebase two night s ago I was filled with cynicism, until I remembered… The thing about Wikipedia is that it only works in practice. In theory, it can never work.
I should make that my mantra. Every time I get cynical about something, think about that idea again. It only works in practice.

PMR: This is a beautiful description of the spirit (and it is spirit) of “Web 2.0”. (I am not religious about this phrase but it is a useful placeholder). I think the word mantra is apt. Only by completely releasing everything do you gain the world. If you put you and yourself at the centre you are not Web 2.0 and you will not prosper.
I learnt this about 5 years ago with JUMBO – my CML browser – and Jmol. JUMBO was Open Source, but I wanted it to do everything – 2D editing, 3D building, 3D display, 2D display, file handling, spectra, compchem, etc. And it DID all of these. It had the functionality of Bioclipse 1.0. But it used Swing (sick) and it sucked. Really badly.
So I realised that my giving up “my” possession of the 3D graphics space to Jmol, I would benefit far more than if I tried to keep everything to myself. I threw out the 3D display and bolted in Jmol – at that stage I think the Dr. Who of Jmol was Egon – it was pre-Miguel.
And so, when we Open Source chemistry hackers met in San Diego 2? 3? years ago it was natural to form the Blue Obelisk. The name chose itself – there were 2, and Geoff Hutchison spent a long time waiting at the Lesser Blue Obelisk. It’s in the spirit of TimBL’s mantra at WWW2007 – “Just Do It”. Since WP reminds us this is Nike’s slogan, perhaps JFDI is better.
So the skill – which is easy to state, but doesn’t gurarantee any outcome – includes: Do the bits you are best at. Do not care about ownership. Free them completely. You must care about freedom. Find other who add the complementary bits. Form a community.
So, for example we are exploring Freebase for chemistry and starting with crystalEye. We have 130,000 molecules with superb data and metadata. Freebase, when we learn how to use it, allows us to query on any of the metadata. Indeed it should be possible to mash this with huge numbers of other resources…
Will the chemists be interested? A few enlightened ones. But that is all it takes to start viral forks
P.
The only non-freedom is the requirement to protect freedom.

Posted in data, open issues | 4 Comments

OPSIN: why we need it and why we shouldn't

Posted on October 14, 2007 by pm286

NOTE: even if you are not a chemist, this should be worth reading…
Rich Apodaca evangelises the value of JRuby as a simple way of glueing together the Java offerings in our Open toolkit. Here he shows how easy it is to use Opsin for converting chemical names to structures. It’s worth reading in full if you are a chemical hacker (but I’ll omit the code here…) . After that I make some comments on OPSIN. Note that OPSIN is commonly distributed with the OSCAR3 package and the names are sometimes confused; OPSIN is a standalone tool for name2structure.

JRuby for Cheminformatics: Parsing IUPAC Nomenclature with OPSIN

Recent articles have discussed the use of JRuby for cheminformatics. We’ve seen how to parse SMILES strings, and read or write InChIs. In this article, we’ll see how easy it is to parse IUPAC nomenclature from JRuby using Peter Corbett’s OPSIN library.
Installation

After installing JRuby, simply download the OPSIN jarfile and copy it to your JRuby lib directory. You’re done.

A Simple Library

We can write a simple library to convert an IUPAC name into a CML document:
[details snipped; see JRuby for Cheminformatics: Parsing IUPAC Nomenclature with OPSIN]
This simple Ruby library has parsed the name ‘4-iodobenzoic acid’ and has returned a string containing the CML representation for the molecule. If we had wanted the read_name method to return a traversable XML object model, we could have enabled that as well.

Conclusions

One of the objections raised whenever the issue of “new” programming languages comes up, regardless of their merit, is the age-old refrain “Yeah, but where’s the software?” With JRuby, we bypass this question altogether. We can leverage the full scope of the massive Java development effort over the last ten years, which includes several excellent cheminformatics libraries. With virtually no effort, we have a working cheminformatics platform based on a widely-used, versatile and dynamic object-oriented scripting language. Future articles will discuss extensions to this platform and some applications.

PMR: Absolutely true. I don’t do a lot of glueware so I don’t use these languages, but my colleagues do.
OPSIN… I am sure Peter Corbett will add more, but the roots of OPSIN are about 5 years ago when we started the OSCAR work. OSCAR started as the Experimental Data Checker sponsored by the Royal Society of Chemistry, and after the first year it became clear that much of the data checking could not happen unless we knew the chemical structure. You can’t check that a compound has the right NMR spectrum if you don’t know how the atoms are connected.
Current chemical publishing and publishers destroy huge amounts of useful chemical information.
(Reread this sentence if you don’t believe it).
If you doubt it, visit the RSC’s Project Prospect (which is based on our collaboration over OSCAR). Find an example paper. (Most of the papers are closed, but there are a few open examples (e.g.) so you can see how it works). By default you will see a normal HTML page. However if you switch on the prospect enhancements, and select “Show Compounds” you should see the compounds highlighted

PMR: now I haven’t a clue what cycloproparadicicol is. I don’t even know how to pronounce it as it’s a made-up name. It’s not very interesting, in fact, as there are only 4 references in Pubmed and all to synthetic studies. It’s not interesting enough to be in Pubchem.
What Project Prospect does is to add this information in the paper. Click the link and you get a box like this which tells you what the molecule actually is:

PMR: The irony is that the author knew all this when the paper was submitted but that there is no way of conveying it to the publisher. Publishers have no mechanism for authors to submit semantic info. And apart from these first starts at the RSC none of them have shown the slightest public interest during the last 10 years since e-publications became routine.
A simple analogy. When a food manufacturer gets wheat they throw out all the brown stuff that is good for you are produce white bread. It’s worse for you, the eater, but easier for them to produce. PDF is similar to white bread. Pretty but horrible and bad for your semantics. So Prospect is adding back the goodness that the publishing process takes out.
The problem is therefore that when robots read papers they don’t know the connection tables (e.g. the “SMILES” or “InChI” representation of the molecules above). So the only way to get the structure (in most cases) is to try to interpret the name. “4-iodo-benzoic-acid” is relatively easy to work out; cycloproparadicicol is impossible without a lookup table and there isn’t one. (I expect CAS has the structure but it will cost you 6 USD for each lookup).
If authors added InChIs and SMILES to their papers, much of the problems would be solved. The world would save thousands of person-years work. So why, over 10 years, haven’t we started doing it?
Mainly cock-up – I have only seen 1-2 publishers who have the slightest idea what is going on in Web 2.0 for example. Nature is one. But a small amount of deliberate opposition. I’ll let you see where that idea leads.
Until a brighter day dawns, with semantic papers, we’ll need InChI to get around the appaling lack of products that the current chemical publishers provide.

Posted in open issues | 4 Comments

My outrage against "Open Access Publisher" continues

Posted on October 14, 2007 by pm286

[Peter Suber, I’d be grateful if you could comment on what it is legal to index without publishers’ permission. And what it is reasonable to expect from someone who labels themselves an Open Access publisher.]

In my post Outrage: Repurposing Open Access material is allowed without explicit permission I blogged the account from Chemspiderman where an Open Access publisher had forbidden the indexing of their material. To remind you of the completely unacceptable position I repeat it, add CSM’s comments and then my own…

CSM: .We have already extracted 10s of thousands of chemical names and will be linking them up to ChemSpider structures to enable Open Access papers to be structure/substructure searchable. However, we’ve hit a bit of a hurdle…more details on this will follow shortly but we have been asked to remove thousands of articles indexed according to what we believe is a standard search engine policy from the ChemRefer index. During our conversation today with the publisher the conversion of chemical names to chemical structures to provide a structure searchable index of the articles was deemed to be “re-purposing” of the Open Access articles and is NOT allowable. Peter Corbett and Peter Murray Rust are engaged in similar activities so will likely run into the same challenges. If they manage to get around this issue with this and other publishers then they will be working in a “permissive” role where they will need to get permission from publishers to perform semantic markup. Their semantic markup is also “re-purposing”. The “permissive challenge” is far away from Peter’s stance in terms of Open Data for all.

PMR: This makes no sense at all. As I understand it Chemrefer indexes Open Access chemistry articles (I did a brief search and verified that there were no articles from ACS, RSC. Wiley, Elsevier and all those other publishers who help scientific communication by closing information). So I assume the target journals or papers are labelled Open Access in some way.

Open access is epitomised by the BBB declarations – here is the Budapest Open Access Initiative one:

By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.

PMR: I am a simple scientist and this is simple – or I used to think so. It was an algorithm that allowed an author to grant rights to everyone in the world to do specific things without having to require further permission.

What what in the world is happening above? The articles are either Open Access or they are not. If they are not, then they had better not be labelled Open Access. If they are, then they cannot and should not and for goodness’ sake should not WANT to restrict any repurposing.

WHO IS THE PUBLISHER. WE HAVE TO KNOW. I have a good idea, but it would be quite improper to say.

Because if the report above is true, it’s outrageous.

CSM replies:

ChemSpiderMan Says:
October 13th, 2007 at 8:31 pm ePeter, I will not announce the publisher at present because I made a commitment to not do so until we had a mutually agreeable blog posting for our users and accurately representing the conversation and agreements between us. I have an urge to co-exist in the world with publishers since they put a lot of value into the world. With the changes going on in Open Access figuring out how to co-exist is very necessary. I hope we can get the information out shortly. It is possible we have mis-stepped but more likely that there is a policy issue with spidering policy that needs addressing by the publisher.

PMR: Publishers do not make the rules. They think they do, but they do not. If they call themselves Open Access then they are expected to abide by the rules. Their licence must be OA compatible or they will be taken off the list of approaved journals. There is and will be increasing pressure from the community to make sure publishers conform. Calling yourself Open Access when you are not is fraudulent (I assume this publisher is taking money from people).
Any proper Open Access publisher who – mistakenly or implicitly – fails to remove permission barriers will remove them once the error is pointed out. It is LEGAL to index articles from ANY publisher. You do not have to ask the publishers’ permission to create an index. You do not have to ask their permission to re-use facts. Facts are not copyright. Elsevier made this clear on this blog. Molecules are not copyright. Molecule names are not copyright. The time of sunset is not copyright. It is LEGAL to extract these from any article. It is totally unacceptable that any publisher, let alone an “Open Access” one should disallow indexing. And to insist that you remove the index is preposterous.
What now worries me greatly is the you (Chemspiderman) are giving in to this extortion. It gives the worst possible message – allowing a publisher to make up non-existent rules to which you have to kow-tow. Don’t do it. Don’t negotiate. There is nothing to negotiate about.
There is only one aspect where you might be in error – the actual spider. If this site wishes to forbid spidering it should have a “robots.txt” file. That indicates which files and directories can be spidered, but this is ONLY to avoid inconvenience to the server. If, for example, I wish to click manually through the whole site while watching the rugby, robots.txt cannot stop me.
I am now worried that you will muddy the waters by suggesting to publishers that they have rights that they do not have. And that when I come to index the material – and re-purpose the facts as we have done with crystalEye – the publisher might object.
I do not know who the publisher is – I have a fair idea. Unlike Chemspiderman I have no hesititation in publishing correspondence and I see no reason why the publisher should have any cloak of anonymity. But I have a strategy for dealing with them
That I SHALL keep secret.

Posted in chemistry, open issues | 5 Comments

Open NMR

Posted on October 13, 2007 by pm286

As I have already blogged (WWMM calculation of spectra) we are hoping to provide Jean-Claude Bradley and others an Open service to calculate NMR spectra from structure. This needs a lot of software components and a lot of glueware. With the release of FROG – not just Free, but Open yet another problem is solved, but we aren’t there quite yet.
The calculation of spectra from NMRShiftDB is automatic because, AND ONLY BECAUSE, Christoph and Stefan have used CMLSpect to represent the data. CMLSpect allows:

connection table
atom labels
3D coordinates
spectra
spectral peaks
assignment of peaks to atoms

all these (except the raw spectra) are required for the calculation. Actually the connection table can be dispensed with if the hydrogen atoms are given explicitly – as they should ALWAYS be. (Implicit hydrogens have probably cost the human race thousands of wasted years through errors. There is now NO excuse for not including hydrogen atoms explicitly in files. Size of files? Rubbish. All the hydrogens in a year’s global chemistry are worth 1 day of astronomical simulation).
So with NMRShiftDB we have the simple process:

read NMRShiftDB file
add hydrogens with coordinates (JUMBO does this)
transform to Gaussian input (XSLT makes this automatic)
run job (Condor makes this automatic)
analyze results (i.e. compare calculated and observed – Nick Day’s software is making this automatic)

With the normal chemical environment this is messier

read mol file
submit to FROG to generate 3D coordinates. Hope it hasn’t changed the order of atoms
convert mol file to CML
read list of peaks in some legacy format (?Excel)
try to match peaks to atoms for assignment (probably have to rely on atom ordering)
create peakList in CMLSpect. How?
combine peakList with molecule in CML
transform to Gaussian input (as above) and then it’s plain sailing

The problems arise because:

hydrogens are a problem
mol files (and all other files than CML) do not have atom labels
there is no Open tool for assigning peaks to atoms
relying on atom ordering is a recipe for disaster and extremely difficult to debug

So what is clear is that we need a tool to couple JSpecView to a molecule in CML. The output, at least, has to be in CML because there is no other way of linking atoms to peaks.
This should be seen as one of the great (but achievable) challenges of the Blue Obelisk movement. When we get it, it will transform the way that graduate students record their peak assignment and publish their papers and THESES!

Posted in blueobelisk, nmr, open issues, open notebook science, programming for scientists, theses | 6 Comments

Outrage: Repurposing Open Access material is allowed without explicit permission

Posted on October 13, 2007 by pm286

In One Day I’ll Have Lunch with Egon Willighagen Too… Chemspiderman wrote

“So what can we do now to help making connections between papers and molecules? Peter Corbett, who works with Peter Murray Rust, is working on automated methods of getting computers to read chemistry papers and output semantic markup of them. “
AW> Over at ChemSpider we are working with Will Griffiths who developed ChemRefer . We have already extracted 10s of thousands of chemical names and will be linking them up to ChemSpider structures to enable Open Access papers to be structure/substructure searchable. However, we’ve hit a bit of a hurdle…more details on this will follow shortly but we have been asked to remove thousands of articles indexed according to what we believe is a standard search engine policy from the ChemRefer index. During our conversation today with the publisher the conversion of chemical names to chemical structures to provide a structure searchable index of the articles was deemed to be “re-purposing” of the Open Access articles and is NOT allowable. Peter Corbett and Peter Murray Rust are engaged in similar activities so will likely run into the same challenges. If they manage to get around this issue with this and other publishers then they will be working in a “permissive” role where they will need to get permission from publishers to perform semantic markup. Their semantic markup is also “re-purposing”. The “permissive challenge” is far away from Peter’s stance in terms of Open Data for all.

Open access is epitomised by the BBB declarations – here is the Budapest Open Access Initiative one:

By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.

WHO IS THE PUBLISHER. WE HAVE TO KNOW. I have a good idea, but it would be quite improper to say.

Because if the report above is true, it’s outrageous.

Posted in open issues | 1 Comment

FROG – not just Free, but Open (PRODRG take note)

Posted on October 13, 2007 by pm286

I am delighted to announce and praise FROG, a free service for creating 3D molecular structures from connection tables. This is yet another module in the list of Open chemistry offerings that will ultimately give the tools to the chemistry community that they need and – possibly – deserve. I comment later.
var imagebase=’file://C:/Program Files/FeedReader30/’;

ANN: Frog donates code to OpenBabel for SMILES to 3D conversion

15:48 12/10/2007, baoilleach, cheminformatics, openbabel, Noel O’Blog

Recently, researchers at the French research institutes INSERM and CNRS developed an online service for converting SMILES string to 3D conformers: “FRee Online druG 3D conformation generator (Frog)”. A description of this service was published in T. Bohme Leite, D. Gomes, M.A. Miteva, J. Chomilier, B.O. Villoutreix and P. Tufféry. Nucleic Acids Research, 2007, 35, W568-W572:

Frog is an on-line service aimed at generating 3D conformations for drug-like compounds starting from their 1D or 2D descriptions. Given the atomic constitution of the molecules and connectivity information, Frog can identify the different unambiguous isomers corresponding to each compound, and generate single or multiple low-to-medium energy 3D conformations, using an assembly process that does not presently consider ring flexibility. Tests show that Frog is able to generate bioactive conformations close to those observed in crystallographic complexes.

On behalf of the OpenBabel project, I am pleased to announce that Dr. Bruno Villoutreix (INSERM, University of Paris 5) and Dr. Pierre Tufféry (INSERM, University of Paris 7) have generously donated their code to OpenBabel. This code will be incorporated into OpenBabel under the GPL in the coming months, making fast and accurate SMILES-to-3D conformer generation available to the open source community for the first time.
The absence of an open source 3D conformer generation algorithm has increasingly become a problem in recent years due to the popularity of SMILES strings for the description of molecular information. Fortunately, this problem has now been solved. Thanks again to all those involved in the development and release of this code.
For further information on Frog, please contact the corresponding author of the Frog paper.

Image credit: gottcha78PMR: Before commenting on FROG, I’ll praise gottcha78 on flickr who has made her beautiful photograph freely re-usable through creative commons. This will increasingly marginalize the mindless publishers who claim copyright on images they haven’t taken. Let’s hope it won’t be long before science is liberated from this madness.
PMR: FROG is exciting not just because of its functionality, but also because it’s OPEN SOURCE. It will be distributed under Open Babel, which uses the GPL license.
Why does this matter? Surely there have been free services like this before. What about PRODRG? (This is a service which has been going for at least 5 years, I think).
I make it clear that I have no quarrel with the authors and maintainers of PRODRG. However they ARE supported by Wellcome, who have blazed a trail for Open Access. Perhaps this is an opportunity to do the same for Open Source and Open Services.

Q: I’m in a non-academic (i.e. commercial) environment. Can I use this server for free?A: You are free to try a few test compounds (up to 5) – then request a license, as explained here

Q: I would like to use PRODRG locally, where can I download a copy?A: PRODRG executables are available under license, as explained here.

Q: I would like to run a database (> 50 compounds) of small molecules on your server. Is that OK?A: Please E-mail me first.

PMR: This shows all the worst aspects of “free” but not open services. You can try a few examples only (Wiley generously allowed me to inspect 3 spectra out of their collection of 500,000. Chemspider allows download of 100 molecules out of ca 10, 000, 000. The license is C20 – you have to fill in a form and then you get your personal copy of the software (presumably in binary – binary is a timebomb waiting to slef-destruct on the next OS upgrade). Can I convert 100 compounds? By default NO.
Maybe PRODRG should contact Wellcome and agree a method for making it Open. PRODRG was, of course, supported before the Wellcome insistence on Open Access, so maybe they can do something retrospectively.
Back to FROG. FROG Is Free. FROG is Open. If you don’t understand what this means, here is a simple translation.
The FROG code has been made available to the human race, without requiring payment now or in the future. ANYONE can have access to the source code. ANYONE can compile it, and ANYONE can run it. You don’t have to email the authors. You don’t have to request permission. You don’t have to tell anyone you are doing it. If you modify it you have to make your modifications available. You have to make clear what the original authors wrote and what you wrote. You have to give credit to the original authors. (There’s a bit more, but that’s the essence).
There is also a free service from FROG (“FRee Online druG 3D conformation generator (Frog)”.). Try it. Here’s penicillin (actually 6-aminopenicilanic acid):

C12C(N)C(=O)N1C(C(=O)O)C(C)(C)S2

Go to the FROG site and paste it in. Non-chemists can also do this! Breaks the ice at parties! You should get something like:

Who owns this image? YOU do. (Unless you donate it to a commercial publisher).
Are there restrictions on the FROG service? I don’t think so, but the service is completely separate from the code. Could they start charging in the future? Yes. Could they restrict access? Yes. Could they limit the usage? Yes. (I don’t think they will…)
Could I take the FROG code and run an Open service. Yes! Could I then close it and start charging money? Yes! Isn’t that immoral. No.
BECAUSE!!! We – the human race – now have the code. If something happens to the FROG service, we can duplicate it.
That’s why we are developing Open services at UCC. It’s not always trivial to distribute Open services (it’s harder than Open code). But we are thinking hard how to do it. Making the code Open is the major first step.

Posted in chemistry, open issues | 1 Comment

Chemspider and "Open Chemistry Web"

Posted on October 13, 2007 by pm286

Recently the Chemspider company has announced an “Open Chemistry Web” which in my opinion misuses the word “Open”. Before I start I’ll review my relationships and attitude to Chemspider and Chemspiderman to try to clear the air.
Chemspider.com and its associates are commercial organization which have aggregated a large number of chemical connection tables and have started by calculating their properties and extracting literature references which they make freely accessible but not Open. The freedom is for an unspecified timescale and you cannot download significant amounts of the data and you cannot re-use it without permission. Initially I was concerned about the complete lack of quality in these calculations and said so – I believe there has been some improvement in quality but I do not check and do not intend to do so. I do not follow Chemspider regularly but they appear to have added the ability for anyone to add annotations and curation. I have serious concerns about the lack of thought given to metadata and I do not expect Chemspider to be able to scale or to compete against modern approaches.
Chemspider also encourages Uploading Spectra Onto ChemSpider. These spectra by default all belong to Chemspider. They are not Open. If you can convince the world at large to donate IPR to you for free, you deserve some form of congratulations for sheer bravado. Note that even if you upload data and metadata you are not allowed to download it (there is a limit of 100 structures). We have ca. 250,000 calculations on molecules and 130,000 crystal structures which Chemspider have suggested we upload to them. I’m not yet sure why we should do this.
Chemrefer appears to allow searching of Open chemistry articles by keyword. Unexceptional, but why shouldn’t we simply use Pubchem? AFAIK it will index all these journals.
The IPR model of Chemspider seems clear. No data, metadata and author contributions are Open. That allows them, at some stage in the future to close some or all of the site and to charge for data and services and – like eMolecules and their tie-up with Wiley (Wiley and eMolecules: unacceptable; an explanation would be welcome) – I predict this will happen within 5 years (unless Chemspider fails to survive in its current form). So all the authors who are contributing metadata are, in effect, donating IP to Chemspider. I have no moral objection to this – it just seems retrograde when we have Open collections of molecules such as PubChem and our own crystalEye. But a number of my friends in the Open Chemistry area are on the Chemspider advisory board, so I must be missing something. Perhaps they can show how donating IP to a commercial closed company advances the cause of Open Chemistry.
And I applaud Chemspiderman’s efforts to clean up chemistry. Sometimes this gets muddled with the association with a commercial organisation based on possessing chemical IP so sometimes my messages have been less than generous and I apologized.
I am not anti-capitalist – I do not attack companies per se. But I do attack people who use the word “Open” incorrectly and to promote themselves. I have done this when publishers come up with “Open Access” offerings which appear to be less than satisfactory ( see “open access products” at Nature obscures the debate, Why Open Access metrics are necessary) and for which the community has to pay. “Open” is now used by commercial organisations in the same way as “healthy” – please feel good about us and our activities as we use the word “Open”. We know it’s meaningless, but it makes us look good.
Well, it isn’t meaningless. A number of people are trying carefully to describe what is meant by Open access, Open Data, Open source and Open Services. And when others use it to mean something less, I take exception. If nothing else it makes our job much harder.
So:

Welcome to Open Chemistry Web
Posted by: will in ProjectAs the latest addition to ChemSpider’s services, ChemRefer is specialised in text-indexing and it is now focused (and soon to be integrated with the main ChemSpider search) on providing access to chemistry related information and building a structure centric community for chemists. I originally created the ChemRefer service to allow chemists to have a search engine to perform text-based searches of freely accessible chemistry articles. When I saw what ChemSpider was trying to achieve I joined their advisory group to assist their efforts. With time it was clear that a closer relationship would benefit both parties. Now, ChemRefer and ChemSpider are merged together and we have an opportunity to produce a FREE search engine which will allow users to input structural and textual queries into one search interface. Any ideas, comments on any sources you would like us to index or any features you would wish this service to have are most welcome.
This blog is a parallel blog to the ChemSpider Blog and ChemSpider News so that we can discuss the ins and outs of text indexing of the chemistry literature. At a time when there is a great deal of openly available literature and data in this arena, it is time there was an openly available service with the cheminformatics and text indexing capabilities to search this effectively. We want to play a role in making that happen. We look forward to dialoguing with you. Please add Open Chemistry Web to your Blog Reader…

PMR: There is nothing Open about this. Even the blog is not Open (it does not carry a CC licence). The services may be free, and they may be useful, but they are not Open. The text that they index may indeed be Open Access in its own right (and probably is because otherwise the publishers will sue them) but this is no especial credit to Chemrefer. We also index Open resources but we make our results Open.
Chemrefer could disappear tomorrow. Only if the data, and the source code are made Openly available under licence can they be called Open.

Posted in chemistry, open issues | 9 Comments

Open Notebook Science

Posted on October 12, 2007 by pm286

To clarify … this is a joint project between Henry and us. Until now the emphasis was here, because it was useful for Nick to run some jobs as part of his eScience PhD.
Nick has already done about 500 calculations from NMRShiftDB. When he has done a first analysis (i.e. to make sure we are actually calculating what we think) he’ll put them up on read-only static pages. We will decide whether any need re-running but in general that should be the end of the first phase. Nick id now starting to write up his PhD and will not, in general, be running any more jobs.
At the same time Henry will mount a Wiki where people can comment on individual compounds and spectra. Henry will be the centre for communal discussion of the methodology. There will also be a blog for the project (i.e. it will move off this one) possibly run by Henry. The balance between static pages, blogs and wikis will be challenging and may need active management by us all.
Assuming the methodology is OK, we’ll continue to run the next batches from NMRShiftDB until they get too large to run in (say) 2 days. Don’t know when that will be but perhaps ca. 5000 compounds == 5000 days => 2 months. They will appear on the HTML pages as they are run.
If the methodology needs changing timescales are unpredictable.
Hope to have something to show in a few days.

Posted in open notebook science | Leave a comment

"How can we persuade ACS to change?"

Posted on October 12, 2007 by pm286

David Wild Says:
October 11th, 2007 at 1:41 pm eSo does anyone have any ideas on how we can pursuade ACS to change? I’m not sure the usual forms of protest would work. We can obviously resign our memberships, but that will have no sizeable impact and will just make ACS conference attendance more expensive. We could refuse to publish in ACS journals or to attend ACS conferences but that would also likely have minimal impact (unless we boycotted JCIM en masse) and could hurt our careers. More positively, we could try to raise awareness in the chemistry community of the benefits of open access in our own respective spheres, and encourage chemists to send letters to ACS leadership expressing their concern and desire for open access.

PMR: I am not a US citizen, do not work in US academia and am not a member of ACS so you may ask why I am responding. Partly because of that – I can claim a certain measure of detachment and non-involvement. To reiterate – I have been invited to many ACS meetings, know many of the staff and have been invited to talk to the ACS publishing staff (on technical matters) and at that level the ACS is what you would expect – hardworking staff, large and impersonal bureaucracy, conservative but wishing to be aware of the likely future.
But the last 2-3 years have seen concerns about the integrity of the ACS – integrity of purpose and integrity of practice. I am not taking some the hearsay as fact, but the very existence of these widespread allegations would require a responsible scientific society to be immediately and seriously concerned. The allegations of Paul Thacker ( “This explains a lot”) and ACS insider ( A sad day for ACS) are independent and – if true – very serious.
The ACS is not a national society, it is effectively an international one. In crystallography world governance (and a considerable amount of leadership) is provided by the International Union of Crystallography (IUCr) . Chemistry is much more diffuse and while there is the International Union of Pure and Applied Chemistry (IUPAC) ~~it has much less effective governance~~ it exerts much less effective governance over the world community of chemists. (They have recently done me the honour of electing me as a fellow). However, ultimately, it is the international body to which all national chemical bodies (should) belong and could, I imagine, decide that any member society was behaving inappropriately. For example many young scientists (including some who work with me) would see the ACS national meetings as the appropriate place to announce their work rather than IUPAC.

Being a de facto international society has responsibilities, not least of which is to give leadership within the chemical community and also in the wider world. For example our planet is in danger, and chemistry is a necessary (if not sufficient) discipline in saving it. An international society must give truthful, balanced, verifiable, and independent advice to the world. If there is any hint that it is hiding information, pursuing the interests of a lobby, or failing to meet accepted standards of evidence it is unfit to lead us.
The ACS is unique in its make-up and to understand it I am going to explore 3-4 areas.
(a) Non-profit journal and serial publisher. As Alma Swan has made clear commercial publishers have a responsibility to their shareholders to make as much money as possible, subject only to legality and general human morality (e.g. non-exploitation of humans, even if legal). To do this a commercial publisher may employ a salesforce which is rewarded for its efforts by bonuses. This has been a problem when selling to a library community which has clung onto the old-style notion of a publisher as providing a service to the community for money.
Those days are gone. The problem is that it necessarily rubs off onto the society sector, which see their “rivals” getting richer and skewing the market. So there is a natural tendency to emulate successful practices. However this can immediately start going against the interest of the community (e.g. by creating journals which are unnecessary but which could raise revenue).
There is clearly an element of this in ACS journal publishing. The ACS (rightly) has a distinguished history and has a near-monopoly position in parts of academic chemistry. Promotion can depend on how many JACS papers someone has published. There can be a tendency to exploit this monopoly and I would regard it as unethical (I have no evidence it happens) if a college had to purchase journals from a given society to get certification from that society for their courses.
I do not know whether other societies suffer the same pressures from the pseudo-commercial world – it could be useful to know if it were a general phenomenon. In the current case, however, I would hope that the ACS has an ethics committee to monitor potential abuses in this area.
(b) Chemical Abstracts Service. This is a division of the ACS ~~and~~ that makes it effectively unique among learned societies. CAS is ca. 100 years old and has done a magnificent job of extracting chemical data from the literature. It employs over 1000 staff and I estimate its turnover to be ca 500-1000 million USD/year. This is similar to the GDP of a very small country. Although CAS is a division of the ACS there is a large degree of devolution and it is commonly seen as quasi-independent. I do not believe figures are published but it gives the air of being profitable and has earned a near monopoly position both in academia and chemical industry. In particular CAS is seen by the pharmaceutical industry as being the gold standard for chemical information both in numbers of compounds and quality. Being a monopoly allows CAS to set its own prices and many academic departments find the annual fees extremely difficult to raise. CAS is still a must-have, but it is increasingly not seen as a service to the chemical community but a highly priced commercial product.
There are many smaller colleges who cannot afford CAS products and I would question whether it is in the spirit of a learned society to disadvantage many chemists throughout the world.
In 2004-5, however, CAS felt threatened by the NIH’s Pubchem project (from WP):

Since the inception of National Center for Biotechnology Information‘s open access PubChem chemical compound database initiative, ACS has actively lobbied NCBI and its supervising agencies to stop the database development effort. ACS markets its own subscription- and pay-based Chemical Abstracts Service. In a May 23, 2005, press-release, the ACS stated:

The ACS believes strongly that the Federal Government should not seek to become a taxpayer supported publisher. By collecting, organizing, and disseminating small molecule information whose creation it has not funded and which duplicates CAS services, NIH has started ominously, down the path to unfettered scientific publishing…

PMR: At this stage Pubchem had ca. 11 staff, compared ton the ACS’s 1000+. However this marks the time at which ACS became publicly active in lobbying against Open Data and Open Access. If my memory still works it was slightly earlier that ACS made it clear that it would not support Open Access – it ran a meeting to which I and others were invited and where both sides of the questions were presented. At that stage, however, the OA issue appeared to be the normal conservative position of any large publisher and no different – say – from Wiley or Royal Society of Chemistry. Pubchem changed all that.
The Pubchem argument was very public and could be summarised as: “ACS sees Pubchem as a threat to CAS (not publications); The way to prevent it is to lobby state and central government; to make our arguments more convincing we label NIH as socialist.”
The biologists were incensed, as were several provosts, ARL etc. However the chemical community was completely unaware and uninterested. Probably no more than 5-10 mainstream chemists made a public fuss including Henry Rzepa, Steve Heller and me. We managed to get it highlighted in Nature and several other organs and ultimately won the day (though the ACS deludedly regards themselves as having won).
I will write later about why I think CAS’s days are more limited than they suspect.
I suspect that it was the Pubchem affair that triggered the current ACS management into its current campaign. Once having railed against the NIH and lobbied government, it becomes easier to spend the society’s money on lobbyists, lawyers and the rest.
I do not know the current ACS management but whatever else they seem to have completely lost sight of what a learned society is about. They have spent money on absurd legal cases (suing Google over the word “Scholar”) and wasting money on Dezenhall. Even if these were reasonable things to do (and for a learned society they are not), the management now exudes a mixture of arrogance, loss of reality, and self-centeredness.
None of these should be seen in a learned society. We desperately need an ACS.
I hope to write more on the issue of truth and integrity and will return to what Paul Thacker alleges.

Posted in open issues | 3 Comments

Advice on Open Notebook Technology

Posted on October 11, 2007 by pm286

Cameron Neylon (How best to do the open notebook thing…a nice specific example) has responded helpfully at great length on collaboratuive technology for our Open NMR Calculations…

How best to do the open notebook thing…a nice specific example

Peter Murray-Rust is going to take an Open Notebook Science approach to a project on checking whether NMR spectra match up with the molecules they are asserted to represent. The question he poses is how best to organise this. The form of an open notebook seems to be a theme at the moment with both discussions between myself and Jean-Claude Bradley (see also the ONS session at SFLO and associated comments) as well as an initiative on OpenWetWare to develop their Wiki notebook platform with more features. There are many ideas around at the moment so Peter’s question is a good specific example to think about.
As I understand Peter’s project the plan is as follows;

Obtain NMR spectra from a public database and carry out a high level QM calculation to see whether this appears consistent with the molecule that the spectra is supposed to represent.

PMR: Yes. We are also tooling up so that individuals can submit spectra. Anyone can do this, but initially they should discuss it just to agree on the technology. All spectra and molecules will be OPen so it won’t suit all applications.

Expose the results of this analysis useful form.

PMR: Yes. Rather like crystalEye where we have one page per compound.

Identify and prioritise examples where the spectrum appears to be ‘wrong’. The spectrum could be misassigned, the actual molecule could be wrong, or the calculation could be wrong.

PMR: We’ll create plots for ALL molecules and spectra. However it may not be always to identify what is “wrong”. Thus a bad TMS value (e.g. if the solvent is wrong) will shift all the values. So we may give a revised line (y = x –> y = x + c).

Obtain feedback on the ‘wrong’ cases and attempt to correct them through a process of discussion and refinement

PMR: Yes. It may not bbe trivial to correct them – we shan’t have a chemical editor in the Wiki, so it may be an idea to have a molecule upload. However the details often bite hard.

So there are several requirements. The raw data needs to be presented in a coherent and organised fashion. Specific examples need to be ‘pushed out’ or ‘alerted’ so that knowledgeable and interested people are made aware and can comment and (and this is separate from commenting) further detailed discussion is enabled and recorded for the record.

PMR: Yes. We’ll probably do this by RMS deviation and we could colour the table of contents or something similar. It may not be easy to make generic corrections over several thousand files. (Hang on – the files are in CML so it’s trivial).

In addition there are the usual requirements for a notebook or a scientific record. The raw data must remain inviolate and any modifications must be recorded along with the process that generated the data. There will also presumably be a requirement to record thought processes and realisations as the process goes forward.

PMR: We’ll have to rely on the Wiki technology.

My suggestion is as follows:

The raw data is generated by a computational and repititive process so I imagine it is highly structured. I would use a template web page, possibly sitting within a Wiki but not editable, to expose these. This would include details of what was run and how and when. This would be machine generated as part of the analysis. Obviously appropriate tagging will play an important role in allowing people to parse this data.

PMR: Yes. I am not yet sure how to insert machine-generated pages into a Wiki and we’ll value help here. The pages will certainly NOT be editable. Any refinement of the protocol or correction will generate a NEW job, not overwrite the last one.

A blog to provide two things. An informal running commentary of what is going on, what the current thought processes are, and what is being run and ‘alerts’ of specific examples which are interesting (or ‘wrong’). This is largely human generated, although the ‘alerts’ could be automated.

PMR: I think we are clearly going to have a new blog. What I’m not clear is how we post comments from the blog to the Wiki and alert the Wiki from the blog.

A wiki to enable discussion of specific examples and detailed comparisons by outside and inside observers. As Peter suggests in his draft paper, specific groups, both functional and academic, may show up as problems but predicting these in advance is challenging. A wiki provides a free form way of letting people identify and collate these. It may be appropriate to (automatically or manually) post comments from the blog into the wiki (which would also provide reliable time stamps and histories, not available in most standard blog engines).

So my answer to Peter’s question which might have been paraphrased as ‘Which engine is the best to use?’ is all of them. They all provide functionality that is important for the project as I understand it but none of them provide enough functionality on their own. An interesting question which would arise from this combination of approaches is ‘where is the notebook?’ to which I will admit I don’t have an answer. But I’m not sure that it matters.

PMR: I am not worried about where the notebook is (though it could be difficult to “lift it up” by a single root.

This doubling up mirrors current practise both in Jean-Claude’s group where the UsefulChem wiki is the core notebook but the Blog is used for high level discussion. Similarly I am moving towards using this Blog for higher level discussion of results but the chemtools blog as more of a data repository. At Southampton we are thinking about the notion of ‘publishing’ from the Blog to a Wiki once a protocol or set of results is sufficiently established as Step 1 on the way to the paper.

PMR: Sounds reasonable

Finally a throw away suggestion. Peter, if you want to get a lot of spectra with a lot of associated molecules, without any concerns about publisher copyrights, then consider opening this up as a service for graduate students to check their NMR assignments. I bet you get inundated…

PMR: We have no lack of vision here! The SPECTRa project has created the technology, and this can be tested through J-CB’s molecules. I’m not too worried about overload – a thesis has ca 200 molecules and we can do this in less than a day. So if we had 300 theses in a year that would be 60,000 molecules. And, of course, since it’s Open it could be done elsewhere. The main problem is that the data HAVE to be open – the students will have to expose their molecules before the thesis is published.

Posted in Uncategorized | 1 Comment

The thing about Wikipedia is that it only works in practice. In theory, it can never work.

Freebase, Wikipedia and the right to fork

OPSIN: why we need it and why we shouldn't

Installation

A Simple Library

Conclusions

My outrage against "Open Access Publisher" continues

Open NMR

Outrage: Repurposing Open Access material is allowed without explicit permission

FROG – not just Free, but Open (PRODRG take note)

C12C(N)C(=O)N1C(C(=O)O)C(C)(C)S2

Chemspider and "Open Chemistry Web"

Open Notebook Science

"How can we persuade ACS to change?"

Advice on Open Notebook Technology

Recent Posts

Recent Comments

Archives

Categories

Meta