Blog as presentation

Any of you on the chemoinformatics circuit knows Wendy Warr – present all at meetings – knows everybody and is tireless in communicating news to the community – formally and informally. She has also been the highly respected and tireless editor of the the Journal of Chemical Information and Modeling (sic), once JCICS and before that J.Chem.Doc. Wendy writes up all informatics talks at the ACS and other meetings. She naturally asks people for their “slides” – normally in powerpoint form.
My problem has been that I don’t have powerpoint slides (I don’t believe in them and in a later post I’ll tell you why). Up till now I have created my material in a single vast hyperslide (using XHTML) where there are ca 1000 individual slides and I create a rough serial menu of what I might cover. In principle I could click on slide 1, then slide 2, etc until slide 125. By then the audience would have gone home – so I select slides that I think the audience might like and show them.
When Wendy asks for the slides I am embarrassed because I can’t easily give a nice single chunk of information. This is partly because the W3C still hasn’t managed to convince browser manufacturers to “save as hyperdocument”. Handing over a directory structure of 100 files and 100 subdirectories and saying “I showed some of these ” also isn’t much use. So normally Wendy is reduced to something like “PMR spoke in an animated fashion on markuplanguages/openAccess/whatever and showed some demos”. This isn’t much use to readers.
So I’m trying something new. I will write beforehand what I am going to say. I may say bits of it, and I won’t say others. But at least it should make a story, if not a historical record. Then I might something afterwards that says 2I said something different from what I said I was going to say”.
So the next few posts are parts of what I might say at the ACS. Hopefully they are of some general interest in science and information as well.

Posted in general | Leave a comment

Mashups and CB2

There’s a regular monthly meeting in Cambridge on Thursdays in the Internet cafe called CB2. Organised with enormous energy by Rufus Pollock – see the Open Knowledge Foundation link. Mostly geeks, we have a common theme of wanting to liberate knowledge. Typical domains are legal, where access to reports in the public domain is effectively controlled by commercial suppliers, maps which are controlled (in the UK) by the Ordnance Survey and my own interest in chemistry. So we talk about bottom up approaches to liberation – what can we get going that has dynamism but is built on modern lightweight software engineering approaches?
A key approach is the mash-up – a fairly new term for me. A mashup is the linking of information from two or more accessible data resources. An excellent example is Placeopedia. This is a really simple, brilliant mashup. It links Google maps to Wikipedia articles. Have a look before you read on.
So any place in Wikipedia in Wikipedia can be linked to its place on Google Maps. Anyone can add a new place just by browsing the map and clicking a link to the Wikipedia entry.
Tonight I learned how lightweigfht it is… It’s simply a link to the Google Maps API and to the Wikipedia API. All Placeopedia is is a table of locations against WP entries. Stunningly clever. And it emphaiszes the power of mashups – link between two separate data sources and you get a completely new information resource.
The Placeopedia mashup came from the MySociety folks – a group of developers committed to
“mySociety builds websites which give people simple, tangible benefits in the civic and community aspects of their lives.” For example in the UK it’s difficult to get hold of some government data – including the formal proceedings. They effectively liberated government proceedings in TheyWorkForYou.
The UK government has actually funded some activities in mashups – see Rufus’ OKF blog.
Unfortunately we have to have access to data sources before we can create mashups. In chemistry there were virtually no data sources before PubChem. Now that PubChem has survived the troubles of last year we can start to create mashups – and we’ll be showing some quite soon – based on InChIs. As we liberate more data sources in chemistry, expect some really exciting things to happen.

Posted in "virtual communities", chemistry, open issues | 3 Comments

Open Source, Open Data and the science commons

In this post I content that the chemical information cycle is broken – to the detriment of the chemical and general commons. I’ll explain what that means.

Robert Terry, Wellcome Trust, is widely know for his advocacy of Open Access. As many of you know from next month if you are funded by Wellcome you MUST make your publications Openly accessible. If your publisher doesn’t allow this, that’s your problem. Robert created a diagrammatic view of why Closed Access deprives the scientific commons –
See slide 6
Essentially his argument is that funders support scientists to do research. The results of this work are then given (i.e. copyright assigned) to publishers who get peer-review donated by the scientific community and then restrict the dissemination to readers who are able and prepared to pay. The wealth flow (which include both money, informatics goods, and services) is a net drain FROM the funders TO the shareholders of the publishers.

I have paraphrased this slide (I have missed out the role of libraries as I want to develop the model for software and databases; and I have added readers and reviewers – they are of course all the same people) as:

pubcycle1.PNG

Robert then showed the benefits of Open Access. I can’t immediately point to his slide but my version hopefully does it justice:

pubcycle2.png

The diagram has changed with the green arrow showing the flow of goods back to the commons. The cycle is complete: funders support science; sceience is published into the commons; the commons can be seen by the funders who can demonstrate the value of their contribution; and the new goods inspire the next generation of science.

Can we apply the same sort of logic to software and data in science? Again we need a cycle or the producers end up subsidizing other parts of the chain. In bioscience this can work. Although there is a considerable problem in any science in supporting data and technology there is direct funding for databases and software. I have drawn 2 cycles – one for software, the other for data. The funders support science with a partial provision for the development of tools to support it. They require that the tools and the data are made available to the community. In this way the cycles are closed and there is a flow of goods back to the commons. Because of the central role of data in modern science, funders may also directly support databases. This is not easy, and it’s expensive but it still seems to happen. In any case the data are Open.

biocycle.png

My key contention is that these communal resources give rise to innovation in both the science and the technology. For example there is exciting research into the semantic web in life sciences because there are data on which to experiment and develop methods.

In contrast the flow in chemistry is broken. I have omitted the funders from the diagram but there is very few projects where major software or data has been mandated as Open by the funders. I’d be delighted to have examples. In practice almost all software is commercial and unresponsive to the needs of the science commons. The major market for both software and data is the pharmaceutical industry which pays billions to major information suppliers. This biases the flow so that only crumbs return to the commons. It’s actually worse than zero because if a commercial offering exists there is no motivation to build one in the Commons. So innovation is stifled.

chemcycle.png

Rich Apodaca in his Blue Obelisk post had a nice quote from the editor of J. Chem. Inf. Comp. Sci. in 1984 urging that chemoinformatics be a reproducible scientific discipline. Unfortunately this is impossible with the software and data models we now have. I’ll post later on the sad state of chemoinformatics practice and why it can’t be properly peer-reviewed.

P.

Posted in chemistry, open issues | 5 Comments

Open Data, Open Science. Closed Data…

(I have been fighting the blogging software – on several occasions it has published a blank post. So please excuse these bits of the “learning curve”. I shall now write my posts in an editor and paste them. This is a repost in case you get a garbled one earlier).
I am speaking at the ACS on Sunday on the general theme of eChemistry – the application of eScience – Grid – cyberinfrastructure to chemistry. Unfortunately that’s fairly simple – outside the Blue Obelisk community (more of that later) and a very few early adopters there is very little. By eChemistry I mean more than simply compiling in-house data and running programs – I mean semantically enriched chemistry that machines can help to process. By contrast there are huge and exciting developments in bioscience , geoscience and many others. So I’ll be asking why this is.
The single fundamental requirement in eScience is that there is shared data. Ideally this should be semantic, and that’s a challenge, but at least it should be there and shared. In chemistry there is virtually none. What there is has almost all come from bioscience (e.g. NCI and PubChem) and some of the US government agencies. However mainstream cheistry is totally unintersted in sharing chemical data and when it needs it expects to have to pay provate sector providers. As a result innovation in eChemistry and chemoiformatics is stifled – more of this in later posts.
This is exmplified by a question from JohnIrwin on the Indian CHMINF-L list (I doubt it has been archived yet – when it is perhaps someone could add the link). John has compiled a wonderful list of compounds (ZINC) from a wide variety of sources such as chemical suppliers and made it available to PubChem – as a result of this and similar efforts PubChem has ca 5 million compounds (information, not physical samples). He quite reasonably asks whether we can do the same for chemical reactions.  Read on…

At 02:14 07/09/2006, John J. Irwin wrote:
JOHN.. Dear CHMINF-L Gurus
PETER..This is a very exciting question, John.
PETER..We have been developing the technology to do this and many of the components are now available. It won’t give you 100% recall or precision, but it could get a lot. However I suspect that if the technology is deployed we will have the lawyers after us immediately because it dares to actually read full-text papers automatically and that is not allowed, except by a few journals such as Molecules and Beilstein New Journal.
JOHN..I’d like to know whether a particular compound, or more generally, a particular ring system / scaffold, has ever been reported synthesized. I’d like to pass e.g. a SMARTS pattern and get back a package including the literature citation or patent identifier, and perhaps an XML structure containing reagents and reaction conditions. I’d like to do this millions, possibly billions of times to build a database of “been there, done that” scaffold space.
PETER..I shall be presenting the technology at ACS on Sunday – at least as much as I can get into 30 mins. It is early days, and some of the steps are not yet well developed.
JOHN..Is anything approximating this possible? In particular, can you direct me to how to script queries (e.g. SMARTS match on the product of a reaction) to CAS?
PETER..The ideal situation would be if all publishers put connection tables and reaction in full semantic form in their paper. This is technically possible – it’s just a question of will and getting a new business model. Then you would simply set a robot to read all the molecules and reactions from every published paper and aggregate the content. The search technology is widely available.
PETER..There are two tiny difficulties.
* the journals do not encode structures and reactions in a meaningful machine-interpretable form. There are some slight signs they are interested in doing so – if so we have all the technology (when I say “we” I mean the Blue Obelisk group in general)
* you aren’t allowed to spider the full text. Probably.
PETER..What we have done – and what I shall be reporting is to spider all the published crystal structures from journals that allow this. We haven’t spidered the ACS because they stamp copyright on the factual data deposited as supplemental and no-one except me and Henry has challenged this. Personally I regard this as illegal and certainly unacceptable but while communities like CHMINF-L accept this there is not a lot that 2 individuals can do other than make a fuss. But the crystal structures deposited with the RSC and Acta Cryst are freely extractable and we now have ca 50, 000. Moreover we have done exactly what you want and extracted all the fragments from them. This means that perhaps 100,000 chemical fragments are browsable without additional software (Obviously we use InChI). So IF we had the structures from synthetic papers the problem would be solved.
PETER..Note, of course, that this does not just give comprehensive coverage of the modern literature, it gives immediate comprehensive coverage of the modern literature. Our robots can report a new structure within 5 minutes of it being published.
PETER..The holy grail is semantic chemical publishing – what can we do before then? We have to use full-text. Unfortunately there is no Open software that can interpret chemical diagrams. I think it would be great to have some – it’s not trivial and you won’t get 100%. And even then the problem doesn’t finish as it can be very difficult to link the graphical schemes to the compounds – e.g.what numbers in a scheme relate to the compound identifier rather than an atom label, quantifier, etc. Graphical reaction schemes and Markush structures in current chemical publishing are often very effective chemical obfuscation tools. I hope to be able to show some small steps to de-obfuscation.
PETER..So before the grail arrives we have to get the structures out somehow, without a connection table. Peter Corbett is addressing this here through OSCAR, which translates names to structure. It runs at over 50% and
could. OSCAR will read a paper and where possible create a complete connection table. Obviously it’s only as good as the authors’ naming and when they have got that wrong – and it happens – OSCAR will get the wrong structure. But it is a step forward.
PETER..Reactions are more tricky. This is because chemists write in unnecessarily convoluted language:
“To a solution of X was added 3 g of Y”. which is equivalent to “To my dog was donated a bone by me” (instead of “I gave my dog a bone” which is the sensible way). If we wrote “I added 3 g of Y to X” current grammars could parse it but this absurd mandation of the passive makes it a lot harder and we have to write a passive chemical grammar.  But when we have cracked it, then we should be able to extract reactions from full-text.
JOHN..Thanks, and sorry if this question seems naive.
PETER..It’s a perfectly sensible question and very exciting, but be prepared for disinterest and opposition from most of the community. We’ve been collaborating with Indiana on the use of a distributed OSCAR system and there are lots of areas where other people could help as long as they don’t mind working with Open Source.
JOHN..John Irwin
UCSF Pharmaceutical Chemistry
PETER..If you are at SF we must meet. Apart from my talk on Sunday I’m pretty free other than a Blue Obelisk beer evening meeting on Tuesday – I am sure you’d be very welcome.

Note: SPARC set up a list on Open Data for which I am the moderator. Technical difficulties meant I haven’t been able to do much there. The business of intergating the technical moderation into my email system was just too complicated for me. Maybe this is a good time to rekindle my involvement.

Posted in general, open issues | 5 Comments

Blog; Alma; ACS

I am stiil working out how the blogging software works – I lost the last post… Also formatting code, XML, etc seems to be hairy. So forgive some of the early stuff. Also Jim showed me today I had to moderate comments, so apologies for anyone who thought I had inhibited their post. Everything posted will appear except blatant spam.
Very pleasant visit from Alma Swan – guru and expert in Open Access. We actually talked about Open Data – how data in scientific publications can be marked up semantically, published, archived and reused. We are doing a lot of this at present – see reply to Jean-Claude Bradley.
I’m talking on Sunday at the American Chemical Society on “eChemistry”. eScience – the Grid – cyberinfrastructure – has a lot of interest and support in almost all disciplines – physics, bioscience, medical, geoscience, astronomy. But not chemistry. Why not? I’ll be exploring these ideas in future posts.
The blog is, I hope, an ideal mechanism for recording thoughts and getting peer-review on them in a way you can’t do in formal publications where if you haven’t done an experiment/calculation you often can’t publish. I will blog a numer of ideas that I want to explore at the ACS meeting so there will be a permanent record. One of my problems is that the media I use – hypertext and java – isn’t easily ar5chivable so some talks don’t have records. With the blog I hope there will be some words to record.

Posted in general | Leave a comment

Tenderbutton – A chemist's blog

I was pointed today to a really impressive blog:
http://blog.tenderbutton.com
The author describes himself as:
My name is Dylan Stiles and I work in the Trost lab at sunny Stanford University. I’m engaged in the total synthesis of two natural products: spirotryprostatin B and …
… and I guess he is a postgraduate research student.  The blog is technically first class, both in its chemistry and its presentation.
I was particularly impressed by the discussion on journal overpricing.
http://blog.tenderbutton.com/?p=153
Although there are a small number of facetious replies there is a good critical mass of discussion, and I guess most of those contributing are also young researchers. One of them discusses Donald Knuth and the Journal Of Algorithms in some depth so I suspect there is a strong Stanford connection.
Dylan enthuses about Wikipedia and has contributed to the chemistry there..
P.

Posted in chemistry, open issues | 1 Comment

Open Molecular Information

Last week we had a young doctor friend staying with us and because he was interested in infection the conversation turned to MRSA. If you don’t what this is, look it up in Wikipedia under MRSA (this hyperlink should work). Under the entry it stated:

Vancomycin and teicoplanin are glycopeptide antibiotics used to treat MRSA infections

I know a fair amount about vancomycin, not least because one of my colleagues Dudley Williams was a pioneer and there is a physical molecular model in the entry hall. But I had never heard of teicoplanin. (I am not afraid to admit ignorance – I am ignorant of almost everything). So what is it?
Before I give my adventures, I’ll give an overview of a typical current process for chemical searching. This is taken from the CHMINF-L list, a highly respected forum for chemical librarians and informaticians run by Gary Wiggins from Indiana University. A list member wanted to know the structure of coenzyme A. I’ll summarise the discussion (you can read it in full in the archives):

From:         Meghan Lafferty
Subject:      Structure of coenzyme A?

Hello,
I have a faculty member who wants to make sure that she has the
correct structure of coenzyme A. When she looks it up in PubChem, the
compounds she finds list 10 related structures with Same,
Connectivity (i.e., "The molecules in this group have the same
regular chemical connectivity, ignoring isotopes and
stereochemistry."). I don't know the significance of the ranking, but
the first 5 hits of a text search for coenzyme a have the CID of 87642.
A search for coenzyme A in SciFinder (using Locate, Substance
Identifier) brings up 1 hit (CAS RN 85-61-0). I got 7 hits in a
search in Beilstein (Substance Identification, Chemical Name); the
information on all 7 was not very extensive. The CAS RNs in the
records (of the 2 or 3 that included them) were 85-61-0 (same as the
SciFinder one) and 31416-98-5 which appears to be for L-Coenzyme A.
I'm inclined to tell the faculty member that the 85-61-0 is the
correct one, but I'm not entirely sure that it's true. Can anyone
shed any light on this?
Thanks!
Meghan
_____________________________________
Meghan Lafferty
Chemistry & Chemical Engineering Librarian
Science & Engineering Library
University of Minnesota
108 Walter Library
Minneapolis MN 55455

(Note: I have included institutions to emphasize the quality of the correspondents. Non-chemists need to know that:

  • Coenzyme A is a fundamental biochemical in almost all organisms and will form part of any biochemistry degree. It is therefore not a rare or contentious substance.
  • PubChem is the NIH’s Open collection of chemical and biological information related to their Molecular Libraries initiative. It contains information (not samples) of about 5 million compounds. The information is not peer-reviewed and PubChem gratefully accepts contributions of information from many sources including suppliers, publishers, researchers.
  • SciFinder is a tool/service created by Chemical Abstracts Service. I do not regularly use it but my colleagues do, after debate as to whether they could afford it (I do not know prices but it costs a lot). I believe it contains about 25 million compounds though many of those are biological sequences.
  • Beilstein is a commercial supplier of chemical information and has, I believe, about 6 million compounds and associated properties. Again, since I don’t use it, I can’t give figures.
  • The CAS-RN is a unique ID for each chemical substance created by Chemical Abstracts on which they claim copyright. It is very widely used as a universal identifier and many sites (but not PubChem) will list the CAS number. Whether this has been agreed with CAS in individual cases is not normally known.
  • PubChem and CAS were in dispute last year, with CAS lobbying the US congress to limit the activities of PubChem.
  • Note also that the answer is not immediately clear (this is not unusual in chemistry as there are some subtle qualifiers).
  • PubChem is free. CAS charges $6.00 to non-subscribers for the information above. Beilstein will also charge.)

Next:

From:         Dana Roth <[log in to unmask]>
 
Meghan: The Merck Index (#2491) gives a structural diagram.
Dana L. Roth
Millikan Library / Caltech 1-32

(The Merck Index was for many years a large physical reference volume giving strucures and properties. I do not use it myself and assume it is now on CDROM or offered online in institutions. I assume it costs money).

Next:

From:         Meghan Lafferty <[log in to unmask]>
Dana,
Thanks. It looks like the same one as in SciFinder (same CAS RN).
Meghan

Next:

From:         "Poynter, Michael" <[log in to unmask]>
 
Hi Meghan,
FYI - Science of Synthesis refers to Acetyl Coenzyme A (and gives a
structural diagram) here:
Seela, F.; Ramzaeva, N.; Rosemeyer, H., in Science of Synthesis, 16
(2003), p.945
DOI: 10.1055/tcsos-016(2006.1)-01192
Michael Poynter,
Thieme New York

(Note: Science of Synthesis is a large series of reviews of chemical reactions published by the commercial publisher Thieme. AFAIK the information is not Open).

From:         Jacob Zabicky <[log in to unmask]>
Subject:      Re: Structure of coenzyme A?

Dear Colleagues,
After trying in WOS SCI the query "ti=coenzyme a and ti=structure"
namely, articles carrying also "coenzyme a" (not necessarily because
of the split words) and "structure", the search ended with  196 hits
over the 1965-today period. Not an unwieldily number for direct
examination. The following recent  entries (from 2000 onwards) have a
chance of carrying the information (nothing to say about "acetyl
coenzyme A" and similar compounds for reconfirmation):
Shirakawa T, Takahashi Y, Wada K, et al.
Identification of variant molecules of Bacillus thermoproteolyticus
ferredoxin: Crystal structure reveals bound coenzyme A and an
unexpected [3Fe-4S] cluster associated with a canonical [4Fe-4S]
ligand motif
BIOCHEMISTRY 44 (37): 12402-12410 SEP 20 2005
(3 other references snipped)
(Note: AFAIK none of these articles are Open - i.e. it costs money to read them 
and you may not even get the answer) 

Next:
From: “E. Connie Powell”
Hello Meghan
Search the NCBI web site and select the Books database. Enter a search for coenzyme A. From the result select the book Biochemistry by J. M. Berg 5th edition. Select the figures tab. The second figure (figure 14.16) is the structure of coenzyme a.
Good luck E. Connie Powell
Evelyn Constance Powell
Physical and Chemical Sciences Librarian
Folsom Library Rensselaer Polytechnic Institute 110 8th Street Troy, NY
(Note: I hadn’t heard of NCBI books on line – thank you Connie – and I’m impressed. This book carries a date of 2002 so it’s uptodate as far as the query is concerned.)
I now try two of my own resources:

  • ChEBI. This is an Open resource run by the European Bioinformatics Institute which publishes a taxonomy of chemical substances of interest to bioscience. I search for “coenzyme A” and immediately get what I want – in machine-readable form. (I could get machine readable info from CAS and Beilstein if I paid). There is a great deal of useful information here as well.
  • Wikipedia. This resource is much-maligned as being inaccurate, created by amateurs, unsuitable for any scientist, etc. I believe that it is the future and that it will rapidly replace many reference works. (I’ll discuss in a later article my own ideas how this might happen in chemistry). So I go to Coenzyme A and find 2D and 3D structural diagrams as well as useful information about the compound.

Now… in WP I compare the structure with some of the others and I think one of the atoms has a different stereochemistry from, say, ChEBI (if true, this is serious). I don’t actually know which is right (or whether I have made a mistake). I could say “Wikipedia is probably wrong as it’s created by non-experts so I’ll ignore it” OR I can leave a note on the WP Talk page saying “I think the stereochemistry may be wrong – see my blog”. I’m optimistic that that note will be picked up by the Wikichemists and between them they will research the literature to confirm or correct the structure.
So what is the message? Firstly it’s not always trivial searching for chemical names and structures as there can be variants under the same name. There was some confusion in the discussion on the list between “coenzyme A” and “acetyl coenzyme A”. And many of the diagrams in PubChem and elsewhere don’t give stereochemistry. But assume I am an intelligent person who does not have an  immediate institutional subscription to expensive chemical resources (e.g. I am travelling). The chemical community can offer me nothing useful unless I pay for it, and some of those are impossible outside an institution. The biological community gives me 3 free resources, two of which can be seen as qaulity controlled and the other as almost comprehensive. I am confident that, with social computing, the quality control will be added to PubChem so that the bioscientists will have created a high-quality chemical information resource.
Back to teicoplanin
My first visit was to PubChem. if you go there and type in “teicoplanin” you get only one entry – and that is a mixture of two compounds – the one shown is not teicoplanin. So off to CheBI… no entry there … and to Wikipedia, which has a significant entry though without the chemical structure. I search the literature and find a link to a report on the Royal Society of Chemistry’s pages. This is Openly Accessible but I assume is copyright so I cannot re-use the structural diagram – I leave a link instead. At some stage a Wikipedian will add a structure, I’m sure. Then I will be able to point my doctor friend at Wikipedia so he can find out what the chemical formula of his drug is…
P.

Posted in chemistry, open issues | 7 Comments

Is Openness "ethically flawed"?

This is the first substantive post in this blog. To help you navigate I have categorised them – this one is “Open Issues”. Other categories are “XML”, “programming for scientists” and “virtual communities. This may help you select just the topics you want.
I have been interested in Openness for many years, and believe that knowledge and science can now only flourish in an Open environment. I believe that close commercial interests (publishers, aggegrators, software developers and industrial customers such as the pharmaceutical industry) stifle innovation in information-driven science. IMO that is why biosciences, with an Open ethic are about 10 years ahead of the chemical sciences in their use of information.
I hope these posts will not be unbalanced rants. I have campaigned for many years for Openness so it is sometimes possible to have misty vision. My aim is not to create divisions but to show positive ways forward. I work closely with many of those organisations on whom I comment, such as the Royal Society of Chemistry and the American Chemical Society (who have invited me to talk next week at their annual meeting). I was particularly encouraged by a meeting of the STM publishers last year in Frankfurt where I presnted the problem of Open Data – I was prepared for a tepid or critical reaction – in fact I had many positive comments and offers of future collaboration.
As an example we have two separate visits from the Royal Society of Chemistry staff this week aimed at developing new publishing technology. Chemists may know of their sponsorship of the Experimental Data Checker, colloquially “OSCAR” (http://www.rsc.org/Publishing/ReSourCe/AuthorGuidelines/AuthoringTools/ExperimentalDataChecker/index.asp). OSCAR can read a complete chemical paper in a few seconds and analyse the data for errors. It picks up those that have been missed by the author, reviewer and technical editor and there are almost always some in every paper! FWIW OSCAR (Open Source Chemical Analysis and Retrieval) was written by 4 undergraduates and if you are interested (and have access to chemistry article – which are almost all closed) it’s well worth trying out. And the RSC are funding 2 students this summer to develop the next generation of OSCAR based on XML.
But as I am pushing for radically new ways of doing things, my stance will sometimes be strong, as in the current post where I take issue with Peter Gregory’s comments on Open Access publishing in chemistry. There are very few Open Access journals in chemistry and PeterG was commenting on the launch of Chemistry Central from the BMC stable (reviewed in Peter Suber’s excellent OA blog:
http://www.earlham.edu/~peters/fos/2006_08_20_fosblogarchive.html
More on OA Central and Chemistry Central”.
PeterS extracts PeterG’s comments in the same issue:
Royal Society of Chemistry lashes out”

But the Royal Society of Chemistry’s director of publishing, Peter Gregory, disagrees. ‘We have absolutely no interest shown from our editorial board members, or our authors, for open access publishing,’ he said.

Gregory believes that the open access author-pays model is ‘ethically flawed’, because it raises the risk that substandard science could be widely circulated without being subjected to more rigorous peer review. This could be particularly problematic in chemistry, where rapid, open access publication could be used to establish priority ahead of more
time-consuming patent applications from rival groups, he added.

PeterS then continues in his incisive style to show the flaws in PeterG’s argument.
My campaign is for Openness in:

  • Access. I am least vocal on this, leaving it to esteablished champions such as PeterS, SPARC, Stefan Harnad, Steve Heller and many others. However I support the formation of Open Access in chemistry and would endeavour to publish there is appropriate journals exist. (Before Chemistry Central there were no Open journals that supported chemoinformatics).
  • Source. Without openness of code it is difficult for academic groups to distribute and anhance. Some groups manage some innovation in some areas (e.g. quantum mechanics codes) but in informatics the lack of Openness is a serious problem.
  • Data. I believe that scientific data belongs to the commons, not to publishers or secondary aggregators which is why I supported the continuation of PubChem last year in its struggle against Chemica Abstracts.
  • Standards. Science is bedevilled by lack of interoperability, often promoted by software companies and instrument manufacturers to create lock-in and closed markets. That is why Henry Rzepa and I have developed Chemical Markup Language as a core technology for interoperability and why we are members of the Blue Obelisk movement.

It is a major challenge to get these ideas accepted in any community (especially chemistry) and I’m happy to take this on. I’m prepared to be called foolish, unrealistic, and encounter prophesies of failure; to be ignored by the mainstream of the discipline. But I don’t like being called unethical.
I strive hard to be ethical. I try to honour publishers’ copyright even when I fundamentally disagree. I do not post my own papers on the web as I am forbidden to do so by almost all publishers in chemistry (this is why I applaud Open Access chemistry journals and will submit papers). I have issues with primary publishers (such as the ACS) and secondary aggregators (such as the CCDC) who add copyright statements to primary scientific data. I regard this as counter to copyright practice and law as I believe that author’s moral rights and the freedom of factual information cannot be overridden by publshers. This is not an oversight by the publisher – as far as we know Henry Rzepa and I are the only authors to have published supplemental (factual) data in an ACS journal without surrendering copyright – and we understand this was not a right, but a one-off privilege.
I have also been publicly criticised on two occasions as being immoral in publishing Open Source programs in chemistry. The argument of the critics is that Open-ness undercuts responsible developers and destroys their market leading to loss of support for science and poor quality code. This may or may not be true, but I do not see it as immoral. Similarly PeterG argues that only pay-to-read publishers can create and protect a high quality scientific record. Neither of my descriptions of Openness is, in my view, unethical.
In any cases the facts do not bear this out. Open source code is gaining ground in science – for example Nature Publishing Group has selected the open Jmol (a Blue Obelisk member) as its tool for displaying protein structures. And I contend that conventional publishing is not effective in preserving the scientific record.

  • There is an increasing trend for publishers to lease electronic copies of the record, rather than sell them. This means that the average scientist – who may move institutions frequently – cannot carry around their copies of the journal. As an instance of ephemerality the publisher can switch off access to a journal at a moment’s notice. This happened to us yesterday – one of our students was reading a number of papers in an ACS journal and had bookmarked them. His browser then tried to open them all at once. The publishers’ software immediately (within 2-3 seconds and with no dialogue) interpreted this as an attempt to steal content and cut off the whole of the University of Cambridge web cache until further notice. We gather this is not an isolated occurrence. It is difficult for me, therefore, to regard pay-to-read electronic publishers as impartial creators and guardians of the archive of science.
  • Apart from the “full-text” the act of scientific publishing is extremely destructive of the scientific record. We have much anecdotal evidence that most scientific data (80+%) supporting primary publications are lost for ever. Many publishers do not support supplemental (factual) data and those that do, do not support its capture in semantic form (PDFs destroy information very effectively). True, we are exploring with several publishers how to tackle this, but they can currently make no strong ethical claims for current practice.

So I contend that Open Access, Data, Source and Standards are not unethical. There will have to be new – and untested – business models for scientific information. Some won’t work. But the whole impetus of the current web with mashups and REST will inevitably change the face of science, so we should start preparing. There is nothing intrinsically laudable in publishing scientific material that looks visually the same as it did 120 years ago.
This blog is intended to promote constructive discussion so we welcome your comments. I shall attempt to be fair and – unlike one well-known Open Access forum – not routinely criticise any posting with which I disagree. I have even been known to change my point of view in response to careful argument supported by facts.
P.

Posted in open issues | 8 Comments

Welcome!

Welcome to the petermr blog! This is one of a series of blogs
from scientists in the Unilever Centre for Molecular Informatics at
Cambridge. I’ll indicate some of the others on my blogroll. For
now, just note that there is another blog specifically dedicated to
Chemical Markup Language (CML) and I’ll be contributing a lot to that as
well.
This blog will cover a wide range of topics that are mushrooing
on today’s web and which will change the practice of science. Areas
which I expect to blog frequently are:

  • The relationship between human readable material (“full text”)
    and scientific data. Henry Rzepa and I have coined the term datumemt
    for the synthesis of these, especially using XML technology. the
    scientific publication in its current form is inspired by 19th Century
    orinting technology and “electronic publications” merely encourag
    outdated ways of communication. Web inspired technologies should
    revolutionize scientific communication. A particular interest is the
    development of the “robotic amanuensis” for scientists – personal
    software which can help indivduals read and publish information
    effectively.
  • Open data, open source, open access, open knowledge. Unless we
    have free aceess to the primary outputs of science we are denied the
    opportunity to develop new ideas in informatics-driven science. I have
    argued publicly that primary scientific data belong to the scientific
    commons and that they must be free. A corollary is that the outout of
    funded science is not just full-text but the complete supporting
    information environment of the experiments.
  • “programming for scientists”. Modern scientists are enhanced
    by “information prosthesis” – the ability to receive and repurpose
    information. If they are able to “program”, they have greater
    expressive power. Many of the future skills will not be with
    conventional programming languages but the tools emerging from the
    explosion of social and technical operations in today’s web. I’ll be
    learning from my colleagues and trying to give readers and contributirs
    a flavour of what is now possible.
  • markup languages in (physical) science. These are the
    handmaidens of the goals above. Currently there are a few main
    approaches for content: MathML, GML (geography), Scalable Vector
    Graphics, Chemical Markup Language, AnIML (analytical chemistry),
    ThermoML (theorchemistry). There are many obvious gaps and I’ll suggest
    guidelines for any person or group interested in building a language.
  • creation and management of virtual communities. I’v been involved with creating and nurturing communities for the last 15 years including
    BioMOO, the Virtual School of Natural Sciences, XML-DEV, and now the Blue Obelisk. I also believe strongly in
    Wikipedia and related efforts. I’ll review the features of successful communities and the
    guidelines for growth.

We welcome anyone as a poster but require them to register (to
prevent spam). We honour copyright, but ask that posters make there
contributions available under Creative Commons. This allows the posters
to retain their moral rights, but allows us to re-use the blog
(including their contributions) for other purposes if required (e.g. it
might be revised for supporting information, tutorials, etc.) We will
always attribute posters.
Technical note: I can edit this blog (e.g. if I make typos or get something wrong) but no-one else can. If you post a comment, we don’t think that anyone can change it. So be careful!
Please let us know your ideas.
P.

Posted in general | 2 Comments