Monthly Archives: September 2006

Datuments and the ACS Style Guide

I was delighted to receive a special book yesterday:

“The ACS Style Guide”

Effective Communication of Scientific Information

 Anne Coghill and Lorrin Garson.

OUP ISBN-13:978-0-8412-3999-9

It’s an attractive produced hardback volume and I’m torn as to whether I should keep it as pristine as possible or cover it with annotations. I think I’ll do the latter!

The editors did me and Henry Rzepa the honour of contributing a chapter on Markup Languages, whih we have called:

“Markup Languages and the Datument”.

In the Foreword Madeleine Jacobs, Executive Director/CEO of ACS writes:

“I fell in love with chemistry when I was 13. I fell in love with writing at the age of four…” and

“The goal of the [guide] is to help authors and editors achieve [...] ease and grace in all of their communications”

So the editors asked Henry and me to look ahead and write about style in an environment that is still building itself. Obviously we shall be out of date in some respects very soon, but we have tried to anticipate the closer linkage of machines and humans in science – epitomised by Tim Berners-Lee’s Semantic Web. The scientific publication of the future will soon be very different from what we do now - the younger generation may soon not use pen and paper and expect instant multichannel information. Science has to react.

So as a first step Henry and I have coined the term “datument” [1] – a portmanteau of “document” and “data”. This is a single compound (or hyper-) document representing the complete experimental and scientific environment of the researcher or scholar. The first steps are to integrate multiple markup languages (e.g. MathML, XHTML, SVG, + CML, AnIML and ThermoML in chemistry). Each language has an intelligent browser or other user agent which can understand the appropriate part of the document. And this is not just creating something that is visual – an equation might say “integrate me” – a molecule might say “I can give you my molecular weight and you can calculate my logP”. When we have rich clients such as Bioclipse (more later) we shall be able to let our machines read the boring bits of the paper while concentrating on the more complex results. Already our group can read a datument and send it off to calculate additional properties of the molecules. This takes a few minutes so the human can read the text while the machine enhances the data.

The previous style guide was published in 1997 and our contribution will look very strange in 2015! I hope that some of the ideas still make sense in that brave future. I gently predict that the Style Guide then will look very different from the book today. But I shall still need to be able to “write on it”!

I’ve been invited to the ACS on Thursday next week and hope to be able to meet some of the other authors. I’ll be taking the guide as my reading on the plane.

[1] This works in IE. It used to work in Firefox. The upgrades have broken it. Since the datument is on the publishers’ site there isn’t much we can do (though perhaps we should take a copy and mend it ourself). It is so frustrating to have to fight the browsers every few months…

 

Wikipedia: Getting started

Sometime last year I made my first edit to Wikipedia. I was extremely nervous despite many years on the web and having built and run virtualo communities. What if I said something stupid? Or broke one of the rules? Since the whole history is recorded I can’t wipe out my mistakes!

I have forgotten exactly what I edited but it was probably changing a little bit of syntax and hoping no one would notice! Nothing went wrong, and I grew in confidence. I found a few things I know about and perhaps added a link. And after a time got to adding new sentences.

These were probably anonymous (i.e. I was only identified by my IP – and that is dynamic). There is no shame in being anonymous and many prolific contributors stay that way. But some register with WP and take up a username – mine is petermr as I have used that for 15 years on the public web. So I am probably easily identifiable :-)

As soon as you register you get a “Talk” page where messages can be left. The first thing is a Greeting:

Welcome! Hello Petermr, and welcome to Wikipedia! Thank you for your contributions. I hope you like the place and decide to stay. Here are a few good links for newcomers:

I hope you enjoy editing here and being a Wikipedian! Please sign your name on talk pages using four tildes (~~~~); this will automatically produce your name and the date. If you have any questions, check out Wikipedia:Where to ask a question or ask me on my talk page. Again, welcome! – UtherSRG (talk) 13:02, 5 December 2005 (UTC)

I already knew how to edit Wikis but even if you get it wrong someone will tidy it up. The main thing that worried me was whether I would be able to fulfil the high standards. So I read the five pillars and they are worth reproducing in full:

Wikipedia is an encyclopedia incorporating elements of general encyclopedias, specialized encyclopedias, and almanacs.Wikipedia is not an indiscriminate collection of information. It is not a trivia collection, a soapbox, a vanity publisher, an experiment in anarchy or democracy, or a web directory. Nor is Wikipedia a collection of source documents, a dictionary, or a newspaper, for these kinds of content should be contributed to the sister projects, Wikisource, Wiktionary, and Wikinews, respectively.Wikipedia is not the place to insert your own opinions, experiences, or arguments — all editors must follow our no original research policy and strive for accuracy.

Wikipedia has a neutral point of view, which means we strive for articles that advocate no single point of view. Sometimes this requires representing multiple points of view; presenting each point of view accurately; providing context for any given point of view, so that readers understand whose view the point represents; and presenting no one point of view as “the truth” or “the best view”. It means citing verifiable, authoritative sources whenever possible, especially on controversial topics. When a conflict arises as to which version is the most neutral, declare a cool-down period and tag the article as disputed; hammer out details on the talk page and follow dispute resolution.

Wikipedia is free content that anyone may edit. All text is available under the GNU Free Documentation License (GFDL) and may be distributed or linked accordingly. Recognize that articles can be changed by anyone and no individual controls any specific article; therefore, any writing you contribute can be mercilessly edited and redistributed at will by the community. Do not submit copyright infringements or works licensed in a way incompatible with the GFDL.

Wikipedia has a code of conduct: Respect your fellow Wikipedians even when you may not agree with them. Be civil. Avoid making personal attacks or sweeping generalizations. Stay cool when the editing gets hot; avoid lame edit wars by following the three-revert rule; remember that there are 1,408,046 articles on the English Wikipedia to work on and discuss. Act in good faith by never disrupting Wikipedia to illustrate a point, and assume good faith on the part of others. Be open, welcoming, and inclusive.

Wikipedia does not have firm rules besides the five general principles elucidated here. Be bold in editing, moving, and modifying articles, because the joy of editing is that although it should be aimed for, perfection isn’t required. And don’t worry about messing up. All prior versions of articles are kept, so there is no way that you can accidentally damage Wikipedia or irretrievably destroy content. But remember — whatever you write here will be preserved for posterity.

So how do these rules relate to creating an “Open Data” entry?

First we should ask whether it is necessary. Frequently we see duplicate entries in WP that zealous editors spot and suggest should be merged or otherwise tidied. For example, OD might be seen as part of Open Access. I don’t believe it is and will defend this view with reasoned arguments and historical references.

Secondly we must strive for Neutral Point of View. That means that I and others must not use it to promote OD although we can reasonably list some of our writings if they are substantive to the entry. The entry is not “mine” but “ours”. It would be completely appropriate to collect evidence that there was opposition to Open Data. But the page is NOT a debate between two sides, howerve carefully reasoned, although it could record such debates if they were deemed to be sufficiently important.

Soon we’ll create an entry and follow its progress…

P.

Let’s write a Wikipedia article

I have always been entralled by the idea of a worldwide knowledgebase and a decade ago Lesley West, Alan Mills and I developed a technology to create a worldwide terminology. The Virtual Hyperglossary (TM) [probably the earliest use of this term] proposed terminological entries with unique identifiers in cascading dictionbaries which – in principle – could resolve any term. It was ahead of its time and although we had several groups who were attracted the technology did not exist.

Wikipedia (WP) has hit the right place at the right time. The Web is always able to tolerate many failures and WP was not the first attempt at a virtual encyclopedia. But it has the right combination of funding, zeitgeist, and technology. Something like this was bound to happen around now – it has turned out to be WP.

Many academic colleagues poo-poo WP as uncurated and capable of corruption. They are shortsighted – in <=2 years time I suspect WP will be standard reading in all undergraduate science and technology courses. This year – 2006 – has seen a critical mass of contributors in all subjects (with chemistry, as always, lagging behind the rest). The maths and physics is superb. The chemistry is good (given the current total disdain of almost all of the community). I salute the efforts of the relatively few who have laboured to create many excellent pages. I have predicted that in <= 5 years WP chemistry will be consulted more frequently than standard references works such as the Merck Index (The recent edition is a massive red paper volume).

So how does WP work? Simply, anyone in the world can contribute and anyone can change what previous authors have contributed. And contributions can be anonymous. So isn’t this just mindless wibble? Unsurprisingly (to me, at least), no. As an example take the first thing that came to my mind: the Gibbs-Duhem relation (like other chemists I struggled with this as an undergraduate).

[Eyeball the article] to get a feel for the scope and quality.

I immediately get a feeling of competence and relevance to what I need. I will read this article with confidence if I need to know about this area of thermodynamics. How can I do this when I know nothing about the people who have written it? Couldn’t it be the delusion of a perpetual motionist? Or some failed undergraduate?

No. The reason is in the history. It was started in 2003, and has over 50 edits. My own experience is that scientific entries are heavily edited until there is acceptable consensus (there is a different approach to contentious issues – e.g. politics). You can see that the frequency of edits is slowing – a good sign that the entry has stabilised. You will see that there are several editors of which one, PAR, has made a large number of edits. PAR’s home page again manifests a high-quality contributor (I have no idea who s/he is). But note, also, many other edits with specialist or niche contributions.

So I myself have started a few pages (e.g. Molecular Graphics , to which having been an Officer of the Molecular Graphics Society I feel I can make a moderately authoritative contribution) . There should be no pride in having done this as the work is not “mine” but the community’s. I’ve probably spent a day or two on this as I care about the discipline and its history (which is so easily lost). I have been the substantive contributor but various people have made contributions to formatting and style which are very useful – this consistency of presentation in WP is one of its great strengths.

In general physical science is often uncontroversial and so it is fairly easy to have a neutral point of view (NPOV). When we come to “Open Data” we shall have to be careful to avoid factionalism and advocacy and to research our sources.

“Open”

As I mentioned earlier I am about to start a Wikipedia entry on “Open Data”. Lorcan Dempsey noted that this was quite a common way of approaching a communal subject.
So I shall take readers through the process of creating a WP entry and hope to convince the unconvinced that this is a high-quality scholarly activity with appropriate checks and balances.
While I was doing this Beth Ritter-Guth has been creating an analysis of our shared vision for the Blue Obelisk. She has taken the discourse on the Blue Obelisk mailing list and her own discussions with Jean-Claude Bradley and summarised these. It is extremely valuable to have such as summary as often when new ideas and activities are started the participants are so engrossed int he detail they don’t have time to look at the wider picture. You can find links to the discussion in Jean-Claude’s recent summary.
My current position – and it has changed as a result of the discussion – is that the term “Open” both unites us and causes potential confusion. “Open” has connotations of trust, collaboration, innovation, etc. but because someone espouses “Open X” that doesn’t mean they espouse “Open Y”.

I realised this at a chemical informatics meeting last year. I gave my usual rant about Open Chemistry and the semantic web and then a software saleman talked about their product. He described it as having an “Open API”. [API = application programming interface; the instructions on how to configure the software]. I asked if it was published on the Web and he said no, it was a trade secret. So here “Open” = a manual that paying customers can read (as opposed to a product where customers have no idea how to configure it.

Our discussions on Blue Obelisk mailing list revolved around the term “Open Source”. We use this in Blue Obelisk to mean “Open Source software” as defined by the Open Source Initiative. [The BO mantra is ODOSOS (Open data, Open Source, Open Standards). ]Naively I assumed that this was the only use of the term “Open Source”. However Jean Claude uses the term “Open Source Science” and Beth had assumed that this means that the philosophy behind Open Source software and Open Source science were identical. In fact I (and I suspect most other BO members) have not heard of Open Source Science (example). So I looked this up and found it has been used about 2 years ago to mean an approach to science which relies of collaboration and openness at an early stage in the process. Here is Jean Claude on patents.
It seems reasonable to extend “Open Source” philosophy to other initiatives that share some of the general principles of Open Source computing. However we cannot assume that the actual practice is compatible. Having looked at Wikipedia I find that “Open Source” is so widespread it needs a disambiguation page which lists an amazing number of “Open Source Foo”:

Specific products

Licensing

Society and culture

Procedures

Organisations

Open-source software related:

Miscellaneous

  • Open Source, a radio show using open content information gathering methods hosted by Christopher Lydon
  • Open source intelligence, an intelligence gathering discipline based on information collected from open sources, i.e. information available to the general public.

This means that any use of “Open” is likely to be fuzzy and confusing. The “Open Access” movement is broad and supports several major points of view which, though overlapping, have significant differences either in pragmatics or philosophy. Moreover “Open Foo” does not imply “Open Bar”. Thus “Open Access” publications will not by themselves ensure “Open Data”.

More on this later…

Hamburger House of Horrors (1)

This is an occasional series indebted to Hammer House of Horrors. You don’t need to be a chemist to understand the message.
It’s sparked off by a comment from Totally Synthetic in this blog:

A good deal of the reasoning behind transcription of spectral data in publication is to impart meaning to the spectra. The 1H NMR spectrum of rasfonin, for instance, would be indeciferable to me, but the data written in the publication, transribed by the author and annoted for every peak would make (more) sense. It’s great to get an idea what the spectra look like, but more often than not, the actual spectra can be found in the supplementory data as a scan of the original. The combination of these two data sources gives the synthetic chemist everything they need.

Before I get onto the horror, Let me make it very clear that Tot. Syn’s blog is excellent and I’m hoping that he can meet us at the Pub on Monday lunch. His blog is a model of the future of chemoniformatics and we’d like to bounce some ideas off him.

(I’m also not specifically criticising the authors of the paper – at least not more than all other organic chemists because this supporting information (SI) is typical. I am of course suggesting gently that the process of publishing organic chemical experiments is seriously and universally broken).
The supporting information is a hamburger PDF and this example excellently makes my point. (Please readers, read it – or as much as you can manage – as I need help. Especially from anyone who is involved in graphical communication). It’s a separate document from the original paper and even though on the ACS site remarkably seems to be openly viewable. Maybe the ACS will close it sometime or maybe this exercise shows that Openness enhances downloads.

The SI draws the spectra on their sides! This is a clear indication that they aren’t meant to be read on the screen, but printed out. But the SI is 106 pages long. That’s not unusual – we have seen over 200 pages. I am sure that many organic chemists who want to read it will print it out rather than trying to read it on the screen. The spectra run from pp 36-107 with no navigational aids – if you want to link a compound to its spectrum you have to scroll through the spectra till you find its formula. Some compounds are depicted as chemical formulae on the spectra and some, but not all, contain index numbers (bold in the text).

Let’s assume that you are at a terminal and your lab has used up its paper bill. You scroll down to the infrared spectrum of a compound:

rasfonin0.png

It doesn’t look very promising, so I turn my head 90 degrees to look at it. Not very comfortable. So there is a tool on Adobe reader that rotates the page to give:

rasfonin1.png

This is awful. It looks like the spectra I used to collect 30 years ago when the pen plotter was running out (before that we plotted the spectra by hand it’s good for the soul). The resolution is probably 0.1 or better in the x-direction. I have no idea why it is so awful.

Now we
want to look back to the text where the author has made the annotations (there are no annotations on the spectra so we have to skip back 70 pages) to find:

rasfonin2.png

Our helpful Adobe reader has turnd all the pages round, so we have to turn this one back again. And, I suspect, the only real way to navigate this is to print it out.

The authors obviously spent a lot of time preparing this SI. The publisher probably calls it a “creative work” – you can claim copyright on creative works. I’d call it a destructive work. It doesn’t actually have a copyright notice, although the ACS has a meta-copyright where they assert copyright over all SI (except one from Henry Rzepa and me).

Now – please help me with the PDF. I have blogged earlier about OSCAR - the data extraction tool that can extract massive information from chemical papers in HTML or even Word. But it doesn’t work with PDF. Is there any way of extracting all the characters from this document? If I try to cut and paste I can only get one page at a time? Yes, I could probably hack something like PDFBox. But otherwise PDF is an appalling efficiently way of locking up and therefore destroying information.

The message is simple:

STOP USING PDF FOR SCIENTIFIC INFORMATION

DO NOT USE PDF FOR DIGITAL CURATION

GoogleInChI

Two months ago I was invited by Timo Hannay of Nature to a Nature/O’Reilly FooCamp at GooglePlex. Unfortunately I was already booked and Peter Corbett was able to step in. But there was a generic invitation from Leslie Hawthorn (who has just been running the Google Summer of Code) so last week in California I took a day off the ACS meeting to go to Google and offered to give a talk about the potential of Google in Chemistry using InChI.
There are millions of known chemical compounds and they are all distinct. It’s very value to give each a unique identifier and until recently this had to be done by an authority (Chemical Abstracts Service, Beilstein, etc.). This is problematic as the numbers are copyright and you have to pay to lookup the formal link between number and compound. Recently the International Union for Pure and Applied Chemistry has developed an identifier InChI that can be automatically generated from the chemical structure with a free OpenSource program. This means that anyone can generate an InChI and the result for a given molecule will always be the same. So if we want to search for a molecule, all we have to do is generate it’s InChI and see if Google has indexed.
Yong Zhang in our group set up a server and we were able to show that they could be discovered in Google. Nick Day last year showed that they worked incredibly well. The University of Southampton crystallographer Simon Coles had put 100 compounds on the web and used Nick’s approach to add InChIs. When Nick searched for them using Google he found all 100 and no junk. This must be one of the most accurate searches ever done!
So I was able to present these ideas to Leslie and colleagues and she offered to record this on video (ca 55 min) – and they do not retain copyright. Unfortunately when I came to the GoogleInChI demo I disovered that our service had died. A pity, but all the other demos worked.

It was really nice to meet Leslie and colleagues and start planning joint activities. There is a very different attitude to that in many otehr companies. They are keen on Open Source and also looking to provide new services in GoogleBase – perhaps more of that later.

Of course Google is a commercial organisation and not a charity but there is a lot of shared vision – we have different things to contribute to the vision. For example at the eScience meeting I’ve just been at there were many demos including GoogleMaps. It has made a considerable impression.

Who knows – Googlechem?

Chemistry, Chess and Computers

Sometime in the 1970′s the Amer. Chem. Soc. published a review of Computers in Chemistry (cannot remember date or title and I’ve lost my copy) and it has remained an inspiration ever since. In it was summarised the work of the Stanford (DENDRAL, CONGEN) and Harvard (LHASA) groups on the applications of artificial intelligence to chemistry (structure elucidation and organic synthesis). Both have heavy elements of problem-solving, coupled with pattern recognition. The systems effectively contained:

* a knowledge base of chemistry

* a set of heuristics (rules)

* formal deterministic procedures (e.g. tree searches).

The accomplishment was remarkable. The systems worked. They weren’t as good as a professional synthetic chemist, but in small areas they were better than me. It seemed obvious to me that with sufficient work on all components, but especially the knowledge base these systems would be able to do organic chemistry at the level of all except the best in the field. Certainly I expected that with the passage of 30 years the chemist/machine combination would be common. (Admittedly I sometimes believed too much hype about AI – now that I work in one branch (language processing) I know how difficult it is).

At the same time very similar work started to be done on chess. Again, when the first programs came out I could easily beat them (and I am a weak player). But gradually they improved and now they can beat essentially all humans.

It seemed to me that chemistry and chess would be quite similar. They are formal systems, too complex for brute force, and where a knowledgebase is essential. In chess all significant games have been captured in a database, and a large number of endgames have been exhaustively worked out. What is interesting is that the chess grandmasters have formed a symbiosis with computer programmers and machines and are still exploring what aspects machines can and cannot do. (I’m not an expert here and comments would be welcome).

By contrast there has been no significant work on chemistry and AI in, perhaps, 15 years. When I was in the pharma industry my boss used to speak of “another outbreak of Lhasa fever” (sic) – meaning that someone had suggested that machine synthesis should be explored. The Lhasa organisation has effectively stopped supplying synthesis methodology and turned to toxicology prediction (albeit it highly valuable).

So I feel a considerable feeling of sadness. I am sure that if synthetic chemists had embraced computers in the same way as chess players we would be sgnificantly better off. This is, of course, an act of faith but it’s borne out by the knowledge revolution taking place in many disciplines. The bioscientists are eagerly exploring the S/semantic W/web witn formal ontologies and reasonaing – another approach to “AI”.

I’ve just been at the UK eScience meeting (cyberinfrastructure) meeting for 3 days. (I’ll probably hark back in future posts). One keynote was given by Stephen Emmott (Director, Eur. Sci. Programme) Microsoft Research, Cambridge). Stephen talked about 2020 and gave a vision when computing could be based on biology – where molecular computers have already been injected into cells. Microsoft is hiring bioscientists who are also computer-able (i.e. they can make their ideas happen through code, rather than requiring comput/er/ational scientists to write the code for them.) He stressed that he did not want a mixture of computer scietists and biologists, he wanted scientists with a mixture of computing and biology. Since his future involves molecules, maybe he’s also hiring chemist/computerScientists…

But we are actively discouraging the sort of work envisioned By Lederberg and Corey 30 years ago. There are exceptions – I spent 3 hours with my colleague Steve Ley discussing how we can bring modern informatics into synthetic chemistry. I am sure that our biggest problem is the lack of an immediate Open global knowledge base in chemistry. It’s all there on paper, but to get it into a machine is a mighty task. It will need new methods of computing – including social computing and I’ll explore these ideas systematically in this blog. We might even achieve something with your help.

So I am pleased to see the quality of the chemical blogs, even if Tenderbutton is retiring. With lightweight mashup-like approaches we may be able to use the new approaches to informatics that are being developed in social computing. Biology has control of its knowledgebase – it had to fight to keep it in the genome information wars- but it’s vibrant and innovative. Chemistry has surrendered its knowledgebase to commercial and quasi-commercial interests who point in the direction of pharma rather than the information revolution. I will show in a week or two how we might be able to start regaining some of it.

P.

The cost of decaying scientific data

My colleague John Davies, who provides a crystallographic service for the deparment has estimated that the data for 80% of crystal structures (in any chemistry department) never leave the laboratory. They are locally archived, perhaps on CDROM, perhaps on a local or departmenta machine. With the passge of time – changes in staff, organisation, machines – information decays and it is likely that crystallographic data wil be systematically lost.

Recently a number of UK groups have been funded by JISC – The Joint Information Systems Committee

to research the development of digital repositories. Three groups have been collaborating in chemistry, with a strong emphasis on crystallography and spectroscopy. This involves all aspects – building software, designing metadata specs, and understanding the way chemists work and think. We have found that the social aspects are at least as important as the technical – I won’t eleborate here yet as these will be reported at:

An eBank / R4L / SPECTRa Joint Consultation Workshop.
Digital repositories supporting eResearch: exploring the eCrystals
Federation Model

Why is it important to archive the data? Isn’t normal academic publication (including theses) sufficient? Isn’t it very costly and a waste of money that could be spent on proper research?
Well, the crystallographic community has archived its data for many years and research on this data alone has given rise to hundreds or even thousands of papers datamining this resource. Without this chemistry would be very much poorer as we would have little in the way of molecular or crystal structure systematics.

So what is the cost of the unpublished data? To carry out the structures at commercial rates would be about USD 1500-5000 for the size of structures currently published. Let’s assume a laboratory does 500 structures a year and if we assume that full economic costs are half the commercial (this is just a guess) – we are looking at half a million dollars per year to do crystal structures in a chemistry department. (I suspect the numbers are on the low side – I’d be interested in comments).
Allowing that there has been some publication of some of the material as comments in chemical papers I suspect that the information from quite a high proportion of the structures is never published in any form. How easy is it to find information in current theses, especially if you don’t know it’s there?

I think I would be safe in saying that wordlwide hundreds of millions of dollars’ worth of crystallographic data is lost each year. For spectra and synthetic chemistry it will be at least 10 times greater. Many synthetic chemists say they are interested in failed reactions – and these are almost never published!
If funders are aware of this they should be concerned about the loss. Funders are increasingly being proactive in requiring funded research to be Openly accessible. The Wellcome Trust is among the stromgest proponents:

Robert Terry on Open Access

and a quote

The Trust provides additional funding to cover the
costs relating to article-processing charges levied by
publishers who support this model.
• Approximately 1% of the research grant budget
would cover costs of open access publishing

Moderatorial

A recent anonymous comment on this blog read

In that case, perhaps you should have parted with the observation “ACS is a problem”.
:-) , but partly serious.

I thnk the tone of this is out of keeping with this blog and I am therefore writing a “moderatorial”. This was a term I used (I doubt it was a neologism) when Henry and I ran the XML-DEV list. A Moderatorial (example) was to guide the list, but not constrain it. Although this is not a list, anyone can post a comment and I will automatically post it whether or not I agree with the sentiment.

However I wish to avoid flame wars and ad hominem remarks and outline my own philosophy on this blog.

I try to post statements which are accurate and not unnecessarily emotive. I do not completely have a strict Wikipedian-like Neutral Point of View (NPOV) in my posts and use the list for advocacy. However I do not wish the comments to be one-sided and invite a range of views – the result might indeed be neutral. I take as an example the excellent blog from Peter Suber – he is analytical and incisive. A typical example read:

(From ACS press release)

In October, American Chemical Society journal authors will have the option of paying to immediately provide free online access to their articles on the society’s website. Authors will also be able to post electronic copies of their sponsored articles on personal websites and institutional repositories. Fees for the program will range from $1,000 to $3,000 per paper, depending on whether the author is an ACS member or is affiliated with an institution that subscribes to ACS journals.

Comments (from PeterS).

(2) See my (PeterS) nine questions for hybrid journal programs, just published on Sunday. Of the nine, the ACS announcements give good and welcome answers to two: it will let authors deposit articles in repositories independent of ACS and it will not retreat on its green self-archiving policy. It gives unwelcome answers to two more: it will not let participating authors retain copyright and it does not promise to reduce its subscription prices in proportion to author uptake. (Hence, it plans to use the “double charge” business model.) It leaves us uncertain on the remainder: Will it let participating authors use OA-friendly licenses? Will it waive fees in cases of economic hardship? Will it force authors to pay the fee if they want to comply with a prior funding contract mandating deposit in an OA repository? Will it lay page charges on top of the new AuthorChoice fee?

(3) The ACS has been a bitter opponent of OA through PubChem and FRPAA. But I don’t believe it ever opposed the very idea of charging author-side fees to support the costs of a peer-reviewed journal, as some other hybrid journal publishers did before adopting the hybrid model.

Permanent link to this post

This is a style I strive to emulate. PeterS has a position of advocacy (Open Access through various models) but reports accurately and without ad hominem arguments.

In the present case it is clear that the devil is in the details. Whether I welcome or criticize the ACS hybrid policy depends on whether it enhances the free use of data. It sounds dubious from PeterS’s report, but hopefully there will be more clarity from all parties.

In the case of control of published data – my fundamental position is that scientific data belongs to the commons and that there is good legal and moral precedent for this. The stronger this basis, the stronger the case. Open Access is complex and, I believe, changing so that entrenched positions are not always helpful. Although I wish for total Open Access I am prepared to work with publishers operating different models. My engagement is dedicated to trying to make scientific data Open.

I have frequently been asked to speak at the ACS meetings and have accepted. My advocacy for Open Data is robust but hopefully not personal. People and organisations are flexible. Thus, for example, I gave a talk at ACS last year in the Open Access session. There were presentations for and against Open Access and (in my opinion) the Open ones were better presented and more compelling. But I still listened carefully to all arguments. My own presentation was a demonstration of the power of data and the value of Opening it. As a result Pieter Borman invited me to talk at the annual meeting of the STM publishers in Frankfurt. I went with some doubt as to whether my arguments would be taken on board – but I had a good audience – and I heard (though I can’t find details) that STM publishers have recommended that scientific data should be copyright free (confirmation is welcomed).

So I don’t take entrenched positions about people and organisations, but about issues. The Firefox/downloading episode is a problem – I have highlighted it – and hope that the factual analysis makes a useful contribution. It might not change policy directly but it should help to avoid misunderstandings.

Finally therefore I shall directly accept all non-spam comments, but reserve the right to issue moderatorials if I feel the comments might ignite flames.

P.

OSCAR reviews a journal

In the last post I described OSCAR, which can review and extract chemical data from published articles. Here is how I used it to review the Beilstein Journal of Organic Chemistry

The BJOC unlike most other chemistry journals encourages reader’s comments, so I thought OSCAR would like to add some. Since I did this on a Saturday none of the comments have been moderated (or at least none have appeared). I first added comments to the journal announcement about what I intended to do, and gave links to the OSCAR home page. I then started at the first paper and found the “Additional File 1” which contains a pointer to the chemical data. (The process seems overly convoluted, and I have commented on this). I first downloaded OSCAR (the adventurous among you can try this and the following), started it (click the jar file), opened the BJOC (Word) file with the data, selected all of it and pasted it into OSCAR.

This is a very well presented file (and worthy of the authors’ orgnaisations – GSK and Leeds) – not all chemical manuscripts are as well prepared. OSCAR reveals only two errors, which are missing commas. (These are more important than they sound as we rely on them for the parsing). Typical results can be seen in the previous post. I therefore added this to the comments section for the paper. I assume the comments will appear in a day or two. I don’t know whether the authors will be automatically informed – I expect so – and whether the deposited data can then be corrected either by authors or editorial staff. If so, this is a real mechanism for cleaning up the current literature. Of course if the authors use OSCAR in future they will get a clean sheet!

I then applied OSCAR to all the papers in the Journal that contained chemical synthetic data – about 27. There is no standard place for the data – sometimes they occur in free text and sometimes in “Additional File n” (this name is not very helpful and I have suggested it should be changed to something with chemical semantics). I commented on the variability in navigation which made it difficult for me (and very difficult for OSCAR if it wished to review the journal systematically). OSCAR discovered several important errors – for example a chemical formula was wrong (this matters) and many suggestions about style improvements. (I did not comment on these as OSCAR’s rules don’t yet include BJOC policy). I also noted that some papers didn’t include data. I did not comment on the chemistry at all – its merit or its correctness – as I am not a specialist except on data. But perhaps this will stimulate expert readers to do so in future.

OSCAR raised concerns in almost all papers – ranging from punctuation to incorrect formula. I stress that this is common in ALL chemistry papers – and should not be used to measure BJOC against others. They all need cleaning up.

I made addiitonal comments on the accessibility of crystallographic data – these were not added as supplemental data and I argue strongly that they should be. I’ll write later about this.

I am hoping this will be seen as positive critiquing – it would be in compsci or crystallography. Certainly the adoption of data standards will make an enormous impact in the standard and re-usability of chemistry.

(Note: Our two summer students this year- Richard Moore and Justin Davies – again financed by RSC, have been refactoring OSCAR – we call this OSCAR-Data. OSCAR-Data uses OPSIN (OSCAR3) and allows for several inputs – SciXML, HTML, converts them into CML and then applies a set of custom rules (which could be publisher-specific). )