petermr's blog

A Scientist and the Web


Archive for June, 2010

When scientific models fail

Thursday, June 17th, 2010

Typed and scraped into Arcturus

From what I see so far Climate Change Research involves a lot of analysis of data sets. I don’t what the ratio of actual measurement to analysis is. I don’t know how often models are tested against experiment or against observed values.

Here’s a scientist concerned about an area of data analysis where there is a great flexibility in choosing models, choosing parameters, choosing methods and with little check against reality. I’ll leave it hidden for a little while where this was published. It’s in a closed access publication which costs about 30 USD for 2 days access, so I’m going to copy the abstract (which I may) and some sentences from the body for which I will claim fair-use. I’ll promise to give the reference later to be fair to the publisher (maybe their sales will increase as a result of my promotion). I’ll hide some key terms (XYZ is a common approach/philosophy) to add to the mystery

A general feeling of disillusionment with XYZ has settled across the modeling community in recent years. Most practitioners seem to agree that XYZ has not fulfilled the expectations set for its ability to predict [...]. Among the possible reasons that have been proposed recently for this disappointment are chance correlation, rough response surfaces, incorrect functional forms, and overtraining. Undoubtedly, each of these plays an important role in the lack of predictivity seen in most XYZ models. Likely to be just as important is the role of the fallacy cum hoc ergo propter hoc in the poor prediction seen with many XYZ models. By embracing fallacy along with an over reliance on statistical inference, it may well be that the manner in which XYZ is practiced is more responsible for its lack of success than any other innate cause.

Sound familiar? Here are some more sentences excerpted from the text…

However, not much has truly changed, and most in the field continue to be frustrated and disappointed why do XYZ models continue to yield significant prediction errors?

How could it be that we consistently arrive at wrong models? With the near infinite number of [parameters] coupled with incredibly flexible machine learning algorithms, perhaps the question really should be why do we expect anything else. XYZ has devolved into a perfectly practiced art of logical fallacy. Cum hoc ergo propter hoc (with this, therefore because of this) is the logical fallacy in which we assign causality to correlated variables. …

Rarely, if ever, are any designed experiments presented to test or challenge the interpretation of the [parameters]. Occasionally, the model will be tested against a set of [data] unmeasured during the development of the model. …

In short, XYZ disappoints because we have largely exchanged the tools of the scientific method in favor of a statistical sledgehammer. Statistical methodologies should be a tool of XYZ but instead have often replaced the craftsman tools of our traderational thought, controlled experiments, and personal observation.

With such an infinite array of descriptions possible, each of which can be coupled with any of a myriad of statistical methods, the number of equivalent solutions is typically fairly substantial. Each of these equivalent solutions, however, represents a hypothesis regarding the underlying [scientific] phenomenon. It may be that each largely encodes the same basic hypothesis but only in subtly different ways. Alternatively, it may be that many of the hypotheses are distinctly different from one another in a meaningful, perhaps unclear, physical way. …

XYZ suffers from the number and complexity of hypotheses that modern computing can generate. The lack of interpretability of many [parameters] only further confounds XYZ. We can generate so many hypotheses, … that the process of careful hypothesis testing so critical to scientific understanding has been circumvented in favor of blind validation tests with low resulting information content. XYZ disappoints so often, not only because the response surface is not smooth but because we have embraced the fallacy that correlation begets causation. By not following through with careful, designed, hypothesis testing we have allowed scientific thinking to be co-opted by statistics and arbitrarily defined fitness functions. Statistics must serve science as a tool; statistics cannot replace scientific rationality, experimental design, and personal observation.

Open Data: Climate Change research and Chemoinformatics

Thursday, June 17th, 2010

Typed into Arcturus

The more that I read about Climate Change Research the more I find similarities with one of the domains in which I might be called an “expert” – Chem[o]info[r]matics (the o/r are often elided). I’m on the editorial board of J Cheminformatics ( ) which is published by Biomed Central almost all of whose journals are Open Access. (I would not sit on the board of a closed access journal and have publicly resigned from one). I’ll explain later what Cheminformatics is, but I’ll start with a graph that appeared in a peer-reviewed journal in the subject. It’s a serious respected journal and the editors seriously reviewed this submission which makes a serious point. So as not to spoil your voyage of discovery I won’t tell you where it is and I’d ask any chemoinformatics readers of this blog not to blurt out the message. Trust me, it’s relevant to cheminformatics and relevant to ClimateChange. No, I haven’t pasted the wrong graph by mistake.

Please comment on this graph as naturally and lucidly as you feel fit. There is no “right answer”.

AMI: Can you help us build a virtual chemistry laboratory in Second Life?

Thursday, June 17th, 2010

Typed into Arcturus

We have a JISC-funded project, AMI, under the Virtual Research Environment Rapid Innovation scheme and we are getting into full swing. The idea of the project is to create an “intelligent fume cupboard” [fume hood] where the fume cupboard has an intelligence. It will be able to record simple data (events, materials, etc.) and be able to answer questions. We are thinking big and have already developed a speech interface (using Chem4Word). It’s an experiment in that we don’t know exactly what we are going to do, but we have a lab full of inexpensive sensors and transducers (IR, Ultrasound, RFID, barcodes, thermo, video, etc.). Many of these will automatically capture events such as humans coming to AMI.

Last week we talked with Steve Ley, one of the world’s great organic synthetic chemists, and he suggested we should think of avatars – as in the recent movie. That’s a great idea and after a little while I thought of Second Life, where there is a usable avatar technology. I’ve always been interested in this and in the early 90′s helped to build a virtual environment at BioMOO, unfortunately deceased.

The Blue Obelisk (Open Data, Open Source, Open Standards) in chemistry has built a SL environment for chemistry and Jean-Claude Bradley and Andrew Lang have made impressive strides and developed a community, built round teaching and citizien science research (measuring solubilities) and also malaria medicinal chemistry. So we have technology and community and energy.

This post is to appeal to anyone who is interested to join in. Anyone can do this as there are all sorts of skills that are valuable to building virtual communities. High-school chemistry is useful but not required, scripting in SL is useful but not required. Constant energy and dedication and ability to world in a largely unstructured community is valuable.

If you are interested, visit the Blue Obelisk ( ) and join the list ( ). I am an optimist and see it as possible to create a growing community round this project.

Open Data: My apologia

Wednesday, June 16th, 2010

Typed into Arcturus

My blog on the RI meeting has been blogged at

which is factual and fair. I should make it clear that I am not putting anyone on a spectrum (“sceptics”, “nice guys”, “cheats”) etc. I went to a meeting I knew almost nothing about and came back saddened and concerned about an apparent priesthood. This has been confirmed by various public emails and blogs which show that there is concern in the community about this issue.

A comment on the blog above reads:

“I had no idea that this “FOI battle” had been going on for several years and that nothing had been done to try to solve the problem”.

Wikipedia says “Peter Murray-Rust campaigns for open data, particularly in science”.

Now that would be nice if Peter was to make a start on an open data campaign in climate science as he seems to be several years behind.

June 15, 2010 | martyn

And this is a good reason for me to make my apologia for Open Data and why I am active in an area I know little about.

I have no problem about “being several years behind” – I would expect nothing less. Ignorance is not a crime – we are all ignorant of almost everything. [Arguing from known ignorance is less excusable.]

I have spent a lunchtime hour flicking through blogs I have been pointed to (e.g. ). There are many issues but my only comment will be that there is a range of views on how easy it is to preserve data. Some posters express surprise that all data is not preserved for ever, others that historically it has been very difficult to preserve it. My own view is that it depends on the motivation, the tools and the funding. Any missing component leads to data being lost.

So what is Open Data and why am I talking about a discipline (Climate) I don’t know much about? I got involved in Open Data about 5 years ago when I was enraged by publishers who sprayed copyright notices over factual data and who were less than enthusiastic about addressing any problem to do with data. The term “Open Data” was almost unknown then and while I am not the first to put the two words together they were sufficiently rare that I started a Wikipedia page ( ) – [BTW this needs updating].

Since then I have been invited to speak on Open Data at a number of meetings (often Open Access or library meetings), met with many editors and publishers and most recently worked the the Open Knowledge Foundation and Science Commons, resulting in the Panton Principles. Most recently BiomedCentral honoured us by presenting Open data prizes and asking us to judge and award them.

I have also worked with the JISC in the general area of Open data and most recently am the PI of a grant award (with OKF, International Union of Crystallography, British Library, Cambridge University Library and PLoS) on “Open Bibliography”. It hasn’t yet started but we’ve made good progress.

My claim to be involved is that there are universal aspects to Openness in science (and usually corresponding benefits) and I’ll summarise them in what I (and I believe colleagues in the OKF) would feel able to do in an objective manner:

  • Inspect data resources and determine whether they were fully Open according to the Open Knowledge Definition ( ). It should be possible to do that in most cases without expert knowledge of the domain.
  • Help to provide a label (button) stating that the resource was Open Data.
  • Inspect a bibliography and determine which of the resources pointed to by the bibliography were Open and comment on appropriate aspects.
  • Work with bibliography creators to ensure that the bibliography itself was Open (even if some of the resources to which it pointed were not)

This list is a first pass – please comment. Note that I myself do not intend to create the bibliography of metadata – that would be inappropriate. A bibliography is an important resource which often represents a point of view and hopefully people in the Climate area have bibliographies (these often emerge when writing theses and reviews). Note that the overall infrastructure of a bibliography and it Openness is independent of whether the science is good or flawed, whether the people quoted have a particular viewpoint or whether they are nice or nasty.

If a resource can be identified as Open, then it can save a great deal of time (and sometimes money) when it is re-used. An Open diagram can be used in a review, book, teaching, etc. without further permission. Data can be mined from it. Text can be quoted from it. These things by themselves can add considerable to the speed and quality of a scientific field.

What if the Open resources are quoted in preference to the Closed ones? That might give a false view of the field?? In which case there is a good incentive for making more resources open.

Here are examples of Openness for resources in climate:

  • The “Keeling Curve” ( ). This carries the licence:
    Own work, from Image:Mauna Loa Carbon Dioxide.png, uploaded in Commons by Nils Simon under licence GFDL & CC-NC-SA ; itself created by Robert A. Rohde from NOAA published data and is incorporated into the Global Warming Art project. ce: However NC is NOT Open – you could not use this in a text book, create a movie from it, etc.
  • The IPPC’s AR4 Synthesis report. () “The right of publication in print, electronic and any other form and in any language is reserved by the IPCC. Short extractsfrom this publication may be reproduced without authorization provided that complete source is clearly indicated. Editorial correspondence and requests to publish, reproduce or translate articles in part or in whole should be addressed to: [IPCC]“. This is NOT Open.


  • Atmos. Chem. Phys., 10, 9-27, 2010
    © Author(s) 2010. This work is distributed
    under the Creative Commons Attribution 3.0 License.

    A comprehensive evaluation of seasonal simulations of ozone in the northeastern US during summers of 2001–2005

    H. Mao1, M. Chen2, J. D. Hegarty1, R. W. Talbot1, J. P. Koermer3, A. M. Thompson4, and M. A. Avery5


    This IS OPEN. The licence (CC-BY) is fully conformant with the OKD. As ACP is an Open Access journal I expect that
    all publications carry this rubric. (Apologies for the cut-n-paste into Word)


So it should be possible to annotate any bibliography as to whether the items are Open. I can’t give examples of datasets as I don’t know the field. Certain ones (e.g. works of the US government) may be clearly Open, but many others will be fuzzier.




Open: Challenging Priesthoods

Wednesday, June 16th, 2010

Dictated and Scraped into Arcturus

There have been a number of useful comments on my blog posts relating to open data in climate science. I’m conscious that I am walking into an area that I know little about and will defend why I think this is useful. I will also tell you what I am not going to do.

Martin Ackroyd says:

June 15, 2010 at 2:02 pm  (Edit)

I’d suggest that essential reading for anyone interested in these issues are:

Climategate The Crutape Letters by Steven Mosher and Thomas W. Fuller and

The Hockey Stick Illusion;Climategate and the Corruption of Science by A W Montford

It has often been said that the climategate emails were taken out of context. But with the full context, as revealed by Mosher and Fuller, they are utterly damning. And these were emails exchanged between leading authors of the IPCC reports.

Once you understand how the famous IPCC Hockey Stick Graph was based on erroneous statistics and dodgy manipulations of proxy data, as set out in verifiable detail by Montford, you wonder if anything at all from “climate scientists” can be trusted.

Richard J says:

June 15, 2010 at 3:34 pm  (Edit)

Nigel Lawson sat on the House of Lords Select Committee which reviewed Climate Change and its Economics in considerable detail in 2005, taking a wide range of technical submissions. If I recall correctly, they concluded that the scientific and economic aspects were potentially flawed by the overtly politicised nature of the IPCC process.

Lawson’s concerns should not be dismissed lightly. They are shared by many scientists in Earth Science disciplines closely related to climate science, but perhaps not directly reliant on research funding in this field.

David Holland says:

June 15, 2010 at 6:19 pm  (Edit)


As the individual whose Freedom of Information Request resulted in the infamous email “Mike, can you delete any emails you may have had with Keith”, I have to tell you that Fred Pearce is well wide of the mark.

If anyone wants to know why someone would try to procure the deletion of AR4 emails just two days after I asked for them, just ask me for a confidential copy of my submission to the Russell Enquiry and confirm that you will not publish it. It is not on the Enquiry website because Sir Muir’s Enquiry does not have Parliamentary privilege and it is worried about being sued. I guess that also limits what the Enquiry will report.

I think these are useful encapsulations of some of the major issues that came out of the meeting. I shall confine myself to specific areas where I consider that my contribution made a useful. There is no point in my acting as an investigative journalist or as a politician so I shall not concern myself with the past history of the E mails and the practice of climate scientists. Nor shall I get into the details of how the hockey stick was produced and whether it is a valid scientific instrument.

I do however, like many people, have expertise which may be valuable in this area. This relates to the general practice of science, whose principles are available to everybody, to the way that knowledge is communicated (again something that anybody has the right to be involved in) and slightly more specifically to some of the statistical processes which appear to have been required in some of the data analysis.

I do not actually intend to get involved in data analysis but I will argue that I have every right to do so if I wish. In my day job – cheminformatics – I use a range of data analysis and statistical tools which are likely to be highly relevant to the processing and analysis of data in many fields, including climate. For example I have many years of experience in principal components, error analysis, data validation and the validity of statistical fitting (“overfitting”). I am on the editorial board of J.Cheminformatics and many of the issues we deal with appear to be similar to those in other disciplines. I would therefore feel it unreasonable to be told that I could not have access to data in climate research because I might misinterpret it.

Although I only have one evening’s evidence it appears that Stephen McIntyre, a mining engineer from Canada, wished to analyse the hockey stick data ( ) and was unable to get the data. I do not intend to debate the historical accuracy of this – the question is whether he has the right to do so . It is quite reasonable to assume that he had statistical and mathematical tools which were appropriate for this analysis. Put another way, if he were to submit a paper to J. Cheminformatics I would take the content of the paper and not his background as the material on which I would make a judgment.

It has been presented that McIntyre was challenging the priesthood of Climate Research and that he was excluded. Whether this is historically accurate is irrelevant to my argument and activity – I had a strong sense of closed ranks at the RI meeting. I sensed that if I asked for data I would not be welcomed and I suspect my current writings may not please everyone.

Science has always been multidisciplinary, but in the Internet age this has been accelerated. It’s possible for “lay people” (we need a better term) to take part in scientific activity. Galaxy Zoo ( ) has shown that “lay people” can author peer-reviewed papers in astronomy. There is absolutely no reason why anyone on the planet cannot, in principle, make contributions to science. Einstein worked in the patent office, Ramanujan in a government accounting office. (But before you all jump in remember that science is very hard, usually boring, work and has to be done carefully and with the right tools).

My concern here is with the cult of priesthood. I had the privilege or hearing Ilaria Capua speak of her campaign to get avian flu viruses published Openly. Until her work there was a culture of closed deposition and the data were only available to those in the priesthood (and I believe gender may have been an implicit issue). I can’t find a Wikipedia entry (there needs to be one) so have to link to and what I wrote ( ). And

So what can and should I do to address this. I believe Open Knowledge (Open Data, Open Source, Open Bibliography) is a key activity. It librates and enables. It is only threatening to indefensible positions. Of course not all data can be made Open, and nor can all code, but there is no reason why bibliography cannot (for example OKF’s CKAN).

More later


For the record – the RI meeting on CRU emails. PM-R ranting [tweeted by Brunella Longo ( )]

Open Data is necessary but not sufficient

Wednesday, June 16th, 2010

Dictated and Scraped into Arcturus

John Wilbanks is Director of Science Commons and a co-author of the Panton Principles. He has responded to my concerns about access to climate change data, with the observation that Open data is not the major problem or solution. I’ll comment at the bottom. I agree with what he says, but I will argue why there is a role for Open Knowledge in this issue.

We’ve spent a lot of time on climate change and open science at Creative Commons. I have a personal interest, as my father is a climate change researcher and was an author on the most recent IPCC report. He and I co-wrote a paper on open innovation in sustainable development earlier this year which was OA, and the references for that paper are a good start for the non-data side of the problem. It’s at


In most cases in climate change science, impacts, and adaptive responses, the hurdles for open science are not intellectual property rights but scientific practices related to confidentiality and protecting one’s own data and models – a different challenge. The current evaluation of iPCC being done by the Interacademy Council at the request of the UN is beginning to take a look at how such conventional scientific practices can become a threat to the perceived integrity of science. IP is a footnote in the debate, unlike in OA or in free software or in free culture. Our successes in these spaces have sadly conditioned us to look at “free” legal tools as our hammers, and see the world as a bunch of nails. It’s a great irony actually. 

In the case of climate change mitigation, of course, the open science issues are similar to those in other areas of traditional manufactured technology – accentuated by the fact that the main drivers of increases in global GHG emissions are now in the larger developing countries, while the industrialized countries still control a lot of the intellectual property for addressing that problem….


In many ways the “open” debate about data fails to capture the reality of these issues. Making data open, even fully compliant with the Science Commons protocol, is actually far from enough. I hope that we can make these debates nuanced enough that we don’t push “open” as the end game, because I can comply with the protocol, or with Panton, and still have my data be worthless from a scientific perspective. An extreme example would be that I publish PDFs of my data under PDDL, and claim the mantle of “open”. If we as a community push “open” as the goal, and not “useful” as the goal, then we enable that outcome. 


Open climate science, at least as it regards data, is almost never an intellectual property problem. It’s a culture problem, it’s a technology problem (formats, ontologies, standards), and it’s a language problem. It’s a political problem, it’s an incentive problem. Getting rid of the IP is no more than table stakes. And if we don’t deal with the inventions – the technologies that both create climate problems and that promise to mitigate them in adaptation – then we won’t be changing the world the way we want. That’s a big part of why our science work has shifted to focusing significantly on patent licensing and materials transfer…

I completely agree that this is a culture problem. It was the culture of priesthood that hit me – unexpectedly and repeatedly – at the RI meeting. And I do not argue that IP issues are the primary problem. But I wouldn’t call Open Data simply an IP problem. Lack of Open Data is symptomatic of a deeper malaise. And open Data is catalytic – if people are accustomed to making their data Open they are more likely to make their processes Open. A group that produces Open Data has to think about openness every time they release a data set, every time they publish a paper.

Perhaps an analogy would be laboratory practice. Running a safe and clean laboratory does not in itself make a good scientist. But it emphasizes certain fundamental principles and attitudes such as consideration for co-workers, having procedures in place, adopting discipline.

I’d describe Open Data as a necessary but nowhere near sufficient condition. But it’s also a visible and valuable touchstone. I’ll address this in the next post.


The Open Geospatial Consortium

Tuesday, June 15th, 2010

Typed and Scraped into Arcturus

When I started to blog and mail about Climate Change/Research I knew I was blundering into areas that I knew little about and that I would discover a great deal of previous and current activity. I ahve a wonderful response from Lance McKee of the Open Geospatial Consortium (OGC) [on the OKF open-science mailing list]

I call your attention to one activity of the Open Geospatial Consortium (OGC): the GEOSS Architecture Implementation Pilot 3 (AIP-3) data sharing activity: .

There are many in the OGC ( who share your concerns about climate data. OGC runs a consensus process in which government and private sector organizations collaborate to develop open interfaces and encodings that enable, among other things, sharing of geospatial data, including climate data. I think the OGC is likely to play an important role in the opening up of climate science.

I invite you to look through a presentation in which I gathered my learnings and musings about the importance, feasibility and inevitability of persistent and open publishing of scientific geospatial data: http .


The presentation is well worth reading, including 17 (sic) reasons why data should be open.

It is very valuable to see that the OGC has done so much. I will read what emerges over the next days. It may be that the OKF has a role – it may be that it should be primarily supportive of others.

I have an open mind.

Open Climate Data: I cannot find the Spectrum of Carbon Dioxide without violating Copyright

Tuesday, June 15th, 2010

Typed and Scraped into Arcturus

Here’s an excellent example of the issues in Open Data. A simple, important, question from David Jones (who is involved in climate research infrastructure). It’s in response to my last post on Open data in Climate Research and it’s an excellent tutorial on the issues. And the result is very depressing.

David Jones says:

June 15, 2010 at 12:42 pm  (Edit)

Perhaps you could kick off with what data you would like to see open.

Data that I would like, that you might have a professional opinion on, is a reference library for the IR spectra of the Kyoto Protocol gasses (CO2 and other greenhouse gasses). I had a look, but I couldn’t find an open archive of IR spectra. Do you know if one exists?

To remind readers – infrared absorption is the reason than greenhouse gases heat up the planet – they absorb infrared radiation and turn it into heat. The heat is trapped in the atmosphere. CO2 is an important greenhouse gas so it’s infrared absorption is a key piece of data. Ideally we need physical and chemical properties for all atmospheric components.

I can probably find the spectrum in an undergraduate textbook. If I copied it I would be sued by the publisher and burn in copyright hell. Yes, it’s factual data, and yes it’s important for the future of the planet and yes the publisher simply copied it from the author but copyright is the supreme god and we must worship it. So simply copying known public information from copyright-holders is a legal no-no.

I go to the web, Google for “collections of infrared spectra” and get:

(I have copied this without permission. Wiley is an aggressive publisher who pursued a graduate student, Shelley Batts, for critical blogging a single graph from one of “their” papers. They said it was a mistake and everything was now OK. It’s OK in that copyright still rests completely with Wiley.)

Anyway, we digress. This shows that a SINGLE BOOK of spectra for a SINGLE USER can cost 3000 Euros (that’s about 4000 USD). That shows the scale of problem we face in chemistry. Now I agree that these spectra were won with the sweat-of-the-brow and so on but in these days of automatic machines it does not cost 2 USD to publish a copy of a spectrum. This is an example of monopoly and scarcity control and inflated prices. (It may well be that Hummel does something laudable with the money – I have no idea).

The message is not only that the data are not Open they are enormously expensive.

Let’s try another:

First read the conditions (I have highlighted parts):

Use of Site. Thermo Fisher authorizes you to view, print and download the materials at this Web site (“Site”) only for your personal, non-commercial use, provided that you retain all copyright and other proprietary notices contained in the original materials on any copies of the materials downloaded or printed from the Site. You may not modify the materials at this Site in any way or reproduce or publicly display, perform, or distribute or otherwise use them for any public or commercial purpose. For purposes of these Terms, any use of these materials on any other Web site or networked computer environment for any purpose is prohibited. The materials at this Site are copyrighted and any unauthorized use of any materials at this Site may violate copyright, trademark, and other laws. You agree that you will not disclose, republish, reproduce, or distribute any of the information displayed on or comprising this Site (the “Content”) or make any use of the Content that would allow a third party to have access to the Content. If you breach any of these Terms, your authorization to use this Site automatically terminates and you must immediately destroy any downloaded or printed materials.

Not exactly cuddly. Where’s the data? They say:

The Spectra Online database is a collection of public domain and other data generously contributed from various sources. Please note that Thermo Electron Corporation does not control the reliability or quality of contributions to the Spectra Online database and therefore makes no guarantees or warranties on the usefulness or correctness of the information or data contained therein. Below are links to descriptions of current Spectra Online data collections:
Acorn NMR NUTS DB Searchable Archive
American Academy of Forensic Sciences (AAFS) MSDC Database Agilent MS of VOC’s Library
Boeing Aerospace FT-IR of Lubricants
Caltech Mineral Spectroscopy Server
CCRC Database – GC-EIMS of Partially Methylated Alditol Acetates
EPA Vapor Phase FTIR Library
EPA-AECD Gas Phase FTIR Database of HAPs
FBI FT-IR Fibers Library (Spectrochimica Acta)
David Hopkins NIR Collection
InPhotonics Raman Forensics Library
IUCr CPD Quantitative Phase XRD Round Robin Test Set
Jobin Yvon Raman Spectra of Polymers
LabSphere FT-IR and NIR Spectral Reflectance of Materials
McCreery Raman Library
NIST Chemistry WebBook
Notre Dame Organics Workbook Spectra
Edward Orton FTIR of Solid Phase Synthesis Resins
OMLC – PhotchemCAD Spectra
Pacific Lutheran University – NMR Spectra for Solomons and Fryhle Organic Chemistry, 7th Ed.
Pacific Lutheran University – FTNMR FID Archive
PhotoMetrics Inc. FT-IR Library
RMIT Applied Chemistry MS Library
SPECARB Raman Spectra of Carbohydrates
David Sullivan FT-IR Collection (University of Texas)
TIAFT User Contributed Collection of EI Mass Spectra
UCL Raman Spectroscopic Library of Natural and Synthetic Pigments
Univ. of Northern Colorado – Protein Infrared Database
University of S.C-Aiken UV-Vis of Dyes
USDA Instrumentation Research Lab NIR Library
U.S.G.S. Spectral Library of Minerals
University of the West Indies, Mona JCAMP Archive
Widener University – Dr. Van Bramer’s Spectral Archive

So what we have here is theft from the public domain. A variety of public sources have donated data to Thermo which has stamped them all with such a restrictive contract that I cannot even show you one spectrum. It is extraordinarily easy to steal from the public domain. Just wrap it in frightening legal stuff.

Now you could argue that actually I can take data from this site as it was originally public domain. But not all of it is. And if I am a robot I have no way of deciding which. I read the terrifying legal conditions and my system stackdumps.

This pollution and theft is endemic. We have to Open the Data.

Let’s try a US government site – NIST. It has an excellent set of chemical data – probably the best in the world. Here’s its excellent Webbook with a spectrum ( )


And NIST is a US Government organization so all its works are ipso facto in the Public Domain, right? And so we can publish an Open Collection of Spectra by copying from NIST?

NO: says

Standard Reference Databases are copyrighted by the U.S. Secretary of Commerce on behalf of the United States of America.  All rights reserved.  No part of our database may be reproduced, stored in a retrieval system or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise without prior permission.

Source: Public Law 90-396, July 11, 1968, The Standard Reference Data Act

Purpose: To provide for the collection, compilation, critical evaluation, publication and sale of standard reference data


Section 6 – …the Secretary may secure copyright and renewal thereof on behalf of the United States as author or proprietor in all or any part of any standard reference data which he prepares or makes available under this Act, and may authorize the reproduction and publication thereof by others.

[The US Gov. Has made an exception for NIST so it can collect money.] So I shall probably go to Guantanamo for publishing the spectrum. I’ll take the risk, but I clearly cannot copy the whole lot.

So, very simply, although some 20 million chemical compounds are known there are no collections of Open infrared spectra.

As a responsible member of the Open Knowledge Foundation I am not prepared to appropriate material that has been “copyrighted” by others. So my conclusion is:


I hope this statement is wrong.


Open Data in Climate Research?

Tuesday, June 15th, 2010

Dictated and Scraped into Arcturus

Yesterday evening I went to a discussion at the Royal Institution. I’ll first give the abstract of the occasion and then my motivation and conclusions. Please read what I write very carefully, because I am not commenting on the primary science – I am commenting on how the science and its conclusions are, or are not, communicated.

The Climate Files; The battle for the truth about global warming


In November 2009 it emerged that thousands of documents and emails had been stolen from one of the top climate science centres in the world [PMR: The Climate Research Unit (CRU) at the University of East Anglia (UEA), UK] . The emails appeared to reveal that scientists had twisted research in order to strengthen the case for global warming. With the UN’s climate summit in Copenhagen just days away, the hack could not have happened at a worse time for climate researchers or at a better time for those who reject the scientific consensus on global warming. Yet although the emails sparked a media frenzy, the fact is that just about everything you have heard and read about the University of East Anglia emails is wrong. They are not, as some have claimed, the smoking gun for some great global warming hoax. They do not reveal a sinister conspiracy by scientists to fabricate global warming data.

To coincide with the launch of his new book, The Climate Files, the veteran environment journalist Fred Pearce discusses how the emails raise deeply disturbing questions about the way climate science is conducted, about researchers’ preparedness to block access to climate data and downplay flaws in their research.

This will then be followed by a panel involving Dr Myles Allen (University of Oxford) and Dr Adam Corner (Cardiff University).

Fred Pearce was the main speaker and described in detail his analysis of the emails which had been exposed from UEA. I would agree from his analysis that there is no “smoking gun” and that many of the emails were unfortunate rather than malicious. He was then answered by Drs. Allen and Corner, and there was clearly some disagreement between them and him. The discussion was then opened to the audience (which included scientists, journalists and many others) and a lively and valuable debate took place.

I should make it clear that I am making no comment at the moment as to whether global warming is a reality and if so how important it is. And I am deliberately taking the position of an agnostic because I want to find for myself what the evidence is and how compelling it is. For that, it is important that the information is Open and so it is as a “data libertarian” (a useful phrase which I heard last night) that I attended the meeting.

As a result of the presentations and the discussions within the panel it seemed to me that there was a serious lack of Openness in the Climate Research community. It is important not to judge from just one meeting but given the enormous public reporting and discussion I was disappointed to find that there were still parochial and entrenched attitudes about ownership and use of data.

My superficial analysis is that the CR community has retreated into defensive mode and has not changed its communication methods or interaction with the community. This is perhaps understandable given the hostility and publicity of much of the media coverage and further comment (and UEA has put a ban on staff speaking on the issue). Such bans can recoil, as it is then easier to believe there is something to hide. It may be difficult, but it seems essential to radically overhaul the governance and communication.

On more than one occasion the panel asserted that Climate data should only be analysed by experts and that releasing it more generally would lead to serious misinterpretations. It was also clear that on occasions data and been requested and refused. The reason appeared to be that these requests were not from established climate “experts”. This had led to the Freedom Of Information Act (FOI) being used to request Scientific Data from the unit. This had reached such a degree of polarisation that of over 100 requests only 10 had resulted in information being released by the University. I had no idea that this “FOI battle” had been going on for several years and that nothing had been done to try to solve the problem. This in itself should have been a signal that change was necessary – however inconvenient.

We should remember that climate research is not an obscure area or of science but something on which governments make major and lasting decisions. It surprised me that there was not an innate culture of making the data and research generally available. The CRU is effectively a publicly funded body (as far as I know there is minimal industrial funding) and I believe there is a natural moral, ethical and political imperative to make the results widely available. The FOI requests should have been seen as a symptom of the problem of not making data available rather than as, it appears, being regarded as irritation from outsiders. Whatever the rights and wrongs, it was a situation with a high probability of ending in public disaster (as it did).

I was sufficiently concerned that I spoke at the end and although I do not have my exact words I said something like the following:

“I am a Chemist and a data libertarian. I am not an expert in climate change but I believe that I could understand and contribute to some parts of climate research (e.g. data analysis and computational science and I do not accept the need for a priesthood. In my advocacy for publishing Open Data I encounter many fields where scientists and publishers are actively working to make data openly available. The pioneers of genome research and structural biology fought their culture (which included major commercial interests) to ensure that the results of the work was universally available. I see other areas where scientific papers cannot now be published unless the scientists also make their data available at time of publication. Climate research appears to have generated a priesthood which controls the release of information. For a science with global implications this is not acceptable.”

This will not be my last blog post on this issue. I was sparked into action when I heard a talk in Cambridge by Nigel Lawson (Margaret Thatcher’s Chancellor of the Exchequer). Lawson argued (using proof by political assertion) that climate change research was a conspiracy. He has now set up a foundation to challenge the mainstream view (The Global Warming Policy Foundation). However I realized while listening to him that I did not have compelling incontrovertible Scientific Data and arguments that I could use to challenge his views. This is an untenable position for a scientist and so I believe I must educate myself and my fellow scientists about which pieces of information are genuine.

To do this we have to develop a culture of Openness and a number of us discussed the problem at the Open Knowledge Foundation’s OKCon earlier this year. Although much has been written and continues to be written on climate research there is no Open repository of information.

The OKF’s goal is to create or expose Open resources. We are currently thinking about how to do this for climate research. We have to be extremely careful that we do not “take sides” and that our role is strictly limited to identification of Open resources.

Data-intensive Science: The JISC I2S2 project.

Monday, June 14th, 2010

Typed into Arcturus

I’m in Bath for a JISC meeting – the I2S2 meeting. All JISC meetings have acronyms – I2S2 stands for Infrastructure for Integration in Structural Sciences and involves a number of experimentalists in finding the structure of materials (More on the I2S2 project at its website: . For example Martin Dove (from Earth Sciences in Cambridge) is looking at how atoms in silicates move, and how this changes the structure of minerals. Since much of the Earth’s crust is made of silicates this is of importance in understanding tectonic movements, exploration for minerals, etc.

Here’s an example of the multidisciplinary nature of science – to find out what happens hundreds of kilometres (10^5 meters) deep in the earth we have to understand how atoms behave at the picometer scale (10^-12 meters). So there is a factor of nearly 20 powers of ten – and it’s remarkable how often the very small and the very large interact.

Martin collaborates with Rutherford laboratory near Harwell run by STFC. Martin uses neutrons to determine how the atoms move and needs a special “facility” (ISIS) to do this. Here ( ) are some of the many projects at ISIS which include wsays of improving mobiles phones, mediacl diagnostics and much more. Science underpins our modern life and however we are to escape from our present plight we must see science at the centre. It’s something that the rest of the world admires in the UK.

ISIS produces DATA. And that’s what the I2S2 project is about. The data is expensive to produce (neutrons are not cheap) and the data are complex. STFC also has a large resource in developing new approaches to information and Brian Mathews from STFC is therefore also on the project.

This is “large science”. But I2S2 also covers “long tail” science – where lots of science is done by individuals. Simon Coles runs the National Crystallographic Service in Southampton where hundreds of researchers submit their samples and his group “solves the structure” and returns the data. Here the data are likely to be in hundreds of separate packages.

What’s characteristic of these projects is that the data often drive the science. So managing the data is critical. And we’ve just been talking about problems of scale. If we get 10 times more data then the problem becomes intrinsically more difficult – it’s not just “buying another disc”. New bugs arise and integration issues become essential.

So I2S2 is looking to see whether there can be a unified approach to managing data. This requires an information model, because only when we understand the model can we create the software and glueware to automate the process. This is not easy even when “most of the experiments are similar”. It needs expert understanding of the domain and a vocabulary (more technically an ontology) for the data and the processes. Moreover it’s not a static process – we often keep refining the processes in transforming and managing data.

And the result of experiment A is often the input for project B. So the process is often shown as cyclic – the research cycle. A key concept id “data reuse” – in this area ideas often build on existing data (which is why I and others keep banging on about publishing data). Here’s a (relatively simple!) diagram for the research cycle in I2S2:

Note the cycle round the outside. Start at the NE corner. Not everyone maps their research in precisely these terms but most do something fairly similar. The data-intensive part is mostly at the bottom. Data are not simple – usually the “raw” data need processing before being interpreted. For example an experiment may collect data as photons (flashes of radiation) and these need integrating locally. Or they need transformation between different mathematically domains (“Fourier transform”). Or they are raw numbers from computer simulations. It’s critical that any transformation is openly inspectable so that the rest of the world does not suspect the authors of “manipulating their data to fit the theories”. That’s one reason why it’s so important to agree on the data transformation process and that anyone (not just scientists) can agree it has been done responsibly.

This is a microcosm of science – data is everywhere – and all of those projects will be thinking and acting as to how their data can be reliably and automatically processed. Because automation gives both reproducibility and also saves costs.

So when scientists say they need resources for processing data, trust us – they do.