PP1_0.1: What is Scientific Data?

Typed into Arcturus

apologies for formatting – Word=> WordPress has somehow trashed the paragraphs
This post is a first outline – not even a draft – of a proposed Panton Paper on “What is Scientific Data?”

The Panton Principles have declared that scientific data should be Open. As John Wilbanks put it

publicly funded science data should be in the public domain, full stop.”

Before you start saying “well what about human data, endangered species, etc.” I know there are many tricky cases. We cannot escape the fact that “Data is Difficult” and subsequent Panton Papers will address these. But the present post explores what Scientific Data is.

Warning: You will disagree with some of what I say. That’s because the area is complex, because I haven’t thought everything out, and because it differs from field to field. But I hope we can agree on some generalities. Please help to refine this, either through comments on this blog or on the OKF open-science list. And if you know of a similar or better analysis let us know,

I’m going to use “Science” to embrace STM (Scientific, Technical, Medical). There may be Arts and Humantities (A&H) which can be covered by this and archaeology has many STM features, for example, but literary criticism does not. Accept that I shall leave fuzzy borderlines.

I’ll start with some rough generalities. Scientific data is usually created by a conscious act. It may be possible to extract new scientific results from reading newspapers but normally the scientist consciously measures , observes or computes scientific data or obtains data from those who have done this.

I suggest that scientific data is created in two main regimes:
  • Hypothesis-driven science, where a hypothesis is proposed and data collected that can falsify the theory. Frequently this process is reported as one or more experiments
  • Data-driven science (also Discovery Science), where data is collected and it is then analysed to show patterns either within the dataset or when combined with other data. Data gathering has an honoured history but is usually done with a purpose – random fact collection is rarely valuable. This motivation affects the choice of study and the methodology and should be made public – part of the Open information available to the world.

The data can come from two main sources:
  • Observation and measurement. In some domains observation (e.g. field studies) is still the only method, and in others measurements are carried out by scientists and recorded in note books, but increasingly the measurement of data (“raw data”) is though instruments and sensors.
  • Calculation. In many cases physical laws allow direct calculation of observables quantities and computers have sufficient power. Computer programs in quantum mechanics, thermodynamics, classical mechanics and many other fields are often capable of showing excellent agreement with experiment and are much cheaper or can simulate unobservable situations (e.g. inside planets or stars).

The basis of reporting an experiment is in part to allow other scientists to falsify the experiment. A scientist should expect others to try to disprove their work – they may not like it when it happens, but it’s a fundamental rule. Therefore a scientist should agree that when reporting an experiment they should make available all data necessary to repeat the experiment.

(Note that I am separating “data” from “materials”. In an ideal world – and some are trying to create this – the scientist should make available enough material for others to repeat the work. But here I am sticking to data, with the expectation that if there is sufficient data in sufficient detail about the materials used (chemicals, animals, telescopes, seismometers, etc.) that a repeater could, in principle, verify that they had an essentially identical experimental setup.)

When the results of an experiment are published it is usually in a self-contained “journal article”. In principle this article should contain all the data necessary to repeat the experiment. This is very rare but many domains are trying to achieve this. Many others are not.

This has all been preamble – now the question of what is data. [Essentially the Panton Paper starts here with comments to be interspersed between the separators]

  1. “Data” implies accompanying metadata (e.g. precise definitions of quantities, equations of interrelationships, scientific units of measurement, error analysis, etc.)
  2. In experimental sciences the data is all the information required to repeat the experiment and the resulting data reported from that experiment.
  3. In data-driven sciences the data is the methodology of data collection and the contents of the database at a given time.
  4. In computational science the data is the program used to compute the results, the parameterisation of the program and the results of the calculation.

What does this mean in practice? The typical journal article falls far short of the ideal but the relevant areas are usually two or more of:
  1. Materials and Methods
  2. Experimental
  3. Results
  4. Supplemental data (or supporting information).

These should all be regarded as complete units of data. There is no scope for any “creative works” – they are all factual reporting of the design, the experiment, the observations and the measurements. (Where data are processed – and this is covered in later papers – this should all be in “Experimental” or “Conclusions”).

Very simply then, all these sections should be regarded as factual data. They should be available in all publications without restriction from subscription barriers or contractual agreements. They should be text- and data-mineable without restriction.

Note that it is common for scientists to report data in many different forms and media. The following is a common subset of material that can be strictly factual
  1. Text (including interspersed numeric values, names, organisms, chemical formula etc.
  2. Mathematical equations
  3. Tables
  4. Images (cells, stars, animals, etc.)
  5. Drawings of experimental procedures (equipment, workflows)
  6. Graphs (relating variables, eg. X-Y, scatterplot, histograms, etc.)
  7. Audio recordings
  8. Video recordings

None of these should be copyrightable as “creative works” and all should be made Open.

Debarring any section of the world community from Open availability of these sections is a direct detriment to science.

Note:

I have also argued that all the bibliographic metadata (author, journal, addresses) and the citations should be regarded as Open but this will be addressed elsewhere
Posted in Uncategorized | 3 Comments

Open Data: The concept of Panton Papers

Typed and scraped into Arcturus

[Pedantic note: I use “data” as either a singular or plural noun according to the feel of the sentence.]

This and following posts have several purposes. They are to help me get my ideas in order for http://opensciencesummit.com/ where I am giving a 10-minute talk on “The Open Knowledge Foundation” (to which I would now add “and Panton”); to try to address the enormous scope of Open Data; and to prepare the ground for a funded project on Managing Research Data.

The current theme is “Panton Papers”. The idea is that part of the value of the Panton Principles (http://www.pantonprinciples.org/ ) is that the whole document is short and the key points are simply made. But the “Principles” can therefore only address the motivation and the procedures for Open data in a general manner, and many of the problems are in the details. I believe that many of the problems in Open Access (which is simpler than Open Data) arose because not enough communal effort was given to the practice of Open Access and I want to avoid as many OD problems as possible before they occur.

Over the last 2 years (when Open Data has started to become important and discussed) I have seen several potentially difficult areas. I’ll simply list the ones I have thought of here and then outline the idea of the Panton Papers. This discussion is mirrored in part by the OKF open-science discussion list (http://lists.okfn.org/pipermail/open-science/2010-July/thread.html ) and you may wish to subscribe. There’s also a regular working group on open-science. (Almost everything in OKF is Open , but it may take a little while to find out where you want to be!). The issues that I currently have are:

  • What is data? Images? Graphs? Tables? Equations? Accounts of experiements? This is a major problem and almost completely unexplored. Without solving this we are held back 10 year or more in our ability to re-use the primary scientific literature (e.g. by closed-access publishers who claim that factual graphs belong to them).
  • Why should data be open? (and when should it not be?). I’ve put forward ideas in http://en.wikipedia.org/wiki/Open_science_data and http://precedings.nature.com/documents/1526/version/1 . They range from moral, to legal/quasi-legal to utilitarian.
  • Who owns data? This is one of the trickiest areas – there is legal and contractual ownership and there is moral ownership. Generally there is far far too much “ownership” of data.
  • When should data be released? This is a key question (see http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=2473 for an example). Some communities have solved it – most haven’t addressed it and will have to go through the rigour of working out release protocols.
  • How and where should data be exposed? I am strongly of the opinion that we need domain-specific repositories (which could be national or international) and the Institutional Repositories are almost never the best place to expose data (I expect and welcome alternative opinions). The “how” depends on understanding what the data and metadata are and is increasingly dependent on specialist software and information standards. “Archival” is often the wrong word to use.
  • Datamining and textmining. Most authors, publishers, repository owners are unaware of the enormous power of automated analysis of the literature. Some closed access publishers expressly forbid these activities. We have to liberate the right of the scientific community to do this enthusiastically and efficiently.
  • Reproducibility. Science is based on reproducibility – we expect to be able to replicate the “materials and methods” of an experiment and to try to falsify its claims. Physical materials are beyond the immediate discussion (though this may change) but much science is now based on computing. It should be possible to replicate simulations, data cleaning, data analysis, model fitting etc. This is a tricky area. It is difficult (though with virtualization and the cloud is becoming easier) to reproduce the computing environment. Large or complex data sets are a major problem but must be addressed. This is not without monetary cost.

I may add more.

The idea is that each of these is a “Panton Paper”. It may or may not be crafted in Pantonia (the hectare of the Chemistry Department, The OKF headquarters, and the Panton Arms in Cambridge UK). Everything I now write is mutable.

Each paper will have a toplevel document of similar form to the Panton Principles (i.e. 3-8 ideas, with short explanatory paragraph(s). This document will be crafted by the OKF in public view on a wiki or Ether/Piratepad. Anyone can take part. We shall welcome contributions from a wide range of disciplines (in fact this is essential). At some stage version 1.0 of the paper will be frozen and will be formally published. We have an offer from a major publisher to do this and I am hoping we can announce this at Open Science Summit.

The Paper should carry a wider range of links to other essays in Open Data and should carry examples from different disciplines. For example there is a well tried and accepted process in many areas of bioscience and astronomy as to what when and how data get published.

The wiki will be mutable so that changes in policies, and updates to links will be continuous, even after the V1.0 publication. This will also serve as an example of a new type of publication where the static, immutable “paper” is replaced by a reviewed series of time-dependent hyperdocuments.

Over the next few days I will refine this for presentation at OSS.


 

Posted in Uncategorized | 1 Comment

Open Data: A typical furore over when data should be published

Typed and scraped into Arcturus

The following recent story in the Times Higher Educational Supplement (the “mainstream” magazine for HE in the UK) shows why we desperately need a clear basis for discussing data. I’ll comment inline, but initially just to make it clear that the fuss and hyperbole is because there is no communal framework for understanding and addressing the problem. Also to remind readers of this blog that the UK has a Freedom Of Information Act (FoI) which allows any citizen to make a request to a public body (government, local government, universities, public research establishments) for information, It is the law, and a reply must be delivered within 20 working days and there are only a few grounds for refusal.

The background is “Climategate” (http://en.wikipedia.org/wiki/Climatic_Research_Unit_email_controversy ) where the FoI was used repeatedly to try to extract data from the CRU at the University of East Anglia (UK). Ultimately there were email leaks and several public enquiries. I shan’t address the facts or the rights and wrongs other than to note that there was a complete lack (failure) of understanding between the requesters of the information (data on climate research) and those from whom it was requested.

My contention is that some of the problem has arisen because we have no framework for understand who has a right to what data when. The use of a legal instrument (FoI) is inappropriate for scientific communication and serves to highlight the work we need to do to create a framework.

The article is http://www.timeshighereducation.co.uk/story.asp?sectioncode=26&storycode=412475&c=2 and I shall quote some of it and some of the comments (I justify everything as fair use/comment).

A ruling that FoI laws require him to share unpublished data has shocked a researcher […]Michael Baillie began analysing the rings in Irish oak trees more than 30 years ago […]

But three decades on, the FoI laws have been used by a science blogger, Douglas Keenan, to obtain data collected by the emeritus professor of palaeoecology at Queen’s University Belfast over the course of a career investigating catastrophic environmental events.

After a three-year battle to get the university to release the data, some of which are yet to be published by the academic himself, Dr Keenan won a ruling from the Information Commissioner in April that said that Queen’s owned the data and must release it.

PMR: The critical aspects here are “some of which are yet to be published by the academic himself” and “a three-year battle” and “Queen’s owned the data”. Whether the science has been published, whether the timescale is appropriate and “who owns the data” are going to be key questions. I haven’t read more than this article – I suspect the Information Commissioner had little precedent to go by and made an appropriate decision in the circumstances. I personally would dispute that Queen’s owned the data (I don’t think data can normally be owned). But this is the essence of exposing the problems.

The precedent has important implications for academics, raising issues similar to those highlighted in last week’s report by Sir Muir Russell into the so-called Climategate affair at the University of East Anglia.

Until now, researchers have published data at the time of their choosing, through the normal academic channels and in the context of the overall objectives of their work.

The decision in the Queen’s case indicates that any interested party can use FoI laws to request any data belonging to a UK university, whether they form part of an academic’s published work or whether they are still raw.

PMR: In the UK much law is by precedent, so if this was a court case this would set a precedent. I don’t know whether the Commissioner’s statements have the same force. But precedents can be overturned by later judgments. I certainly would not regard this as absolute.

Professor Baillie said the Information Commissioner’s ruling demonstrated how ill-equipped universities were to deal with the dilemmas this posed. “I think the problem is that no one has ever defined university data, either for academics like me or for the university as an institution.”

He said his professional relationship with Queen’s had been formed in a less managerial age, one untroubled by modern demands for public accountability.

“There was nothing in our old academic contracts about data and responsibility for data,” he said, adding that ownership of research data had never been discussed.

“As far as we were concerned, it was our data because the issue of who owned them never arose: the data belonged to the people who made the measurements,” he explained.

PMR: All this is true. It is the uncertainty that causes the problem and why we have to address it

This attitude is no longer acceptable, according to the Russell report, which criticised UEA for not being open enough with climate change sceptics who requested information under FoI legislation. It said the university had “failed to recognise not only the significance of statutory requirements (under the FoI Act), but also the risks to the reputation of the university and, indeed, to the credibility of UK climate science”.

PMR: Climategate involved data supporting previous publications and where the timescales were considerable. Many request for data had been made (under FoI) and almost all had been refused. This at least points to a systemic problem over several years which should have been addressed. I don’t think it generalizes.

However, the Information Commissioner’s decision in the Queen’s case raises concerns that third parties could disrupt projects, with FoI requests potentially forcing the early release of data gathered as part of longitudinal research.

PMR: This is a concern but it is manageable if we create the right infrastructure. Timescales for release are now critical.

The Joint Information Systems Committee, which supports the use of IT in UK higher education, has commissioned a consultancy firm to produce guidance on the issue, which is due to be published in September.

“The big lesson is that a lot of the rules governing what people thought were exemptions didn’t stand up to analysis by the Information Commissioner’s office,” said Simon Hodson, programme manager at Jisc.

PMR: Yes. We are delighted to be working with JISC who have funded projects where we are bringing greater Openness to data and procedures.

[Professor Baillie] explained: “An FoI request granted by the commissioner leaves the university on the back foot. It also leaves people like me with loads of data that we are still exploiting on the back foot. Clearly each university needs to have a definite policy on the release of research data.”

PMR: Agreed. And that is why we are trying to help formulate policies.

… and now some reader comments …

  • Dorothy Bishop 16 July, 2010

    Data sharing has become standard practice in the field of genetics. Obviously, when this was first raised, researchers were worried about others stelaing a march on them, but guidelines have been developed to protect the interests of researchers, and overall the availability of data on the web seems to be of benefit to both researchers and the general community. See:
    http://www.genome.gov/Pages/Research/WellcomeReport0303.pdf

PMR: exactly. Some fields have solved this years ago – it’s important to learn from them.

  • Mary 16 July, 2010

    I’m curious to know how this FoI ruling will apply to cases where university researchers such as myself who are industry funded and have non-disclosure agreements in place (as opposed to being funded from government grants or the university itself). Does the FoI act overrule the non-disclosure agreements signed between the university and external (non-govt) funding sources? This has the potential to see industry funding dry up in a flash…

  • Bill Cooke 16 July, 2010

    Mary has a point. I possess elite-interview data gathered with a promise of confidentiality. It would be very difficult to share any part of this without compromising confidentiality. In the longer run, interviewees are just likely to say “no” to interview requests, are they not ?

PMR: I would be amazed if this happened, and if it did there would be an appeal which would be won. UK has an element of commonsense in its legal system (perhaps not enough…)


Ellie Dewar 17 July, 2010

Government will need to think very carefully before extending its ‘transparency principles’ to research data. Fields benefiting from open data release are those where data is an end-product of the research – these are an exception not the rule. Making data open takes time and effort. Diverting researchers from work they have been educated at great expense to do, to do something they see as damaging to their own interest, their industry collaborators, and their research goals might not be a great idea. As someone recently said, it is like forcing a sculptor to ‘openly release’ the lump of stone they are working on before it is a sculpture. Leave PIs to decide what to make open and when, at least until they have published.

PMR: premature worry.

  • Shabba 18 July, 2010

    The ESRC and MRC expect their PIs to place data in national repositories, and rightly so, this work is done at public expense. In my field, (Health Services Research), research ethics committees rightly restrict access to unanonymized data about people and organizations, but NHS [UK National Health Service] trusts are also required to audit data collection undertaken by universities. We are careful to tell participants in our studies that what they say may be anonymized but that this may not adequately conceal their identity, and further, that anonymization and confidentiality are not the same thing. None of our data is immune from subpoena, and there are many public agencies that can demand to see it (the police and HMRC [HM Revenue and Customs], for example). In this context, FoI is only one possible way in which seemingly private research data is in fact very public. The most important thing is to be forthcoming and sensible about it ~ and definitely not make the bollocks of this kind of thing that happened with the climate resarchers at UEA!

PMR: Yes. We must always be prepared to be accountable and sensible,

  • Chris Rusbridge 19 July, 2010

    It’s important to note in this case that the ruling was based on Environmental Information Regulations (EIRs) rather than freedom of Information. Exceptions in EIR are stricter. Also the details of the Information Commissioner’s ruling at http://www.ico.gov.uk/uploadre/documents/decisionnotices/2010/fs_50163282.pdf show that procedural issues were part of the problem. This is not all as bleak as it’s painted!

PMR: Exactly so. (Chris has recently retired from running the UK’s Digital Curation Centre)

  • Rodney Breen 20 July, 2010

    In answer to Mary and Bill, the Freedom of Information Act has exemptions to protect data which is collected with a reasonable expectation of confidentiality, and data which is commercially sensitive. Under the Scottish Act, there is specific protection for research data. There is no reason why material for which researchers have legitimate need for protection should need to be disclosed.

PMR: Exactly. The clearer that these issues are made the less likelihood of problems.

  • Douglas J. Keenan 20 July, 2010

    I am the Douglas Keenan mentioned in the article. The article’s claim that the ICO decision “indicates that any interested party can use FoI laws to request any data belonging to a UK university, whether they form part of an academic’s published work or whether they are still raw” is false. Indeed, the FoI Act (Section 22) states that information is exempt from request if “the information is held by the public authority with a view to its publication, by the authority or any other person, at some future date”. The real situation is that Baillie has had almost all the data for over 30 years, has published many papers based on the data, is now retired, and yet claims the data as his private property–and the ICO rejected Baillie’s claim. The ICO decision is obviously reasonable, and the article is misleading. There is more about what happened, including detailed documentation, on my web site:
    http://www.informath.org/apprise/a3900.htm

    .

  • Mike Baillie 23 July, 2010

    Just to provide a little closure on the issue of Irish tree-ring data and Freedom of Information I would like to point out a flaw in Mr Keenan’s logic (20 July, above).
    Mr Keenan tells active academics that their data is safe because the FoI specifically exempts information held “with a view to its publication, by the authority or any other person, at some future date”. This must imply (given that the tree-ring data had to be released) that there was no intention to publish the tree-ring data by anyone at Queen’s University Belfast; otherwise the data should have been exempted. How did Mr Keenan know that there was no intention to publish? The answer is simple, he stated on his web-site that QUB had closed the tree-ring laboratory. The same tone can be detected in his statement above.
    So to put the record straight, the tree-ring laboratory at QUB is not closed, it is staffed and remains active. Apart from undertaking tree-ring research and offering a commercial dating service for oak samples, it continues to publish. Two books and some 20 single and joint authored papers have been published between January 2005 and June 2010. Other publications are with editors and in the pipeline. The intention has always been for existing staff and emeritus professors to publish all the dated tree-ring data. Yet despite all that, the ICO saw fit to find against QUB and force release of the raw tree-ring research data.
    So the message that comes out of the Belfast experience is that academics do not have the data protection that Mr Keenan says they have. He has personally demonstrated how it is possible for an interested party to abuse FoI laws to extract current research data that was manifestly exempt.

PMR: Again the issues are – data supporting publication – and timescale.

So to sum up, this would not have happened if the principles of data had been clear – whether it can be owned, when it should be released, and how much. Many domains have solved this. Many have not.

But remember that it is not trivial. Each domain is likely to have different views of data and different constraints.


 

Posted in Uncategorized | 3 Comments

Open Data: why I need the Open Knowledge Foundation

Typed into Arcturus

After a period of silence on this blog (but not on the Open Knowledge Foundation lists) I hope to publish a flurry of ideas on Open Data. There is no doubt that “Open Data” has arrived and there is enormous interest. (By contrast when I started to investigate it 5 years ago there was nothing). It’s desperately important, more complex than I ever imagined, and it’s critical to address it immediately, responsibly, dispassionately and inclusively. If we manage to set out the concerns now, we may manage to avoid the worst problems that were encountered by the Open Source and later Open Access movements. [They have made enormous progress and without their footsteps OpenData would fall into many of the same pitfalls. But OpenData is Difficult – a phrase I shall repeat frequently.]

I am putting my faith and energy into the Open Knowledge Foundation (http://www.okfn.org ) – its people and its infrastructure. This is because it’s an organisation which is wideranging (it deals with open content of all sorts, open metatada, services, etc.). It has great expertise in legal problems and solutions (where these are necessary) and also how to find alternative approaches. It’s neutral (apart from urging Openness and developing the infrastructure). It’s very professional, and realises that ideas without implementation have less weight. So there is an impressive range of software and information skills. I am reminded of my favourite motto (from the IETF) – “rough consensus and running code”, one the greatest productive mantras of our time.

The enthusiasm is palpable. [Today I had a breakfast Skype session with Jonathan Gray (coordinator of OKF) and it’s all about how we can make things happen fast and responsibly.] The OKF works through Working Groups and discussion lists, and so when I had a concern about Open Data I brought it to the OKF and – after a great deal of work – we emerged with the Panton Principles (http://pantonprinciples.org/ ) which have now been translated into several languages by OKF members.

Simply, OKF amplifies the visions of individuals from the almost-impossible to the attainable.

So I am putting some ideas into the OKF melting pot to see what emerges. They are not “my ideas” – ideas have an independent existence and they visit people – the more people the more likely they are to get implemented. The great thing is that the infrastructure connects me with others in the same area of thought and action. The software already exists – I do not have to create it.

So the next blog posts will outline some of the ideas which will help the discussion and implementation of Open Data.

The following post shows why we need rational discussion.

Posted in Uncategorized | 2 Comments

I am so Excited about the Open revolution

Written on a plane from Seattle toSF Arcturus

I haven’t blogged for a while but there is just so much going on that I haven’t had a moment. I’m now in a plane (Alaska Air) with free wifi. (Unfortunately it’s only a promo for 2 weeks)

But Thank Seattle for free wifi at airport. It makes so much difference. It’s worth it. You get kudos. Whereas with all the money graspers in most places I feel a lower class organism. It’s the sort of thing that makes me tell people to come to Seattle.

So here is a simple a list of things that are exploding in my head and I’ll hope to expound later;

  • Open Knowledge Foundation (okfn.org). It’s unstoppable. I think Open Knowledge will become the “Wikipedia” of the 2010’s.
  • Open Data. Panton Principles are being translated into Russian. Thank you everybody. When you come to Cambridge, look us up in Pantonia.
  • Open Bibliography and Open Citations. This has the chance to change the world of academic scholarship. It’s liberation informatics (I’ll talk more later). We can reclaim our scholarship
  • Open Science (Summit). I’m so excited to be invited to the Open Science Summit in Berkeley at the end of the month. It’s the sixties reborn for Openness. I’ll be waving a flower.
  • Microsoft. A fantastic Research Summit. Microsoft Research are making the transition to an Open supporter of science. They’ve got the vision. Yes, you’ll still have to pay for Word, etc. But they are working towards interoperability.
  • Chem4Word. This is now robust. Expect a V1 release ASAP. Chem4Word is Open. Joe Townsend, and Alex Wade (and the rest of us) have done a fantastic job. It’s a platform for Open science.
  • Visualisation. I have had my head blown away by displays that can cover several orders of magnitude.
  • Interactivity, user interfaces. The word of Natural User interfaces is coming
  • Data everywhere. 5 years ago Open Data was never mentioned. Now it’s coming out of the walls. The closed empires will fall
  • JISC. It’s great to work with them. They breed collaboration.
  • Scifoo. I’m honoured to be re-invited. I hope I can survive the excitement.
  • Citizen science, citizen innovation, citizen democracy. We have the tools and can create them. Let’s use them to liberate our world. The barriers between scientists and everyone are falling down. It’s arrogant to pretend your area is not accessible to others. Galaxy Zoo has shown that.
  • The timescale of innovation has collapsed. 5 years ago I wouldn’t have contemplated doing computer vision. I see OpenCV will be at Scifoo – can’t wait. NLP works. Speech recognition works. Mashups with all sorts of Linked open Data. It’s starting to stream out of the woodwork.
  • Science Online (London) Sept. Come.

If you haven’t visited OKF (okfn.org) you should. There is so much going on. Rufus is a pan-dimension hyperbeing to keep it all afloat. Seriously there’s lot of others and the critical mass and organization is awesome. Join the open-science and open-bibliography lists and you’ll see more contributions from me (which is why the blog has been a bit quiet).

And special thanks to publishers and libraries who have supported our JISCorama in various projects

  • Int. Union of Crystallography
  • PLoS
  • BMC
  • RSC
  • BL
  • Cambridge UL

 

 


 

Posted in Uncategorized | 2 Comments

When scientific models fail

Typed and scraped into Arcturus

From what I see so far Climate Change Research involves a lot of analysis of data sets. I don’t what the ratio of actual measurement to analysis is. I don’t know how often models are tested against experiment or against observed values.

Here’s a scientist concerned about an area of data analysis where there is a great flexibility in choosing models, choosing parameters, choosing methods and with little check against reality. I’ll leave it hidden for a little while where this was published. It’s in a closed access publication which costs about 30 USD for 2 days access, so I’m going to copy the abstract (which I may) and some sentences from the body for which I will claim fair-use. I’ll promise to give the reference later to be fair to the publisher (maybe their sales will increase as a result of my promotion). I’ll hide some key terms (XYZ is a common approach/philosophy) to add to the mystery

A general feeling of disillusionment with XYZ has settled across the modeling community in recent years. Most practitioners seem to agree that XYZ has not fulfilled the expectations set for its ability to predict […]. Among the possible reasons that have been proposed recently for this disappointment are chance correlation, rough response surfaces, incorrect functional forms, and overtraining. Undoubtedly, each of these plays an important role in the lack of predictivity seen in most XYZ models. Likely to be just as important is the role of the fallacy cum hoc ergo propter hoc in the poor prediction seen with many XYZ models. By embracing fallacy along with an over reliance on statistical inference, it may well be that the manner in which XYZ is practiced is more responsible for its lack of success than any other innate cause.

Sound familiar? Here are some more sentences excerpted from the text…

However, not much has truly changed, and most in the field continue to be frustrated and disappointed why do XYZ models continue to yield significant prediction errors?

How could it be that we consistently arrive at wrong models? With the near infinite number of [parameters] coupled with incredibly flexible machine learning algorithms, perhaps the question really should be why do we expect anything else. XYZ has devolved into a perfectly practiced art of logical fallacy. Cum hoc ergo propter hoc (with this, therefore because of this) is the logical fallacy in which we assign causality to correlated variables. …

Rarely, if ever, are any designed experiments presented to test or challenge the interpretation of the [parameters]. Occasionally, the model will be tested against a set of [data] unmeasured during the development of the model. …

In short, XYZ disappoints because we have largely exchanged the tools of the scientific method in favor of a statistical sledgehammer. Statistical methodologies should be a tool of XYZ but instead have often replaced the craftsman tools of our traderational thought, controlled experiments, and personal observation.

With such an infinite array of descriptions possible, each of which can be coupled with any of a myriad of statistical methods, the number of equivalent solutions is typically fairly substantial. Each of these equivalent solutions, however, represents a hypothesis regarding the underlying [scientific] phenomenon. It may be that each largely encodes the same basic hypothesis but only in subtly different ways. Alternatively, it may be that many of the hypotheses are distinctly different from one another in a meaningful, perhaps unclear, physical way. …

XYZ suffers from the number and complexity of hypotheses that modern computing can generate. The lack of interpretability of many [parameters] only further confounds XYZ. We can generate so many hypotheses, … that the process of careful hypothesis testing so critical to scientific understanding has been circumvented in favor of blind validation tests with low resulting information content. XYZ disappoints so often, not only because the response surface is not smooth but because we have embraced the fallacy that correlation begets causation. By not following through with careful, designed, hypothesis testing we have allowed scientific thinking to be co-opted by statistics and arbitrarily defined fitness functions. Statistics must serve science as a tool; statistics cannot replace scientific rationality, experimental design, and personal observation.

Posted in Uncategorized | Leave a comment

Open Data: Climate Change research and Chemoinformatics

Typed into Arcturus

The more that I read about Climate Change Research the more I find similarities with one of the domains in which I might be called an “expert” – Chem[o]info[r]matics (the o/r are often elided). I’m on the editorial board of J Cheminformatics (http://www.jcheminf.com/ ) which is published by Biomed Central almost all of whose journals are Open Access. (I would not sit on the board of a closed access journal and have publicly resigned from one). I’ll explain later what Cheminformatics is, but I’ll start with a graph that appeared in a peer-reviewed journal in the subject. It’s a serious respected journal and the editors seriously reviewed this submission which makes a serious point. So as not to spoil your voyage of discovery I won’t tell you where it is and I’d ask any chemoinformatics readers of this blog not to blurt out the message. Trust me, it’s relevant to cheminformatics and relevant to ClimateChange. No, I haven’t pasted the wrong graph by mistake.

Please comment on this graph as naturally and lucidly as you feel fit. There is no “right answer”.

Posted in Uncategorized | 6 Comments

AMI: Can you help us build a virtual chemistry laboratory in Second Life?

Typed into Arcturus

We have a JISC-funded project, AMI, under the Virtual Research Environment Rapid Innovation scheme and we are getting into full swing. The idea of the project is to create an “intelligent fume cupboard” [fume hood] where the fume cupboard has an intelligence. It will be able to record simple data (events, materials, etc.) and be able to answer questions. We are thinking big and have already developed a speech interface (using Chem4Word). It’s an experiment in that we don’t know exactly what we are going to do, but we have a lab full of inexpensive sensors and transducers (IR, Ultrasound, RFID, barcodes, thermo, video, etc.). Many of these will automatically capture events such as humans coming to AMI.

Last week we talked with Steve Ley, one of the world’s great organic synthetic chemists, and he suggested we should think of avatars – as in the recent movie. That’s a great idea and after a little while I thought of Second Life, where there is a usable avatar technology. I’ve always been interested in this and in the early 90’s helped to build a virtual environment at BioMOO, unfortunately deceased.

The Blue Obelisk (Open Data, Open Source, Open Standards) in chemistry has built a SL environment for chemistry and Jean-Claude Bradley and Andrew Lang have made impressive strides and developed a community, built round teaching and citizien science research (measuring solubilities) and also malaria medicinal chemistry. So we have technology and community and energy.

This post is to appeal to anyone who is interested to join in. Anyone can do this as there are all sorts of skills that are valuable to building virtual communities. High-school chemistry is useful but not required, scripting in SL is useful but not required. Constant energy and dedication and ability to world in a largely unstructured community is valuable.

If you are interested, visit the Blue Obelisk (http://blueobelisk.sourceforge.net/wiki/Main_Page ) and join the list (http://www.mail-archive.com/blue-obelisk ). I am an optimist and see it as possible to create a growing community round this project.

Posted in Uncategorized | 1 Comment

Open Data: My apologia

Typed into Arcturus

My blog on the RI meeting has been blogged at

http://bishophill.squarespace.com/blog/2010/6/15/murray-rust-on-pearce.html?currentPage=2#comments

which is factual and fair. I should make it clear that I am not putting anyone on a spectrum (“sceptics”, “nice guys”, “cheats”) etc. I went to a meeting I knew almost nothing about and came back saddened and concerned about an apparent priesthood. This has been confirmed by various public emails and blogs which show that there is concern in the community about this issue.

A comment on the blog above reads:

“I had no idea that this “FOI battle” had been going on for several years and that nothing had been done to try to solve the problem”.

Wikipedia says “Peter Murray-Rust campaigns for open data, particularly in science”.

Now that would be nice if Peter was to make a start on an open data campaign in climate science as he seems to be several years behind.

June 15, 2010 | martyn

And this is a good reason for me to make my apologia for Open Data and why I am active in an area I know little about.

I have no problem about “being several years behind” – I would expect nothing less. Ignorance is not a crime – we are all ignorant of almost everything. [Arguing from known ignorance is less excusable.]

I have spent a lunchtime hour flicking through blogs I have been pointed to (e.g. http://climateaudit.org/2010/06/04/losing-glacier-data/#comment-231990 ). There are many issues but my only comment will be that there is a range of views on how easy it is to preserve data. Some posters express surprise that all data is not preserved for ever, others that historically it has been very difficult to preserve it. My own view is that it depends on the motivation, the tools and the funding. Any missing component leads to data being lost.

So what is Open Data and why am I talking about a discipline (Climate) I don’t know much about? I got involved in Open Data about 5 years ago when I was enraged by publishers who sprayed copyright notices over factual data and who were less than enthusiastic about addressing any problem to do with data. The term “Open Data” was almost unknown then and while I am not the first to put the two words together they were sufficiently rare that I started a Wikipedia page (http://en.wikipedia.org/wiki/Open_science_data ) – [BTW this needs updating].

Since then I have been invited to speak on Open Data at a number of meetings (often Open Access or library meetings), met with many editors and publishers and most recently worked the the Open Knowledge Foundation and Science Commons, resulting in the Panton Principles. Most recently BiomedCentral honoured us by presenting Open data prizes and asking us to judge and award them.

I have also worked with the JISC in the general area of Open data and most recently am the PI of a grant award (with OKF, International Union of Crystallography, British Library, Cambridge University Library and PLoS) on “Open Bibliography”. It hasn’t yet started but we’ve made good progress.

My claim to be involved is that there are universal aspects to Openness in science (and usually corresponding benefits) and I’ll summarise them in what I (and I believe colleagues in the OKF) would feel able to do in an objective manner:

  • Inspect data resources and determine whether they were fully Open according to the Open Knowledge Definition (http://www.opendefinition.org/ ). It should be possible to do that in most cases without expert knowledge of the domain.
  • Help to provide a label (button) stating that the resource was Open Data.
  • Inspect a bibliography and determine which of the resources pointed to by the bibliography were Open and comment on appropriate aspects.
  • Work with bibliography creators to ensure that the bibliography itself was Open (even if some of the resources to which it pointed were not)

This list is a first pass – please comment. Note that I myself do not intend to create the bibliography of metadata – that would be inappropriate. A bibliography is an important resource which often represents a point of view and hopefully people in the Climate area have bibliographies (these often emerge when writing theses and reviews). Note that the overall infrastructure of a bibliography and it Openness is independent of whether the science is good or flawed, whether the people quoted have a particular viewpoint or whether they are nice or nasty.

If a resource can be identified as Open, then it can save a great deal of time (and sometimes money) when it is re-used. An Open diagram can be used in a review, book, teaching, etc. without further permission. Data can be mined from it. Text can be quoted from it. These things by themselves can add considerable to the speed and quality of a scientific field.

What if the Open resources are quoted in preference to the Closed ones? That might give a false view of the field?? In which case there is a good incentive for making more resources open.

Here are examples of Openness for resources in climate:

  • The “Keeling Curve” (http://en.wikipedia.org/wiki/File:Mauna_Loa_Carbon_Dioxide-en.svg ). This carries the licence:
    Own work, from Image:Mauna Loa Carbon Dioxide.png, uploaded in Commons by Nils Simon under licence GFDL & CC-NC-SA ; itself created by Robert A. Rohde from NOAA published data and is incorporated into the Global Warming Art project. ce: However NC is NOT Open – you could not use this in a text book, create a movie from it, etc.
  • The IPPC’s AR4 Synthesis report. () “The right of publication in print, electronic and any other form and in any language is reserved by the IPCC. Short extractsfrom this publication may be reproduced without authorization provided that complete source is clearly indicated. Editorial correspondence and requests to publish, reproduce or translate articles in part or in whole should be addressed to: [IPCC]”. This is NOT Open.

     

  • Atmos. Chem. Phys., 10, 9-27, 2010
    www.atmos-chem-phys.net/10/9/2010/
    doi:10.5194/acp-10-9-2010
    © Author(s) 2010. This work is distributed
    under the Creative Commons Attribution 3.0 License.

    A comprehensive evaluation of seasonal simulations of ozone in the northeastern US during summers of 2001–2005

    H. Mao1, M. Chen2, J. D. Hegarty1, R. W. Talbot1, J. P. Koermer3, A. M. Thompson4, and M. A. Avery5

     

    This IS OPEN. The licence (CC-BY) is fully conformant with the OKD. As ACP is an Open Access journal I expect that
    all publications carry this rubric. (Apologies for the cut-n-paste into Word)

     

So it should be possible to annotate any bibliography as to whether the items are Open. I can’t give examples of datasets as I don’t know the field. Certain ones (e.g. works of the US government) may be clearly Open, but many others will be fuzzier.

 

 


 

Posted in Uncategorized | 3 Comments

Open: Challenging Priesthoods

Dictated and Scraped into Arcturus

There have been a number of useful comments on my blog posts relating to open data in climate science. I’m conscious that I am walking into an area that I know little about and will defend why I think this is useful. I will also tell you what I am not going to do.

Martin Ackroyd says:

June 15, 2010 at 2:02 pm  (Edit)

I’d suggest that essential reading for anyone interested in these issues are:

Climategate The Crutape Letters by Steven Mosher and Thomas W. Fuller and

The Hockey Stick Illusion;Climategate and the Corruption of Science by A W Montford

It has often been said that the climategate emails were taken out of context. But with the full context, as revealed by Mosher and Fuller, they are utterly damning. And these were emails exchanged between leading authors of the IPCC reports.

Once you understand how the famous IPCC Hockey Stick Graph was based on erroneous statistics and dodgy manipulations of proxy data, as set out in verifiable detail by Montford, you wonder if anything at all from “climate scientists” can be trusted.

Richard J says:

June 15, 2010 at 3:34 pm  (Edit)

Nigel Lawson sat on the House of Lords Select Committee which reviewed Climate Change and its Economics in considerable detail in 2005, taking a wide range of technical submissions. If I recall correctly, they concluded that the scientific and economic aspects were potentially flawed by the overtly politicised nature of the IPCC process.

Lawson’s concerns should not be dismissed lightly. They are shared by many scientists in Earth Science disciplines closely related to climate science, but perhaps not directly reliant on research funding in this field.

David Holland says:

June 15, 2010 at 6:19 pm  (Edit)

Peter,

As the individual whose Freedom of Information Request resulted in the infamous email “Mike, can you delete any emails you may have had with Keith”, I have to tell you that Fred Pearce is well wide of the mark.

If anyone wants to know why someone would try to procure the deletion of AR4 emails just two days after I asked for them, just ask me for a confidential copy of my submission to the Russell Enquiry and confirm that you will not publish it. It is not on the Enquiry website because Sir Muir’s Enquiry does not have Parliamentary privilege and it is worried about being sued. I guess that also limits what the Enquiry will report.

I think these are useful encapsulations of some of the major issues that came out of the meeting. I shall confine myself to specific areas where I consider that my contribution made a useful. There is no point in my acting as an investigative journalist or as a politician so I shall not concern myself with the past history of the E mails and the practice of climate scientists. Nor shall I get into the details of how the hockey stick was produced and whether it is a valid scientific instrument.

I do however, like many people, have expertise which may be valuable in this area. This relates to the general practice of science, whose principles are available to everybody, to the way that knowledge is communicated (again something that anybody has the right to be involved in) and slightly more specifically to some of the statistical processes which appear to have been required in some of the data analysis.

I do not actually intend to get involved in data analysis but I will argue that I have every right to do so if I wish. In my day job – cheminformatics – I use a range of data analysis and statistical tools which are likely to be highly relevant to the processing and analysis of data in many fields, including climate. For example I have many years of experience in principal components, error analysis, data validation and the validity of statistical fitting (“overfitting”). I am on the editorial board of J.Cheminformatics and many of the issues we deal with appear to be similar to those in other disciplines. I would therefore feel it unreasonable to be told that I could not have access to data in climate research because I might misinterpret it.

Although I only have one evening’s evidence it appears that Stephen McIntyre, a mining engineer from Canada, wished to analyse the hockey stick data (http://en.wikipedia.org/wiki/Hockey_stick_controversy ) and was unable to get the data. I do not intend to debate the historical accuracy of this – the question is whether he has the right to do so . It is quite reasonable to assume that he had statistical and mathematical tools which were appropriate for this analysis. Put another way, if he were to submit a paper to J. Cheminformatics I would take the content of the paper and not his background as the material on which I would make a judgment.

It has been presented that McIntyre was challenging the priesthood of Climate Research and that he was excluded. Whether this is historically accurate is irrelevant to my argument and activity – I had a strong sense of closed ranks at the RI meeting. I sensed that if I asked for data I would not be welcomed and I suspect my current writings may not please everyone.

Science has always been multidisciplinary, but in the Internet age this has been accelerated. It’s possible for “lay people” (we need a better term) to take part in scientific activity. Galaxy Zoo (http://www.galaxyzoo.org/ ) has shown that “lay people” can author peer-reviewed papers in astronomy. There is absolutely no reason why anyone on the planet cannot, in principle, make contributions to science. Einstein worked in the patent office, Ramanujan in a government accounting office. (But before you all jump in remember that science is very hard, usually boring, work and has to be done carefully and with the right tools).

My concern here is with the cult of priesthood. I had the privilege or hearing Ilaria Capua speak of her campaign to get avian flu viruses published Openly. Until her work there was a culture of closed deposition and the data were only available to those in the priesthood (and I believe gender may have been an implicit issue). I can’t find a Wikipedia entry (there needs to be one) so have to link to

http://www.wired.it/magazine/archivio/2009/02/password/ilaria-capua-la-scienza-open-source-.aspx and what I wrote (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=607 ). And http://www.comminit.com/en/node/221260/293.

So what can and should I do to address this. I believe Open Knowledge (Open Data, Open Source, Open Bibliography) is a key activity. It librates and enables. It is only threatening to indefensible positions. Of course not all data can be made Open, and nor can all code, but there is no reason why bibliography cannot (for example OKF’s CKAN).

More later

 

For the record – the RI meeting on CRU emails. PM-R ranting [tweeted by Brunella Longo (http://www.pantarei.it )]

Posted in Uncategorized | 1 Comment