petermr's blog

A Scientist and the Web

 

PP1_0.1: What is Scientific Data?

Typed into Arcturus

apologies for formatting – Word=> WordPress has somehow trashed the paragraphs

This post is a first outline – not even a draft – of a proposed Panton Paper on “What is Scientific Data?”

The Panton Principles have declared that scientific data should be Open. As John Wilbanks put it

publicly funded science data should be in the public domain, full stop.”

Before you start saying “well what about human data, endangered species, etc.” I know there are many tricky cases. We cannot escape the fact that “Data is Difficult” and subsequent Panton Papers will address these. But the present post explores what Scientific Data is.

Warning: You will disagree with some of what I say. That’s because the area is complex, because I haven’t thought everything out, and because it differs from field to field. But I hope we can agree on some generalities. Please help to refine this, either through comments on this blog or on the OKF open-science list. And if you know of a similar or better analysis let us know,

I’m going to use “Science” to embrace STM (Scientific, Technical, Medical). There may be Arts and Humantities (A&H) which can be covered by this and archaeology has many STM features, for example, but literary criticism does not. Accept that I shall leave fuzzy borderlines.

I’ll start with some rough generalities. Scientific data is usually created by a conscious act. It may be possible to extract new scientific results from reading newspapers but normally the scientist consciously measures , observes or computes scientific data or obtains data from those who have done this.

I suggest that scientific data is created in two main regimes:

  • Hypothesis-driven science, where a hypothesis is proposed and data collected that can falsify the theory. Frequently this process is reported as one or more experiments
  • Data-driven science (also Discovery Science), where data is collected and it is then analysed to show patterns either within the dataset or when combined with other data. Data gathering has an honoured history but is usually done with a purpose – random fact collection is rarely valuable. This motivation affects the choice of study and the methodology and should be made public – part of the Open information available to the world.

The data can come from two main sources:

  • Observation and measurement. In some domains observation (e.g. field studies) is still the only method, and in others measurements are carried out by scientists and recorded in note books, but increasingly the measurement of data (“raw data”) is though instruments and sensors.
  • Calculation. In many cases physical laws allow direct calculation of observables quantities and computers have sufficient power. Computer programs in quantum mechanics, thermodynamics, classical mechanics and many other fields are often capable of showing excellent agreement with experiment and are much cheaper or can simulate unobservable situations (e.g. inside planets or stars).

The basis of reporting an experiment is in part to allow other scientists to falsify the experiment. A scientist should expect others to try to disprove their work – they may not like it when it happens, but it’s a fundamental rule. Therefore a scientist should agree that when reporting an experiment they should make available all data necessary to repeat the experiment.

(Note that I am separating “data” from “materials”. In an ideal world – and some are trying to create this – the scientist should make available enough material for others to repeat the work. But here I am sticking to data, with the expectation that if there is sufficient data in sufficient detail about the materials used (chemicals, animals, telescopes, seismometers, etc.) that a repeater could, in principle, verify that they had an essentially identical experimental setup.)

When the results of an experiment are published it is usually in a self-contained “journal article”. In principle this article should contain all the data necessary to repeat the experiment. This is very rare but many domains are trying to achieve this. Many others are not.

This has all been preamble – now the question of what is data. [Essentially the Panton Paper starts here with comments to be interspersed between the separators]


  1. “Data” implies accompanying metadata (e.g. precise definitions of quantities, equations of interrelationships, scientific units of measurement, error analysis, etc.)
  2. In experimental sciences the data is all the information required to repeat the experiment and the resulting data reported from that experiment.
  3. In data-driven sciences the data is the methodology of data collection and the contents of the database at a given time.
  4. In computational science the data is the program used to compute the results, the parameterisation of the program and the results of the calculation.

What does this mean in practice? The typical journal article falls far short of the ideal but the relevant areas are usually two or more of:

  1. Materials and Methods
  2. Experimental
  3. Results
  4. Supplemental data (or supporting information).

These should all be regarded as complete units of data. There is no scope for any “creative works” – they are all factual reporting of the design, the experiment, the observations and the measurements. (Where data are processed – and this is covered in later papers – this should all be in “Experimental” or “Conclusions”).

Very simply then, all these sections should be regarded as factual data. They should be available in all publications without restriction from subscription barriers or contractual agreements. They should be text- and data-mineable without restriction.

Note that it is common for scientists to report data in many different forms and media. The following is a common subset of material that can be strictly factual

  1. Text (including interspersed numeric values, names, organisms, chemical formula etc.
  2. Mathematical equations
  3. Tables
  4. Images (cells, stars, animals, etc.)
  5. Drawings of experimental procedures (equipment, workflows)
  6. Graphs (relating variables, eg. X-Y, scatterplot, histograms, etc.)
  7. Audio recordings
  8. Video recordings

None of these should be copyrightable as “creative works” and all should be made Open.

Debarring any section of the world community from Open availability of these sections is a direct detriment to science.


Note:

I have also argued that all the bibliographic metadata (author, journal, addresses) and the citations should be regarded as Open but this will be addressed elsewhere

3 Responses to “PP1_0.1: What is Scientific Data?”

  1. [...] Unilever Centre for Molecular Informatics, Cambridge – PP0.1: What is Scientific Data? « petermr’… wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=2475 – view page – cached This post is a first outline – not even a draft – of a proposed Panton Paper on “What is Scientific Data?” Tweets about this link [...]

  2. Claudia Koltzenburg says:

    http://flowingdata.com/2010/07/28/brief-history-of-data-visualization/?utm_source=twitterfeed&utm_medium=twitter “… description of what data is, from a practical point of view” thanks @Khader http://ff.im/ooHDK

  3. steggb says:

    This will include proposals for a limited private copying exception; to widen the exception for noncommercial research, which should also cover both text- and data-mining to the extent permissible.-Markus Lattner

Leave a Reply