Typed into Arcturus
apologies for formatting - Word=> WordPress has somehow trashed the paragraphs
This post is a first outline – not even a draft - of a proposed Panton Paper on "What is Scientific Data?"
"publicly funded science data should be in the public domain, full stop." Before you start saying "well what about human data, endangered species, etc." I know there are many tricky cases. We cannot escape the fact that "Data is Difficult" and subsequent Panton Papers will address these. But the present post explores what Scientific Data is. Warning: You will disagree with some of what I say. That's because the area is complex, because I haven't thought everything out, and because it differs from field to field. But I hope we can agree on some generalities. Please help to refine this, either through comments on this blog or on the OKF open-science list. And if you know of a similar or better analysis let us know,
I'm going to use "Science" to embrace STM (Scientific, Technical, Medical). There may be Arts and Humantities (A&H) which can be covered by this and archaeology has many STM features, for example, but literary criticism does not. Accept that I shall leave fuzzy borderlines. I'll start with some rough generalities. Scientific data is usually created by a conscious act. It may be possible to extract new scientific results from reading newspapers but normally the scientist consciously measures , observes or computes scientific data or obtains data from those who have done this. I suggest that scientific data is created in two main regimes:
- Hypothesis-driven science, where a hypothesis is proposed and data collected that can falsify the theory. Frequently this process is reported as one or more experiments
- Data-driven science (also Discovery Science), where data is collected and it is then analysed to show patterns either within the dataset or when combined with other data. Data gathering has an honoured history but is usually done with a purpose – random fact collection is rarely valuable. This motivation affects the choice of study and the methodology and should be made public - part of the Open information available to the world.
- Observation and measurement. In some domains observation (e.g. field studies) is still the only method, and in others measurements are carried out by scientists and recorded in note books, but increasingly the measurement of data ("raw data") is though instruments and sensors.
- Calculation. In many cases physical laws allow direct calculation of observables quantities and computers have sufficient power. Computer programs in quantum mechanics, thermodynamics, classical mechanics and many other fields are often capable of showing excellent agreement with experiment and are much cheaper or can simulate unobservable situations (e.g. inside planets or stars).
The basis of reporting an experiment is in part to allow other scientists to falsify the experiment. A scientist should expect others to try to disprove their work – they may not like it when it happens, but it's a fundamental rule. Therefore a scientist should agree that when reporting an experiment they should make available all data necessary to repeat the experiment.
When the results of an experiment are published it is usually in a self-contained "journal article". In principle this article should contain all the data necessary to repeat the experiment. This is very rare but many domains are trying to achieve this. Many others are not.
This has all been preamble – now the question of what is data. [Essentially the Panton Paper starts here with comments to be interspersed between the separators]
- "Data" implies accompanying metadata (e.g. precise definitions of quantities, equations of interrelationships, scientific units of measurement, error analysis, etc.)
- In experimental sciences the data is all the information required to repeat the experiment and the resulting data reported from that experiment.
- In data-driven sciences the data is the methodology of data collection and the contents of the database at a given time.
- In computational science the data is the program used to compute the results, the parameterisation of the program and the results of the calculation.
- Materials and Methods
- Supplemental data (or supporting information).
These should all be regarded as complete units of data. There is no scope for any "creative works" – they are all factual reporting of the design, the experiment, the observations and the measurements. (Where data are processed – and this is covered in later papers – this should all be in "Experimental" or "Conclusions").
Note that it is common for scientists to report data in many different forms and media. The following is a common subset of material that can be strictly factual
- Text (including interspersed numeric values, names, organisms, chemical formula etc.
- Mathematical equations
- Images (cells, stars, animals, etc.)
- Drawings of experimental procedures (equipment, workflows)
- Graphs (relating variables, eg. X-Y, scatterplot, histograms, etc.)
- Audio recordings
- Video recordings
None of these should be copyrightable as "creative works" and all should be made Open.
I have also argued that all the bibliographic metadata (author, journal, addresses) and the citations should be regarded as Open but this will be addressed elsewhere