Data repositories for long-tail science: setting the scene?

I’m assuming we all believe that we need data repositories for science; that’s there are about 10 different reasons why (not all consistent); that many of us (including communities I am) in are starting to build them. So what should they be like?

I’m talking here and in the future about long-tail science – where there are tens of thousands of separate laboratories with no infrastructural coordination and with a huge variety of interest, support, antagonism, suspicion, excitement, boredom about data. A 6-10 person lab which analyses living organisms, or dead organisms, or properties of materials, or making chemicals, or photographing meteorites, or correlating economic indicators, or climate models, or nanotubes or… Where the individuals may work together on one or more projects, or where each worker has a separate project. Where there is a mix of postdocs, technical staff, central support, graduate students, interns, undergraduates, visiting staff, public projects, private projects, commercially exploitable material. With 3 month projects, 3 year projects, 3 decade projects. With ground-breaking new science, minor tweaks to existing science , application of science to make new things or develop new processes. Not usually all in the same lab. But often in the same department.

Indeed almost as diverse and heterogeneous as you can imagine. The common theme being that people create new raw stuff through lab experiments, field observations, computer simulations, analysing existing data, or text. We’ll call the non-material part (i.e. the bits and bytes, not the atoms or photons) of the raw stuff “data” (a singular noun). This data is very important to each researcher, but they have had generally had no formal training in how to manage this data. They base their “data management policy” on software already on their machine, what their neighbours suggest to them, what they see on the public web, what the student last year submitted for their thesis.

And unfortunately data are often very complicated. So generic data management systems are likely to be either very abstract, or complicated, or both.

So here I’ll try to suggest some views as to how long-tail scientists regard data… I’ll also treat it as a 3-4 year problem – the length of a typical PhD thesis.

At the start data is not an issue. Undergraduate work has designed the environment of an experiment so that you only capture and record a very small amount of stuff. In many PhDs you aren’t expected to start collecting data at this stage. You are meant to be reading and thinking.
You read the literature. In the literature data is a second-class citizen. It’s hidden away, never published. Maybe you read some theses from last year. They have a bit more data in. Usually tables. But it’s still something that you read, rather than interact with. There are also graphs and photographs of stuff. They are self-consistent and make sense (they have to make sense to the examiners).
You learn how to use the equipment, or grow the bugs, or grow the crystals or collect fruit flies or photograph mating bats or whatever. Sometimes this is fun; sometimes it doesn’t work. You’ve been trained to have a lab book (a blue one with 128 pages with hard covers and “University of Hogwarts” on each numbered page.) You’ve been trained (perhaps) to write down your experiment plan. This is generally required if you work with stuff which has legal or mortal consequences if you do it wrong. Hydrogen peroxide can be used for homeland insecurity. In some cases someone has to sign off what you say you are going to do.
Now you do your experiment. You write down – in ballpoint – the date. Then what you are doing, and what happened. And now you have got some stuff. You record it, perhaps as a photograph, perhaps as a spectrum, perhaps in a spreadsheet if it changes with time. Your first data. By this time you are well into your PhD. You’re computer-literature so you have it as a file. But you also have to record it in your lab-book. So, easy – you print it out! Then get some glue and glue it into the book. Now it’s a permanent record of the experiment. [For additional fun, some glues degrade with time, so by the third year all your pasted spectra fall out. Naturally you haven’t labelled which page they were stuck to – why would you? So you have to make an educated guess as to where they used to be.
Oh, and that file with the spectrum in? You have to give it a name – so “first spectrum” and we’ll put it on the Desktop because then we know where to find it. At least it’s safe.
6 months, and the experiments are doing well. Now your desktop is pretty full, so we’ll make the icons smaller. They are called “first spectrum after second purification” and so forth. You can always remember what page in the lab book this related to.
A year on, and you’ve gone to the “Unseen University” to use their new entranceometer. The data is all collected and stored on their machine. You get a paper copy of each experiment for your book. There is no point in taking a copy of the data as it’s in binary and the only processing software is at UU. And you are going back next year so any problems can be sorted then.
Two years on and you are proud of the new system you have devised. Each bit of stuff has a separate name. Something like “carbon-magnesium-high-temperature/1/aug/200atmosphere/version5”. You’ve finished this part of the study and you need a new machine. So you save your data as 2010.zip. Your shiny new machine has 5 times as much diskspace so you copy the file into “old machine”/2010.zip. It’s safe.
Three years on. Time to start thinking about writing up. The entranceometer stuff has been reported at a meeting and got quite a lot of interest. Your supervisor has started to write a paper (some supervisors write their students’ papers. Others don’t). This is good practice for the thesis. You give him the entranceometer diagrams. The paper is sent off.
And reviewer 2 doesn’t like the diagram. They’ve used a different design of entranceometer and it plots the data on logarithmic axes. They ask you to replot.
What to do? You have the data, so we’ll replot it. Where is it? “old machine”/something_or_other. Finally you find the zip file.
It doesn’t unzip – “auxiliary file missing”. You have no idea what that means. So let’s mail the UU quoting the reference number on the printed diagram. After a week or so no answer so try again. A mail with a binary file “these are the only files we could find”. You unzip it – “imaginary component of Furrier transform missing”. Basically you’re stuffed. You cannot recompute the diagram.
Then you have a bright idea. You could replot the spectra by hand. By measuring every point on the curve you get X and Y. And they want logX. Your mate writes a Javascript tool that reads points off an image and records the X-Y values. So you can digitize the spectrum by clicking each point. It only takes 2 hours per spectrum. There’s 30 altogether. So you can do it in under a week if you spend most of the day working on it…
Now this is not fun, and it’s not science and it’s against health and safety. But it will get the data measured for the paper. And you are now knackered.
Wow – the paper got a “highly accessed” (You can’t actually read it because you’re now visiting a lab which doesn’t subscribe to that journal. So it will have to wait till you can read your own paper.
And now the thesis. It’s a bit of a rush because you had to present the results at a conference because the boss said so. But you got a job offer – assuming you finish your thesis.

…

Help, what does this file mean: “first compound second version really latest version 3.1”? What type of data is it (it doesn’t seem to have an extension). And should you not use “first compound (version 2) first spectrum”. You can’t find the dates because when you copied the file they all got changed to the date of copying so they all have the same date. So you talk to the second year PhD . “one of the files was run after the machine software changed; which is similar to yours?” “Ah, I have only seen that type” “Thanks, this must the later one, I’ll use it”.

So, as I continue to stress, the person who needs the datamanagement plan is the researcher themselves. Forget preserving for posterity if you cannot preserve for the present. So let’s formulate principle 1:

“the people who need repositories are the people who will want to put data into them”.

The difficult word is “want”. If we can solve that we have a chance of moving forward. If we can’t we shall not succeed.

Data repositories for long-tail science: setting the scene?

One Response to Data repositories for long-tail science: setting the scene?

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta