cyberscience: Where does the data come from?
[In several previous and future posts I use the tag "cyberscience" - a portmanteau of E-Science
(UK, Europe) and Cyberinfrastructure (US) which emphasizes the international aspect and importance of the discipline.]
Cyberscience is a vision for this century:
The term describes the new research environments that support advanced data acquisition, data storage, data management, data integration, data mining, data visualization and other computing and information processing services over the Internet. In scientific usage, cyberinfrastructure is a technological solution to the problem of efficiently connecting data, computers, and people with the goal of enabling derivation of novel scientific theories and knowledge.
So we in Cambridge are attempting to derive novel scientific theories and knowledge. We may do much of this in public – on this blog and other media – and we welcome any help from readers.
Many cyberscientists will bemoan the lack of data. Often this is because science is difficult and expensive – if I want a crystal structure at 4K determined by neutron diffraction I have to send the sample to a nuclear reactor. If I want images of the IR spectrum of parts of the environment I need a satellite with modern detectors. If I want to find a large hadron I need a large hadron collider.
But we are getting a lot better at automating data collection. Almost all instruments are now digital and many disciplines have now put effort into devising metadata and markup procedures. The dramatic, but largely unsung, developments in detectors – sensitivity, size, response, rate, wavelength, cost makes many data trivially cheap. And there is a corresponding increase in quality – if all data are determined from a given detector then it is much easier to judge quality automatically. We’ll explore this in later posts.
So where does data come from? I suggest the following, but would be grateful for more ideas:
- directly from a survey to gather data. This is common in astronomy, environment, particle physics, genomes, social science, economics, etc. Frequently the data are deposited in the hope that someone else will use them. It’s unknown in chemistry (and I would not be optimistic of getting funding). It sometimes happens from government agencies (e.g. NIST) but the results are often not open.
- directly from the published literature. This is uncommon and the next post will highlight our CrystalEye project. The reasons are that (a) we haven’t agreed the metadata (b) scientists don’t see the point (c) the journals don’t encourage or even permit it. However this has been possible in our CrystalEye
project (described in later posts). Note, of course, that this is an implicit and valuable way of validating published information.
- retyped from the current literature. Unfortunately this is all too common. It has no advantages, other than that it is often legal to retype facts but not for robots to ingest them as above. It is slow, expensive, error prone and almost leads to closed data. It may be argues that the process is critical and thus adds value – in some cases this is true – but most of this is unnecessary – robots can often do a good job of critiquing data.
- output of simulations and other calculations. We do a lot of this – over 1 million calculations on molecules. If you believe the results it’s a wonderful source of data. We’ve been trying to validate the calculations in CrystalEye.
- mashups and meshups. The aggregation of data from any of the 4 sources above into a new work. The aggregation can bridge disciplines, help validation, etc. We are validating computation against crystallography and crystallography against computation. Both win.
So given the power of cyberscience why is it still so hard to find data even when they exist? (I except the disciplines which are explicitly data-driven). Here are some ideas:
- The scientists don’t realise how useful their data are. Hopefully public demonstration will overcome this.
- The scientists do realise how useful their data are (and want to protect them). A natural emotion and one that repositories everywhere have to deal with. It’s basically the Prisoners’ dliemma where individual benefit competes against communal good. In many disciplines the societies and the publishers have collaborated to require deposition of data regardless of screams from authors, and the community can see the benefit.
- The data are published, but only in human-readable form (“hamburgers”). This is all too common, and we make gentle progress with publishers to try to persuade authors and readers of the value of semantic data.
- The publishers actively resist the re-use of data because they have a knee-jerk reaction to making any content free. They don’t realise data are different (I am not, of course asserting that I personally agree with copyrighting the non-data, but the argument is accepted by some others). They fear that unless they avidly protect everything they are going to lose out. We need to work ceaselessly to show them that this is misguiding and this century is about sharing, not possessing scientific content.
- The publishers actively resist the re-use of data because they understand its value and wish to sell us our own data back. This blog will try to clarify the cases where this happens and try to give the publishers a chance to argue their case.
The next post shows how our CrystalEye project is helping to make data available to cyberscience.