The C20th saw the rise and value of scientific data aggregators – organisation who extracted data from the literature, cleaned it, packaged it and offered it for re-use. In some cases they got grants to support this, but most moved to a commercial (or at least non-profit) model where costs were recovered (and profits could be made). In chemistry one of the best know and archetypal is “Beilstein” , created in 1881 by Friedrich Konrad Beilstein . Other well-known examples are Chemical Abstracts (run by the ACS) and the Cambridge Crystallographic Database (no direct link with ourselves).
These databases were created by necessary human labour (“the sweat of the brow” in US copyright phraseology). The fees or subscriptions were usually seen as a return for fair endeavour. However the problem of monopoly is always present and some have charged very high prices for their cash cows.
This model is now becoming increasingly untenable objectively. This comes from many sources including:
- The gift economy (more below)
- Changes in social attitudes (especially among young people)
- The dramatically lower cost of creating data, often near zero
- Unacceptable monopolistic activities by aggregators trying to save their content, leading to public opposition
- The requirement by funders that the complete scientific record is made Openly available
Although I have a personal opinion that data should be free, this post is intended to be objective. I know people in data aggregation activities and I do not wish them ill, but Cassandra-like I have to predict that they must change or die. (Steve Heller is one of the most eloquent evangelists on this theme, showing the inexorable pressure for change).
Few data aggregators show any signs of their impending problems – maybe this post will waken some.
The software community has already seen the rise of Open Source and some prophets suggest that ultimately all software will become free. Part of the motivation has been described as a “gift economy” (Eric Raymond in Homesteading the Noosphere). He argues that the Open Source movements value gifts rather than conventional material wealth. The costs of software creation are low enough and the rewards from the gift economy high enough that the equation balances. In chemistry the Blue Obelisk epitomises free donation.
I argue here that science will increasingly also create a gift economy and that individuals and organisations should be valued not simply (or even) for their integrated citation count and grant income but for what they have freely donated to science.
Data are even more important than software for future success in science. They may be expensive to gather but the costs of further dissemination are near-zero. We have shown (SPECTRa) that data can be published as part of the process of their generation and if funders require this, then it requires only marginal costs to integrate this operation into the normal scientific procedure of experiment, analysis and publication. Through SPECTRa, CrystalEye, WWMM etc. the data are published to the community, and current informatics protocols (metadata, harvesting, etc.) are sufficient to create a virtual database.
However few users of the current commercial databases in chemistry would willing give them up. Why? Because (quite reasonably) chemistry values historical data. The chemists of the C19th were meticulous experimenters and their work is still valuable today. So both aggregators and most “users” see the historical content as critical. The monopoly continues.
But is this really relevant today? And even if it is, is the cost worth it. Why is historical data important?
- patents. I accept that historical data is critical for showing prior art. And we shall need to continue historical data for patents
- safety. It may be necessary to search widely such data.
The rest of the arguments are less supportable:
- comprehensiveness. Many scientists have been trained – or become adapted – to the need for a comprehensive review of the literature. But this is illusory. A large amount of chemistry lies in paper theses or other hardly-accessible sources and this is rarely searched. 80% of crystal structures are never published so the crystallographic data aggregators are only comprehensive in the narrow sense of formal publication.
- curation and data cleaning. How well are data actually cleaned? Our robots show that most organic chemistry papers contain at least one error. Robots can now add a new dimension to data quality
- commentary. Some of the aggregators will add synoptic evaluation and commentary. This is enormously expensive. As social computing and annotation develops it will become uneconomic, except for safety-critical and legal-critical processes.
There will continue to be a limited need for high-quality data curation in these limited areas. But for scientific research these are unnecessary.
Most young scientists do not read papers. Allen Renear (UIUC) gave a splendid talk at ACS where he argued that the point of browsing the literature was to avoid reading papers. Seriously. The exploratory phase of modern science values speed, multidisciplinarity, impressions as much as it values formality.
Any hindrance to access destroys the rhythm of exploration. I have seen graduate students give up (in seconds) on any paper to which their institution does not subscribe.
But still there is the historical content. The nagging feeling that the pearl of wisdom could be missed. And that the expensive aggregation is essential. I think this is a myth, and will be recognised as such in the next 5-10 years. I presented this at the ACS meeting (in the symposium honouring Gary Wiggins, of Indiana).
Historically the growth of most databases have been exponential, with a doubling period of – say – a decade. During that time labour costs increase, and customers expect prices to remain constant in real terms. So already there is a mismatch – leading to compromises in comprehensiveness and up-to-dateness.
(Here I had a diagram but I can’t get it into the blog).
In the mid 1990’s electronic publications became common and the technical cost of data dropped to near-zero. Many aggregators took advantage of this and moved to an author-submission model for data collection, with major reduction in costs. However because of the historical material, pricess remained constant or increased to the benefit of the aggregators and the disadvantage of the community.
Initially the proportion of author-contributed data was small, but now it is significant and growing. Thus we have collected ca. 60,000 crystal structures from the literature. This is perhaps only 20-30% of the historically aggregated crystallography, and might be seen as of limited use. However:
- it has immediate currency – the robots aggregate it as soon as it is published.
- the data are robotically curated
- we can look to the gift economy to grow more rapidly than the historical model. If SPECTRa-T, COD and other gifts provide substantive new data then our Open data will start to outnumber those collected by the aggregators.
- The community can add innovative data management tools whereas conventionla aggregators tradtionally are responsible for their own innovation and will be slower.
- The community will develop social models for data annotation
The prime barrier to this is restrictive practices and inertia. How long it takes to overcome these depends on the community. But I predict that within 5-10 years many of the current data aggregators in chemistry will have seen their “markets” seriously challenged by the gift economy. It will give better data, more data, better metadata and will be zero-cost. So why not start to embrace it now?