Can Open Data be manipulated?

Chrsi Rusbridge – who runs the Digital Curation Centre – has raised the question of whether making data Open increases the risk of fraudulent manipulation of content:

Open Data… Open Season?

Peter Murray Rust is an enthusiastic advocate of Open Data (the discussion runs right through his blog, this link is just to one of his articles that is close to the subject). I understand him to want to make science data openly accessible for scientific access and re-use.
PMR: Correct!
It sounds a pretty good thing! Are there significant downsides?
Mags McGinley recently posted in the DCC Blawg about the report “Building the Infrastructure for Data Access and Reuse in Collaborative Research” from the Australian OAK Law project. This report includes a substantial section (Chapter 4) on Current Practices and Attitudes to Data Sharing, which includes 31 examples, many from the genomics and related areas. Peter MR wants a very strong definition of Open Access (defined by Peter Suber as BBB, for Budapest, Bethesda and Berlin, which effectively requires no restrictions on reuse, even commercially). Although licences were often not clear, what could be inferred in these 31 cases generally would probably not fit the BBB definition.
PMR: Although BBB is the most straightforward philosophy for data re-use it is not the only approach. I am promoting it at present because I feel that a large number of scientists create data which they would like to be made available under – say – a CC-BY licence. But I fully accept that in some disciplines re-use of data may have to be governed by additional principles, especially where it has to support regulatory processes or involves human data.
However, buried in the middle of the report is a cautionary tale. Towards the end of chapter 4, there is a section on risks of open data in relation to patents, following on from experiences in the Human Genome and related projects.

“Claire Driscoll of the NIH describes the dilemma as follows:
It would be theoretically possible for an unscrupulous company or entity to add on a trivial amount of information to the published…data and then attempt to secure ‘parasitic’ patent claims such that all others would be prohibited from using the original public data.”

(The reference given is Claire T Driscoll, ‘NIH data and resource sharing, data release and intellectual property policies for genomics community resource projects’ Expert Opin. Ther. Patents (2005) 15(1), 4)
The report goes on:

“Consequently, subsequent research projects relied on licensing methods in an attempt to restrict the development of intellectual property in downstream discoveries based on the disclosed data, rather than simply releasing the data into the public domain.”

They then discuss the HapMap (International Haplotype) project, which attempted to make data available while restricting the possibilities for parasitic patenting.

“Individual genotypes were made available on the HapMap website, but anyone seeking to use the research data was first required to register via the website and enter into a click-wrap licence for the use of the data. The licence entered into, the International HapMap Project Public Access Licence, was explicitly modeled on the General Public Licence (GPL) used by open source software developers. A central term of the licence related to patents. It allowed users of the HapMap data to file patent applications on associations they uncovered between particular SNP data and disease or disease susceptibility, but the patent had to allow further use of the HapMap data. The licence specifically prohibited licensees from combining the HapMap data with their own in order to seek product patents…”

Checking HapMap, the Project’s Data Release Policy describes the process, but the link to the Click-Wrap agreement says that the data is now open. See also the NIH press release). There were obvious problems, in that the data could not be incorporated into more open databases. The turning point for them seems to be:

“…advances led the consortium to conclude that the patterns of human genetic variation can readily be determined clearly enough from the primary genotype data to constitute prior art. Thus, in the view of the consortium, derivation of haplotypes and ‘haplotype tag SNPs’ from HapMap data should be considered obvious and thus not patentable. Therefore, the original reasons for imposing the licensing requirement no longer exist and the requirement can be dropped.”

So, they don’t say the threat does not exist from all such open data releases, but that it was mitigated in this case.
Are there other examples of these kinds of restrictions being imposed? Or of problems ensuing because they have not been imposed, and the data left open? (Note, I’m not at all advocating closed access!)

PMR: Digital curation is hard – it is one of the hard challenges of this century, and it is critical that organisations such as the DCC exist. I don’t have answers in all cases. The following may meet many requirements:

  • the author creates a definitive, signed, version of the data and reposits it in an institutional  or (possibly better) domain repository. XML has a mechanism for canonicalization and signatures.
  • software is used which is able to compare any version of a document with the definitive version. XML provides this.
  • there are domain-specific repositories. This is the hard part – it costs money. However it is well served in bioscience at present.

The question is similar to plagiarism. If the data are available it is easier to manipulate them. But it is also easier to detect any mistakes or fraud.
There is challenge as to whether it is possible to create a licence which restricts the use of the data but not its dissemination. If, for example, I get the latest version of the fundamental constants (e.g. the speed of light) from NIST it is not unreasonable that I cannot change these without permission – certainly not if I want to maintain they came from NIST. So there is a role for certified immutable reference documents under a BBB philosophy. I think it should be limited.
I faced this problem when I released some of my code under a GPL licence. I normally use a non-viral one, but this was a derivative work of a GPL program. I was worried that people might make derivative works of CMLSchema, which defines CML and thereby corrupt the practice of CML. So I said that anyone can change the code, which they can, but required that they make an announcement that the result could not be considered CML. The GNU software auditors approached me and said that I could not impose this restriction under the GNU licence. I changed “require” to “request”.
So, in conclusion, Chris – I am not worried beyond that fact that I think digital curation is extremely hard, must support Open Data as much as feasibly possibly, and the cannot look to the commercial (publishing) sector which wishes to aggregate and possess. The inability to get closed data, however well “curated” (and I don’t believe it normally is) is far more damaging.

This entry was posted in data, open issues, Uncategorized. Bookmark the permalink.

One Response to Can Open Data be manipulated?

  1. Bill says:

    The HapMap license was abandoned because increasing openness rendered it obsolete, even harmful. How is this a problem with Open Data?
    It would be theoretically possible for an unscrupulous company or entity to add on a trivial amount of information to the published…data and then attempt to secure ‘parasitic’ patent claims such that all others would be prohibited from using the original public data.
    This has nothing to do with Open Data per se, and everything to do with the ridiculous, moneygrubbing, lawyer-infested world of patents. If that system worked, patents which relied on trivial changes would not be granted in the first place.

Leave a Reply

Your email address will not be published. Required fields are marked *