open data: concepts from David Wiley

David Wiley has commented very clearly on the issues involved in licensing content (or putting it in the public domain). This is the first of two posts, with my comments interjected.
By background, David seems to be writing in an educational context (i.e. material created for or by instructors and students for the teaching and learning process). There are concepts in his “Four Rs” which probably don’t apply to data, but they are still helpful. At one level it can be argued that facts are not copyrightable and simply pursue that, but “data” is possibly more complex. So while I don’t like the idea of putting licences on data, I think it’s worse to leave them off.
Bits are snipped, but I have included a lot:

Open Education License Draft

If you follow this blog with any regularity you’ll have seen this coming for several weeks now. When I began recommending that people quit using OpenContent licenses and begin using Creative Commons licenses, I said it was one of the hardest things I had ever done. And it was.Today I take the lid off the next most difficult thing I’ve done. As I describe below, I hate the idea of license proliferation. However, I feel that there are several convincing arguments that we need a new license at this point in the history of open content, and specifically in the history of open education. After providing the arguments and my thoughts below, you’ll find a draft of the first license issued by OpenContent in eight years – the Open Education License.

The Four Rs of Open Content

When I began promoting the idea of open content almost 10 years ago, there were four main types of activity I was interested in promoting (although it took me some time to get to the point where I could articulate them clearly). The four main types of activity enabled by open content can be summarized as “the four Rs”:

  • Reuse – Use the work verbatim, just exactly as you found it
  • Rework – Alter or transform the work so that it better meets your needs
  • Remix – Combine the (verbatim or altered) work with other works to better meet your needs
  • Redistribute – Share the verbatim work, the reworked work, or the remixed work with others

Notice how each of the first three Rs encompasses those that came before it. Reusing involves copying, displaying, performing, and making other uses of a work just as you found it. Reworking involves altering or transforming content, which one would only do if afterward they would be able to reuse the derivative work. Remixing involves creating a mashup of several works – some of which will be reworked as part of the remixing process – which one would only do if afterward they would be able to reuse the remix. (A “remix” in which no reworking is done is an anthology (a collection of simple reuses) and not particularly interesting for the purposes of this discussion.)

PMR: The word “work” emphasizes that much of what DavidW is talking about is creative. I don’t know whether everyone would apply “work” to data, but let’s try it.I have been using simply “re-use”. I want to use all of David’s 4 W’s but I think it may further confuse if I split my language into another 4 concepts. It may be that we need an overarching term (“reprocessing” is not ideal but it gives the idea).
In the learning objects literature and elsewhere, endless problems have been caused by the fact that people say “reuse” when they actually mean “rework” or “remix,” or some combination of the first three Rs. This is a classic problem of imprecision; of talking fast and loose. Add to this difficulty the fact that each of these three Rs thrives under different conditions, and you’ve got a recipe for general confusion.For example, take “rework.” This R deals with creating a derivative by altering or adapting a work. Traditionally licenses have tried to strengthen the rework activity through the “copyleft” mechanism. Copyleft is an idea borrowed directly from the world of free or open source software, requiring that derivative works be licensed using the exact same license as the original. This insures that when derivatives are created from a copylefted open content work, those children and grandchildren works remain open content, licensed using exactly the same license as the original.
distribution of copyleft licenses
However, while copyleft strictly requires that all future generations of derivative works be free and open, copyleft significantly hinders the remix activity. For example, conservative estimates say that there are approximately 40 million creative works that are currently licensed using a Creative Commons license. About half of these use the ShareAlike clause (Creative Commons’ copyleft clause). Of those creative works that use SA, about two thirds (~13 million) use By-NC-SA, while the other third (~7 million) uses By-SA. While statistics on GFDL adoption are harder to come by, because Wikipedia and the other Wikimedia projects use the GFDL we can safely estimate at least 7 million works are licensed using the GFDL (which contains its own copyleft clause). Since half of all CC licensed materials are licensed using a copyleft clause and all GFDL licensed materials are licensed using a copyleft clause, this means that over half of the world’s open content is copylefted. And while the CC and GFDL copyleft clauses guarantee that all derivative works will be “open,” they also guarantee that they can never be used in remixes with the majority of other copylefted works. You can’t remix a GFDL work with a By-NC-SA work when the licenses require that the child be licensed exactly as the parent. Each parent had one and only one license – which license would the derivative use? It’s just not possible to legally remix these materials; copyleft prevents this remixing.
PMR: I am not in favour of copyleft for data. I have no fundamental objection to creating a copyrighted work from data as long as there is significant added value. And copyleft is viral – deliberately. If any item in a system/collection/program etc. is copyleft, then the whole is (at least by the algorithm).
While promoting rework at the expense of remix – in other words, taking the copyleft approach – is fine for software, it is problematic for content and extremely problematic for education. As educators, we are always remixing materials for use in our classrooms both in the “real” world and online. Your mileage may vary, but over my last 15 years of teaching I would estimate that my remixing activities outnumber my reworking activities 10:1 or more. If other teachers are like me in this regard, then, copyleft is a huge problem for open education. Like the American football coach who tries to use his successful offensive and defensive strategies with a European football (or soccer) team, the open source advocate who brings the successful idea of copyleft into the world of open content will eventually be disappointed. The primary activity of the open source software developer is reworking; the primary activity of the open educator is remixing. Different activities require different supporting strategies to be successful.If we are serious about wanting the freedom to legally and frictionlessly remix educational materials, we have one of two choices: either ignore the OpenCourseWares, Wikipedia, and other copylefted open content of the world (i.e., work only with open content that isn’t copylefted), or forcibly constrain ourselves to one subset of the “open” content universe. Do you see the irony?
PMR: Yes. I would argue that if I get factual information from WP then it cannot carry a copyleft. I need the fundamental physical constants and get them from WP. I don’t think that my data and programs are thereby copyleft. All algorithms are now slightly fuzzy.

About the Copyleft and Attribution Restrictions

Some supporters of copyleft licenses like CC By-SA and the GFDL claim that they give users the ability to use and reuse open content with “no restrictions.” Obviously, requirements for attribution and copylefting of derivatives are very real restrictions that should not be overlooked. While supporters claim that “some restrictions are necessary to protect freedom,” and that requirements for attribution and copylefting fall into this category, both these restrictions can be problematic both practically and philosophically. I’ve spent a significant amount of time above describing why this is the case for the copyleft restriction.
When you contemplate the different cultures and cultural values in the world, it isn’t hard to imagine scenarios in which the requirement for attribution would prevent appropriate uses of open content. One need only contemplate any of the areas of enduring unrest in the world to understand that the requirement to attribute a reuse or rework of content to a Sunni or Shia author, for example, will prevent members of the other group from using the content. Sadly, over a dozen other examples of this kind (Israeli / Palestinian, etc.) could be given. It quickly becomes clear that the requirement to attribute the original author can be a subtle but no less real way of discriminating against persons or groups. (If the accusation of being an instrument of discrimination is not convincing enough to some open source advocates, this situation also puts the seemingly innocuous requirement for attribution at odds with one of the basic premises of the open source definition.) I believe it is absolutely crucial that we do everything we can to live up to the ideals of nondiscrimination expressed in the definition, our institutions, and civilization generally.

PMR: I appreciate the logic, but do not feel that attribution for scientific data can cause problems.

Why Not a Public Domain Dedication?

If the appropriate goal for a license is, as it appears, to make open content available without any restrictions, why not simply dedicate the works in question to the public domain? There are a number of problems with a public domain dedication (like that offered by Creative Commons). First, dedicating a work to the public domain is a significantly more involved process than licensing a work. While Creative Commons is rightly famous for how easy their license selection technology and little green buttons make licensing your work with a CC license, the public domain dedication is much more complicated and includes a number of steps, including making a request for Creative Commons to send you an email regarding your intent to place a work in the public domain. This rigamarole is not the fault of Creative Commons; they have simplified as much as possible the process of putting a work in the public domain in the US.
But secondly, and more importantly, it may be impossible under the law in some jurisdictions to place a work in the public domain. For example, in the EU authors have certain rights that cannot be contracted or licensed away, making it impossible for an author to legally relinquish all rights to a work (or put it in the public domain). Creative Commons also recognizes this problem with the statement that their public domain dedication “may not be valid outside of the United States.” Hence, a public domain dedication is not an internationally viable mechanism for open content.

PMR: Certain data is already clearly in the public domain such as works of the US government. That doesn’t apply to some collections such as made by the National Institute of Science and Technology (NIST) beacuse there is a special bill allowing them to recover costs. Public domain data should not cause problems, but if we every get to the stage “the data are PD so we can put a commercial licence on them” then we have a problem that needs addressing quickly.

About the Four Rs and the Four Freedoms

I hate definitions and taxonomies outside the hard sciences. I hate them particularly because I have been involved in the political contests of creating and perpetuating them – specifically, definitions and taxonomies of “learning objects.” Whose definition of learning object is best? Whose taxonomy is best? These are largely meaningless political battles I left behind many years ago.
It therefore surprises no one more than it surprised me that I felt the need to list and explicate the Four Rs, especially in the context of the existing “Four Freedoms.” While the Four Freedoms have their roots in free or open source software, they have been discussed in the context of open content as well. Wikipedia’s Terry Foote summarized the freedoms at our 2005 Open Education Conference as:

  • Freedom to copy
  • Freedom to modify
  • Freedom to redistribute
  • Freedom to redistribute modified versions

Freedom 1 is analogous to the first R, reuse. Freedoms 3 and 4 are analogous to the final R, redistribute. Freedom 2 is either analogous to the second R, rework, or is an amalgamation of the second and third Rs, rework and remix. In either case, the Four Freedoms do not distinguish sufficiently between the rework and remix activities. This leads to the problems described above in which rework is considered and supported at the cost of remix. These are distinct activities that require different environmental conditions.

PMR: Since data do not (I hope) have problems of remixing these seem clear and simple. I would be happy to define “data re-use” as the 4 F’s above.
The Four Freedoms as listed by Freedom Defined also fail to make this distinction:
  • the freedom to use the work and enjoy the benefits of using it
  • the freedom to study the work and to apply knowledge acquired from it
  • the freedom to make and redistribute copies, in whole or in part, of the information or expression
  • the freedom to make changes and improvements, and to distribute derivative works

While the “father knows best” approach of copyleft places only incentive obstacles in the path of would-be creators of derivative works (by stripping them of the ability to choose how to license their derivative works), copyleft places legal obstacles in the path of would be remixers. This problem is difficult to see through the imprecision of the way the Four Freedoms deals with “modify,” and this is one reason I felt justified in listing and explaining the Four Rs.

PMR: This seems less useful for data. David now offers a Open Education Licence draft. I think it’s not relevant to data, so it’s snipped.
PMR: So I’m going to continue to use the word “re-use”. It includes
  • Freedom to copy
  • Freedom to modify
  • Freedom to redistribute
  • Freedom to redistribute modified versions
but also includes the concept of input into programs, creation of new data derived algorithmically or stochastically from the data, aggregation with other data sources. We probably also need something on metadata.
This entry was posted in data, open issues. Bookmark the permalink.

One Response to open data: concepts from David Wiley

  1. Pingback: Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » open data: are licenses needed?

Leave a Reply

Your email address will not be published. Required fields are marked *