There has been considerable interest in having access to the bulk knowledgebase of CrystalEye – WWMM which contains primary data for over 100,000 crystal structures and probably over 1 million copies of fragments derived from those. We are obviously excited to see the interest and will be talking this morning, and possibly later today at the SPECTRa-T meeting, about the problems of disseminating repositories and knowledgebases where we have some experts in the field.
Firstly I reiterate that CrystalEye is OpenData according to the The Open Knowledge Foundation definition. It does not actually carry a licence but uses this as a meta-licence. So it is legally allowed for anyone to take copies of the contents and re-use them, including for commercial purposes. We shall not waver in that.
There have been recent suggestions that to save bandwidth people should make copies of the data and redistribute them on DVD. We would ask you to refrain from doing this for the immediate future for several reasons:
- The architecture of CrystalEye and its dissemination through AtomPP is new. Jim Downing hasn’t even had the chance to explain what his vision for the dissemination is. Please give Jim a chance to explain.
- It is not trivial to take a physical snapshot of dynamic hypermedia. CrystalEye is updated daily, has over 1 million hyperlinks, and contains several distinct meta-views of the knowledge. This cannot be captured in a single process. Therefore any physical copy will involve significant loss of metadata. This loss could be so significant that the copy was effectively broken.
- It seems clear from the 2-3 days discussion that different communities want different views of CrystalEye. Some want links to the entries as arranged by the literature, others want it organised by fragments. These are almost completely orthogonal.
- Copying the data and redisseminating without reference to the originators it is, in effect, forking (see below).
- We are critically concerned about versioning and annotations. CrystalEye has effectively nightly versions and it is important that when people use it for data-driven science it is clear that they are referring to PRECISELY the same collection.
- We have thought carefully about sustainability of CrystalEye and have had discussion with appropriate bodies. These would maintain the Openness, but would look to sustainable processes. I cannot give more details in public.
Please note than many Open resources ask or require that their database is not distributed in toto without their involvement. I think this is true of Pubchem – anyone can download individual entries and re-use but it is required (and common courtesy) to ask before downloading the whole lot. We have done this, for example, for the names in Pubchem which are now part of OSCAR.
Then there are the more intangible aspects. It is appropriate that this is seen as a creation of the authors, their collaborators, the Unilever Centre for Molecular Science and Informatics, and the University of Cambridge. It would be appropriate that these are the first entities that the world should look to if there is to be a physical distribution of some of the resource. At present we see a physical resource as potentially creating as many problems as it solves – whether done by us or not. CrystalEye is much more than the data contained in it – a physical snapshot gives as much indication of this as a series of photographs does of television.
So before distributing the data without our involvement, please let’s discuss the aspects – and at present this blog is the appropriate place. I reiterate that no comments are moderated out.
========================
From WP: Fork: (this relates to software, rather than data but the principles overlap)
In free software, forks often result from a schism over different goals or personality clashes. In a fork, both parties assume nearly identical code bases but typically only the larger group, or that containing the original architect, will retain the full original name and its associated user community. Thus there is a reputation penalty associated with forking. The relationship between the different teams can be cordial (e.g., Ubuntu and Debian), very bitter (X.Org Server and XFree86, or cdrtools and cdrkit) or none to speak of (most branching Linux distributions).
Forks are considered an expression of the freedom made available by free software, but a weakness since they duplicate development efforts and can confuse users over which forked package to use. Developers have the option to collaborate and pool resources with free software, but it is not ensured by free software licenses, only by a commitment to cooperation. That said, many developers will make the effort to put changes into all relevant forks, e.g., amongst the BSDs.[citation needed]
The Cathedral and the Bazaar stated in 1997 [1] that “The most important characteristic of a fork is that it spawns competing projects that cannot later exchange code, splitting the potential developer community.” However, this is not common present usage.
In some cases, a fork can merge back into the original project or replace it. EGCS (the Experimental/Enhanced GNU Compiler System) was a fork from GCC which proved more vital than the original project and was eventually “blessed” as the official GCC project. Some have attempted to invoke this effect deliberately, e.g., Mozilla Firefox was an unofficial project within Mozilla that soon replaced the Mozilla Suite as the focus of development.
On the matter of forking, the Jargon File says:
- “Forking is considered a Bad Thing—not merely because it implies a lot of wasted effort in the future, but because forks tend to be accompanied by a great deal of strife and acrimony between the successor groups over issues of legitimacy, succession, and design direction. There is serious social pressure against forking. As a result, major forks (such as the Gnu-Emacs/XEmacs split, the fissioning of the 386BSD group into three daughter projects, and the short-lived GCC/EGCS split) are rare enough that they are remembered individually in hacker folklore.”
It is easy to declare a fork, but can require considerable effort to continue independent development and support. As such, forks without adequate resources can soon become inactive, e.g., GoneME, a fork of GNOME by a former developer, which was soon discontinued despite attracting some publicity. Some well-known forks have enjoyed great success, however, such as the X.Org X11 server, a fork from XFree86 which gained widespread support from developers and users and notably sped up X development
Peter..regarding “Please note than many Open resources ask or require that their database is not distributed in toto without their involvement. I think this is true of Pubchem – anyone can download individual entries and re-use but it is required (and common courtesy) to ask before downloading the whole lot.”
Based on what I know and my experiences of the number of groups who have downloaded PubChem there is no expectation, formal or informal, to ask their permission to download the data. However, that’s my experience not necessarily their expectations.
In this case for CrystalEye you have people asking you for the data, they are OpenData but now your concern over forking appears to be the problem with sharing the data. I wish you luck resolving this so that we can access the data. Otherwise we will initiate our scraping as you suggested and it will fork anyway.
It boils down to the question of how truly “OPEN” are those open data, Peter, when you start expressing concerns about sharing those data, i.e. the discussion about forking.