Derivative use of Open Access works

A number of people have commented on my concern about the re-use of Open Data and suggested that I have put unreasonable restrictions on it. I show two comments and then refer to Klaus Graf who has, I think, put the position very clearly.
Two comments:

  1. ChemSpiderMan Says:
    November 2nd, 2007 at 2:31 pm e[…] In this case for CrystalEye you have people asking you for the data, they are OpenData but now your concern over forking appears to be the problem with sharing the data. I wish you luck resolving this so that we can access the data. Otherwise we will initiate our scraping as you suggested and it will fork anyway.
  2. Gary Martin Says:
    November 2nd, 2007 at 7:28 pm eIt boils down to the question of how truly “OPEN” are those open data, Peter, when you start expressing concerns about sharing those data, i.e. the discussion about forking.

PMR: CrystalEye is a highly complex system, not initially designed for re-distribution. It contains probably 3 million files and many 100’s of gigabytes. If each file is spidered courteously (i.e. pausing after each download so as to consume only a single thread) it could take 10 million seconds = 3 months. During that time the database will have grown by 10-15% so that that percentage of links will ipso facto be broken. So any redistribution will involve distributing a broken system. Conversely if the whole DB is zipped into a 100GB file, downloading that is likely to break the server and the connection. So we have to create a sensitive and manageable process.
The data are Open and you can legally do almost anything other than claim you were the progenitor. That’s what Open means. But some of the things you can legally do are antisocial and we are requesting you don’t do them. Failing to respect the “integrity of the work” may not be illegal but it can be regarded as antisocial. The licences do not manage this
Klaus Graf:

http://www.earlham.edu/~peters/fos/2007/11/whether-or-not-to-allow-derivative.html
I disagree with Peter Suber and agree with PLoS and its position:
The Creative Commons web site explains the meaning of “no derivative works” as follows: “You may not alter, transform, or build upon this work”. This is not open access.
Its a clear misinterpretation of Budapest when Subers cites the definition as argument that derivative use isn’t allowed:
The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.
To control the integrity is a moral right and has nothing to do with a license formula. It’s the same as the “responsible use of the published work” in the Berlin declaration which allows explicitely derivative works.
Harnad is denying the need of re-use. Suber has often argued for the reduction of PERMISSION BARRIERS and his personal position to prefer a CC-BY use is honest but his opinion that CC-ND is compatible with BBB and also OA is absolutely disappointing. And it’s false too.

PMR: I agree with Klaus. I believe that PERMISSION BARRIERS must be removed. Whatever the moral arguments about PB I think there are also utilitarian ones. Open Access and Open Data are sufficiently complex already that differential barriers are counterproductive – they confuse people. There is also enough evidence that many publishers pay lipservice to OA by producing overpriced substandard hybrid products. If CC-ND is seen as OA then it is easy for the publishers to claim that any visible document is OA. There must be clear lines and I think CC-BY is where they are.
(And yes I have asked that my licence on this blog is changed to CC-BY)

This entry was posted in open issues. Bookmark the permalink.

6 Responses to Derivative use of Open Access works

  1. And..as I have said many times before, and I don’t know how to make it any clearer, I don’t want 100s of gigabytes of data. All I want is the file of structures and structure IDs and the construction of the URL to allow us to connect CHemSpider so that when people do a search on ChemSPider they find the structure and a link out to the structure on CrystalEye. I stated this clearly here: http://www.chemspider.com/blog/?p=191 and again on the blog. So, there will be no impact on your server and a file of 100,000 structures in a zipped SDF file can besent as en email attachment or made available as an FTP download. It cannot be difficult with the team you have to create an SDF file of structures. It will be a point in time I understand but that’s true about PubChem too…they have added more data but we have not downloaded it yet.
    You’ve suggested we scrape the pages…why not participate in the community building I am attempting at ChemSPider and simply generate an SDF file. Please?

  2. pm286 says:

    (1) The data are present as CML and CIF. We do not hold SDFiles and have no automatic means of creating them. Moreover crystal structures contain problems such as disorder and partial occupancy which are impossible to hold in an SDFile as far as I know without corrupting the data. You would also need the cell dimensions and the symmetry operations and SDFiles do not normally hold these. There are many disordered structures where, say, a perchlorate contains 1 Cl and 8 O atoms. Do you really want that? Or it might be 2 Cl and 6 O. We have to do a lot of work before creating an SDFile and it loses information. That is why we use CML which does not lose info.

  3. I thought Openbabel would convert CML to SDF. Maybe I’m wrong. I haven’t worked with CML since the original implementation into ChemSketch many years ago (and that is way out of date as you know).
    So, we’ll scrape the InChIs directly and ONLY link structures with InChIs from ChemSpider to CrystalEye

  4. pm286 says:

    (3) CML is Chemical Markup Language and besides supporting molecules can hold a wide range of chemical concepts including crystallography, spectra, properties, actions, operators, and so on. OpenBabel is effectively a molecule-based format and I don’t think it has deep support for crystallography. SDFiles do not, as far as I know, support crystallographic concepts. Therefore OpenBabel will only recognise the molecular concepts in a CML files and output these to SDFiles. Since CIF files do not currently hold molecular information the SDFile will probably be empty or contain a count of the atomic species. For example I would expect CaCO3 (calcium carbonate) to be corrupted to CaCO.
    As there is clearly some confusion I will try to explain more expansively on the blog.

  5. Andrew Dalke says:

    During that time the database will have grown by 10-15% so that that percentage of links will ipso facto be broken. So any redistribution will involve distributing a broken system.
    What? Are you saying you don’t have backups for your system? If it goes down and you recover from backups, will pages be broken? I hope not! And if not, then use the backups to generate the distribution. That can’t break the server.
    if the whole DB is zipped into a 100GB file, downloading that is likely to break the server and the connection
    ftp and bittorrent do very good jobs of transfering 100 GB files. I mentioned in another comment that using a system like Amazon’s S3 makes it easy to distribute the data, and cost about US $20 for the bandwidth costs of a 100GB download. (You would need to use multiple files because Amazon has a 5GB cap on file size.) Using S3 would not affect your systems at all, except for the one-shot upload time and the time it would take to put such a system into place.
    The data distribution site does not need to be on the same machine as your service. That’s a key part of a ReST architecture.

  6. Andrew Dalke says:

    Moreover crystal structures contain problems such as disorder and partial occupancy which are impossible to hold in an SDFile as far as I know without corrupting the data.
    “Corruption” is a strong word. Why not think of it as the way you wrote in your “Round-trip format conversion” wikipedia article?
    When a document in one format is converted to another there is likely to be information loss.
    Is “information loss” necessarily “corruption”? From my experience in dealing with PDB files, which has some of these crystallographic properties, I think there can be meaningful information despite the information loss. So long as the tools and the users understand that there are limitations in the conversion.

Leave a Reply

Your email address will not be published. Required fields are marked *