Theses: why do we force graduate students to corrupt crystallographic information?

In our SPECTRa-T project we are exploring how we can extract data and metadata from chemistry theses. Almost all these documents are now born-digital, i.e. written in a wordprocessor such as Word or TeX rather than being typed on carbon paper. So in principle we should be able to include the actual data into the thesis. And occasionally this happens – I’ll give an example later. But all too often the absurd ritual requires the author to retranscribe experimental data into pretty “readable form”. This is a lot or work and often requires special programs to generate the prettiness. Here I show the wasted labour and data corruption required when reporting crystallography.
Continue reading

Posted in data, theses | Leave a comment

CrystalEye repository: technical aspects

There has been some confusion recently (post+comments , post+comments) about copying and redistributing CrystalEye. While some of this relates to the legal, moral and ethical issues, there are major technical aspects that need to be understood. Here are some, without any “Open” issues.

  • CrystalEye is not a database, it is effectively a repository. We are in the process of developing this a technology to support a number of projects in-house and it is at an early stage. Jim is working actively on how to make the contents of a repository available and at present he believes Atom is the best approach. We are therefore starting to mount Atom feeds for this purpose. Note that even repositories such as DSpace are not good at managing scientific data content – we put 150,000 calculations in the Cambridge DSpace and cannot extract them without writing code to access them. So this is a general aspect of repositories and we are in contact with other repository research.
  • CrystalEye is dynamic. It is updated every day. If someone starts spidering the complete content (probably > 1 million “files” (“resources” is a better term)) and is courteous (i.e. only uses a single thread with a delay) it probably takes ca several weeks to spider. By that time the data will be out of sync.
  • CrystalEye is complex, because combining chemistry with crystallography is complex. The raw material of CrystalEye is crystallography, not chemistry. Crystallography detects the position and nature of atoms (actually electrons, but accept that atoms is a better concept here). It does not detect bonds (which are human concepts). It does not, except in the very best experiments, detect charges. So the primary data in CrystalEye is a list of atoms with fractional coordinates (not Cartesians).
  • There are several reasons why a simple list of atoms is not a good representation of chemistry. These include space-group translations, space group symmetry, disorder, special positions, partial occupancy, etc. It is a matter of judgement and heuristics as to whether these are present and how they map onto chemistry. CrystalEye is an experiment in high-throughput chemical heuristics in crystal structures.
  • CrystalEye uses these heuristics to “guess” bonds, bond orders and charges. We think it does a pretty good job, but it’s not perfect (we’d like to know where it fails). This is one of the main reasons for posting CrystalEye – how well does it work?
  • Many – if not most – crystals contain more than one “moiety”. Thus Na2SO4.10H2O contains 2 sodium cations, 1 sulfate anion and 10 water molecules. How we break this up affects what the InChI looks like. In this case we have probably got it right, but in many cases the InChI is a matter of judgement rather than fact. This is because chemistry is varied and complex.
  • Many scientists are interested in the crystal structure – the way the moieties pack together. Others are completely uninterested in this and only wish to know about individual moieties. We have to cater for all sorts of science – organic, metal organic, inorganic, materials science, nanotechnology, etc. All of these disciplines will want something completely different from a crystallographic repository.
  • Nick Day has also created a huge amount of derived data. The most obvious of these are the fragments, where he has split the molecules into “natural” subunits. We expect this to be very useful to people (like Openbabel and FROG and BUSTR3D) who wish to build 3D molecules from smaller fragments. IN fact there are more fragments than entries.

So in extracting data from crystalEye it is important to consider what the discipline is. I suspect that so far most of the requests have come from the molecular organic community, which often has a focus on drug design. They request “structures”, but there are no structures in CrystalEye, only entries with a variety of derived chemical concepts, some of which may be considered as “structures”. It is important to define precisely what is required before it can be provided.
We wish to make CrystalEye as useful as possible. Please remember, however, that it is the work of a single graduate student (Nick Day) who is now writing up. We are actively continuing to develop repository technology to fit to CrystalEye and Jim Downing will be blogging on this. We are also actively taking to 3 groups about sustainability and we are thinking very hard about how to “copy” and “update” a dynamic repository. DSpace, Fedora and ePrints developers and managers will know that isn’t easy – it’s a topic of research. It isn’t easy for CrystalEye either.
But it should be easy to use CrystalEye as installed here for the applications we have created (browse, bond search, substructure search and RSS feeds). If there are applications and extensions you are interested in we’d be delighted to know. You may have to write the code!

Posted in crystaleye | 1 Comment

Derivative use of Open Access works

A number of people have commented on my concern about the re-use of Open Data and suggested that I have put unreasonable restrictions on it. I show two comments and then refer to Klaus Graf who has, I think, put the position very clearly.
Two comments:

  1. ChemSpiderMan Says:
    November 2nd, 2007 at 2:31 pm e[…] In this case for CrystalEye you have people asking you for the data, they are OpenData but now your concern over forking appears to be the problem with sharing the data. I wish you luck resolving this so that we can access the data. Otherwise we will initiate our scraping as you suggested and it will fork anyway.
  2. Gary Martin Says:
    November 2nd, 2007 at 7:28 pm eIt boils down to the question of how truly “OPEN” are those open data, Peter, when you start expressing concerns about sharing those data, i.e. the discussion about forking.

PMR: CrystalEye is a highly complex system, not initially designed for re-distribution. It contains probably 3 million files and many 100’s of gigabytes. If each file is spidered courteously (i.e. pausing after each download so as to consume only a single thread) it could take 10 million seconds = 3 months. During that time the database will have grown by 10-15% so that that percentage of links will ipso facto be broken. So any redistribution will involve distributing a broken system. Conversely if the whole DB is zipped into a 100GB file, downloading that is likely to break the server and the connection. So we have to create a sensitive and manageable process.
The data are Open and you can legally do almost anything other than claim you were the progenitor. That’s what Open means. But some of the things you can legally do are antisocial and we are requesting you don’t do them. Failing to respect the “integrity of the work” may not be illegal but it can be regarded as antisocial. The licences do not manage this
Klaus Graf:

http://www.earlham.edu/~peters/fos/2007/11/whether-or-not-to-allow-derivative.html
I disagree with Peter Suber and agree with PLoS and its position:
The Creative Commons web site explains the meaning of “no derivative works” as follows: “You may not alter, transform, or build upon this work”. This is not open access.
Its a clear misinterpretation of Budapest when Subers cites the definition as argument that derivative use isn’t allowed:
The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.
To control the integrity is a moral right and has nothing to do with a license formula. It’s the same as the “responsible use of the published work” in the Berlin declaration which allows explicitely derivative works.
Harnad is denying the need of re-use. Suber has often argued for the reduction of PERMISSION BARRIERS and his personal position to prefer a CC-BY use is honest but his opinion that CC-ND is compatible with BBB and also OA is absolutely disappointing. And it’s false too.

PMR: I agree with Klaus. I believe that PERMISSION BARRIERS must be removed. Whatever the moral arguments about PB I think there are also utilitarian ones. Open Access and Open Data are sufficiently complex already that differential barriers are counterproductive – they confuse people. There is also enough evidence that many publishers pay lipservice to OA by producing overpriced substandard hybrid products. If CC-ND is seen as OA then it is easy for the publishers to claim that any visible document is OA. There must be clear lines and I think CC-BY is where they are.
(And yes I have asked that my licence on this blog is changed to CC-BY)

Posted in open issues | 6 Comments

Open NMR: what metadata do we want?

One of the reasons that CrystalEye works is that the metadata contributed by the authors (and required by the publishers, through IUCr) is superb. Is there general agreement about what metadata should be captured for NMR spectra or shifts? The JCAMP files potentially contain a lot but this depends on their availability – we know of few publishers who accept, let alone publish JCAMP files. On the assumption that this post reaches some publishers who wish to promote good scientific practice in reporting spectroscopy, what do we want.
NMRShiftDB mirrors the summary material  published in the body of chemical papers (and sometimes supplemental). A typical recent entry shows:

Bookmarks
You could bookmark structures if you were logged in!
Details

Spectral Data Additional Data
Molecule (20139616)
nmrshiftdb.cubic.uni-koeln.de_Kolshorn_2007-10-23_02:02:58_0723
Chemical name(s) (5-bromo-2-furyl)methanol
Molecular weight 176.996
Number of all rings, size of smallest set of smallest rings 1, 1
CAS-Number  
Molecule keywords  
Type 13C
Assignment Method 1D shift positions
Solvent Chloroform-D1 (CDCl3)
Additional comments MZ2N-107
Spectrum categories ocmainz inhouse database

The following may be available elsewhere but (for 13C) I would like to see additionally:

  • organization/person depositing data (i.e. not just in filename)
  • date of deposition
  • whether solvent is spiked with reference material (e.g. TMS).
  • method of assignment of peaks (ideally this should be per peak as well as per experiment)
  • known hydrogen counts (this might be from in-house experiment or report in literature).
  • comments

A lot of this can be gleaned from publications. Here’s a typical rubric:

The 1H NMR (300 MHz) or 13C NMR (75 MHz) and DEPT spectra were recorded in CDCl3 using Brucker 300 MHz or JEOL 60 MHz spectrometers with TMS (0 ppm) as the internal standard.

PMR: but that’s all you get. At least we know that they used real TMS.

Posted in nmr, open issues | 5 Comments

A novel idea: publishers adding value to OA publications

Here’s a post from a little while ago. TA publishers thinking about how to build on OA repositories

19:19 08/10/2007, Peter Suber, Open Access News: Stephane Goldstein has blogged some notes on the ALPSP conference, Repositories – for better or worse? (London, October 5, 2007). Excerpt:

…This was genuinely interesting, and the question in the title proved to be satisfyingly open. Had such an event been run a couple of years ago, I suspect that the overwhelming consensus on the day would have been ‘for worse’ – but on this occasion, there was a real debate, symptomatic perhaps of changing attitudes.
This is not to say that concern and worry about the development of repositories has vanished among learned societies – far from it. However, the discussion at the workshop underlined a willingness from many quarters to think about new business opportunities. These include the provision of value added services and enhancements as text harvesting, dynamic formatting and Web 2.0 applications. There was a suggestion that, on this basis, publishers could actually generate revenue from repository usage, effectively by providing services that can exploit their content. The possible role of publishers in relation to data publication was also evoked, and this elicited much interest.
To my mind, one of the most interesting ideas raised by one of the societies was that ‘publishing’, as an all-embracing term, is an outdated concept. The view was that, from a business modelling, perspective, it is better to deconstruct ‘publishing’ into distinct but related processes that form an integral part of the research endeavour: broadly, (i) selection of scholarly material, (ii) validation, and (iii) editing – each of which could in principle be offered as a service, with cost implications, in its own right. Incidentally, this is not dissimilar from the view contained in RCUK’s original draft position statement (June 2005) on access to publication outputs.
All in all, this was a useful afternoon, and I can recommend ALPSP’s programme of such activities (not the first I’ve attended), even for non-publishers….

PeterS: Comment. One of the brightest futures for all stakeholders is for TA publishers to see OA as a business opportunity, not a business killer, or to reconfigure their operations to make it a business opportunity. This is the path of adaptation, not resistance. In this future, publishers accept and even encourage OA to peer-reviewed articles, and make their money by selling enhanced editions of the basic OA texts and selling tools and services to build, and build on, that OA foundation. There have been earlier signs of movement in this direction, from priced access to priced services for adding value to OA literature, but Goldstein’s notes suggest that this meeting was one of the strongest to date.

PMR: This seems to me an obvious market opportunity for science publishers and especially chemistry ones. The pace of innovation in chemistry – apart from the RSC’s Project Prospect in which our collaboration played an important part – is near zero. Yet the opportunity for enhanced services is enormous. Any publisher taking an active role here will rapidly leapfrog their competition. Spend the investment on things the community wants – not wasting effort on supporting subscritpions and policing access. There’s a huge role for “repositories” – not just of hamburger PDFs but real science. Anyone looking for new opportunities can contact us – we are serious.
Of course it depends on the chemical community actually appreciating enhanced and better chemical publications – if all they worry about is their citation count and not the scientific content we have a problem.

Posted in open issues | Leave a comment

COST and CrystalEye: What's the longest B-C bond?

In my talk to COST I demonstrated CrystalEye – Nick Day’s collection of > 100,000 crystal structures. I like giving live demos – at least it keeps me on my toes – and so I loaded the CrystalEye Home page. (You can follow this on your browser if you have Java and SVG). I then asked the audience for a bond type they were interested in and Hans-Peter Luethi (who is the project owner and with whom we are starting collaboration) said “Boron – Carbon”. So click on “Bond Lengths | “B”  and you will find all the bonds that boron has been found to make and you find a section:

  • B-Br
  • B-C
  • B-Ca
  • “protocol” means Joe Townsend’s protocol for identifying high quality structures from the metadata and content. “B-C” “after protocol” will load a histogram that looks like this:
    bc.PNG
    You’ll notice that there is a wide spread – 1.5 -> 1.8 Angstrom and that the distribution is at least bimodal. I’d welcome comments as to whether this is real. You can click on any bar and it will list all the entries contributing. So we click on the rightmost bin (1.805) and get (apologies for the large size but it’s worth seeing):
    bc1.PNG
    Nick’s software has chosen the first structure and has also highlighted the bond in question. Nearly everything is clickable – you can browse to the abstract on the publisher’s page and even read the paper if you pay 25 USD. (The abstract doesn’t even mention the crystal structure, let alone that this is a record-breaking B-C bond…). The 2D structure (which includes an icosahedron) is difficult for any automatic layout and CDK has done a reasonable job.

    Posted in crystaleye | Leave a comment

    Open NMR: update and requests for input

    The NMR project that Nick Day has been working on for the last month has run its course. We said that it would finish at the end of October so as not to prolong Nick’s writing up. Like all research it has not gone completely smoothly but it has actually been well on track. Nick will be posting all the material in the next day or so and everyone will have access. This will approximate an Open Notebook and we’ll invite comments later as to whether this is satisfactory. I shan’t fill in numbers in this post
    We set out our expected goals at the start of the project and this has proved extremely valuable. During an exciting period it has helped us stay focussed both in direction and extent. We had not anticipated the interest it would generate and it’s a credit to Nick that he has stayed clear-headed during the process. Here’s what we said we would do and whether we managed it (actual numbers and details will be posted soon):
    We adapted Rychnovksy’s method of calculating 13C NMR shifts by adding (XXX) basis set and functionals (Henry has done this).[ DONE]
    We extracted ca 400 spectra shifts with predicted 3D geometries for rigid molecules in NMRShiftDB (no acyclic-acyclic bond nodes for heavy atoms). Molecules had < = 21 heavy atoms (<= Cl).
    [DONE – we also used Br]
    These were optimised using Gaussian XXX and the isotropic magnetic tensors calculated using correction for the known solvent.
    [DONE]
    The shift was subtracted from the calculated TMS shift (in the same solvent) and the predicted shift compared with the observed.
    [DONE]
    Initially the RMS deviation was [large]. This was due to a small number of structures where there appeared to be gross errors of assignment.
    [CORRECT. there were "wrong structures" and also misassignments of peaks were common]
    These were exposed to the community who agreed that these should be removed.
    [DONE in part. We are extremely grateful to the community for commenting on general methodology and individual entries. A small number of entries were clearly grossly wrong or mistreated.]
    The RMS dropped to yyy. The largest deviations were then due to Y-C-X systems, where a correction was applied (with theoretical backing). The RMS then dropped to zzz.
    [CORRECT. This is true for C-Br and C-Cl systems. We shall invite comments for some other groups.]
    The main outliers then appeared to be from laboratory AAA to whom we wrote and they agreed that their output format introduced systematic errors. They have now corrected this. The RMS was now zzz.
    [NOT DONE. The metadata are not included in the CMLSpect file so it this would have to be done manually. It is probably not a major contributor to variance. We would also like to have included dates but these are not easily extracted.]
    The deviations were analysed by standard chemoinformatics methods and were found to correlate with the XC(=Z)Y group which probably has two conformations. A conformational analysis of the system was undertaken for any system with this group and the contributions from different conformers averaged. The RMS now dropped to vvv.
    [STARTED, and the community can help. We have identified clear conformational effects in some cases, suspected tautomers, and unmodelled solvent effects.]
    This established a protocol for predicting NMR spectra to 99.3% confidence.
    [TO BE POSTED. We are confident that this method is applicable to a subset of chemistry and does not rely on fitted parameters. We are working in variance space but may be able to transform to confidence. The treatement of misassignment looks promising.]
    We then applied this to spectra published in 2007 in major chemical journals. We found that aa% of spectra appeared to be misassigned, and that bb% of suggested structures were "wrong" – i.e. the reported chemical shifts did not fit the reported spectra values.
    [NOT YET STARTED. We are hoping to build a submission system and invite the community to contribute. This is not part of Nick's thesis, though obviously if useful work is done before he finishes the writing he can include it in discussion. We may do some exemplars when we write the paper. We'd be very grateful for any examples of recent publications where the spectral peaks look reliable and the structure does not.]
    [… ideas related to publishers snipped …]
    We started this 3 weeks ago and have effectively finished all computation, tools for display, and much of the analysis. We feel confident in stating that initially most of the variance was due to problems in the "experimental" – i.e. the actual data and its metadata. We identified ca. 11 possible error types (post) and have actually found four of them:

    • wrong compound assigned to spectrum (i.e. error in bookkeeping or drawing error)
    • transcription errors in spectrum or peaks.
    • misassignment of peaks to inappropriate atoms
    • human editing of spectra including fraud

    We also found a number of limitations in our model (so far we haven’t found any “bugs”).

    • theoretical model has limitations. YES So far the main one appears to be lack of treatment of solvent (e.g. for C=O groups in CHCl3). We anticipated this and we think it shows up.
    • Oversimplified chemical model. There are several common problems:
    1. only one conformer is calculated. YES. identifiable
    2. symmetry is not well treated. No clear exemplar other than conformers
    3. tautomerism is ignored. PROBABLY. We invite your comments
    4. isomerism (e.g. ring-chain is ignored). No examples.

    So when Nick posts the data – and you should have access to all of it – I suspect that there will be many opportunities to contribute.

    Posted in nmr, open issues | 1 Comment

    Feature Extraction and Feature authoring

    An interesting review: Deepak Singh: The value of feature extraction:

    Let’s start with a quote from a talk on Ambient Findability
    • For every search on cancer.gov, there are over 100 cancer-related searches on public search engines.
    • Of these searches, 70% are on specific types of cancer.

    There is another statement of interest in the same talk

    … the ability to find anyone or anything from anywhere at anytime

    The above statements bring to mind the subject of context. Let us agree that “data finds the data“. In that case we must also agree that data must be found in the correct context. . Don’t believe me, just ask Jeff Jonas. In my mind, if machines are to do this, semantic markup of some sort is the only way. Extracting information from documents, regardless of format, whether they be text, images, video, is one of the key challenges of our times. In the life sciences, right now, I don’t really know of any ways (if someone knows of any, let me know) that someone can extract the meta-data from an image or a video, and correlate it to meta-data in a set of text files and automatically come to a conclusion about the potential context of the two observations. I talked about Persistent Context for the life sciences in the past. Let me steal another of Jeff’s ideas, that of Sequence Neutrality. Essentially, “context engines must constantly be on the lookout for new observations that change earlier assertions – and if a new observation provides such evidence – the invalidated assertions from the past must be remedied.“. Context and feature extraction together make a very powerful mix, which can help pharma companies find better, safer drugs faster. This is especially critical in the kind of healthcare environment taking root today, with an emphasis on pharmacovigilance, early safety assessment, etc. If we can continuously update our safety databases based on new data, we are likely to identify adverse events faster, and essentially could carry out constant meta-analysis.
    Jon Udell in a post commenting on Tim O’Reilly’s review of Twine talks about entity extraction and a firefox plugin called Gnosis. I had heard about Gnosis before, but only looked at it askance. However, Jon’s post made me take a second look, and all I can say is WOW. Take a look at the screenshot below [PMR: omitted here]. It shows the features that Gnosis extracted from my blog post on pharma futurology. The interesting thing is not the actual results, but the concept. If you could do the Freebase thing, and add additional information which gets stored in a dictionary somewhere, you have that much power available to you.

    PMR: And OSCAR does pretty much the same for chemistry. Maybe the way forward is a mashup of domain-specific engines in a single framework. I’d certainly like to see the context added. There is so much experimentation to be done – and like all experiments we have to expect failures as well as successes. But the cost of each is getting less.

    But shouldn’t we be getting these sort of tools to authors as well as readers? That’s one of our next steps.

    Posted in semanticWeb | Leave a comment

    Gordon calls for Open Something

    The case for allowing free access to data collected and held at taxpayers’ expense has received endorsement from the top of the British government. In his speech on civil liberties last week, Gordon Brown, the prime minister, said: “Public information does not belong to government, it belongs to the public on whose behalf government is conducted.”
    Brown’s speech acknowledged the power of the web to give access to information about public services. “The availability of real-time data about what is happening on the ground – whether about local policing or local health services – is vital in enabling people to make informed choices about how they use their local services and the standards they expect.”
    Brown is also considering opening new parts of the government’s digital archives….
    However, the prime minister did not mention how his enthusiasm for allowing citizens access to data squares with the policy of encouraging some publicly-owned bodies to charge for data sets, especially for re-use in web mashups and other products. But Locus, a trade association, welcomed the speech. “Next time we hope he will focus on re-use,” it said.
    We agree. Over the past 18 months, our Free Our Data campaign has argued that the government should stop attempting to trade in information, but instead make its unrefined data (except where it threatens privacy or national security) freely available to all comers.
    Later this month, an independent review commissioned by the Treasury will report on the costs and benefits of the current “trading fund” model….

    PMR: Credit to the Guardian which has campaigned for Open Data – particularly in government areas. British political processes work in mysterious ways (unlike the US which from here seems to be a well-accepted public bearfight in the lobbies). Here the Open Access policy is a “level playing field” – code (I think from Lord Sainsbury) which means that commercial interests and public interest thrash it out with the government acting as a spectator on  the touchline. Of course they can always move the goal posts and stonewall on a sticky wicket or kick it into touch when they are stymied.

    Posted in open issues | Leave a comment

    Open Source and Open Data

    In the last post I commented on some of the limitations of licences to ensure Open Data. This post  now compares it with Open Source. I am a campaigner for Open Data (see Wikipedia) and on the advisory board for the The Open Knowledge Foundation which has produced an Open Data definition (a metalicence). People who wish to make their data Openly available often use a licence (such as CC-BY) which is compatible with the OKFN definition. These licences serve many valuable purposes – they act as a formal statement of the general desires of the author(s) and they provide legal force in certain ways (e.g. protecting against third-party copyright, etc.). They are, however, blunt instruments.
    This problem has been recognised for many years in Open source where there are many different licences why try to express not just the general freedoms but add additional freedoms or restrictions. One of the best known is the GNU General Public License (GPL) which – in simple terms – requires all derivative works (even if much larger) to carry the GPL. This has been described as “viral” – it infects every piece of software it is linked with. This was the original motivation and has been policed by the Free Software Foundation.

    The GPL additionally states that a distributor may not impose “further restrictions on the rights granted by the GPL”. This forbids activities such as distributing of the software under a non-disclosure agreement or contract. Distributors under the GPL also grant a license for any of their patents practiced by the software, to practice those patents in GPL software.

    PMR: I ran into these concerns when I wrote a Chemical Markup Language converter for Open Babel. OB is issued under a GPL licence, so any code added must automatically carry the GPL. My JUMBO program (from which I would convert the code for OB) is Open Source, but issued under the Artistic License. I chose the AL because it allowed some control over the use of the code, while still honouring the OS principles. In particular it states that if someone creates a derivative work independently they must release it under a different name. Because JUMBO is primarily designed to test conformance to the definition of CML I wanted to ensure that derivative versions (which might not conform to CML) were not called “JUMBO”.
    In Open Babel I added a paragraph under the licence to say that if anyone edited the code they were required to make it clear to users that this was not necessarily conformant to CML. The FSF audited OB and objected to this statement. I therefore rewrote the “requirement” as a “request” and added it to the in-code documentation. (Because of major rewrites from the OB community it’s no longer in the latest release – I think all my code has been obsoleted – no bad thing!).
    It’s clear that a licence only covers various aspects of re-use and redistribution. There are many legal derivative works that are unacceptable. If someone other than the primary author(s) introduces a bug in a derivative it confuses the community, lowers the apparent quality of the code and increases tensions. If someone writes a lightweight wrapper round a code (legally) and then claims “ownership” of the result, that can be unfair to the original author(s). If someone makes extravagant claims for software that the author(s) do not support, that can cause problems. And so on.
    Some of these may be thoughtless and could be prevented by a clear indication of what is “reasonable” and “unreasonable” practice – a set of requests or policy. In practice this is probably the best way as if someone is using code unscrupulously (which happens rarely, but happens) a licence will not protect against this. Most people in the Open Source arena work by the gift economy and value the authors’ contributions. They would always try to contact the author before considering forking the project.
    The same considerations and tensions apply to data. However many people are coming to Open Data from outside the practice of Open Source. They may have encountered Open Access, but this has little to say about the gift economy.  Open Access requires a sound business model and there is sufficient ad hoc evidence to show this is possible. Much Open Source is initially more ad hoc and dependent on a shared gift ethic. This ethic is not wholly altruistic and some would claim none, but whatever the basic of the ethic there is high consciousness of it in the community – OS contributions earn karma. I suspect this is less important for many authors of OA – they will do it because they believe their articles will be more widely read, more widely used or they have been mandated to do it.
    Open Data is relatively new and I believe urgently requires an ethic. In some cases it will be mandated but in many cases it will have elements of the gift economy. If so this needs to be protected by an awareness in the community of the value of the gift and that it should not be deliberately or inadvertently mis-used. Some communities – e.g. bioscience – have several years’ experience of (effectively) Open Data and must have encountered some of these problems – misappropriation, analogies to passing off, corruption (of data), etc. It could be useful to have examples and solutions if any.

    Posted in open issues | Leave a comment