Monthly Archives: December 2007

Mystery picture

What's this picture?

mystery.JPG

and why might I be interested in it?

(It's not the whole picture, so I claim fair use - I don't know who the copyright holder is. And the clipped space hides a fairly vital clue).

[UPDATE: 2007-12-23:

It's a penguin, drawn by Robert Shackelton. There's also one by Robert Scott.  They were discovered in a basement in the Scott Polar Research Institute which is just next to The Chemistry lab in Cambridge. There was a TV van there two days ago...

http://ap.google.com/article/ALeqM5iKl5uJqCfIDn9RKK1LZK2JmKTxhwD8TM770G0

and

http://news.bbc.co.uk/1/hi/sci/tech/7154205.stm

and

http://www.telegraph.co.uk/news/main.jhtml?xml=/news/2007/12/21/npenguin121.xml

P.

the end of the beginning

I got a series of euphoric messages from fellow OA activists rejoicing at the news that Preseident Bush was "certain" to sign the House appropriations bill. I searched for the message in Peter Suber's blog and found ...

Congress sends revised spending bill, and OA mandate for NIH, to President

This evening the House of Representatives passed an omnibus spending bill containing language requiring the NIH to adopt an OA mandate.  The Senate passed the bill on Tuesday.

Because it cuts spending to the levels President Bush requested, and gives him $70 billion for the war in Iraq and Afghanistan, he is expected to sign it.

The OA mandate for the NIH isn't law yet, but it's very, very close. Watch this space.

PMR: I am watching this space ... and Alma Swan writes:

>The Appropriations Bill, with the language in about the NIH mandate, passed

>in the US Senate last night. It now *will* be signed off by President Bush.

>

>Heather deserves huge congratulations. This has been virtually a

>one-woman-led effort, and she has fought the publishers all the way, in

>every corridor and in every committee room. Now to try to emulate her in

>Brussels ...

PMR: Absolutely total congratulations to Heather. I don't know enough about whether presidential signatures are deterministic so will wait a few more days before breaking open any bottles.

And we should remember that the struggle continues.

"Now this is not the end. It is not even the beginning of the end. but it is, perhaps, the end of the beginning."

and why should I choose this quotation?

Java: labelled break considered harmful

Readers of my last post may have thought that Eclipse makes refactoring easy. It does - up to a point. I had started to refactor an 800-line module with deeply nested loops - just a matter of extracting the inner loops as methods...

... NO!

When I tried this I got:

"Selection contains branch statement but corresponding branch target is not selected"

???

On closer examination I discovered that the code contained a construct like:

foo:
plunge();
for (int i = 0; i < 1; i++) {
boggle();
if (bar) {
break foo;
}
}

[Added later: PUBLIC GROVEL. Jim has pointed out that I have misunderstood the break syntax, so the code above is WRONG. At least this shows that I never use labelled break. It should read:

plunge();
foo: for (int i = 0; i < 1; i++) {
boggle();
if (bar) {
break foo;
}
}

Strikethoughs indicate my earlier misconceptions.

What's happening here? The code contains a labelled break. If the break foo is encountered, then the control jumps to the label foo. This can be almost anywhere in the module - and in this case it was often before the start of the loop. to

out of the labelled loop.

Jumping to arbitrary parts of a module is considered harmful (Go To Statement Considered Harmful). Sun/Java announces:

2.2.6 No More Goto Statements

Java has no goto statement1. Studies illustrated that goto is (mis)used more often than not simply "because it's there". Eliminating goto led to a simplification of the language--there are no rules about the effects of a goto into the middle of a for statement, for example. Studies on approximately 100,000 lines of C code determined that roughly 90 percent of the goto statements were used purely to obtain the effect of breaking out of nested loops. As mentioned above, multi-level break and continue remove most of the need for goto statements.

but surely the code below is a direct replacement for goto.

while (true) {

break foo;
}

continue is useful. break out of single level (unlabelled) is useful. break out of multiple loops might just be OK if it was always downwards and always to the point immediately after a loop.

But it isn't.

so - and I am surprised that I can't easily find it on Google:

"labelled break considered harmful"

However as it is still extremely easy to write code that cannot be easily refactored I still hold that labelled breaks should be used only when essential.

Refactoring large modules using Eclipse

I have recently had to consider refactoring a piece of Java which had got slightly out of hand - the module was 800 lines long and the if statements so deeply nested that they ran well off the right-hand edge of the page. I will NOT identify where it came from or to criticize - I have written much worse in my past (you can do really fun things with computed GOTOs in FORTRAN.). But it was and is unmaintainable and we care about that in the Centre.

So I thought that I would sit down with Eclipse in front of the football and refactor it. Eclipse has this really neat Refactor that allows you to select a chunk of code and turn it into a method. For example:

public void add3DStereo() {
// StereochemistryTool stereochemistryTool = new
// StereochemistryTool(molecule);
ConnectionTableTool ct = new ConnectionTableTool(molecule);
List cyclicBonds = ct.getCyclicBonds();
List doubleBonds = molecule.getDoubleBonds();
for (CMLBond bond : doubleBonds) {
if (!cyclicBonds.contains(bond)) {
CMLBondStereo bondStereo3 = create3DBondStereo(bond);
if (bondStereo3 != null) {
bond.addBondStereo(bondStereo3);
}
}
}
List chiralAtoms = new StereochemistryTool(molecule).getChiralAtoms();
for (CMLAtom chiralAtom : chiralAtoms) {
CMLAtomParity atomParity3 = null;
atomParity3 = calculateAtomParity(chiralAtom);
if (atomParity3 != null) {
chiralAtom.addAtomParity(atomParity3);
}
}
}

I now select the first for loop and turn it into a method; and repeat for the second and get:

public void add3DStereo() {
// StereochemistryTool stereochemistryTool = new
// StereochemistryTool(molecule);
ConnectionTableTool ct = new ConnectionTableTool(molecule);
List cyclicBonds = ct.getCyclicBonds();
List doubleBonds = molecule.getDoubleBonds();
addBondStereo(cyclicBonds, doubleBonds);
List chiralAtoms = new StereochemistryTool(molecule).getChiralAtoms();
addAtomParity(chiralAtoms);
}

/**
* @param chiralAtoms
*/
private void addAtomParity(List chiralAtoms) {
for (CMLAtom chiralAtom : chiralAtoms) {
CMLAtomParity atomParity3 = null;
atomParity3 = calculateAtomParity(chiralAtom);
if (atomParity3 != null) {
chiralAtom.addAtomParity(atomParity3);
}
}
}
/**
* @param cyclicBonds
* @param doubleBonds
*/
private void addBondStereo(List cyclicBonds, List doubleBonds) {
for (CMLBond bond : doubleBonds) {
if (!cyclicBonds.contains(bond)) {
CMLBondStereo bondStereo3 = create3DBondStereo(bond);
if (bondStereo3 != null) {
bond.addBondStereo(bondStereo3);
}
}
}
}

The whole thing took 30 seconds, including choosing the module names. Eclipse did all the params, documentation return values - everything.

Try it - it will really fix up many sorts of grotty code...

Mystery Picture

Here is a photograph (untouched, not CGI). When I saw it I went wow! (I knew what it was). I'd be interested to know if anyone (a) KNOWS what it is of (b) can estimate the scale (c) has seen anything like it. If you do know, please post a comment saying so [but please DON'T give the answer]. I plan to release more information daily...

Besides the photo itself there is a serious question. How can you search the web for images like this?

picture0.JPG

and a close-up:

picture1.JPG

[UPDATE - more info: The photograph was taken yesterday by Dr. Judith Murray-Rust.]

[ANSWER: This is, indeed, crystalline water but the scale took us by surprise. The x-axis is ca. 20 cm. This artefact appeared in our bird bath and there appear to be 2 perfect, huge, hexagonal ice crystals (it is possible that they are both sixfold twins, I suppose). The faces are highly planar and specular (we have more pictures).

It is also remarkable that there are two artefacts separated by 10 cm(between centres) which are almost identical. What possible coupling could there be between them - that is the real mysetery.]

Happy Holliday - as I might say to Gemma.

Open Data: publishers are the problem

The Chemspider site and blog have been making rapid and valuable progress towards Open Data. This is particularly laudable for a commercial site where Openness in chemistry is a long way from being a proven business model and is actively resisted by many. Here is a typical tale of frustration - I comment below
Why We Can’t Publish Scraped CrystalEye Data Yet….And Science Commons Declare a Protocol for Implementing Open Access Data
Previously I blogged about our intention to scrape CrystalEye data and publish onto ChemSpider. The original comments regarding the data on CrystalEye were as follows:

  1. pm286 Says:
    October 26th, 2007 at 7:54 am (1) All data come from Free sources - i.e. visible without a subscription. Some journals (Acta Crystallographica and RSC for example) do not copyright the data. Others like ACS add copyright notices. It is our contention, and Elsevier has agreed for its own material, that facts are not copyrightable. We have therefore extracted and transformed facts and mounted these. Where the original material (CIF) does not carry copyright we mount it on our pages - where it does we do not, but we have the transformed data. In those cases it would be possible to recreate the original CIF data in semantic form ,but not the exact typographical layout which contains meaningless whitespace.I am not aware that ACS or Elsevier have ever made statements of any kind about our Open Data efforts.You may scrape anything, must you must honour the source and the metadata and you should add the Open Data sticker. If you scrape the link (simplest) you may simpy point to our site. If you scrape more data you should ensure that the integrity of the data is maintined and that if it is re-used the re-used data should still clearly show our metadata.

[PMR: Yesterday's announcement of the CCZero licence could mean that we change from a meta-licence ("Open Data") to an explicit CCZero licence. I will need to read the details. I don't think it changes the arguments below.]

We have already done the work to scrape certain data from the site but have chosen to be extra careful with taking the declaration of Open Data made to all data sources. My primary worry was with the data scraped from the ACS journals. With this caution in mind I sent a letter to the copyright department at ACS as outlined here. In fact I made a couple of phone calls, sent the email about 2 more times and finally managed to talk to a nice gentleman from the ACS copyright department and brought my concerns to light. Since then we have exchanged multiple emails, spoken again on the phone and I have been told that a meeting of minds from both Washington and Ohio was being scheduled to discuss the situation. That’s 2 months after my original email.

Today I received the following email and I am excerpting from it..

“Thank you for your inquiry about the proposed use by ChemSpider of information in the CrystalEye database that has been published within certain ACS journal publications. In light of your query, we are examining the manner in which ACS published material is represented within that database as well as the nature of your proposed use, so that we can respond in an informed manner to your request.

<snip>

If you will be attending the ACS National Meeting in New Orleans, perhaps we could confer with you at that time to discuss our findings and advise you appropriately?

Communicators Name withheld ”

What I thought was a simple question and done with the intention that ChemSpider was safe turns out not to be so simple. It could take until March 2008 to get an answer! At this stage we will not be publishing any of the CrystalEye data without confirmation from each of the publishers that this is allowed. I asked the question previously “Who gets to declare data open or not?“ and even received the question “Why even offer the option of closed?” The primary reason is that we have turbulent times ahead of us around such issues of “openness” and until these are navigated I am working to keep ChemSpider “safe “. I am willing to participate, support and contribute to the evangelism of openness but am equally concerned with keeping ChemSpider alive for the close to 3000 users per day now accessing the service.

It was an interesting day to receive this email about a potential FIVE MONTH delay to a decision about Open Data especially now that Science Commons have released a Protocol for Implementing Open Access Data just yesterday. ...

So, while protocols are exposed to the community by Science Commons the challenge of utilizing them now begins…I will be in communication with members of the Science Commons soon to determine how ChemSpider can it into the model…

PMR: This is, unfortunately, completely typical. Earlier this year I wrote to Tetrahedron (an Elsevier journal) asking if they would consider posting CIFs (crystallographic data):

Request for Open publication of crystallographic data in Elsevier’s Tetrahedron

=========== Open letter to editors of Tetrahedron ==========

Professor L. Ghosez ,
Professor Lin Guo-Qiang ,
Professor T. Lectka ,
Professor S.F. Martin ,
Professor W.B. Motherwell ,
Professor R.J.K. Taylor ,
Professor K. Tomioka

Subj: Request for Open publication of crystallographic data in Tetrahedron
Dear editors,
I have recently been reviewing access to supplemental data in chemistry publications, in particular crystallographic data (”CIFs”). Many publishers (IUCr, RSC, ACS…) expose these on their websites as Open Data (for examples see: http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=455). The data are acknowledged not to be copyrightable (see http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=447) where your colleague Jennifer Jones (copied) has confirmed:

Dear Peter Murray-Rust
Thanks for your email. Data is not copyrighted. If you are reusing the entire presentation of the data, then you have to seek permission, otherwise, you can use the data without seeking our permission.
Yours sincerely
Jennifer Jones
Rights Assistant
Global Rights Department
Elsevier Ltd
PO Box 800
Oxford OX5 1GB
UK
Tel: + 44 (1) 865 843830
Fax: +44 (1) 865 853333
email: j.jones@elsevier.com

Other Elsevier journals such as those publishing thermochemistry (see last blog post) are now actively making the supplemental data Openly available on the journal website. I am therefore asking whether Tetrahedron (and perhaps other Elsevier chemistry journals) might consider publishing their data Openly in this way and would be grateful for your views.

(This is an Open letter (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=456) and I would like to publish your reply so please mark any confidential material as such).

Thank you for considering this

PMR: Five editors - I haven't had the courtesy of a reply. This is not uncommon - I didn't get replies on Open topics from Wiley, Springer (first time round) either. Either journals are not in the habit of replying - they consider ordinary scientists too low in the foodchain to merit consideration (most likely) - or they regard anything Open as a pain and want to slow it by inaction (also most likely). They have their set way of doing things - God ordained in 1972 that the world belongs to the publishers and they don't want to see it change.

Another typical example. I was invited to write an article for Serials Review on Open Data. I asked if I could write my artcile in HTML and embed my own copyright material, noted as such under appropriate licence. The editorial office siad that would come back to me. It's now past the closing date of the submission. After ca. 6 weeks I got the reply:

Facts and data are not copyrightable but the expression of data is

copyrightable. If you wish to use third-party data in a different

format within your article, including full acknowledgement to the source

of the data, then that would be acceptable. However, if you wish to

retain the expression of the data, then you will need to include

alternate diagrams within the article.

So I can use the data - IF I can get it. If I can only get a graph then I can't unless I redraw it. Is redrawing a graph a useful activity for science - do I need to answer? The only value is that it adds some random errors to the data (or systematic ones) that would be fun to give as exercises in bad scientific practice for students. "Expression of the data" - i.e. the author's graphs - are not re-usable.

So what's the answer? Currently I use the "ask forgiveness, not ask permission" mode. And if the "owners" ot the data (read "appropriators") send the lawyers and ask for a take-down - make a huge public fuss. As the world did when Shelly Batts "stole" a graph from from Wiley (Sued for 10 Data Points). And Wiley backed down. The publishers don't like public fuss.

So a few months ago I would have advised Chemspider "go ahead". But they ran foul of another publisher (I think it was the Royal Society of Chemistry). I never understood the details but Chemspider linked to publicly visible papers (not Open) and were asked to take the links out of the Chemspider database. This doesn't even seem to make sense. I would have thought publishers would like people linking to their papers - maybe it was the metadata.

So I appreciate Chemspider's wish to remain on the correct legal side of the publisher. But [the publishers'] actions destroy scientific data in the current century. Chemistry publishers [OA publishers and IUCr excepted] are actively and passively resisting the re-use of data. They copyright factual data, hide it, require take-downs, refuse to reply to reasonable letters - everything. They are simply in the way between the creator of the data and the consumer

As I have blogged we now have an exciting project sponsored by Microsoft on eChemistry. We are going to fill repositories with data. And we are going to get that data ("not copyrightable" - see above) from any source we reasonably can. It will be available to the whole world. It will probably be stamped CCZero. CrystalEye will be in there. We shall, of course, include the source (provenance) as we really care about it and metadata. So people will know where it came from.

Why can't the ACS reply "Yes" to Chemspider by return? Does it really make sense for chemistry publishers to be universally seen as Luddites? Because the world will sweep these restrictive practices away, and the business will have moved from the publishers to somewhere in the twenty-first century (the one we are in).

Open Notebook Science and Glueware

Cameron laments the difficulty of creating an Open Notebook system when there is a lot of data:

 

The problem with data…


Our laboratory blog system has been doing a reasonable job of handling protocols and simple pieces of analysis thus far. While more automation in the posting would be a big benefit, this is more a mechanical issue than a fundamental problem. To re-cap our system is that every “item” has its own post. Until now these items have been samples, or materials. The items are linked by posts that describe procedures. This system provides a crude kind of triple; Sample X was generated using Procedure A from Material Z. Where we have some analytical data, like a gel, it was generally enough to drop that in at the bottom of the procedure post. I blithely assumed that when we had more complicated data, that might for instance need re-processing, we could treat it the same way as a product or sample.

[snip...]

 

PMR: How I sympathize! We had a closely related problem with Nick Day's protocol for NMR calculations. There were also other reasons why we didn't do complete Open Notebook, but even if we had wanted we couldn't. Because the whole submissions and calculation process is such horrendous glueware. It's difficult enough keeping it under control yourself, let alone exposing the spaghetti to others. So, until the protocol has stabilised (and that's hard when it's perpetual beta), it's very hard to do ONS.

 

And what happens when you change the protocol? The data formats suddenly change. And that will foul all your possible collaborators. Do you have a duty of care to support any random visitor who wants to use your data - I have to argue "no" at this stage. You may expose what you have but it's a mess.

 

The only viable solution is to create a workflow - and to tee the output. But as Carole Goble told us at DCC - worklfows are HARD. That's why glueware is so messy - if we had cracked the workflow problem we would have eliminated glueware.

 

The good news is that IF we crack it for a problem, then it should be much much easier to archive, preserve and re-use the output of ONS.

 

What sort of repositories do we want?

 

Open Access and Institutional Repositories: The Future of Scholarly Communications, Academic Commons,


Institutional repositories were the stated topic for a workshop convened in Phoenix, Arizona earlier this year (April 17-19, 2007) by the National Science Foundation (NSF) and the United Kingdom's Joint Information Systems Committee (JISC). While in their report on the workshop, The Future of Scholarly Communication: Building the Infrastructure for Cyberscholarship, Bill Arms and Ron Larsen build out a larger landscape of concern, institutional repositories remain a crucial topic, which, without institutional cyberscholarship, will never approach their full potential.

 

PMR: Although I'm going to agree generally with Greg I don't think the stated topic of the workshop was institutional repositories per se. It was digital scholarship, digital libraries and datasets. I would expect to find many datasets outside institutions (witness the bio-databases).

Repositories enable institutions and faculty to offer long-term access to digital objects that have persistent value. They extend the core missions of libraries into the digital environment by providing reliable, scalable, comprehensible, and free access to libraries' holdings for the world as a whole. In some measure, repositories constitute a reaction against those publishers that create monopolies, charging for access to publications on research they have not conducted, funded, or supported. In the long run, many hope faculty will place the results of their scholarship into institutional repositories with open access to all. Libraries could then shift their business model away from paying publishers for exclusive access. When no one has a monopoly on content, the free market should kick in, with commercial entities competing on their ability to provide better access to that freely available content. Business models could include subscription to services and/or advertising.

Repositories offer one model of a sustainable future for libraries, faculty, academic institutions and disciplines. In effect, they reverse the polarity of libraries. Rather than import and aggregate physical content from many sources for local use, as their libraries have traditionally done, universities can, by expanding access to the digital content of their own faculty through repositories, effectively export their faculty's scholarship. The centers of gravity in this new world remain unclear: each academic institution probably cannot maintain the specialized services needed to create digital objects for each academic discipline. A handful of institutions may well emerge as specialist centers for particular areas (as Michael Lesk suggests in his paper here).

The repository movement has, as yet, failed to exert a significant impact upon intellectual life. Libraries have failed to articulate what they can provide and, far more often, have failed to provide repository services of compelling interest. Repository efforts remain fragmented: small, locally customized projects that are not interoperable--insofar as they operate at all. Administrations have failed to show leadership. Happy to complain about exorbitant prices charged by publishers, they have not done the one thing that would lead to serious change: implement a transitional period by the end of which only publications deposited within the institutional repository under an open access license will count for tenure, promotion, and yearly reviews. Of course, senior faculty would object to such action, content with their privileged access to primary sources through expensive subscriptions. Also, publications in prestigious venues (owned and controlled by ruthless publishers) might be lost. Unfortunately, faculty have failed to look beyond their own immediate needs: verbally welcoming initiatives to open our global cultural heritage to the world but not themselves engaging in any meaningful action that will make that happen.

The published NSF/JISC report wisely skips past the repository impasse to describe the broader intellectual environment that we could now develop. Libraries, administrators and faculty can muddle through with variations on proprietary, publisher-centered distribution. However, existing distribution channels cannot support more advanced scholarship: intellectual life increasingly depends upon open access to large bodies of machine actionable data.

The larger picture depicted by the report demands an environment in which open access becomes an essential principle for intellectual life.The more pervasive that principle, the greater the pressure for instruments such as institutional repositories that can provide efficient access to large bodies of machine actionable data over long periods of time. The report's authors summarize as follows the goal of the project around which this workshop was created:

To ensure that all publicly-funded research products and primary resources will be readily available, accessible, and usable via common infrastructure and tools through space, time, and across disciplines, stages of research, and modes of human expression.

To accomplish this goal, the report proposes a detailed seven-year plan to push cyberscholarship beyond prototypes and buzzwords, including action under the following rubrics:

  • Infrastructure: to develop and deploy a foundation for scalable, sustainable cyberscholarship
  • Research: to advance cyberscholarship capability through basic and applied research and development
  • Behaviors: to understand and incentivize personal, professional and organizational behaviors
  • Administration: to plan and manage the program at local, national and international levels

For members of the science, technology, engineering, and medical fields, the situation is promising. This report encourages the NSF to take the lead and, even if it does not pursue the particular recommendations advocated here, the NSF does have an Office of Cyberinfrastructure responsible for such issues, and, more importantly, enjoys a budget some twenty times larger than that of the National Endowment for the Humanities. In the United Kingdom, humanists may be reasonably optimistic, since JISC supports all academic disciplines with a healthy budget. Humanists in the US face a much more uncertain future.

PMR: I would agree with Greg that IRs are oversold and underdeliver. I never expected differently. I have never yet located a digital object I wanted in an IR expect when I specifically went looking (e.g. for theses). And I went to Soton to see what papers of Stevan's were public and what their metadata were. But I have never found one through Google.

Why is this? The search engines locate content. Tyr searching for NSC383501 (the entry for a molecule from the NCI) and you'll find: DSpace at Cambridge: NSC383501

But the actual data itself (some of which is textual metadata) is not accessible to search engines so isn't indexed. So if you know how to look for it through the ID, fine. If you don't you won't.

I don't know what the situation is in humantities, so I looked up the Fitzwilliam (the major museum in Cambridge) newsletter and looked for "The Fitzwilliam Museum Newsletter Winter 2003/2004" in Google and found: DSpace at Cambridge: The Fitzwilliam Museum Newsletter 22 but when I looked for the first sentence "The building phase of The Fitzwilliam Museum Courtyard" Google returned zero hits.

So (unless I'm wrong and please correct me), deposition in DSpace does NOT allow Google to index the text that it would expose on normal web pages. Jim explained that this was due to the handle system and the use of one level of indirection - Google indexes the metadata but not the data. (I suspect this is true of ePrints - I don't know about Fedora).

If this is true, then repositing at the moment may archive the data but it hides it from public view except to diligent humans. So people are simply not seeing the benefit of repositing - they don't disover material though simple searches.

So I'm hoping that ORE will change all this. Because we can expose all the data as well as the metadata to search engines. That's one of the many reasons why I'm excited about our molecular repositories (eChemistry) project.

As I said in a previous post, it will change the public face of chemical information. The key word for this post is "public". In others we'll look at "chemical" and "information".

====================

[ans: German. Because the majority of scholarship in the C19 was in German.]

Open Access Data, Open Data Commons PDDL and CCZero

This is great news. We now have a widely agreed protocol for Open Data, channeled through Science Commons but with great input for several sources including Talis, and the Open Knowledge Foundation. Here is the OKFN report (I also got a mail from Paul Miller or Talis without a clear link to a webpage).

 

This means that the vast majority of scientists can simply add CCZero to their data. I shall do this from now on. Although I am sure that there will be edge cases it shouldn't apply to ANYTHING in chemistry.

Good news for open data: Protocol for Implementing Open Access Data, Open Data Commons PDDL and CCZero

15:21 17/12/2007, Jonathan Gray, external, news, okf, open access, open data, open geodata, open knowledge definition, Open Knowledge Foundation Weblog

Last night Science Commons announced the release of the Protocol for Implementing Open Access Data:

The Protocol is a method for ensuring that scientific databases can be legally integrated with one another. The Protocol is built on the public domain status of data in many countries (including the United States) and provides legal certainty to both data deposit and data use. The protocol is not a license or legal tool in itself, but instead a methodology for a) creating such legal tools and b) marking data already in the public domain for machine-assisted discovery.

As well as working closely with the Open Knowledge Foundation, Talis and Jordan Hatcher, Science Commons have spent the last year consulting widely with international geospatial and biodiversity scientific communities. They’ve also made sure that the protocol is conformant with the Open Knowledge Definition:

We are also pleased to announce that the Open Knowledge Foundation has certified the Protocol as conforming to the Open Knowledge Definition. We think it’s important to avoid legal fragmentation at the early stages, and that one way to avoid that fragmentation is to work with the existing thought leaders like the OKF.

Also, Jordan Hatcher has just released a draft of the Public Domain Dedication & Licence (PDDL) and an accompanying document on open data community norms. This is also conformant with the Open Knowledge Definition:

The current draft PDDL is compliant with the newly released Science Commons draft protocol for the “Open Access Data Mark” and with the Open Knowledge Foundation’s Open Definition.

Furthermore Creative Commons have recently made public a new protocol called CCZero which will be released in January. CCZero will allow people:

(a) ASSERT that a workhas no legal restrictions attached to it, OR
(b) WAIVE any rights associated with a work so it has not legal restrictions attached to it,
and
(c) “SIGN” the assertion or waiver.

All of this is fantastic news for open data!

Deepak Singh: Educating people about data ownership

Deepak Singh: Educating people about data ownership


I never got to watch the Bubble 2.0 video (I only heard it on net@nite). Before I could get to see it, it got taken down. Wired talks about the reasons behind the takedown. As a content producer who shares content online and as a scientist who has published papers and a not-so-casual observer of the entire content ownership debate, I am often torn by examples like this one.

What is important for the author? Is it monetary compensation? If content, scientific, media or otherwise is your primary source of income, you can understand why people get a little antsy when someone uses the content without permission. I know too many people, journalists, musicians, etc for whom their creativity is the sole source of income and they are all well meaning, even if they don’t always understand the environment that they operate in.

However, a lot of these issues date back to a world free of Creative Commons, which I believe is celebrating a 5th birthday this weekend. In today’s climate we have choice, so to some extent content owners need to make that choice and then live with their consequences. You can choose to publish your papers in a PLoS journal under a CC license, or you can choose to publish in a closed journal. Obviously, I belong to the open science camp, but I also believe that people have the choice of making decisions. They then must also live with the consequences of those decisions.

What we need is education. When Larry Lessig spoke at the University of Washington recently (I have the full recording if anyone is interested), I asked him a question on this very issue. How many people who upload pictures to flickr really understand the licensing options available to them? How many people understand the pros/cons and implications? Most scientists I know don’t even know what Creative Commons is, Science Commons even less so. On the flip side, do the majority of people wanting to use pictures, etc understand what they can do with media, the proper ways of attribution, etc? I doubt it. Even I am not always sure.

We have a plethora of resources available to us for sharing data, media and information. Scientists have the PLoS and BMC journals. You have resources to share data, documents, pictures, videos, screencasts, etc etc. It is up to us to decide where we put our information and how it is managed. It is also important for everyone to understand and respect those choices. The dialog on what is the best approach to sharing data and the advantages of open data can be discussed as we go along.

PMR: We have to liberate scientific images unless there is a good reason why not. There will continue to be problematic areas when re-use is mis-use. For example CC-BY would allow derivative works including - say - altering the gray scale or the pixels in an image. (I hope no-one would edit in an incorrect scale bar!) And it's important to keep the caption with the image - until we get better metadata packaging. But, in general all scientific images should be stamped CC-BY or SC. Scientific images are different from people's photographs. They are part of the scientific record. And they should NOT belong to the publisher.