Monthly Archives: March 2011

Open Data at the ACS

#acsanaheim

I spoke on Monday at the ACS “Open Data” session on the Panton Principles. I had to leave after mine because I was speaking in the Education session and my comments on them are based on hearsay and their abstracts. There were only 4 contributed papers.

  • Mine
  • One on a commercial software project (the only OD reference was apparently “it would be nice to have some Open Data”)
  • The Cambridge Data centre arguing that data should be curated and charged for and that this business model had to be maintained. (I should point out that CCDC is the “official” repository of raw data from crystallographic experiments. Half the publishers (Springer, Wiley, Elsevier) do not publish supplemental crystallography and the authors then donate the data to CCDC. If you want the data you either have to subscribe to the database or can only get a handful of data (I think 25 out of 500,000). There is no right of re-use)
  • A paper by the organizer Irina Sens who wasn’t able to come.

In another talk Steve Bachrach reviewed the SOAP report of Open Access. It says – no great surprise – that chemistry is well behind other sciences in OA – estimated at 5 years (I would increase this to 10).

I was told that the ACS supplemental data was now Open. Wow! I was going to jump up and down publicly. There was a JPA (journal publishing agreement on this). http://pubs.acs.org/userimages/ContentEditor/1285231362937/jpa_user_guide.pdf (11 pp) It is (I quote claiming fair use, as the document is copyright) “is a result of ACS’ ongoing efforts to provide the best possible publishing experience for our authors”. (I note this awful word “use experience” creeping into the language)…Here’s some more:

 

The new agreement specifically addresses what authors can do with the different versions of their manuscript—e.g. use in theses and collections, teaching and training,conference presentations, sharing with colleagues, and posting on websites and repositories.

The terms under which these uses can occur are clearly identified to prevent misunderstandings that could jeopardize final publication of a manuscript.

The new agreement clarifies that the transfer of copyright in Supporting Information is nonexclusive. Authors may use or authorize the use of Supporting Information in which they hold copyright for any purpose and in any format.

The new agreement extends key terms of use to an author’s previously published work with ACS—as long as the same conditions of use are met.

Behaviors expected of ACS authors are more fully addressed throughout the agreement.

I haven’t read it all but these seem small positive steps. But I am more interested in what READERS (an archaic term replaced by “end-user”) can do. A reader is a human OR machine who actually wants to do something with the published material. To have an interactive experience. So, with great excitement I turned to the conditions of use of ACS supplemental info:

 

Electronic Supporting Information files are available without a subscription to ACS Web Editions. The American Chemical Society holds a copyright ownership interest in any copyrightable Supporting Information. Files available from the ACS website may be downloaded for personal use only. Users are not otherwise permitted to reproduce, republish, redistribute, or sell any Supporting Information from the ACS website, either in whole or in part, in either machine-readable form or any other form without permission from the American Chemical Society. For permission to reproduce, republish and redistribute this material, requesters must process their own requests via the RightsLink permission system. Information about how to use the RightsLink permission system can be found at http://pubs.acs.org/page/copyright/permissions.html.

 

What’s changed? Here’s the same paragraph about 5 years ago

 

Electronic Supporting Information files are available without a subscription to ACS Web Editions. All files are copyrighted by the American Chemical Society. Files may be downloaded for personal use; users are not permitted to reproduce, republish, redistribute, or resell any Supporting Information, either in whole or in part, in either machine-readable form or any other form. For permission to reproduce this material, contact the ACS Copyright Office by e-mail at copyright@acs.org or by fax at 202-776-8112.

 

Well the “end-user experience” is pretty much the same. You can’t do anything without permission. Oh, dear – and I was so expectant.

Actually it’s worse. The old version meant it was a straight dialogue with the ACS – I carried this out over several years without much response. The new version has:

 

The American Chemical Society holds a copyright ownership interest in any copyrightable Supporting Information.

This is so wonderfully fuzzy that it guarantees that you will not get a clear response from the ACS as to what it means. (Well, actually, you won’t get a response anyway. I have had one response in 4 years’ of trying. “Let’s discuss it at the next ACS meeting”. Not yes, not no, but classic beautiful MUMBLE.

 

By contrast two cheers to Chemspider. Chemspider is not an Open resource – it is run by the RSC and the system and content is by default closed. They have collected data and contributed data and some of this is Open. So Tony Williams showed that the Open Data items will be stamped with the OKF button.

 

Well don Chemspiderman.

Open Theses at EURODOC: 2011-04-01; Sleepless in Seattle

#jiscopenbib #opentheses

As part of our JISCOpenBIB project we are running a workshop on Open Theses at EURODOC 2011. “We” is an extended community of volunteers centered round the main JISC project. In that project we have developed an approach to the representation of Open Bibliographic metadata, and now we are extending this to theses.

Why theses?

Because, surprisingly, many theses are not easily discoverable outside their universities. So we are running the workshop to see how much metadata we can collect on European theses. Things like name, university, subject, datae, title – standard metadata.

For the workshop we’ll have an Etherpad… http://science.okfnpad.org/Conference-call-Eurodoc-Open-Theses-workshop-20110321 If you haven’t used an Etherpad just go to the address. You can add your material into the pad. Let us know if you are interested in being involved.

There will be a datasheet for collecting data: https://spreadsheets.google.com/ccc?key=0AnCtSdb7ZFJ3dHFTNDhJU0xfdGhIT01WeTBMMDZWOGc&hl=en_GB&authkey=CJuy4owB#gid=0

We’ll also be collecting survey data survey location (coming online very soon) http://bit.ly/Eurodoc-opentheses-survey

This workshop is not limited to participants. I shall be in Seattle US. Sleepless (it’ll be 0300 in the morning there). So all of us can and should participate. I’ll try to add MY thesis data (1967, but I think that counts as European?)

So I’ll blog more info as we create it. But 1300 WEST = 1200UTC is the time we start – make a note to be involved.

 

 

ScholarlyHTML – ScholarlyChemistry!

#scholarlyhtml #acsanaheim

In this morning’s CINF program http://abstracts.acs.org/chem/241nm/program/divisionindex.php?act=presentations&val=Internet+and+Chemistry:+Social+Networking&ses=Internet+and+Chemistry:+Social+Networking&prog=54108

Alex Clark observed that there was such as mess of different mobile providers (Apple, Blackberry, Android …) all incompatible that the solution for Chemistry was to adopt HTML5 and Javascript.

Just what we have concluded for Documents!

Let’s build the next generation of chemistry in HTML5! It’s a bit of work, but it will be worth it. I will start hacking CML …

What I shall present at ACS: Chemistry and Social networks on the Internet

#acsanaheim #greenchain #quixotechem

My presentation is diverse and unpredictable. It covers:

 

Green Chain Results: summarized for talk at ACS

Summary of resources and results from the Greenchain reaction. Collecting them into a blog is the best way of collecting and probably preserving them

 

http://greenchain.ch.cam.ac.uk/patents/

..
ben/
flob/
indexes/
jars/
patents.tgz
quixote/
results.tgz
results/
test/

http://greenchain.ch.cam.ac.uk/patents/results/

…>

Listing of “/patents/results”

..
1996/
1998/
1999/
2000/
2000yearTotal.htm
2001/
2002/
2003/
2004/
2005/
2006/
2007/
2008/
2009/
2010/
9999/
complete.htm
dissolveTotal.htm

 

Typical years

Listing of “/patents/results/2000″

..
EPO-2000-02-23/
EPO-2000-03-01/
EPO-2000-04-05/
EPO-2000-06-07/
EPO-2000-07-12/
EPO-2000-08-09/
EPO-2000-09-13/
EPO-2000-10-11/
EPO-2000-11-08/
EPO-2000-12-13/
EPO-2000-12-20/
EPO-2000-12-27/
solventFrequency.htm
yearTotal.htm

Listing of “/patents/results/2008″

..
EPO-2008-01-30/
EPO-2008-02-27/
EPO-2008-03-26/
EPO-2008-04-02/
EPO-2008-04-30/
EPO-2008-05-28/
EPO-2008-06-25/
EPO-2008-07-30/
EPO-2008-08-27/
EPO-2008-09-24/
EPO-2008-10-01/
EPO-2008-11-26/
EPO-2008-12-24/
solventFrequency.htm
yearTotal.htm

 

ACS Talk on the Green Chain Reaction: summary of what we did

This is a collection of all the blog posts that I made on the Greenchain reaction. They give some idea of the tight deadline , the volunteering. It’s followed by a scrape of the (now-spammed) site…

 


#solo2010: The Green Chain Reaction; please get involved!


Sunday, August 8th, 2010


#solo2010: Green Chain Reaction; details and offers of help


Monday, August 9th, 2010


#solo2010: How safe / green is my reaction? Feedback requested


Tuesday, August 10th, 2010


#solo2010: Computing volunteers needed for Green Chain Reaction


Tuesday, August 10th, 2010


#solo2010: Green Chain Reaction – update and more volunteers!


Wednesday, August 11th, 2010


#solo2010: Sharing Data is Good for All of Us


Wednesday, August 11th, 2010


#solo2010: Where can we get Open Chemical Reactions?


Thursday, August 12th, 2010


#solo10: The Green Chain Reaction is becoming a chain reaction!


Thursday, August 12th, 2010


#solo10: Publishers, is your data Openly re-usable – please help us


Thursday, August 12th, 2010


#solo10: Green Chain Reaction is using an Etherpad to collect its thoughts


Thursday, August 12th, 2010


#solo10: Green Chain Reaction – much progress and continued request for help


Saturday, August 14th, 2010


#solo10: Green Chain Reaction; where to store the data? DSR? IR? BioTorrent, OKF or ???


Saturday, August 14th, 2010


#solo10 GreenChainReaction: update and What is that Chemical Compound?


Sunday, August 15th, 2010


#solo10 GreenChainReaction: Some chemical information is appalling quality, but does anyone care?


Sunday, August 15th, 2010


#solo10 GreenChainReaction: Update and continued request for help


Monday, August 16th, 2010


#solo10 GreenChainReaction: We need more volunteers – and yes, it will be fun!


Tuesday, August 17th, 2010


#solo10 GreenChainReaction: Can you spot how green the reaction is?


Monday, August 16th, 2010


#solo10 GreenChainReaction: A greenness calculator


Monday, August 16th, 2010


#solo10 An introduction to textmining and data extraction


Thursday, August 19th, 2010


#solo10 GreenChainReaction: Almost ready to go, so please volunteer


Wednesday, August 25th, 2010


#solo10 GreenChainReaction : an Open Index of patents


Thursday, August 26th, 2010


#solo10: Green Chain Reaction starts to deliver Open Notebook Science and Open Patents


Thursday, September 2nd, 2010


#solo10: Green Chain Reaction here are the first results! Make your own judgment on greenness


Friday, September 3rd, 2010


#solo10 GreenChainReaction first results and comments


Friday, September 3rd, 2010


#solo10 Immediate reactions and thanks


Monday, September 6th, 2010


#solo10 GreenChainReaction how we take this forward


Monday, September 6th, 2010


#solo10: Green Chain Reaction – cleaning the data and next steps


Friday, September 10th, 2010

 

History in SCRAPES from the Science online site

 

General intro

 

You don’t have to be a chemist (or a practising scientist) to be involved in this experiment! The most important thing is that you offer to help. Here are some suggestions:

  • help organize and expand these pages
  • publicize the experiment
  • add your name and possible contribution here
  • help request data-mining permissions
  • read chemistry experiments and extract information manually
  • test and deploy software
  • run text-mining software to extract data from publications
  • organize the retrieved information
  • present the results in an exciting manner
  • create links
  • [add your suggestion]

Ready? Email Peter Murray-Rust and add your name to the contributors and participants below!

One of many offers: Hi Peter, sounds a really fun project. I’m happy to help out with some Java coding. Also I have a cloud-hosted virtual machine I’m not really making much use of right now which you’re welcome to use.

Who has volunteered

Current contributors (alphabetical, add links)

Unilever Credits

We must also give great credit to David Jessop (Unilever Centre) for having written the code which we have re-used in this experiment. David worked out how to

  • download patents and understand what comes down
  • restructure the XML into sections and paragraphs

and to Lezan Hawizy (Unilever Centre) who wrote the chemical natural language processing (NLP) (ChemicalTagger) which structures the text and identifies the chemical “parts of speech”.

and to Daniel Lowe who has created OPSIN as the best IUPAC chemical name-2-structure converter

and to Sam Adams who has helped with lots of stuff including Hudson, Pivot and general insight.

ISITOPEN requests

We need people to help with legwork behind requests to IsItOpen.

IsItOpen aims to facilitate enquires of data holders about the openness of the data they hold — and to record publicly the results of those efforts. It is inspired by [hwhatdotheyknow.org What Do They Know?], a site which allow citizens to make requests to find out information to which they have a right. It works by identifying the right place to ask, assisting users make requests and publicly displaying the results.

The Green Chain Reaction project will be using IsItOpen to ask publishers whether we can data-mine their chemical reactions. We know we can do this with BMC, PLoS, Acta E and a few others, but we need to capture and record their formal replies. In particular it’s valuable to make sure they get courteous replies (even if they say no). So we’d like to start by asking the committed OA publishers, get “yes” from them and then start the not-so-committed ones.

How far we’ll get before the date we don’t know. We’d like to present this at the meeting and at least be able to show successes.

CODE

Anyone can contribute or develop software as long as it is Open Source. Initially we start with Code from the Unilever Centre

Code written by members of the Unilever Centre for text-mining, high-throughput computation, semantic web, etc. Most of this has been build under our Hudson continuous integration system and should be reasonably portable but has not been widely deployed. Early adopters should be familiar with:

  • Maven 2.0
  • Java 1.6
  • an IDE (we have used Eclipse, IntelliJ and Netbeans).

Code to be tested and deployed (please be gentle — the README hasn’t been written yet—)

  • Crystaleye processor . Code to extract entries from Acta Crystallographica E. Please checkout and run tests and report problems. (DanH and MarkW have got this running – thx)
  • test data from ActaE for CrystaleyeProcessor
  • Code to extract patent entries and analyse them. alpha-release now on bitbucket. needs customising

*
Getting started quickly:

CODE AND DATA FOLR HACKERS

Prerequisites: Java, Maven and Mercurial.

If you are unfamiliar with these technologies then check this page for some useful tips and links.

These instructions have been tested on Linux and Windows 7

For analysing papers (output to crystaleyeTestData/html/x/y/z/…):

For patent analysis (output to patentData/EPO-yyyy-MM-dd):

To build executable jar for patent analysis, including all dependencies:

  • mvn -Dmaven.test.skip=true assembly:assembly
  • generates target/patent-analysis-0.0.1-jar-with-dependencies.jar
  • See this page for instructions on running the jar

To obtain new patent catalogs in the patentData folder:

If you want patent analysis to use self-built crystaleye-moieties then perform this command in crystaleye-moieties folder:

  • mvn -Dmaven.test.skip=true install

INSTRUCTIONS

What is going on here?

What you are going to do is download a small program that runs in Java. You almost certainly have java installed on your computer if you have a web browser. The program reads an instruction file which tells it how to read through a list of patents that relate to chemistry. You will also need to download these two files and instructions are given below.

Why would I want to do this?

This project is attempting to ask a question by getting computers to “read” as many patents as possible from the recent to the quite old. The question we are asking is “Is chemistry becoming more green in the processes and reagents that it uses?” To do this work we are asking volunteers to become involved by contributing their computing resources to help read the patents. No knowledge of chemistry is necessary!

More generally we are trying to demonstrate the feasibility of collecting information from across a wide range of documents that relate to science to ask wider questions. The results of this work will be presented at Science Online London 2010 in a few weeks time.

Sounds great! How do I do it?

Prerequisites: Java

Instructions for analysing patents:

Latest instructions for the experienced

  1. please always use the code from Hudson
  2. Download latest jar from https://hudson.ch.cam.ac.uk/job/patentanalysis/lastSuccessfulBuild/wwmm$patent-analysis/patent-analysis-0.0.1-jar-with-dependencies.jar which have been lightly tested.
  3. Create a folder named e.g. patentData where the index is and where the results will come
  4. Download http://greenchain.ch.cam.ac.uk/patents/jars/parsePatent.xml intoanywhere convenient – yourDir
  5. Download http://greenchain.ch.cam.ac.uk/patents/jars/uploadWeek.xml to anywhere convenient – yourDir
  6. Download a random patent catalogue (though pre-1990 may be lacking Chemistry patents) from http://greenchain.ch.cam.ac.uk/patents/indexes/ into the patentData folder
  7. run “java -Xmx512m -jar patent-analysis-0.0.1-jar-with-dependencies.jar -p <yourDir>/parsePatent.xml -d <patentData>”
  8. Then run “java -Xmx512m -jar patent-analysis-0.0.1-jar-with-dependencies.jar -p <yourDir>/uploadWeek.xml -d <patentData>”

to upload the results.

THE FOLLOWING SECTION MAY BE OBSOLETE

More detailed instructions for the less confident volunteer (but check filenames against those above)

  1. Downloading the software tools and creating a working directory
    1. Open a browser and paste the following link into your address bar: http://dl.dropbox.com/u/1120779/solo/patentData.zip A download should start automatically. It might take a little while (around 40 seconds for me).
    2. Once you’ve downloaded the zip file, find it (your browser should help you with this) and unzip it. In most cases, double clicking, or right-clicking and selecting “Unzip” or something similar should do the job.
    3. Check that you have three files in the unzipped folder, they should be called “parsePatent.xml”, “uploadSolvent.xml”, and “patent-analysis-0.0.1-with-dependencies.jar”
    4. Drag the folder to somewhere convenient, like the desktop or your documents folder
  2. Second step – getting a patent index
    1. Point your browser at http://greenchain.ch.cam.ac.uk/patents/indexes This takes you to the main index.
    2. You can select any year. Probably not much point going for ones much before 1990.
    3. Then select an index. Probably easiest to right click (or click-hold on a Mac) and choose “Save target as…” Save the folder into the directory with the tools that you just put somewhere where you can remember it. Now you are reading to…
  3. Do the analysis!
    1. Open a terminal window.
      1. Windows: In Start Menu select “Run” and type “cmd” and press return
      2. Mac: Open “Terminal” from Applications -> Utilities
    2. Navigate to your directory.
      1. On Windows or Windows if the directory is on the desktop try “cd Desktop/patentData”
    3. In the terminal type the command “java -Xmx512m -jar patent-analysis-0.0.1-jar-with-dependencies.jar parsePatent.xml”
    4. This should then run the extraction. Sit back and enjoy the nice warm feeling. The analysis will take between 10 and 60 minutes depending on how many patents are in the index.
    5. When the program has finished running you are ready to upload the results. At the command prompt type “java -jar patent-analysis-0.0.1-jar-with-dependencies.jar uploadSolvent.xml”
  4. All done! You can now go back to Step 2, pick a different patent index and start again…(you might want to delete all the folders and files that have been created first just to keep things clear and tidy)

History of Internet Social Networks in Chemistry: can we create a collective memory?

#greenchain #acsanaheim #quixotechem #blueobelisk

I am leading off today’s ACS program on the Internet and Social Networking http://abstracts.acs.org/chem/241nm/program/divisionindex.php?act=presentations&val=Internet+and+Chemistry:+Social+Networking&ses=Internet+and+Chemistry:+Social+Networking&prog=54108 . I intend to talk about our social experiment last year at Science Online (the “greenchain reaction”) but this is also a wonderful opportunity to create a collective memory of the last ?20 years. Memory fades (at least mine) and the Internet record decays with ruthless and terrifying speed. So here’s the idea (only made possible because we have good wifi in the centre):

We create an Etherpad (http://okfnpad.org/internetChemistry ) of the timeline of the last 20 years (1990-) and populate it communally from within the real-life audience and also wider (anyone who sees this mail or related tweets). If you have never used an Etherpad before it’s trivial. Just enter your name or alias, and start typing. I am going to seed it with some timepoints which I think are critical, and here’s the current list. I will almost certainly get dates wrong and miss people. So here’s my first pass… current pad scraped…

 History of Chemistry on the Internet 1990- 

 
 

 Emphasis on Social networks or seminal technologies (not organizational presence)

 
 

 Please enter events or technologies or resources that are seminal to the development of social networks either directly (e.g. a wiki) or indirectly (e.g. Rasmol). Commercial entities are welcomed if they contribute to the history (e.g. Chemweb) but please avoid treating this as an opportunity for product placement – e.g. “AcmeChem put its catalog online”

 
 

 Many of the dates are certainly wrong – please correct them!

 
 

1990

1991

HTML TB-L 

1992

BioMOO

1993

Kinemage?

Rasmol

Mosaic

1994

WWW1 at Geneva demos of Rasmol, etc.

Chemical MIME

Chemical Markup Language

1995

Chime?

Hyperactive molecules (HSR et al)

Principles of Protein Structure Internet Course

1996

ECTOC-1/2/3

Structure Based Drug Design Course 

biomednet?

1997

Chemweb

1998

Chemweb launch of CML

1999

2000

2001

First datument (MW,HSR, PMR in RSC)

2002

2003

2004

Internet Journal Chemistry

ZINC?

2005

Blue Obelisk

2006

2007

Chemistry in Second life

2008

Chemspider

2009

2010

Green Chain Reaction

Quixote

2011 

ACS Internet meeting

 

Please visit the pad and contribute. Before, during and after the meeting. It’s not meant to be PMR-centric.

Peter Sefton has a great tool for formatting the pad – it will look pretty after that and is a useful static snapshot

 

Open Data: latest overview. Please comment

#acsanaheim #opendata #crystaleye

I’ve more-or-less put my thoughts together for the session on Open Data. It seems to me that the key question is whether the price we pay for traditional closed data is worth it. Not just the monetary cost, but the opportunity cost – particularly in access by everyone and re-use. I’ve created a list of issues which I’d like you to think about – I have tried to be fair. If you feel strongly, please edit the Etherpad:

Overview

VERY SORRY!! I HAVE TO LEAVE AT END OF TALK AS I AM TALKING IN ANOTHER SESSION

Web-based science relies on Linked Open Data.

Topics

  • Almost no scientific data is effectively published
  • “Almost Open”, “Freely Accessible” is not good enough
  • Open Knowledge Foundation – defines Open and DOES THINGS
  • Individuals and small groups can change the world
    • Wikipedia
    • OpenStreetMap – The Ordnance survey generates 100 M GBP per year but open maps bring 500 M to the economy
    • What Do They Know? (Web democracy through FOI)
    • Quixote – reclaiming computational chemistry
    • Current publishing models are asymmetric; the author and reader have few rights or influence
    • Software as an agent of political change
    • Web democracy – cf Wikipedia
    • Bottom-up Web 2.0 (The Blue Obelisk and Quixote)
    • Text and data mining
    • Panton Principles
    • Near-zero cost of robots – crystalEye
    • eTheses

    Resources

  • “Open Data” on Wikipedia
  • “Open Data in Science” (Murray-Rust on Nature Precedings (http://precedings.nature.com)
  • Science Commons
  • Open Knowledge Foundation

    Recent Blogs

  • http://blogs.ch.cam.ac.uk/pmr/2011/03/28/open-data-what-i-shall-say-at-acs
  • http://blogs.ch.cam.ac.uk/pmr/2011/03/28/draft-panton-paper-on-textmining/
  • Some fallacies:

  • “You can have SOME of the data (ACS make 8000 CAS numbers freely available to Wikipedia)
  • The data are free for NON-COMMERCIAL use (see my http://blogs.ch.cam.ac.uk/pmr/2010/12/17/why-i-and-you-should-avoid-nc-licences/
  • “You can always ask permission and we’ll grant it”; PMR: doesn’t scale, doesn’t persist, can’t re-use

    The key question: Is the price of closed data worth it?. Do the benefits outweight the disadvantages?: to help you:

issue

closed data

open data

sustainability

supported by income

few proven models

creation of business model

easyish

hard

added human value

often common

possible

support

usually good

depends on community

acceptability

well proven

often suspicious

cost

high; increasing?

marginal

innovation

central authority

fully open

reuse

normally NO

fully OPEN

speed from source

often slow

immediate

mashupability/LODD

very rare

almost universal

reaction to new tech.

often slow

very fast

comprehensivenes

often patchy

potentially v. high

global availability

often very poor

universal

 

I have started an Etherpad at http://okfnpad.org/openClosedData. Please feel free to contribute

BiomedCentral use Open Data buttons in their publications

#opendata #acsanaheim #pantonprinciples

Last night I asked Jan Kuras of BiomedCentral (BMC) whether any of their publications specifically declared their data as Open [OKD-compliant). Here’s his immediate reply:

Hi Peter

 

The following papers at BMC Bioinformatics have Open Data within the Additional Files, Tables:

 

compomics-utilities: an open-source Java library for computational proteomics

Harald Barsnes, Marc Vaudel, Niklaas Colaert, Kenny Helsens, Albert Sickmann, Frode S Berven, Lennart Martens

BMC Bioinformatics 2011, 12:70 (8 March 2011)

http://www.biomedcentral.com/1471-2105/12/70

 

A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment

Yufan Guo, Anna Korhonen, Maria Liakata, Ilona Silins, Johan Hogberg, Ulla Stenius

BMC Bioinformatics 2011, 12:69 (8 March 2011)

http://www.biomedcentral.com/1471-2105/12/69

 

Comparing genotyping algorithms for Illumina’s Infinium whole-genome SNP BeadChips

Matthew E Ritchie, Ruijie Liu, Benilton S Carvalho, The Australia and New Zealand Multiple Sclerosis Genetics Consortium (ANZgene), Rafael A Irizarry

BMC Bioinformatics 2011, 12:68 (8 March 2011)

http://www.biomedcentral.com/1471-2105/12/68

 

So I am delighted to blog this and – at least for me – launch the OpenData button as a viable, respected and used tool in clarifying and asserting Openness.

Let’s follow the first link:

Comparing genotyping algorithms for Illumina’s Infinium whole-genome SNP BeadChips

Matthew E Ritchie1,2* , Ruijie Liu1* , Benilton S Carvalho3
, The Australia and New Zealand Multiple Sclerosis Genetics Consortium (ANZgene)
and Rafael A Irizarry4

Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia

Department of Medical Biology, The University of Melbourne, Parkville, Victoria 3010, Australia

Department of Oncology, University of Cambridge, CRUK Cambridge Research Institute, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, UK

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, North Wolfe Street E3035, Baltimore, MD 21205, USA

author email corresponding author email* Contributed equally

BMC Bioinformatics 2011, 12:68doi:10.1186/1471-2105-12-68

 

Published:

8 March 2011

Additional files

Additional file 1:

Supplemental Figures.

Format: PDF Size: 547KB Download file

This file can be viewed with: Adobe Acrobat Reader


And when opened the file contained:


 

And the BMC explanation of Open Data is consistent with the OKDefinition:

  Brief summary of what Open Access means for the reader:

  Articles with this logo are immediately and permanently available online. Unrestricted use, distribution and reproduction in any medium is permitted, provided the article is properly cited. See our open access charter.

By open data we mean that it is freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. We encourage the use of fully open formats wherever possible.

Anyone is free:

  • to copy, distribute, and display the work;
  • to make derivative works;
  • to make commercial use of the work;

Under the following conditions: Attribution

  • the original author must be given credit;
  • for any reuse or distribution, it must be made clear to others what the license terms of this work are;
  • any of these conditions can be waived if the authors gives permission.

Statutory fair use and other rights are in no way affected by the above.

This equates to a CC-BY, not a PDDL/CC0, licence but its effect is to make the data completely Open.

So great kudos to BMC. They have shown the way and it’s now easy for others to follow. Just create a protocol that all scientific data is Open and label it as such. I hope we can see OpenData buttons mushrooimg everywhere.

Open Data: what I shall say at ACS

#pantonprinciples #opendata #acsanaheim

I am speaking on Open Data and the Panton Principles at ACS. It’s in a session solely on Open Data. But I suspect I am the only one who is promoting Free-as-in-speech. I get to kick off, so here’s an idea of what I am going to say… I have promised (gently) not to rant. But I cannot help myself if the arguments for Openness are overwhelming.

Note for non-chemists. Chemistry is a very non-Open subject. 95% of the journals are closed and of the remaining 5% it’s difficult to get people other than zealots to publish. There is almost no Open Data. Publishers defend “their content” avidly and there have been lawsuits. About 0.01% of practising chemists know or practice Openness (a guess, but probably a reasonable one).

The principles below hold for any science but the examples are chemistry.

Why is Open Data Important?

Data-rich sciences rely on data to:

  • Confirm or disprove experiments. When the data are not published the science cannot be verified or validated
  • Act as yardsticks for other methods (e.g. computational chemistry strives to replicate experiment. Experiment is moderated by theory)
  • Be re-used (mashed up, linked) in novel ways. There are over a thousand papers describing chemistry derived from the reported crystal structure literature

Traditionally only a small amount of data was published. Now, with printing costs irrelevant it’s technically possible to publish the whole experiment.

Moreover much science reported in text is now processible by machines. I argue http://blogs.ch.cam.ac.uk/pmr/2011/03/28/draft-panton-paper-on-textmining/ that textmining can bring huge amounts of value to science.

What is the problem?

Most data is never published at all. That is partly laziness, partly selfishness, partly lack of technology, and partly the lack of culture in favour of publishing.

The data that is published is often published in “textual” form. By default we are forbidden – not for scientific reasons but for legal and commercial ones – to use modern methods of textmining to extract this data. This means that the legal restrictions are holding back science, perhaps by 10 years. The effect of this is:

  • We have much less data and much less variety of content
  • The current quality of data could be much higher (machines do a good job of validation)
  • The efficiency of data creation could be much higher
  • We could detect more fraud or other questionable data
  • Data would be more immediately available (humans have a slower clock cycle)

The downside – which I am sure most closed access publishers must concur with – is that some publishers cannot support their business model. So the equation is simple:

Support closed access publishers at the cost of fewer poorer later data.

I don’t think anyone can disagree with this conclusion. I make no public judgment here – the choice is yours

How to proceed?

If all publishers adopt a business model of Open Data content (and this is compatible with closed access publishing) then we have a step forward. So I and others are asking all publishers to declare that the data in the publications is Open.

What is Open?

Open is free-as-in-speech (libre) not free-as-in-beer (gratis) [Richard Stallman]. Gratis means you can use something but you have no rights. Most of the free services in chemistry are gratis. They can be switched off tomorrow (and frequently have been – I can name many service which were free-to-use and now are not). With gratis material you cannot as of right:

  • Create a derivative work – this curtails innovation
  • Rely on the material being persistent. This curtails mashups and linking
  • Publish aggregations, compure derivaties…
  • Create new ways of presenting and using the content

Libre allows all this. Free-as-in-speech is exemplified for knowledge by the Open Knowledge Foundation’s open Definition. : http://www.opendefinition.org/ :

The Open Knowledge Definition (OKD) sets out principles to define ‘openness’ in knowledge – that’s any kind of content or data ‘from sonnets to statistics, genes to geodata’. The definition can be summed up in the statement that “A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”.

It’s very simple. Just remember “free to use, free to re-use, free to redistribute”. It is clinically effective at deciding whether something is Open.

So what’s the problem?

The problem is that almost no data is marked as Open. So by default its state is unknown. And in that case the default action has to be to say it’s not Open. You cannot guess, or use algorithms to determine whether something is Open. The only foolproof way is to let someone sue you and lose. That’s because it’s a legal thing, not a moral one. And algorithms and reasonableness and morals don’t work in law.

So the way round this is for content providers (I avoid “owners”) to declare that data are Open. And that is what we are asking YOU to do.

So Why the Panton Principles?

The problem is that it’s not legally trivial to declare something Open. It’s taken the OKF and Creative/Science Commons two years to work out the best way. And the best way is to dedicate the data to the public domain. That needs a licence – and we suggest either PDDL or CC0. (not any-old CC licence, but CC0). So we met over some years and finally came up with the Panton Principles: http://pantonprinciples.org/

There is no reason why all authors publishers and funders should not endorse these principles and some have.

So what’s the problem?

The problem is that many content providers don’t realise that this is a problem. So we’ve built a site to ask them about the Openness of their data: http://www.isitopendata.org/ . Here we can ask questions of content providers and record their answers – in public. This means that we can save them time and hassle by only asking the question once.

The answer is for content providers who wish to make it clear that their data is Open to add a licence to that data. It’s also useful to add a visual indicator such as the OKF’s Open Data button.

And where is this going?

The steps are:

  • To get content providers to consider the importance of Openness
  • To get them to make a considered decision (hopefully Open)
  • To get them to mark the content as Open
  • To get them to spread the idea.

The framework is compelling. The Panton Principles have been successfully applied to bibliography – an important part of science. And they have activated a set of Panton deliberations – discussions (audio/visual) and papers.

We need Open Data for better quicker more complete science. That may mean changing the business models. If so we need to think soon…

 

 

almost no data is marked as Open. So by default its state is unknown. And in that case the default action has to be to say it’s not Open. You