14 Asian Chemical Congress: Why we need Open Data/Source/Standards

I am talking tomorrow as an invited lecturer to the 14^th Asian Chemical Congress in the Cheminformatics section (http://www.14acc.org/speakers.htm#s8). My message is the Cheminformatics needs Open resources (as in http://www.opendefinition.org/ ). I am not arguing that everything should be Open, but that everything critical should be. To summarise:

Data should be open. Unless data are Open they cannot be:

Independently validated
Republished
Re-used for derivative works; this is where the innovation comes from
Used as reference sources

Source code should be open to the extent that:

It should be possible to recalculate a model, a set of properties, an analysis independently of closed systems
The algorithm used should be inspectable.

This does not prevent proprietary codes being used for speed, convenience, etc. but they should not be the only way of verifying the work

Standards, including dictionaries. Where files are used to communicate data, the syntax must be agreed (e.g. OpenSmiles (http://www.opensmiles.org/ ), and documentation openly visible. Where terms/metadata are used then they must be defined and agreed by the community (e.g. http://www.xml-cml.org/dictionary/ ). Modern dictionaries should be semantic (i.e. understandable by machine)

Chemistry , and cheminformatics even more, has very little in any of these areas. InChI is one of the few exceptions. Openness is being driven by funders, regulators, some government agencies and (from the bottom-up) the Blue Obelisk (http://sourceforge.net/apps/mediawiki/blueobelisk/index.php?title=Main_Page ).

Without Open Data/Source/Standards, computational/data-driven science is not reproducible.

Many areas in science, especially bioscience, are driven by the vision of the Semantic Web and Linked Open Data (http://en.wikipedia.org/wiki/Linked_Data ) and graph (http://en.wikipedia.org/wiki/File:Lod-datasets_2010-09-22_colored.png ). There is very little chemistry here, because very little is Open. Even KEGG will disappear because It’s becoming closed

I am working with the European Bioinformatics Institute on ChEBI (http://en.wikipedia.org/wiki/ChEBI ) and hopefully also on CHEMBL and related data. The bioinformatics community need Open chemical data and they are prepared to work to make it happen. Maybe at some stage the chemical community will see the value of Open knowledgebases. Until then we will continue to generate collections of computational chemistry, crystallography, spectra, and other properties by using machines to extract or generate them.

Here’s some material I presented earlier to the ChEBI group (2011-06-01) …

Web-based science relies on Linked Open Data.

Topics

Vision: machines could publish and “read” current chemical data
Almost no chemical data is effectively published
There are technical and cultural problems
Current publishing models are asymmetric; the author and reader have few rights or influence
“Almost Open”, “Freely Accessible” is not good enough
Individuals and small groups can change the world
- Wikipedia
- OpenStreetMap
- Quixote – reclaiming computational chemistry (http://quixote.wikispot.org/Front_Page and http://quixote.ch.cam.ac.uk/content/ )
- Software as an agent of political change
- Bottom-up Web 2.0 (The Blue Obelisk (http://www.blueobelisk.org and Quixote)
- Text and data mining
- Automated computation and aggregation of data
- Near-zero cost of robots – crystalEye
- eTheses
- Panton Principles
- Open bibliography
Resources
“Open Data” on Wikipedia (http://en.wikipedia.org/wiki/Open_data )
“Open Data in Science” (Murray-Rust on Nature Precedings (http://precedings.nature.com/ )
Science Commons (http://www.sciencecommons.org )
Open Knowledge Foundation (http://www.okfn.org )

Recent Blogs
/pmr/2011/03/28/open-data-what-i-shall-say-at-acs
/pmr/2011/03/28/draft-panton-paper-on-textmining/
/pmr/2011/03/28/biomedcentral-use-open-data-buttons-in-their-publications

Some fallacies:
“You can have SOME of the data (ACS make 8000 CAS numbers freely available to Wikipedia)
The data are free for NON-COMMERCIAL use (see my /pmr/2010/12/17/why-i-and-you-should-avoid-nc-licences/
“You can always ask permission and we’ll grant it”; PMR: doesn’t scale, doesn’t persist, can’t re-use

The key question: Is the price of closed data worth it?. Do the benefits outweigh the disadvantages?: to help you:

issue	closed data	open data
sustainability	supported by income	few proven models
creation of business model	easyish	hard
added human value	often common	anything possible
support	usually good	depends on community
domain acceptability	well proven	often suspicious
cost	high; increasing?	marginal
innovation	central authority	fully open
reuse	normally NO	fully OPEN
speed from source	often slow	immediate
mashupability/LODD	very rare	almost universal
reaction to new tech.	often slow	very fast
comprehensiveness	very good to patchy	potentially v. high
global availability	often very poor	universal
acceptable to funders	variable; decreasing	very high

In the current talk I shall stress tools for data extraction and creation. In particularly:

Open data bases (Crystaleye1 (http://wwmm.ch.cam.ac.uk/crystaleye , Crystaleye2 http://crystaleye.ch.cam.ac.uk and Crystallography Open database). See also http://opencryst.wordpress.com
OSCAR ( and Chemical Tagger (http://chemicaltagger.ch.cam.ac.uk/ ) for text-mining (hopefully the Hargreaves report will stimulate text mining)

I hope to show demos of some/all of:

OPSIN (http://opsin.ch.cam.ac.uk )
Crystaleye1
Crystaleye2, Quixote (http://quixote.ch.cam.ac.uk/ )
Chemical Tagger (http://chemicaltagger.ch.cam.ac.uk/ )
OSCAR Data
Avogadro

And, if you are excited about creating Open Chemistry, here are some tools to help (/pmr/2011/09/04/open-crystallography-how-to-start-it-and-where-should-we-base-it/ ).

2 Responses to 14 Asian Chemical Congress: Why we need Open Data/Source/Standards

Sean says:

September 7, 2011 at 8:44 pm

Interesting series of posts relating to the openness of crystallographic data. What is your feeling about the likelihood that the CCDC releasing the data?
Although this would be quite painful, I imagine much of the data could be attained by asking authors directly.

- pm286 says:
  
  September 8, 2011 at 10:25 am
  
  >>>Interesting series of posts relating to the openness of crystallographic data. What is your feeling about the likelihood that the CCDC releasing the data?
  I can’t say. I have made a good case.
  >>>Although this would be quite painful, I imagine much of the data could be attained by asking authors directly.
  It goes back 15 years and I think the recall would be very low

14 Asian Chemical Congress: Why we need Open Data/Source/Standards

2 Responses to 14 Asian Chemical Congress: Why we need Open Data/Source/Standards

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta