14 Asian Chemical Congress: Why we need Open Data/Source/Standards

I am talking tomorrow as an invited lecturer to the 14th Asian Chemical Congress in the Cheminformatics section (http://www.14acc.org/speakers.htm#s8). My message is the Cheminformatics needs Open resources (as in http://www.opendefinition.org/ ). I am not arguing that everything should be Open, but that everything critical should be. To summarise:

  • Data should be open. Unless data are Open they cannot be:
  1. Independently validated
  2. Republished
  3. Re-used for derivative works; this is where the innovation comes from
  4. Used as reference sources
  • Source code should be open to the extent that:
  1. It should be possible to recalculate a model, a set of properties, an analysis independently of closed systems
  2. The algorithm used should be inspectable.

This does not prevent proprietary codes being used for speed, convenience, etc. but they should not be the only way of verifying the work

  • Standards, including dictionaries. Where files are used to communicate data, the syntax must be agreed (e.g. OpenSmiles (http://www.opensmiles.org/ ), and documentation openly visible. Where terms/metadata are used then they must be defined and agreed by the community (e.g. http://www.xml-cml.org/dictionary/ ). Modern dictionaries should be semantic (i.e. understandable by machine)

Chemistry , and cheminformatics even more, has very little in any of these areas. InChI is one of the few exceptions. Openness is being driven by funders, regulators, some government agencies and (from the bottom-up) the Blue Obelisk (http://sourceforge.net/apps/mediawiki/blueobelisk/index.php?title=Main_Page ).

Without Open Data/Source/Standards, computational/data-driven science is not reproducible.

Many areas in science, especially bioscience, are driven by the vision of the Semantic Web and Linked Open Data (http://en.wikipedia.org/wiki/Linked_Data ) and graph (http://en.wikipedia.org/wiki/File:Lod-datasets_2010-09-22_colored.png ). There is very little chemistry here, because very little is Open. Even KEGG will disappear because It’s becoming closed

I am working with the European Bioinformatics Institute on ChEBI (http://en.wikipedia.org/wiki/ChEBI ) and hopefully also on CHEMBL and related data. The bioinformatics community need Open chemical data and they are prepared to work to make it happen. Maybe at some stage the chemical community will see the value of Open knowledgebases. Until then we will continue to generate collections of computational chemistry, crystallography, spectra, and other properties by using machines to extract or generate them.

Here’s some material I presented earlier to the ChEBI group (2011-06-01) …

Web-based science relies on Linked Open Data.

Topics

issue

closed data

open data

sustainability

supported by income

few proven models

creation of business model

easyish

hard

added human value

often common

anything possible

support

usually good

depends on community

domain acceptability

well proven

often suspicious

cost

high; increasing?

marginal

innovation

central authority

fully open

reuse

normally NO

fully OPEN

speed from source

often slow

immediate

mashupability/LODD

very rare

almost universal

reaction to new tech.

often slow

very fast

comprehensiveness

very good to patchy

potentially v. high

global availability

often very poor

universal

acceptable to funders

variable; decreasing

very high

 

In the current talk I shall stress tools for data extraction and creation. In particularly:

I hope to show demos of some/all of:

And, if you are excited about creating Open Chemistry, here are some tools to help (/pmr/2011/09/04/open-crystallography-how-to-start-it-and-where-should-we-base-it/ ).

 

 

 

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to 14 Asian Chemical Congress: Why we need Open Data/Source/Standards

  1. Sean says:

    Interesting series of posts relating to the openness of crystallographic data. What is your feeling about the likelihood that the CCDC releasing the data?
    Although this would be quite painful, I imagine much of the data could be attained by asking authors directly.

    • pm286 says:

      >>>Interesting series of posts relating to the openness of crystallographic data. What is your feeling about the likelihood that the CCDC releasing the data?
      I can’t say. I have made a good case.
      >>>Although this would be quite painful, I imagine much of the data could be attained by asking authors directly.
      It goes back 15 years and I think the recall would be very low

Leave a Reply

Your email address will not be published. Required fields are marked *