I am talking tomorrow as an invited lecturer to the 14th Asian Chemical Congress in the Cheminformatics section (http://www.14acc.org/speakers.htm#s8). My message is the Cheminformatics needs Open resources (as in http://www.opendefinition.org/ ). I am not arguing that everything should be Open, but that everything critical should be. To summarise:
- Data should be open. Unless data are Open they cannot be:
- Independently validated
- Republished
- Re-used for derivative works; this is where the innovation comes from
- Used as reference sources
- Source code should be open to the extent that:
- It should be possible to recalculate a model, a set of properties, an analysis independently of closed systems
- The algorithm used should be inspectable.
This does not prevent proprietary codes being used for speed, convenience, etc. but they should not be the only way of verifying the work
- Standards, including dictionaries. Where files are used to communicate data, the syntax must be agreed (e.g. OpenSmiles (http://www.opensmiles.org/ ), and documentation openly visible. Where terms/metadata are used then they must be defined and agreed by the community (e.g. http://www.xml-cml.org/dictionary/ ). Modern dictionaries should be semantic (i.e. understandable by machine)
Chemistry , and cheminformatics even more, has very little in any of these areas. InChI is one of the few exceptions. Openness is being driven by funders, regulators, some government agencies and (from the bottom-up) the Blue Obelisk (http://sourceforge.net/apps/mediawiki/blueobelisk/index.php?title=Main_Page ).
Without Open Data/Source/Standards, computational/data-driven science is not reproducible.
Many areas in science, especially bioscience, are driven by the vision of the Semantic Web and Linked Open Data (http://en.wikipedia.org/wiki/Linked_Data ) and graph (http://en.wikipedia.org/wiki/File:Lod-datasets_2010-09-22_colored.png ). There is very little chemistry here, because very little is Open. Even KEGG will disappear because It’s becoming closed
I am working with the European Bioinformatics Institute on ChEBI (http://en.wikipedia.org/wiki/ChEBI ) and hopefully also on CHEMBL and related data. The bioinformatics community need Open chemical data and they are prepared to work to make it happen. Maybe at some stage the chemical community will see the value of Open knowledgebases. Until then we will continue to generate collections of computational chemistry, crystallography, spectra, and other properties by using machines to extract or generate them.
Here’s some material I presented earlier to the ChEBI group (2011-06-01) …
Web-based science relies on Linked Open Data.
Topics
- Vision: machines could publish and “read” current chemical data
- Almost no chemical data is effectively published
- There are technical and cultural problems
- Current publishing models are asymmetric; the author and reader have few rights or influence
- “Almost Open”, “Freely Accessible” is not good enough
-
Individuals and small groups can change the world
- Wikipedia
- OpenStreetMap
- Quixote – reclaiming computational chemistry (http://quixote.wikispot.org/Front_Page and http://quixote.ch.cam.ac.uk/content/ )
- Software as an agent of political change
- Bottom-up Web 2.0 (The Blue Obelisk (http://www.blueobelisk.org and Quixote)
- Text and data mining
- Automated computation and aggregation of data
- Near-zero cost of robots – crystalEye
- eTheses
- Panton Principles
- Open bibliography
Resources
- Wikipedia
- “Open Data” on Wikipedia (http://en.wikipedia.org/wiki/Open_data )
- “Open Data in Science” (Murray-Rust on Nature Precedings (http://precedings.nature.com/ )
- Science Commons (http://www.sciencecommons.org )
-
Open Knowledge Foundation (http://www.okfn.org )
Recent Blogs
-
/pmr/2011/03/28/open-data-what-i-shall-say-at-acs
-
/pmr/2011/03/28/draft-panton-paper-on-textmining/
-
Some fallacies:
- “You can have SOME of the data (ACS make 8000 CAS numbers freely available to Wikipedia)
- The data are free for NON-COMMERCIAL use (see my /pmr/2010/12/17/why-i-and-you-should-avoid-nc-licences/
-
“You can always ask permission and we’ll grant it”; PMR: doesn’t scale, doesn’t persist, can’t re-use
The key question: Is the price of closed data worth it?. Do the benefits outweigh the disadvantages?: to help you:
issue |
closed data |
open data |
sustainability |
supported by income |
few proven models |
creation of business model |
easyish |
hard |
added human value |
often common |
anything possible |
support |
usually good |
depends on community |
domain acceptability |
well proven |
often suspicious |
cost |
high; increasing? |
marginal |
innovation |
central authority |
fully open |
reuse |
normally NO |
fully OPEN |
speed from source |
often slow |
immediate |
mashupability/LODD |
very rare |
almost universal |
reaction to new tech. |
often slow |
very fast |
comprehensiveness |
very good to patchy |
potentially v. high |
global availability |
often very poor |
universal |
acceptable to funders |
variable; decreasing |
very high |
In the current talk I shall stress tools for data extraction and creation. In particularly:
- Open data bases (Crystaleye1 (http://wwmm.ch.cam.ac.uk/crystaleye , Crystaleye2 http://crystaleye.ch.cam.ac.uk and Crystallography Open database). See also http://opencryst.wordpress.com
- OSCAR ( and Chemical Tagger (http://chemicaltagger.ch.cam.ac.uk/ ) for text-mining (hopefully the Hargreaves report will stimulate text mining)
I hope to show demos of some/all of:
- OPSIN (http://opsin.ch.cam.ac.uk )
- Crystaleye1
- Crystaleye2, Quixote (http://quixote.ch.cam.ac.uk/ )
- Chemical Tagger (http://chemicaltagger.ch.cam.ac.uk/ )
- OSCAR Data
-
Avogadro
And, if you are excited about creating Open Chemistry, here are some tools to help (/pmr/2011/09/04/open-crystallography-how-to-start-it-and-where-should-we-base-it/ ).
Interesting series of posts relating to the openness of crystallographic data. What is your feeling about the likelihood that the CCDC releasing the data?
Although this would be quite painful, I imagine much of the data could be attained by asking authors directly.
>>>Interesting series of posts relating to the openness of crystallographic data. What is your feeling about the likelihood that the CCDC releasing the data?
I can’t say. I have made a good case.
>>>Although this would be quite painful, I imagine much of the data could be attained by asking authors directly.
It goes back 15 years and I think the recall would be very low