I am talking an a few minutes to a group of chemists, other scientists, computational scientists, informatics specialist, IR managers, etc. in ZCAM (computational chemistry) in Zaragoza. This is a very exciting project and we hope to not only talk, but actually do things today.
Rather than use Powerpoint I blog my materials. A lot is present in previous blog posts, but this adds an overview of what I might say, and some of the materials I might use. What I actually say depends as always on what has already been said, and not said and the interests of the people present.
My motivation
[With ChEBI, Christoph Steinbeck] Compute properties (spectra, conformations, reactivity) of compounds in the human metabolome.
- Open to all – no central ownership (cf. Wikipedia). Not my project, but OUR
- Very cost-effective with a high potential for success
- A long-tail discipline, with discrete data.
Data Sharing
- Must be driven by scientists (researchers, editors)
- Should be domain-specific
Why share data?
- To promote MY work and receive credit (data citation)
- To save MY work
- To share MY datasets with ME (i.e. look for paterns, correlation)
- To share MY datasets with MY colleagues
- To share MY datasets with the world
- To improve methodology
- To validate science
What are the problems?
- People want to use their results as intellectual capital
- People can sell their data for money
- It takes effort and money
- It challenges established interests (priesthood, market)
- Chemists are more conservative than many disciplines
Why/how will it happen?
- Because individuals (e.g. grad students) find it useful
- Because groups find it useful
- Because journals find it useful enough to mandate
- Because funders require it
- Because developers (e.g. programs)find it useful
What should we do today?
- Make a wish list for compchem data sharing
- What is possible right now?
Resources related to Data Sharing
Recent blogs by PMR
Data repositories
Dryad A repository for data, especially biosciences
How to deposit in Dryad
FigshareMark Hahnel’s commuity for sharing figures and data
Dataverse – publicly visible (not fully open) datasets in social science
dataverse deposit-terms
Authors in Figshare
Figures in Figshare
An author in Figshare
Proliferation_of_PBMCs,_expressed_as_stimulation_indices_ in Figshare
Validation of Science requires data
Spectral data should be digital Mestrec blog (NMR software) argues that if the data were digital potential fraud would have been detected
The PDF of controversial data This data cannot be easily understood by machines so validation is impossible
CML, Quixote and Crystaleye Data sharers
Crystaleye home page
Crystaleye2 data sharer
Quixote data sharer
CML resources (dictionaries and conventions)
COMO Chem Eng knowledgebase
SPARQL query for Crystaleye2
[This will be used interactively with crystaleye2. Try it under SPARQL. It’s very new. If it works, congratulate Sam.
If it fails maybe the server is down, or blame me.]
Report only structures with R values less than 0.02:
PREFIX cif: <http://www.xml-cml.org/dictionary/cif/>
SELECT ?uri ?rfactor {
?uri cif:refine_ls_r_factor_gt ?rfactor
FILTER (?rfactor < 0.02)
Is that SPARQL end point public?
You mean: is there an API/URL. Will have to ask Sam.
Oh, never mind: http://crystaleye.ch.cam.ac.uk/
I really need to have a close look at that software, used for Quixote and now for CrystalEye too… nice work!
It’s very nice. Still very new. It’s at https://bitbucket.org/chempound – suggest you see if you can get it running. Sam has developed a modular approach where different CML conventions can be used
This is exciting work, sorry I missed the meeting. We had a very interesting session on chemical databases in Frederick, MD too. I met someone interested in sharing his data more widely, and will be following up but there is potential to use Quixote there. I need to try and get an instance up and running now the code is out there.