I am talking an a few minutes to a group of chemists, other scientists, computational scientists, informatics specialist, IR managers, etc. in ZCAM (computational chemistry) in Zaragoza. This is a very exciting project and we hope to not only talk, but actually do things today.
Rather than use Powerpoint I blog my materials. A lot is present in previous blog posts, but this adds an overview of what I might say, and some of the materials I might use. What I actually say depends as always on what has already been said, and not said and the interests of the people present.
My motivation
[With ChEBI, Christoph Steinbeck] Compute properties (spectra, conformations, reactivity) of compounds in the human metabolome.
Quixote…
- Open to all – no central ownership (cf. Wikipedia). Not my project, but OUR
- Very cost-effective with a high potential for success
- A long-tail discipline, with discrete data.
Data Sharing
- Must be driven by scientists (researchers, editors)
- Should be domain-specific
Why share data?
- To promote MY work and receive credit (data citation)
- To save MY work
- To share MY datasets with ME (i.e. look for paterns, correlation)
- To share MY datasets with MY colleagues
- To share MY datasets with the world
- To improve methodology
- To validate science
What are the problems?
- People want to use their results as intellectual capital
- People can sell their data for money
- It takes effort and money
- It challenges established interests (priesthood, market)
- Chemists are more conservative than many disciplines
Why/how will it happen?
- Because individuals (e.g. grad students) find it useful
- Because groups find it useful
- Because journals find it useful enough to mandate
- Because funders require it
- Because developers (e.g. programs)find it useful
What should we do today?
- Make a wish list for compchem data sharing
- What is possible right now?
Resources related to Data Sharing
Recent blogs by PMR
-
criteria-for-datasharers/
-
data-repositories-for-long-tail-science-setting-the-scene/
-
data-publication-some-replies/
-
Data repositories
-
Dryad A repository for data, especially biosciences
-
2011/07/28/uk-parliament-report-supports-dryad-and-data-access/
-
-
How to deposit in Dryad
-
FigshareMark Hahnel’s commuity for sharing figures and data
-
Dataverse – publicly visible (not fully open) datasets in social science
-
dataverse deposit-terms
-
Authors in Figshare
-
Figures in Figshare
-
An author in Figshare
-
Proliferation_of_PBMCs,_expressed_as_stimulation_indices_ in Figshare
-
Validation of Science requires data
-
http://pubs.acs.org/doi/suppl/10.1021/jo200117p
-
Spectral data should be digital Mestrec blog (NMR software) argues that if the data were digital potential fraud would have been detected
-
The PDF of controversial data This data cannot be easily understood by machines so validation is impossible
CML, Quixote and Crystaleye Data sharers
-
Crystaleye home page
-
Crystaleye2 data sharer
-
Quixote data sharer
-
CML resources (dictionaries and conventions)
-
COMO Chem Eng knowledgebase
SPARQL query for Crystaleye2
[This will be used interactively with crystaleye2. Try it under SPARQL. It’s very new. If it works, congratulate Sam.
If it fails maybe the server is down, or blame me.]
Report only structures with R values less than 0.02:
PREFIX cif: <http://www.xml-cml.org/dictionary/cif/>
SELECT ?uri ?rfactor {
?uri cif:refine_ls_r_factor_gt ?rfactor
FILTER (?rfactor < 0.02)
}
Is that SPARQL end point public?
You mean: is there an API/URL. Will have to ask Sam.
Oh, never mind: http://crystaleye.ch.cam.ac.uk/
I really need to have a close look at that software, used for Quixote and now for CrystalEye too… nice work!
It’s very nice. Still very new. It’s at https://bitbucket.org/chempound – suggest you see if you can get it running. Sam has developed a modular approach where different CML conventions can be used
This is exciting work, sorry I missed the meeting. We had a very interesting session on chemical databases in Frederick, MD too. I met someone interested in sharing his data more widely, and will be following up but there is potential to use Quixote there. I need to try and get an instance up and running now the code is out there.