DataSharer principles: tested against GigaScience; partially Open, but not enough

Gigascience tweeted that they were studying my suggested principles for data repositories – which I shall now amend to DataSharers. I’d heard vaguely about GigaScience on the blogosphere but not paid huge attention as their datasets are large and I am more interested in the long tail. But as they are at least interested in me I will have a look at them. In what follows I am probably simply ignorant so corrections are welcome.

I am taking my information from: http://www.gigasciencejournal.com/about which seems to have some relation with Biomed Central, though nowhere is this very explicit. The tweetfeed comes from Shenzhen, China and the editorial board is from the BGI (http://en.wikipedia.org/wiki/Beijing_Genomics_Institute ). After some time I have now found a press release: http://www.eurekalert.org/pub_releases/2011-07/bc-fde070611.php which is clearer:

BioMed Central and BGI launch a new integrated database and journal, to meet the needs of a new generation of biological and biomedical research as it enters the era of “big-data.”

GigaScience, an innovative new journal and integrated database to be launched by BioMed Central in November 2011, has released their first datasets to be given a Digital Object Identifier (DOI). This enables a long-needed way to properly recognize the data producers who have provided an untold number of essential resources to the entire research community. This not only promotes very rapid data release, but also provides easy access, reuse, tracking, and most importantly permanency for such datasets. The journal is being launched by a collaboration between BGI, the world’s largest genomics institute, and open access publisher BioMed Central, a leader in scientific data sharing and open data.).

Media Contact
Matt McKay
Head of Public Relations, BioMed Central

PMR: Recommendation. If you are launching a new journal make it clear on the web page what the publishing organization is. It took me 15+ minutes of trawling the web to get these facts. The BGI seems to have commercial interests so I would want to know what the absolute policy on the journal is – who runs it, who runs the data?

So let’s see how the DataSharer principles match up. I’m still refining them. I am taking the view that DataSharers must be completely Open (OKD-compliant, libre). BMC and I share the same operational views on Openness. The final pointer to the BMC licence makes the general principles reasonably clear for articles (but not for data)

Authors of articles published in GigaScience retain the copyright of their articles and are free to reproduce and disseminate their work (for further details, see the BioMed Central copyright and license agreement)

An online open-access open-data journal, we publish ‘big-data’ studies from the entire spectrum of life and biomedical sciences. To achieve our goals, the journal has a novel publication format: one that links standard manuscript publication with an extensive database that hosts all associated data and provides data analysis tools and cloud-computing resources.

PMR: I will be interested to see the links.

GigaScience aims to increase transparency and reproducibility of research, emphasizing data quality and utility over subjective assessments of immediate impact. To enable future access and analyses, we require that all supporting data and source code be publically available and we provide an extensive database and cloud repository that can host associated data, supplementary information and tools.

PMR: this will be interesting. I question “publically available” as I’m not clear what this means in practice (note, it often isn’t very easy to make all code available if part of it have been licenced, e.g. from database vendors)

A unique feature of our database is that important associated datasets can be given DOIs, providing both permanency and an additional citation. Thus GigaScience provides easier access to associated data as well as recognition for data producers.

PMR: Very important, but surely not “unique” – isn’t this what Datacite does?

Open access

All articles published by GigaScience are made freely and permanently accessible online immediately upon publication, without subscription charges or registration barriers. Further information about open access can be found here.

PMR: There is nothing about data – data are different from articles, so this should be addressed specifically.

Indexing services

Following publication in GigaScience, the full-text of each article is deposited immediately and permanently in repositories in e-Depot, the National Library of the Netherlands’ digital archive of electronic publications. GigaScience is included in all major bibliographic databases. A complete list of indexing web services that include BioMed Central’s journals can be found here.

BioMed Central is working closely with Thomson Reuters (ISI) to ensure that citation analysis of articles published in GigaScience will be available.

PMR: It is critical that this indexing metadata is made specifically Open, identified as such and made available to the community. Otherwise BMC is granting third parties ownership over citation data that they can control and resell to the scientific community (as happens with other citations). Make data set citation data OPEN.

Publication and peer-review process

Suitability of research for publication in GigaScience is dependent primarily on the data quality and utility, rather than a subjective assessment of immediate impact. To encourage transparent reporting of scientific research as well as enable future access and analyses, it is a requirement of submission that all supporting data and source code be made available.

PMR: Excellent requirement – it won’t be easy.

.

Data and materials release

Submission of a manuscript to GigaScience implies that readily reproducible materials described in the manuscript, including all relevant raw data, will be freely available to any scientist wishing to use them for non-commercial purposes
[PMR emphasis]. Nucleic acid sequences, protein sequences, and atomic coordinates should be deposited in an appropriate database in time for the accession number to be included in the published article. In computational studies where the sequence information is unacceptable for inclusion in databases because of lack of experimental validation, the sequences must be published as an additional file with the article.

PMR: Whyever has the NC been included? It’s inconsistent with everything that has been said before. It’s unenforceable. It goes against all current BMC policies AFAIK. Please, Please remove it asap. I cannot regard Gigascience as Open while it remains. See http://blogs.ch.cam.ac.uk/pmr/2010/12/17/why-i-and-you-should-avoid-nc-licences/ .

Structures

Crystal structures of organic compounds can be deposited with the Cambridge Crystallographic Data Centre.

PMR: Structures in the CCDC are not Open. Their distribution is controlled by the CCDC and there is no right of re-use. Put them anywhere Open.

PMR: In general I get good vibes about Gigascience. I think they check most, but not all, of my initial principles. However I would like to see Data addressed specifically and consideration given to the Panton Principles for Open Data in Science http://pantonprinciples.org/ including clear labelling.

PMR: If you reply in comments these will be visible to everyone. I will treat them constructively.

UPDATE: Comment from GigaScience to this blog crossed this post.

 

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to DataSharer principles: tested against GigaScience; partially Open, but not enough

  1. Hi Peter,
    As I say this is work in process (with a lot to do particularly with the database) so this detailed feedback is extremely useful. The journal is a collaboration between BMC and the BGI, with the external database hopefully producing some useful infrastructure by utilising the large amounts of server space and cloud computing resources of the BGI. As the theme of the journal is “big-data”, we want to maximise data usability and reuse, and so the openness of the data is obviously key here. We are trying to optimise the peer-review process for articles associated with large datasets, so we want to give maximum weight and credit for authors making everything as transparent and reproducible as possible.
    All data associated with articles in the journal will therefore be as Open as possible and we’ll use DOIs to make it searchable and citable (we are indeed working with Datacite on this). We are also using the same infrastructure to make more of BGI datasets publicly available, and we’ve been experimenting with CC0 (see: http://blogs.openaccesscentral.com/blogs/gigablog/entry/notes_from_an_e_coli), although it’s going to take some time to retrospectively get permissions from projects involving external collaborators, and its also taking time to make the database more robust and stable by moving it to Hong Kong.
    We are in the process of adding more detailed information about the database onto the site, and your feedback on what can be improved with the journal authors instructions is very useful as we’ve inherited and built upon policies from a few sources. Your flagging the issue with the NC part is also very valid (and I agree unenforceable anyway), so I’ll work on getting that changed. Some of this is in the standard authors instructions from BMC, so I’ll feed back the relevant parts to them. Can you suggest a better substitute for CCDC?
    We’ve obviously still got a bit of work to go, but I’m glad we are part of the way there. Hopefully talk at Science Online London, and thanks again.
    Scott

    • pm286 says:

      Great
      >>>All data associated with articles in the journal will therefore be as Open as possible and we’ll use DOIs to make it searchable and citable (we are indeed working with Datacite on this). We are also using the same infrastructure to make more of BGI datasets publicly available, and we’ve been experimenting with CC0 (see: http://blogs.openaccesscentral.com/blogs/gigablog/entry/notes_from_an_e_coli), although it’s going to take some time to retrospectively get permissions from projects involving external collaborators, and its also taking time to make the database more robust and stable by moving it to Hong Kong.
      I appreciate this is non-trivial. Generally the sooner we start the better
      >>>We are in the process of adding more detailed information about the database onto the site, and your feedback on what can be improved with the journal authors instructions is very useful as we’ve inherited and built upon policies from a few sources. Your flagging the issue with the NC part is also very valid (and I agree unenforceable anyway), so I’ll work on getting that changed. Some of this is in the standard authors instructions from BMC, so I’ll feed back the relevant parts to them. Can you suggest a better substitute for CCDC?
      Yes, the Crystallographic Open Database: http://www.crystallography.net/index.php
      P.
      We’ve obviously still got a bit of work to go, but I’m glad we are part of the way there. Hopefully talk at Science Online London, and thanks again.
      Scott

Leave a Reply

Your email address will not be published. Required fields are marked *