#IUCR2011: Open and Closed publication and Gatekeepers

I shall be explaining (very rapidly) the role of Open, Closed, text and data in my talk on Open Crystallography. I shall use a series of diagrams with consistent semiotics.

Before I start I want to make it clear that scientific publishing is not a zero cost activity. I am not waving hands here, I am basing this on models that work. There may be major difficulties to change en masses but that is not the same as being woolly-headed and naïve. And I know it takes time and I am impatient. However academic scientific research is a 100 billion -> 1000 billion USD industry and the money is in there. It is a question of re-orienting it.

Here is the traditional publication model, without data. A piece of fulltext is published and the reader (usually through their library) pays. Individual subscribers can pay 40 USD per day for 1 article.

This discussion is orthogonal to peer-review (please don’t complicate it by involving PR – ALL the diagrams can have PR or NO-PR). It is also slanted towards crystallography

 

The components:

  • The funders fund the authors to do the research and publish fulltexts and data
  • The publisher decides whether and how to publish the paper (Gatekeeper)
  • The publisher uses a paywall to restrict access
  • The reader inputs money, reads fulltexts and data, but has no other say
  • Open means OKDefiniiton compliant. Anything else is closed

Gatekeepers

Most aspects of publishing are infinitely discussed, but the role of gatekeeper is critical here. The gatekeeper has complete and aribitrary say over what happens. The GK can restrict and filter input and can bar access. In the modern world GKs are frequently

  • In conflict with emerging ideas of web-democracy
  • Conservative
  • Self-serving; i.e. the operation is run for the benefit of gatekeepers and not the authors or the readers

In general most gatekeepers outlive their usefulness

 

 

Here the publisher arbitrarily refuses to publish data (no examples from crystallography)

Here the publisher accepts data and publishes it on their website, generally without any restrictions (effectively Open)

The publisher publishes the data to a closed repository run by another Gatekeeper. There may or may not be an agreement between the gatekeepers. The reader has paywalls and permission walls for both components. Nothing is Openly reusable.

The authors pays an Open Access charge (normally “GoldOA”) and the publisher makes the text Openly accessible. Note that the publisher is still a gatekeeper in all other respects.

The OA model with data on publishers site

The authors or publishers put the data in a domain-specific repository (e.g. DRYAD, Tranche, PDB). Anyon can re-use the data. Funding models are variable (e.g. pay for deposit or grant)

The funder pays for the publication (e.g. Wellcome/HHMI/MPI journal). The funder may have input into gatekeeping

And finally the models emerging from the web culture

Here the readership, authorship and publishing mechanism merge into a meritocracy. There is still a need for gatekeeping, but decided by the community.

Tomorrow I will post a synopsis of my talk and give links

Posted in Uncategorized | Leave a comment

#IUCR2011: Open Crystallography

I am speaking on 2011-08-29 on new methods of publishing crystallography including data. I shall prepare my talk as a series of blog posts, not necessarily in the order that they are presented at the meeting.

I am arguing that there should be a concept of Open Crystallography to which crystallographers and other communities (not restricted to scientists) can subscribe. The idea is that published crystallographic iformation should be Open to everyone. Open as in the Open Knowledge Foundation’s Definition (http://www.opendefinition.org/ ):

“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”.

This is a crystal-clear operational definition – something either conforms completely or it does not conform. There are no intermediate positions

Crystallography is among the most Open of disciplines. It has conducted its affairs for the benefit of the community and has pioneered the concept of publishing data, especially in the long-tail of science (where millions of data are published independently). It has an Open Access journal (Acta Crystallographica E) with a very modest fee and a very large authorship (1000s of articles per year).

In the days of print journals it was extremely pro-active in requiring the publication of crystallographic supporting data/information alongside text. Indeed if the IUCr had not argued for and won this process there would be many fewer examples of supplementary data published today.

There are several subdisciplines of crystallography. Macromolecular crystallography (proteins, nucleic acids, etc.) are supported by the PDB (http://www.pdb.org/ , Protein Databank) which is effectively Open. People can copy entries, create derivatives create mashups, reformat, etc. without permission.

My focus here is on chemical crystallography (small molecules). Although all supporting information must be submitted to journals, only some of them publish it visibly and Openly. High-volume publishers with Open supplemental information include:

  • American Chemical Society
  • Int. Union of Crystallography (IUCr)
  • Royal Society of Chemistry
  • Nature

And by default all Gold Open Access publishers (e.g. BMC)

In contrast a number of publishers (Elsevier, Springer (ex BMC), Wiley/Blackwells) do not publish the supplemental information (or hide it behind paywalls). They send it (or get the author to send it) to the Cambridge Crystallographic Data Centre CCDC). This information is hidden behind paywalls and permissionwalls. It is not open

There is now a growing grounswell for making small-molecule crystallography completely Open. Some of us have built tools to collect Open CIFs into our own repositories or to accept donations. [see /pmr/2007/12/22/update-on-open-crystallography/ for historical position]. The largest of these are

Saulius and I met at this meeting and we have completely aligned objectives. We have agreed to use “Open Crystallography” as an umbrella for our efforts, and to exchange data and tools (I will explain this at the meeting). We can immediately donate our data to COD, and while most are duplicates there are clearly a number which are not.

Open Crystallography can follow the Panton Principles. [I have substituted ‘science’ by ‘crystallography’]

By open data in crystallography we mean that it is freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. To this end data related to published crystallography should be explicitly placed in the public domain.

Formally, we recommend adopting and acting on the following principles:

  1. Where data or collections of data are published it is critical that they be published with a clear and explicit statement of the wishes and expectations of the publishers with respect to re-use and re-purposing of individual data elements, the whole data collection, and subsets of the collection. This statement should be precise, irrevocable, and based on an appropriate and recognized legal statement in the form of a waiver or license.

    When publishing data make an explicit and robust statement of your wishes.

  2. […] Creative Commons licenses (apart from CCZero), GFDL, GPL, BSD, etc are NOT appropriate for data and their use is STRONGLY discouraged.

    Use a recognized waiver or license that is appropriate for data.

  3.  The use of licenses which limit commercial re-use or limit the production of derivative works by excluding use for particular purposes or by specific persons or organizations is STRONGLY discouraged. These licenses make it impossible to effectively integrate and re-purpose datasets and prevent commercial activities that could be used to support data preservation.

    If you want your data to be effectively used and added to by others it should be open as defined by the Open Knowledge/Data Definition – in particular non-commercial and other restrictive clauses should not be used.

  4. Furthermore, in science it is STRONGLY recommended that data, especially where publicly funded, be explicitly placed in the public domain via the use of the Public Domain Dedication and Licence or Creative Commons Zero Waiver. This is in keeping with the public funding of much scientific research and the general ethos of sharing and re-use within the scientific community.

    Explicit dedication of data underlying published science into the public domain via PDDL or CCZero is strongly recommended and ensures compliance with both the Science Commons Protocol for Implementing Open Access Data and the Open Knowledge/Data Definition.

There are probably about 300,000 datasets for small-molecules (CIFs) now openly available through COD or Crystaleye and smaller collections. But there are probably about 50,000 – 150,000 CIFs published electronically but closed behind the CCDC walls. If we can Open these (and I am awaiting CCDC’s reply) then all small-molecule crystallography becomes Open.

 

 

Posted in Uncategorized | Leave a comment

Talk at Int. Union of Crystallography: I ask for the availability of scientific data

I am giving an invited talk on Monday 2011-08-29 at the IUCr (http://www.iucr2011madrid.es/index.php/program/scientific-program ) about our Crystaleye system and more generally about new approaches to publishing science. I’ll be blogging a lot over the next 2 days and get all my ideas into posts.

 

Authors’ raw data is part of the necessary scientific record (i.e. the material required to support the authors’ claims) and there is a large groundswell that it should be universally published. Some disciplines already do this and in the Long-tail of science crystallography leads the field. Every paper MUST be accompanied by the crystallographic data (CIF) and all publishers require authors to make this available. (More later).

 

Crystaleye (written by Nick Day) is a system that reads the CIFs from publishers webpages and aggregates them into a browsable and searchable knowledgebase. So far it has got over 200,000 different CIFs and covers both organic and inorganic data. It does this for journals published by IUCr, Am. Chem. Soc. (ACS), Royal Soc. Chem. because they publish the CIFs on their websites. Other publishers, however, do not. They send the data to the Cambridge Crystallographic Data Centre (CCDC, http://www.ccdc.cam.ac.uk) [Note I have no formal connection with the CCDC].

 

The CCDC are the only place in the world allowed to hold these CIFs and they are not on public view. They have many tens of thousands of these CIFs but normally restrict gratis access to very small numbers, normally 1 at a time, by email request. To get access to the CIFs you have to subscribe by paying an annual subscription. This subscription also provides added downstream value, which I do not need.

 

I would like the raw data from major publishers such as Wiley, Elsevier and Springer, which at present are only held to my knowledge at CCDC. I believe that access to them is essential to doing modern science (I will explain why later), so I have asked the director of CCDC (Dr Colin Groom) for these CIFs. Note that these are the CIFs deposited by the authors, not material enhanced by the CCDC (and already routinely published by ACS, RSC, IUCr, etc.)

 

Colin and I met yesterday and he agreed to reply to my request. We understand each other’s positions but it will be useful for Monday’s talk to have a formal record and this is the mail I sent to Colin yesterday.

 

Colin,
Several journals (mainly Wiley, Elsevier, Springer, Science) do not publish the authors’ electronic CIFs but instead deposit them in the CCDC. These CIFs are a major part of the primary scientific record and can be used for validation, detection of fraud and error, systematic studies, mashups, etc.

I am asking whether the CCDC is prepared to make all these available as an Open collection (e.g. under CC0) so that the community can have bulk access to these without requiring further permission.

I asked the ICSD earlier this week and they were prepared to do this for the electronic CIFs they have received as part of the deposition process. If CCDC can do the same then the complete record of published electronic data will be available.

I am talking about new approaches to publication on Monday and would like to be able to present the CCDC’s formal position and any future plans on this issue.

I am around most of the time and happy to see if we can meet
 
Many thanks,

Peter

 

I will publish his reply in full and unabridged on this blog and discuss it in my talk on Monday

Posted in Uncategorized | 2 Comments

Data sharing and Quixote meeting (Zaragoza)

I am talking an a few minutes to a group of chemists, other scientists, computational scientists, informatics specialist, IR managers, etc. in ZCAM (computational chemistry) in Zaragoza. This is a very exciting project and we hope to not only talk, but actually do things today.

Rather than use Powerpoint I blog my materials. A lot is present in previous blog posts, but this adds an overview of what I might say, and some of the materials I might use. What I actually say depends as always on what has already been said, and not said and the interests of the people present.

My motivation

[With ChEBI, Christoph Steinbeck] Compute properties (spectra, conformations, reactivity) of compounds in the human metabolome.

Quixote…

  • Open to all – no central ownership (cf. Wikipedia). Not my project, but OUR
  • Very cost-effective with a high potential for success
  • A long-tail discipline, with discrete data.

Data Sharing

  • Must be driven by scientists (researchers, editors)
  • Should be domain-specific

Why share data?

  • To promote MY work and receive credit (data citation)
  • To save MY work
  • To share MY datasets with ME (i.e. look for paterns, correlation)
  • To share MY datasets with MY colleagues
  • To share MY datasets with the world
  • To improve methodology
  • To validate science

What are the problems?

  • People want to use their results as intellectual capital
  • People can sell their data for money
  • It takes effort and money
  • It challenges established interests (priesthood, market)
  • Chemists are more conservative than many disciplines

Why/how will it happen?

  • Because individuals (e.g. grad students) find it useful
  • Because groups find it useful
  • Because journals find it useful enough to mandate
  • Because funders require it
  • Because developers (e.g. programs)find it useful

What should we do today?

  • Make a wish list for compchem data sharing
  • What is possible right now?

Resources related to Data Sharing

Recent blogs by PMR

SPARQL query for Crystaleye2

[This will be used interactively with crystaleye2. Try it under SPARQL. It’s very new. If it works, congratulate Sam.

If it fails maybe the server is down, or blame me.]

 

Report only structures with R values less than 0.02:

 

PREFIX cif: <http://www.xml-cml.org/dictionary/cif/>

SELECT ?uri ?rfactor {

?uri cif:refine_ls_r_factor_gt ?rfactor

FILTER (?rfactor < 0.02)

}

Posted in Uncategorized | 5 Comments

Criteria for DataSharers

I am defining DataSharers as an ecology of places to make digital stuff available for sharing. I am still searching for a good term. Digisharer? Digital Sharer? I originally explored this list for repositories so the criteria are slightly different. Interestingly some of the criteria here are now almost trivially obvious, reflecting the innovative and community-based approach of science rather than the unclear world of repositories. So maybe some of the criteria below will dissolve away.

  • A dataSharer should have a clear purpose.
  • People should want to make stuff available through dataSharers.
  • People should want to get stuff via dataSharers.
  • The Community running any dataSharer should be clear and their motivation transparent.
  • Most successful dataSharers are started by one or more identifiable people.
  • The DataSharer should generate a wider community that has a sense of ownership. Encourage it to innovate in searching and displaying the contents of the DataSharer.
  • The DataSharers and their ecology should be a dynamic organism. Build for evolution, not stasis.
  • Make the DataSharer successful rapidly. Plan for the “now”, not for future generations.
  • Make everything associated with the DataSharer OPEN and make this explicit.
  • Build DataSharers for machines and humans.
  • Make it clear what the extent of the DataSharer is, and make it iterable.
  • A DataSharer should be cloneable/forkable.
  • Do not rely on traditional metadata strategy.
  • Give depositors massive feedback.
  • Give everything unique IDs.

I’m promoting the idea of domain-specific Sharers, because this is much more likely to create a community. It’s also interesting to see them coupled to journals and traditional publishers. However this requires very careful scrutiny. There are excellent publishers (IUCr, BMC, EGU…) and there are others. Some publishers, both commercial and non-commercial want to own our data. A DataSharer run by a traditional “owner and controller and reseller” of academic content is likely to need very careful scrutiny. Is the licence really Open? Is the data completely available? Is there a guarantee that it won’t get closed sometime in the future? I find it impossible to see an Open DataSharer coupled to a closed access journal.

 

Posted in Uncategorized | Leave a comment

DataSharer principles: tested against GigaScience; partially Open, but not enough

Gigascience tweeted that they were studying my suggested principles for data repositories – which I shall now amend to DataSharers. I’d heard vaguely about GigaScience on the blogosphere but not paid huge attention as their datasets are large and I am more interested in the long tail. But as they are at least interested in me I will have a look at them. In what follows I am probably simply ignorant so corrections are welcome.

I am taking my information from: http://www.gigasciencejournal.com/about which seems to have some relation with Biomed Central, though nowhere is this very explicit. The tweetfeed comes from Shenzhen, China and the editorial board is from the BGI (http://en.wikipedia.org/wiki/Beijing_Genomics_Institute ). After some time I have now found a press release: http://www.eurekalert.org/pub_releases/2011-07/bc-fde070611.php which is clearer:

BioMed Central and BGI launch a new integrated database and journal, to meet the needs of a new generation of biological and biomedical research as it enters the era of “big-data.”

GigaScience, an innovative new journal and integrated database to be launched by BioMed Central in November 2011, has released their first datasets to be given a Digital Object Identifier (DOI). This enables a long-needed way to properly recognize the data producers who have provided an untold number of essential resources to the entire research community. This not only promotes very rapid data release, but also provides easy access, reuse, tracking, and most importantly permanency for such datasets. The journal is being launched by a collaboration between BGI, the world’s largest genomics institute, and open access publisher BioMed Central, a leader in scientific data sharing and open data.).

Media Contact
Matt McKay
Head of Public Relations, BioMed Central

PMR: Recommendation. If you are launching a new journal make it clear on the web page what the publishing organization is. It took me 15+ minutes of trawling the web to get these facts. The BGI seems to have commercial interests so I would want to know what the absolute policy on the journal is – who runs it, who runs the data?

So let’s see how the DataSharer principles match up. I’m still refining them. I am taking the view that DataSharers must be completely Open (OKD-compliant, libre). BMC and I share the same operational views on Openness. The final pointer to the BMC licence makes the general principles reasonably clear for articles (but not for data)

Authors of articles published in GigaScience retain the copyright of their articles and are free to reproduce and disseminate their work (for further details, see the BioMed Central copyright and license agreement)

An online open-access open-data journal, we publish ‘big-data’ studies from the entire spectrum of life and biomedical sciences. To achieve our goals, the journal has a novel publication format: one that links standard manuscript publication with an extensive database that hosts all associated data and provides data analysis tools and cloud-computing resources.

PMR: I will be interested to see the links.

GigaScience aims to increase transparency and reproducibility of research, emphasizing data quality and utility over subjective assessments of immediate impact. To enable future access and analyses, we require that all supporting data and source code be publically available and we provide an extensive database and cloud repository that can host associated data, supplementary information and tools.

PMR: this will be interesting. I question “publically available” as I’m not clear what this means in practice (note, it often isn’t very easy to make all code available if part of it have been licenced, e.g. from database vendors)

A unique feature of our database is that important associated datasets can be given DOIs, providing both permanency and an additional citation. Thus GigaScience provides easier access to associated data as well as recognition for data producers.

PMR: Very important, but surely not “unique” – isn’t this what Datacite does?

Open access

All articles published by GigaScience are made freely and permanently accessible online immediately upon publication, without subscription charges or registration barriers. Further information about open access can be found here.

PMR: There is nothing about data – data are different from articles, so this should be addressed specifically.

Indexing services

Following publication in GigaScience, the full-text of each article is deposited immediately and permanently in repositories in e-Depot, the National Library of the Netherlands’ digital archive of electronic publications. GigaScience is included in all major bibliographic databases. A complete list of indexing web services that include BioMed Central’s journals can be found here.

BioMed Central is working closely with Thomson Reuters (ISI) to ensure that citation analysis of articles published in GigaScience will be available.

PMR: It is critical that this indexing metadata is made specifically Open, identified as such and made available to the community. Otherwise BMC is granting third parties ownership over citation data that they can control and resell to the scientific community (as happens with other citations). Make data set citation data OPEN.

Publication and peer-review process

Suitability of research for publication in GigaScience is dependent primarily on the data quality and utility, rather than a subjective assessment of immediate impact. To encourage transparent reporting of scientific research as well as enable future access and analyses, it is a requirement of submission that all supporting data and source code be made available.

PMR: Excellent requirement – it won’t be easy.

.

Data and materials release

Submission of a manuscript to GigaScience implies that readily reproducible materials described in the manuscript, including all relevant raw data, will be freely available to any scientist wishing to use them for non-commercial purposes
[PMR emphasis]. Nucleic acid sequences, protein sequences, and atomic coordinates should be deposited in an appropriate database in time for the accession number to be included in the published article. In computational studies where the sequence information is unacceptable for inclusion in databases because of lack of experimental validation, the sequences must be published as an additional file with the article.

PMR: Whyever has the NC been included? It’s inconsistent with everything that has been said before. It’s unenforceable. It goes against all current BMC policies AFAIK. Please, Please remove it asap. I cannot regard Gigascience as Open while it remains. See /pmr/2010/12/17/why-i-and-you-should-avoid-nc-licences/ .

Structures

Crystal structures of organic compounds can be deposited with the Cambridge Crystallographic Data Centre.

PMR: Structures in the CCDC are not Open. Their distribution is controlled by the CCDC and there is no right of re-use. Put them anywhere Open.

PMR: In general I get good vibes about Gigascience. I think they check most, but not all, of my initial principles. However I would like to see Data addressed specifically and consideration given to the Panton Principles for Open Data in Science http://pantonprinciples.org/ including clear labelling.

PMR: If you reply in comments these will be visible to everyone. I will treat them constructively.

UPDATE: Comment from GigaScience to this blog crossed this post.

 

Posted in Uncategorized | 2 Comments

Talks: Data Journals and DataSharers for science

I have 3-4 important talks to give in the next 2-3 weeks about Openness, Data and how they come together. As I often do I expose the ideas on this blog and hope for feedback. I’m going to highlight the meetings, but first I’m going to explain why IRs are not valuable for what I want to do.

Institutional Repositories – possibly for the last time

I’ve been writing a lot about “Repositories” and I am going to stop writing about them. My 6+ years of involvement with formal “repositories”, essentially Institutional Repositories (IRs), has convinced me that their design and philosophy is completely wrong for modern scientific research, especially data-driven research. They can’t be mended or adapted for science. Where their purposes are clear (and most do not have single clear purposes) they are designed and used for static book-like and article-like objects such as theses, e-articles and reports. The purposes are usually either preservation or research management for the university’s benefit. They are heavily institution-oriented and so only of interest to depositors and users closely involved with the institution. Nothing wrong with that, but some practitioners and advocates go beyond that and suggest they are general solutions – they are not and they will never be.

I’ve enjoyed being part of this community. I’ve been able to get some funding (JISC, Microsoft, thanks). The developer groups have been imaginative and energetic and there has been a small spinoff in generic software and protocols (e.g. SWORD2). But it’s clear after this time that the idea of “Repository” – as somewhere you deposit a fixed and final precious digital object unrelated to the rest of the content is not what we now want in science. The concept of “repository” has never engaged scientists – they either think of “databases” or systems for sharing community resources. I use http://en.wikipedia.org/wiki/Software_repository (s) but they are very different beasts and have many features that IRs don’t have, such as versioning, automatic installation software, and distributed repositories.

As a scientist I’ve tried to engage the IR community but with almost no success (other than words). There may have been opportunities for IRs to provide domain-specific solutions, but these haven’t caught the imagination of the IR managers. I have tried to suggest distributed iteration (e.g. for theses), and more generally the provision of born-digital theses (e.g. Word, not PDF). Again no common interests. I’ve tried to start discussion on support for scientists in laboratories (e.g with distributed versioned data capture). Again no practical interest.

I was disappointed not to get any substantial feedback from my earlier post on Criteria for Data Repositories. I put quite a bit of effort into thinking about it and I got one comment. Clearly this is a community which doesn’t discuss things in public and where there is no sense of electronic community. Maybe occasionally on the DCC-list, but repositories shouldn’t be about curation, they should be about people – which they aren’t. There is, of course, no reason why anyone should reply to me. But it gives the clear message that IRs should not be involved in data – and they should make this clear.

Actually it seems more generally that there is little public discussion of IRs anywhere; not even an active general global mailing list. So I am talking in the wrong direction. It’s saddening how little activity Universities have had in the information revolution. IRs seem to be the largest area of university funding – there’s basically no interest in publication, no interest in data. There is no way of communicating to universities even though I am employed by one.

New directions

So from now on I’m going to be addressing the following:

  • Groups of practising scientists, especially those building new tools for information
  • Funders
  • Enlightened scientific publishers (effectively those committed to Open Access/Data and models where the scientist is not just seen as a commodity for creating income)

Rather than use “Repository” I’ll introduce a new term “Sharer”, as in CodeSharer, DataSharer. I’ll develop these ideas over the new day or so and present them at the meetings

 

First let me advertise a meeting in Zaragoza (http://grandir.com/EN/debatesessionSTM/ )

Debate session on STM research data management (Zaragoza, Aug 25)
25/08/2011

Next Thu Aug 25 a technical session organised under the auspices of GrandIR will be held in Zaragoza, Spain, for dealing with the management of STM research data, a yet relatively unexplored field in Spain. Along the meeting the current state of development of the Quixote Project will be also presented as an example. Quixote is a pioneering initiative for research data management in Quantum Chemistry in which several Spanish researchers are involved.


The meeting will have two sections: The first one will introduce the Quixote project, as well as existing national and international research data management initiatives. The talks will be short (15-20 minutes), with 10 extra minutes for questions. The second and core section of the meeting will be a discussion session, aiming to evaluate the needs of researchers and repository managers regarding data management repositories and tools, and to plan collaborations for creating a research data management infrastructure in Spain as a collection of repositories.

 

I am very appreciative of this – it’s less than a year since we conceived the Quixote system and here the second formal meeting about it. More later.

And then a meeting in Madrid (Int. Union of Crystallography, Triennial) 2011-08-29

http://www.iucr.org/__data/iucr/lists/comcifs-l/msg00533.html

In honour of the 20th anniversary of CIF, the upcoming IUCr meeting in

Madrid will feature a COMCIFS-sponsored microsymposium entitled

“Scientific Data Archiving, Exchange and Retrieval in the 21st

Century”. We have three excellent invited speakers, Brian Matthews,

Brian MacMahon and Peter Murray-Rust. These speakers will discuss

various topics drawn from the past, present and future of scientific

data exchange and management.

Here I am trying to work out aspects of how a Data Journal ties into a DataSharer. I don’t what I am going to say – some of it will be controversial and may upset some people. The general emphasis will be that primary Scientific data must be universally Open/libre. Since some organisations make their income by selling our data back to us I think we need som change of thought.

Finally I caught a tweet from Gigascience (a new Data Journal? with a DataSharer?):

Lots of useful advice (especially for us) on what makes a successful repository from @petermurrayrust, #opendata

So this is the direction I should now point in, perhaps. I’ll analyse their web site in a future post – there are some plus and minus things…

 

 


 

Posted in Uncategorized | 7 Comments

Criteria for successful Repositories

[I overwrote this, but have recovered the content from Google’s cache. Thanks to Google]

 

In thinking about data repositories I have been thinking about repositories in general and what makes them successful. I’m going to start by suggesting some principles for success. These are based on the repositories I use and the repositories set up in our group.

  • A repository should have a clear, single purpose. Bitbucket is for repositing and editing code. Stackoverflow is for programming questions. EMSL is for atomic basis functions. Crystaleye is for crystal structures. CKAN is for public metadata. Chemspider is for chemical compounds. Purpose is more important than technology. A repository which tries to do more than one thing will have enormous difficulty. Creating one because everyone else is doing it is most unlikely to be useful.
  • People should want to put stuff in a repository. The want is almost always self-interest, or is driven by actual coercion (employment contract or legal requirements). Voluntary and altruistic expectations don’t work.
  • People should want to get stuff out of the repository. It’s the difference between this and this. [Jim Downing]
  • The Community using the repository should be clear. This can be a person, a research group, an institution, a government agency, a nation, or “the web”. You cannot by default expect the world to be interested in a repository designed for a particular group. A university is not, generally, a cohesive community and does not have a clear purpose.
  • Most successful repositories are started by one or more identifiable people. The repos I have mentioned are all started by identifiable, passionate people. They may change to institutional mode later, but in the formative period it is the drive of individuals that sets the basis for the repository. Committees are not generally visionaries.
  • The repository should generate a community that has a sense of ownership. Typical features are mailing lists, movement of high-profile users into development roles (usually spare-time). I fell I have a small ownership of Wikipedia and I could have more if I wanted.
  • The repository should be a dynamic organism. Content should be versionable and editable by “anyone” using the repository. Who that is depends on the purpose, and obviously some data cannot be edited for legal or historical reasons. But a repository without history and editing is effectively “dead trees”, not a living Web organism.
  • If you want people outside the “community” to become involved work to make this happen. The home page of a repository sends a positive or negative message to newcomers as to whether they are welcome. Otherwise plan that this is a repository for the limited community only. Provide examples, software, news, discussion, etc.
  • Make the repository successful rapidly. A repository that grows rapidly will attract new blood, new material, new ideas. Some don’t and you shouldn’t be afraid to let them wither. I am doing that with some of the things I have started while trying to grow others. A successful repository will attract sustainability.
  • Plan for the “now”, not for future generations. There is a modest role for repositories that preserve but they are incompatible with current use. I want to use things now. There is a role for digitization, but it is expensive, slow, layered with legality and generally out of sync with much of internet.
  • Make the rights on the repository universal and completely clear. A repository is either completely Open/libre (OKDefinition), completely gratis (viewable but with no rights), or should be regarded as closed. A repository where some of the items are Open, some visible and some restricted is effectively restricted. Machines are a primary user of repositories. They cannot read and cannot understand licences.
  • Build repositories for machines as well as humans. Assuming that users will navigate by reading metadata and clicking on it restricts the use to humans. The value of data repositories will be that they can be used by machines; if not, they will fail.
  • Make it clear what the extent of the repository is, and make it iterable. It should be easy to find out how many items there are in a repository and get a list of all of them. For an Open repository it should be possible to visit every one automatically (iteration)
  • An open/libre repository should be cloneable/forkable. Id should be possible to copy the whole or part content of a repository. There may be technical problems with this, but the repository owners should be prepared to help make this possible as far as resources allow. This is the only way of protecting Openness.
  • Encourage the community to innovate in searching and displaying the contents of the repository. If you delegate your search strategy to Bingle, you will get what Bingle provides. If you want to search for any concept or data that Bingle isn’t interested in, you won’t be able to. Wikipedia has flourished in part because people have done clever things with it.
  • Do not rely on traditional metadata strategy. Depositors hate metadata. Full-text provides better metadata that hand-crafted metadata. Machines produce better metadata than people (a machine can, for example, tell whether something is a FORTRAN program, an electron micrograph, a map… leave it to machines.
  • Build for evolution, not stasis. If your repository looks the same as it did 5 years ago then it is either obviously successful with a vibrant community, or needs tearing down.
  • Give depositors massive feedback. Add download counters. Allow visitors to comment/vote. Because to post material and get zero feedback is a massive turnoff (I know).
  • Create a system of unique identifiers that can be turned into URIs. or URLs. It’s really important that every item can be identified independently of its method of storage.
  • Please add your own. This list is not comprehensive.

Having been to repo fringe, I had hoped for more feedback and public discussion about repositories. (In writing this I realise that repo fringe is effectively about institutional repositories, not repositories in general). Yes, I’m provocative, and my views are not shared by everyone or even the majority. But in that case say something. I am one of the very few academics who is in any way trying to communicate with the IR community. I am one of the even smaller number of scientists who even know what repository means. Institutional repositories have exactly two (UPDATED) things in their favour,

  • They have significant resource commitment from their institutions.
  • They have created an excellent community including some star developers

That is a great deal more than most other repositories have, which rely on building critical mass out of marginal funds and trying as hard as possible to create a community. When an institution funds a repository it presumably has a clear reason for doing it. In most cases I have no idea what that is. It seems increasingly that is it aimed at the management of research. That’s fine, research should be managed, and management takes money. But it bores most academics stiff. It bores the non-academic community unless they are government or funding agencies. If repositories are being used to support the Research Assessment process (REF), say so – and stick to that single purpose. Don’t also use them to try to advertise the University, or pretend they are for the benefit of academics (they generally aren’t) or that they take scholarship to the “public” (they don’t), or that they promote Open Access (they don’t, only hard political slogging does that), or that they are preserving scholarship for future generations. If you want to do any of these things build separate repos. If they don’t have a clear institutional basis – and I can only think of the REF as in that category – build domain or national repos. They are cheaper, and will allow much better discovery.

 

NL has a single repository for the country’s theses. UK has 100 universities each with their own, limited approach. I can find theses in the NL, not UK. Why does every university have to have its own approach (yes, it’s politics and yes every university has statues going back to … and yes every student is the copyright owner and copyright is the new religion so it trumps everything. Did anyone lobby Hargreaves to have a national approach to copyright for theses?). This post is still looking for positive ways forward.

 

If IRs want to be involved with data then they should seek to host domain-specific repositories. I’ve had no take-up on this idea but I keep hoping.

 

Posted in Uncategorized | 5 Comments

Repository Feedback: why one size fits hardly anyone; and an offer

I have got useful feedback from my last post about design of repositories and it raises some serious issues which will be important in future discussions. [I should say that I get very little feedback, which is a pity as I am actually trying to be constructive. Not in the sense of pretending everything is fine and we need incremental growth. But telling it how it is, and how it needs to be.]

There are many motivations for repositories, and I will touch on them later. They range from managing the institution’s business processes to altruistic storage of important digital objects (for a wide variety of processes).

It is impossible for all of these to be properly served by a single repository management and a single repository management system. In the world outside academia we build repositories in response to need (often our own). Both the design of the project and the design of the software is geared towards specific goals.

Repositories are hard. [I base this on Carole Goble’s announcement some year ago that “workflows are hard”, after we had burnt six months trying to get Taverna to do something it wasn’t designed to do (chemistry) though everyone thought it could]. I only know DSpace (please don’t tell me that ePrints and Fedora are ipso facto wonderful and the solution to my problems – they aren’t.] Most attempts to design a repository system will either serve a small section of the community, or not serve anything. DSpace fall into the former. It’s basically a twentieth-century approach to managing electronic library objects. It’s heavy on formal metadata, non-existent on community, feedback, etc. If you don’t have these and many other things in the current era then success is due to political or financial power, not excellence of design.

History – about 5 years ago I bought into the idea that DSpace@cam was for sharing useful digital objects with the rest of the world – and preserving them for a year or three. I and colleagues had run 750,000 computational chemistry jobs (based on molecules from the National Cancer Institute, US) and it took us months. We thought it would be useful for others to be spared the effort of repeating that. (We had published that in a closed journal – partly due to circumstances). BTW there are probably > 100 million comp chem jobs run each year and NONE of them are shared. We wanted to change that culture. So Jim Downing wrote a system to put them all into DSpace. We weren’t worried about the REF. We weren’t worried about promoting Green Open Access. We weren’t interested in promoting the reputation of the university (although if it helps, fine). We wanted to store the stuff, 175000 calculations, somewhere.

So Jim did this. It’s not a particularly clean collection because there are duplicates and crashes. We have subsequently built tools which help to minimize this problem. That’s HARD and tedious. The collection suffers from several major flaws, mainly due to DSpace software. If I implicitly criticize people, they are generic criticisms of IRs – not specific to Cambridge.

  • It is impossible to search the collection for anything other than crafted metadata – no chemistry or numeric values. But that is what chemists want to do. The titles are meaningless. The authors are meaningless. Most of the traditional metadata is non-existent.
  • It is impossible to iterate over the collection. It is really important that you understand this, because if you do not, your repository is largely useless for science and technology objects. Someone mailed me and asked if I could let him have all the objects in to collection. A very reasonable request. I was delighted to help. Except I couldn’t. Not didn’t want to. Couldn’t. For at least 2 reasons: (a) there was no single list of all the entries and (b) even if there was it wouldn’t retrieve the data, but only the HTML metadata. That’s because DSpace is basically a metadata repository assuming that searches are by humans and for humans to click on the links. Yes, it has OAI-PMH but that’s not going to help me find my molecules. (People keep telling me that the repositories have OAI-PMH and I should use it. How? By writing a repository crawler?).
  • There are no download statistics visible to depositors or users. (See below).
  • There is no way for humans to leave annotations. Such as “Liked this”, “this entry is rubbish” (several are), “are you going to be producing more?”, “I’d like to use them for …”.
  • There is no way for me to help innovate the system other than by moaning on this blog. Which I don’t actually enjoy. There is a wealth of young talent waiting to build and modify information systems and the opportunities are limited because there is a presumed architecture of humans depositing e-objects and humans searching and browsing e-objects.

So to the comments:


Steve Hitchcock [A well known guru from Southampton] says: August 17, 2011 at 9:58 am  

>> Peter, It pains me to say it, but you have some valid criticisms of institutional repositories. As a passionate supporter of IRs my view is that IRs broadly exhibiting the problems you identify lack leadership, that is, institutional leadership (it’s in the term, *I*R). As a result many repository managers are left thrashing around to find a purpose and identity for the repository. And yet. If you look closely, you will find exemplary IRs.

PMR: I am reasonably appreciative of the Soton commitment and the amount of content, but I’m not aware of much else. You will have to be specific; I shouldn’t have to look closely – it should be self-evident

>>These may be typical open access IRs, or untypical but with a clear focus and implementation to match. From experience there are repository teams taking exemplary approaches, but it is not apparent in the resulting repository, yet. They will be rewarded if the institution recognises and backs their efforts.

PMR: This is rather close to vapourware. Five years of non-progress for my requirements have made me take a different tack.

>>The wider repository community might be too slow to recognise, promote and emulate its best, and not always clear on the importance of their mission. Let’s be clear, you recognise the importance of the mission, ultimately the progress of science and wider academic endeavour by correct and proper management and access to all of its critical (now) digital outputs,

PMR: Yes, I do – I am passionate about it.

>> yet your response to current perceived problems is to look for instant solutions without joining it all up.

PMR: They aren’t instant, but they happen within months rather than a decade. And they are solutions. Our extended group has developed Crystaleye (250,000 entries), Chempound/Quixote (thanks to JISC funding) with ca. 40,000 combined crystal and compchem entries, BibSoup (20 million bibliographic records), etc. These all took less than a year before deployment. Yes, they are raw, and yes some of the architecture may later be replaced but they are evolving systems.

>>IRs can offer a joined-up approach to the mission. That is why they were conceived, and we too often and too easily lose sight of that. The way forward is to highlight and learn from the best IRs. That is the way to fix these problems, rather than abandon them.. I

PMR: Unless you give specifics this is no use. I WANT to join up IRs. I have made a practical suggestion on this Blog – no interest. I have asked how we do federated search. No replies. I want to retrieve all UK theses. I can’t (I am told “you can do this through eTHOS” – how?)

PMR: I could believe in federated repositories if I saw some examples. They won’t be easy to scale – we have seen this in RDF/LOD. But I actually suspect that very few people actually want to federate repositories – why should they when part of the purpose of a repo is to compete against other institutions? Federation is most effective when the identity of the components is irrelevant


Chris Rusbridge [ex-director of the Digital Curation Centre] says: August 17, 2011 at 10:03 am  (Edit)

CR >>Peter, my sincere apologies for not letting you know this before, but you vastly under-estimate the usage of the WWMM collection. The report I wrote on DSpace @ Cambridge belongs to the University Library, but I’m sure they will not mind if I quote one paragraph to you:

“There have been concerns that the huge WWMM collection might be essentially unused. Because of the Handle-like structure of identifiers, it is not easy to count accesses to all the items in a collection. However, thanks to sterling work by the DSpace team, they were able to determine that accesses to the WWMM collection represent approximately 12.5% of total accesses over the 15-month period available. This is a respectably high usage rate, that justifies keeping the collection, and comes despite the fact that 104,000 of the 175,000 items in the collection received no accesses at all during that period (a classic long tail effect).”

One in every eight accesses to the repository was to that collection.

Thanks, Chris.

Firstly, I had no knowledge of this whatsoever and it re-emphasizes that repositories are useless to depositors unless there is feedback. So I have millions of downloads of my entries (do the power law sums). Where from, Machines? Humans? What for? What are the popular molecules?

I should thank the DSpace team for their work although I had no idea it was going on.

If I had known I would have made some suggestions as to improvement. An obvious one would be a simple feedback button “found this useful”. A chance to mail me. How many potential collaborators might I have had if I had known? (A note of caution – we have had >300,000 downloads of our Chem4Word software and not a single communication other from those who have technical problems. So I don’t expect much.

I am now involved in setting up a new generation of molecular repository – Quixote. This will allow the questions above to be answered, and for people to innovate. This is not vapourware – it exists and is being cloned.

So I am grateful to the University for supporting this resource. But it actually make much more sense for SOME university to support Quixote. It will have far more use than PMRmolecules@Dspace. Because people can deposit their own work. For their own purposes.

So – once again a question to readers…

Is anyone interested in supporting a world repository for computational chemistry?

Or are universities forever tied to only looking after their own? In which case expect to see me and others do it outside the IR system.

RSN.

Posted in Uncategorized | 3 Comments

Criteria for Successful Repositories: Initial thoughts

As part of my analysis of what data repositories should look like, I look here at repos in general. There has been some useful feedback to my latest posts, mainly about Instituional Repositories (IRs), in the comments and on Twitter. Some people agree with me, others have suggested I have either got things wrong. I am not against repositories – in fact I am strongly in favour of them. I question the role of many institutional repositories and, as readers know, I argue in favour of domain-specific repositories

Institutional Repositories have only one thing in common – they are supported by cash and staff provided by the institution – and they are institution-centric. Here for example is Imperial College (http://eprints.imperial.ac.uk/):

Welcome to Spiral, the Digital Repository for research output of Imperial College. Spiral primarily contains full text peer-reviewed versions of journal articles and conference papers produced by academic staff of Imperial College London, as well as PhD theses by students of Imperial College London.

In fact NONE of the theses are visible to people outside Imperial. If I go to “More Information” (http://www3.imperial.ac.uk/library/find/spiral) It’s primarily about how to submit content (obviously only for Imperial people). If I go to the first item in “Chemistry” I find a “journal” which is actually a thesis from Cambridge (http://eprints.imperial.ac.uk/handle/10044/1/6193). One useful feature is the “Top twenty downloads” (http://www3.imperial.ac.uk/library/find/spiral/toptwenty ).

By contrast my own University’s repository exists for a completely different purpose:

DSpace@Cambridge is the institutional repository of the University of Cambridge. The repository was established in 2003 to facilitate the deposit of digital content of a scholarly or heritage nature, allowing academics and their departments at the University to share and preserve this content in a managed environment.

I have actually uploaded ca 180,000 items. There is no download indicator so I have no indication that anyone has ever downloaded anything. (Actually I have had 1 email , which shows somebody downloaded something 2 months ago). This, not surprisingly, is demotivating.

So, as a result, I am not highly motivated to explore Imperial as it is highly Imperial-centric. And I am not very motivated to deposit things in DSpace@Cam (I continue to do so, but out of a sense of duty, rather than because I want to).

By contrast Nature Precedings (run by Nature Publishing Group) runs a preprint server, and I have put papers in that, for example: http://precedings.nature.com/documents/1526/version/1 (My ill-fated paper on Open data which got buried behind the Elsevier paywall). People have read the NP offering. 11 voted for it. Now votes are not very scientific but it gives me a slight warm fuzzy feeling. It would be interested to know the downloads (and I’m slightly surprised I can’t find that). The NP site is nicely presented.

The upshot is:

  • I don’t want to browse the Imperial repository – it makes me feel an outsider
  • I don’t want to upload to DSpace@Cam (it’s tedious and I have no evidence anyone reads it)
  • I do want to upload to Nature Precedings.

So 3 months ago I sent 11-15 papers off the Biomed Central. (J. Cheminformatics). I put them all in DSpace@Cam. The reviews have mainly come in and I think all the papers will get published. So I think I’ll upload them also to Nature Precedings and get a feel of what the world thinks of them. I’ll also see whether BMC have a “precedings” – if not, maybe they should.

And I submitted a lot of work last night to another repository – I stayed up till 0100 because of the excitement of doing so. It’s called Bitbucket. It’s how I make sure my code is working, high quality, available to everyone. The main motive was to increase our collaboration with the European Bioinformatics Institute (Christoph Steinbecks’ group at ChEBI).

There is no reason why IRs should not be able to appeal to sections of the community. But I think very few appeal to any more than very small groups, mainly within the institution. And if the repository is not clear what its purpose is, then I suspect it won’t appeal to anyone.

So I’ll leave you with Ranganathan’s laws, modified for repos (authors now have a role that they did not)

  • Repositories are for use (by machines and/or humans)
  • Every entry its reader
  • Every author and every reader their entry
  • Save the time of the reader and the author
  • The repository is an evolving organism

If, for a given community of authors and readers, you can truly answer YES to every law, then you already have a success repository. Bitbucket has. Stackoverflow has. Wikipedia has. Dryad has. Tranche has. NCBI and EBI has. CKAN has. ArXiv has. Chemspider has. Figshare looks promising. Nature precedings (3500 entries) continues – I would expect more.

You cannot be everything to everyone and this is where IRs generally fail. If your main purpose is to manage the REF, say so. If it is to store theses and stop the rest of the world seeing them, say so. If it’s to create collections of important digital objects, say so. And don’t do the other things unless you are sure you can make a success of them.


 

Posted in Uncategorized | 2 Comments