#rds2013 Principles for Managing Research Data

These are thoughts for my 15-minute session at #rds2013. Feel free to comment. I’d particularly like to know of any F/OSS that manages timed slide presentation on Windows so I don’t have to use Powerpoint. I have 900 seconds including 5 at each end for stepping up and stepping down. I shall refuse to be introduced – it’s all in Wikipedia (http://en.wikipedia.org/wiki/Peter_Murray-Rust ). It’s therefore essential to have timed transitions, a la PechaKucha. The cryptic notes here will be elaborated in each detailed blog post. The order is random and the numbers of principles will change.

Management of data is a state of mind, not a process or technology. Follow Ranganathan.

  • The world owns the data, not you.

        Use CC0. (see Ross Mounce’s work on licences).

    The data you work with is provided by the universe of things and ideas. It is yours to nurture, refine and evangelize, but not yours to own.

  • You do not fully understand the potential of your data.

    Encourage downstream use. Data increases in value with refinement, subtraction, and addition. Example: The historic observation of a Chinese eclipse has been used to calculate the coefficient of dynamic viscosity of the earth’s mantle.

  • Walled gardens destroy the potential of data and innovation.

    Walled gardens, however benign, control access and seriously limit innovation and re-use. You cannot get all of the data out for Open re-use. Examples: Sciverse, CCDC crystallographyReaxys, Chemical Abstracts. Now , Mendeley. Will Figshare remain unwalled for long?

    #animalgarden have made a 3.5 minutes video (http://vimeo.com/34323486, there won’t be time to show; it will exercise all your emotions).

  • Build the memex
    for data. (http://en.wikipedia.org/wiki/Memex )

    Manage data without noticing. Sourceforge/Github capture our code with zero effort, because we want to use them, not because we have to. We can do this for data. Turn instruments, laboratories and authoring systems into memexes. If you have to “put it in the repository” the system has failed.

  • Revere the long-tail.

    Most data is in the long-tail of science, collected in individual laboratories on unique protocols and strange instruments. This can only be tackled by giving scientists toolkits for informatics and allowing them to build the solutions.

  • Text, data, audio, images, movies are different views of “data” – scientific truth.

    They must all be free. The idea that scientific images, video, audio “belong” to people or institutions must be challenged. They are all CC0.

  • Mentor young people in data and let them mentor you

    Young people have a different, fearless attitude. I’ve seen them attempt the impossible. Sometimes they succeed. (Sophie Kershaw (doctoral student) has been mentoring Oxonians in how to manage data )

  • The problems of data are people, not storage or bandwidth.

    A computational chemistry program solves Schroedinger’s equation. If you publish the results in full the company will send the lawyers.

    I can mine 500,000 reactions from patents (and my colleague Daniel has). Elsevier won’t let me mine any. Nor will ACS. Or the others. These restrictions destroy imaginative thought.

  • Develop Patterns for Data

    Cameron McLean has shown me how the architects have patterns for building. These were adapted to patterns for software. He’s adapting these to research. We don’t yet have patterns for data.

  • Honour Tim Berners-Lee’s 5 stars of Linked Open Data.

    Yes. Open Data, Open standards, Open links and Open minds.

  • Work collaboratively.

    Share tools and ideas. Use hackfests. The library should run hackfests. Not for academics, For everyone. You would be surprised who you get.

  • Computing and Bioscience have got it as right as possible.

    Emulate them. Use their tools. Create communities like theirs.

  • Build your own tools, don’t buy anything.

    “Rough consensus and running code” built the Internet and the web. Build, test, teardown, rebuild. Building teaches you. Buying things numbs your imagination, Renting information is even worse.

  • Get out more.

    Wikipedia was built by non-academics. Academics sneered (and some still do). Wikipedia is the future of scientific information, Steve Coates built OpenStreetmap, Galaxyzoo brought in hundreds of thousands of citizens. Academia neglects the #scholarly poor – non-academics (everywhere) facing daily paywalls.

  • Campaign for change.

    Read and honour Aaron Swartz. Mail your representatives. Blog. You don’t have to go to jail if enough people protest.

  • Use domain repositories

    Institutional repositories don’t work – for science and for data. We must create our own. Commercial ones will be constraining and controlled.

  • Start bottom-up Communities.

    Wikipedia is a bottom-up community. It creates not only knowledge but models of governance. We’ve created the Blue Obelisk for chemistry

PMR: has been involved in all of the above and will no doubt think of more.


This entry was posted in Uncategorized. Bookmark the permalink.

6 Responses to #rds2013 Principles for Managing Research Data

  1. Phil Lord says:

    We published a paper on the notion that tools have to work for authors and do good stuff at the same time last year. http://www.russet.org.uk/blog/2054. Your point about walled gardens is very true. CrossRef, for example, and DOIs are hard to get into for an independent academic. This is why we created greycite.knowledgeblog.org which returns metadata (like crossref) and provides two-step resolution (purls). But, unlike CrossRef, we have no privileged position; we take metadata direct from the webpage; anyone at all could set up an alternative. The data is all out there.
    So, here is my principle. All repositories should think right from the start, how to enable users to move (or copy) their data to another service at any point they want to.

    • pm286 says:

      The problem is that people setting up Institutional Repositories have no consensus of why they are doing it. They’ve spent hundreds of millions and the repos are still largely empty or behind embrago-walls. No-one outside academia even knows repos exist so they don’t use them. And most people in repos think th.ey are a bureacratic chore

      • Phil Lord says:

        Well, you and I know exactly why Institutional Repositories have been set up — it’s for the REF.
        I agree with you about the repo’s being a bureacractic chore. Most of them don’t have nice APIs, which makes it impossible to code around this chore. I’ve been looking for a repository, institutional or otherwise where I can archive my blog content. The only thing out there is http://webcitation.org and it’s got a request for funding on the front page (archive.org and the UK Web Archive are not on-demand). I’d love an alternative.
        So, here is my other principle for you. Repositories should manage data that people would like to be managed.

        • pm286 says:

          Thanks – like it Phil.
          There is a huge need for places where we can put things that we’d like others to get. We’ve got 500,000 chemical reactions.

  2. Pingback: Unilever Centre for Molecular Informatics, Cambridge - #rds2013 Managing Research Data. “Where are we at? And who are ‘we’?” « petermr's blog

  3. Pingback: Unilever Centre for Molecular Informatics, Cambridge - #rds2013 Managing Research Data « petermr's blog

Leave a Reply

Your email address will not be published. Required fields are marked *