Open Notebook Science and Glueware

Cameron laments the difficulty of creating an Open Notebook system when there is a lot of data:

 

The problem with data…


Our laboratory blog system has been doing a reasonable job of handling protocols and simple pieces of analysis thus far. While more automation in the posting would be a big benefit, this is more a mechanical issue than a fundamental problem. To re-cap our system is that every “item” has its own post. Until now these items have been samples, or materials. The items are linked by posts that describe procedures. This system provides a crude kind of triple; Sample X was generated using Procedure A from Material Z. Where we have some analytical data, like a gel, it was generally enough to drop that in at the bottom of the procedure post. I blithely assumed that when we had more complicated data, that might for instance need re-processing, we could treat it the same way as a product or sample.

[snip…]

 

PMR: How I sympathize! We had a closely related problem with Nick Day’s protocol for NMR calculations. There were also other reasons why we didn’t do complete Open Notebook, but even if we had wanted we couldn’t. Because the whole submissions and calculation process is such horrendous glueware. It’s difficult enough keeping it under control yourself, let alone exposing the spaghetti to others. So, until the protocol has stabilised (and that’s hard when it’s perpetual beta), it’s very hard to do ONS.

 

And what happens when you change the protocol? The data formats suddenly change. And that will foul all your possible collaborators. Do you have a duty of care to support any random visitor who wants to use your data – I have to argue “no” at this stage. You may expose what you have but it’s a mess.

 

The only viable solution is to create a workflow – and to tee the output. But as Carole Goble told us at DCC – worklfows are HARD. That’s why glueware is so messy – if we had cracked the workflow problem we would have eliminated glueware.

 

The good news is that IF we crack it for a problem, then it should be much much easier to archive, preserve and re-use the output of ONS.

 

Posted in open notebook science | Tagged , , | Leave a comment

What sort of repositories do we want?

 

Open Access and Institutional Repositories: The Future of Scholarly Communications, Academic Commons,


Institutional repositories were the stated topic for a workshop convened in Phoenix, Arizona earlier this year (April 17-19, 2007) by the National Science Foundation (NSF) and the United Kingdom’s Joint Information Systems Committee (JISC). While in their report on the workshop, The Future of Scholarly Communication: Building the Infrastructure for Cyberscholarship, Bill Arms and Ron Larsen build out a larger landscape of concern, institutional repositories remain a crucial topic, which, without institutional cyberscholarship, will never approach their full potential.

 

PMR: Although I’m going to agree generally with Greg I don’t think the stated topic of the workshop was institutional repositories per se. It was digital scholarship, digital libraries and datasets. I would expect to find many datasets outside institutions (witness the bio-databases).

Repositories enable institutions and faculty to offer long-term access to digital objects that have persistent value. They extend the core missions of libraries into the digital environment by providing reliable, scalable, comprehensible, and free access to libraries’ holdings for the world as a whole. In some measure, repositories constitute a reaction against those publishers that create monopolies, charging for access to publications on research they have not conducted, funded, or supported. In the long run, many hope faculty will place the results of their scholarship into institutional repositories with open access to all. Libraries could then shift their business model away from paying publishers for exclusive access. When no one has a monopoly on content, the free market should kick in, with commercial entities competing on their ability to provide better access to that freely available content. Business models could include subscription to services and/or advertising.

Repositories offer one model of a sustainable future for libraries, faculty, academic institutions and disciplines. In effect, they reverse the polarity of libraries. Rather than import and aggregate physical content from many sources for local use, as their libraries have traditionally done, universities can, by expanding access to the digital content of their own faculty through repositories, effectively export their faculty’s scholarship. The centers of gravity in this new world remain unclear: each academic institution probably cannot maintain the specialized services needed to create digital objects for each academic discipline. A handful of institutions may well emerge as specialist centers for particular areas (as Michael Lesk suggests in his paper here).
The repository movement has, as yet, failed to exert a significant impact upon intellectual life. Libraries have failed to articulate what they can provide and, far more often, have failed to provide repository services of compelling interest. Repository efforts remain fragmented: small, locally customized projects that are not interoperable–insofar as they operate at all. Administrations have failed to show leadership. Happy to complain about exorbitant prices charged by publishers, they have not done the one thing that would lead to serious change: implement a transitional period by the end of which only publications deposited within the institutional repository under an open access license will count for tenure, promotion, and yearly reviews. Of course, senior faculty would object to such action, content with their privileged access to primary sources through expensive subscriptions. Also, publications in prestigious venues (owned and controlled by ruthless publishers) might be lost. Unfortunately, faculty have failed to look beyond their own immediate needs: verbally welcoming initiatives to open our global cultural heritage to the world but not themselves engaging in any meaningful action that will make that happen.
The published NSF/JISC report wisely skips past the repository impasse to describe the broader intellectual environment that we could now develop. Libraries, administrators and faculty can muddle through with variations on proprietary, publisher-centered distribution. However, existing distribution channels cannot support more advanced scholarship: intellectual life increasingly depends upon open access to large bodies of machine actionable data.
The larger picture depicted by the report demands an environment in which open access becomes an essential principle for intellectual life.The more pervasive that principle, the greater the pressure for instruments such as institutional repositories that can provide efficient access to large bodies of machine actionable data over long periods of time. The report’s authors summarize as follows the goal of the project around which this workshop was created:

To ensure that all publicly-funded research products and primary resources will be readily available, accessible, and usable via common infrastructure and tools through space, time, and across disciplines, stages of research, and modes of human expression.

To accomplish this goal, the report proposes a detailed seven-year plan to push cyberscholarship beyond prototypes and buzzwords, including action under the following rubrics:

  • Infrastructure: to develop and deploy a foundation for scalable, sustainable cyberscholarship
  • Research: to advance cyberscholarship capability through basic and applied research and development
  • Behaviors: to understand and incentivize personal, professional and organizational behaviors
  • Administration: to plan and manage the program at local, national and international levels

For members of the science, technology, engineering, and medical fields, the situation is promising. This report encourages the NSF to take the lead and, even if it does not pursue the particular recommendations advocated here, the NSF does have an Office of Cyberinfrastructure responsible for such issues, and, more importantly, enjoys a budget some twenty times larger than that of the National Endowment for the Humanities. In the United Kingdom, humanists may be reasonably optimistic, since JISC supports all academic disciplines with a healthy budget. Humanists in the US face a much more uncertain future.

PMR: I would agree with Greg that IRs are oversold and underdeliver. I never expected differently. I have never yet located a digital object I wanted in an IR expect when I specifically went looking (e.g. for theses). And I went to Soton to see what papers of Stevan’s were public and what their metadata were. But I have never found one through Google.

Why is this? The search engines locate content. Tyr searching for NSC383501 (the entry for a molecule from the NCI) and you’ll find: DSpace at Cambridge: NSC383501
But the actual data itself (some of which is textual metadata) is not accessible to search engines so isn’t indexed. So if you know how to look for it through the ID, fine. If you don’t you won’t.
I don’t know what the situation is in humantities, so I looked up the Fitzwilliam (the major museum in Cambridge) newsletter and looked for “The Fitzwilliam Museum Newsletter Winter 2003/2004” in Google and found: DSpace at Cambridge: The Fitzwilliam Museum Newsletter 22 but when I looked for the first sentence “The building phase of The Fitzwilliam Museum Courtyard Google returned zero hits.
So (unless I’m wrong and please correct me), deposition in DSpace does NOT allow Google to index the text that it would expose on normal web pages. Jim explained that this was due to the handle system and the use of one level of indirection – Google indexes the metadata but not the data. (I suspect this is true of ePrints – I don’t know about Fedora).
If this is true, then repositing at the moment may archive the data but it hides it from public view except to diligent humans. So people are simply not seeing the benefit of repositing – they don’t disover material though simple searches.
So I’m hoping that ORE will change all this. Because we can expose all the data as well as the metadata to search engines. That’s one of the many reasons why I’m excited about our molecular repositories (eChemistry) project.
As I said in a previous post, it will change the public face of chemical information. The key word for this post is “public”. In others we’ll look at “chemical” and “information”.
====================
[ans: German. Because the majority of scholarship in the C19 was in German.]

Posted in repositories | 6 Comments

Open Access Data, Open Data Commons PDDL and CCZero

This is great news. We now have a widely agreed protocol for Open Data, channeled through Science Commons but with great input for several sources including Talis, and the Open Knowledge Foundation. Here is the OKFN report (I also got a mail from Paul Miller or Talis without a clear link to a webpage).

 

This means that the vast majority of scientists can simply add CCZero to their data. I shall do this from now on. Although I am sure that there will be edge cases it shouldn’t apply to ANYTHING in chemistry.

Good news for open data: Protocol for Implementing Open Access Data, Open Data Commons PDDL and CCZero

15:21 17/12/2007, Jonathan Gray, external, news, okf, open access, open data, open geodata, open knowledge definition, Open Knowledge Foundation Weblog

Last night Science Commons announced the release of the Protocol for Implementing Open Access Data:

The Protocol is a method for ensuring that scientific databases can be legally integrated with one another. The Protocol is built on the public domain status of data in many countries (including the United States) and provides legal certainty to both data deposit and data use. The protocol is not a license or legal tool in itself, but instead a methodology for a) creating such legal tools and b) marking data already in the public domain for machine-assisted discovery.

As well as working closely with the Open Knowledge Foundation, Talis and Jordan Hatcher, Science Commons have spent the last year consulting widely with international geospatial and biodiversity scientific communities. They’ve also made sure that the protocol is conformant with the Open Knowledge Definition:

We are also pleased to announce that the Open Knowledge Foundation has certified the Protocol as conforming to the Open Knowledge Definition. We think it’s important to avoid legal fragmentation at the early stages, and that one way to avoid that fragmentation is to work with the existing thought leaders like the OKF.

Also, Jordan Hatcher has just released a draft of the Public Domain Dedication & Licence (PDDL) and an accompanying document on open data community norms. This is also conformant with the Open Knowledge Definition:

The current draft PDDL is compliant with the newly released Science Commons draft protocol for the “Open Access Data Mark” and with the Open Knowledge Foundation’s Open Definition.

Furthermore Creative Commons have recently made public a new protocol called CCZero which will be released in January. CCZero will allow people:

(a) ASSERT that a workhas no legal restrictions attached to it, OR
(b) WAIVE any rights associated with a work so it has not legal restrictions attached to it,
and
(c) “SIGN” the assertion or waiver.

All of this is fantastic news for open data!

Posted in data, open issues | Tagged | Leave a comment

Deepak Singh: Educating people about data ownership

Deepak Singh: Educating people about data ownership


I never got to watch the Bubble 2.0 video (I only heard it on net@nite). Before I could get to see it, it got taken down. Wired talks about the reasons behind the takedown. As a content producer who shares content online and as a scientist who has published papers and a not-so-casual observer of the entire content ownership debate, I am often torn by examples like this one.
What is important for the author? Is it monetary compensation? If content, scientific, media or otherwise is your primary source of income, you can understand why people get a little antsy when someone uses the content without permission. I know too many people, journalists, musicians, etc for whom their creativity is the sole source of income and they are all well meaning, even if they don’t always understand the environment that they operate in.
However, a lot of these issues date back to a world free of Creative Commons, which I believe is celebrating a 5th birthday this weekend. In today’s climate we have choice, so to some extent content owners need to make that choice and then live with their consequences. You can choose to publish your papers in a PLoS journal under a CC license, or you can choose to publish in a closed journal. Obviously, I belong to the open science camp, but I also believe that people have the choice of making decisions. They then must also live with the consequences of those decisions.
What we need is education. When Larry Lessig spoke at the University of Washington recently (I have the full recording if anyone is interested), I asked him a question on this very issue. How many people who upload pictures to flickr really understand the licensing options available to them? How many people understand the pros/cons and implications? Most scientists I know don’t even know what Creative Commons is, Science Commons even less so. On the flip side, do the majority of people wanting to use pictures, etc understand what they can do with media, the proper ways of attribution, etc? I doubt it. Even I am not always sure.
We have a plethora of resources available to us for sharing data, media and information. Scientists have the PLoS and BMC journals. You have resources to share data, documents, pictures, videos, screencasts, etc etc. It is up to us to decide where we put our information and how it is managed. It is also important for everyone to understand and respect those choices. The dialog on what is the best approach to sharing data and the advantages of open data can be discussed as we go along.

PMR: We have to liberate scientific images unless there is a good reason why not. There will continue to be problematic areas when re-use is mis-use. For example CC-BY would allow derivative works including – say – altering the gray scale or the pixels in an image. (I hope no-one would edit in an incorrect scale bar!) And it’s important to keep the caption with the image – until we get better metadata packaging. But, in general all scientific images should be stamped CC-BY or SC. Scientific images are different from people’s photographs. They are part of the scientific record. And they should NOT belong to the publisher.

Posted in open issues | Tagged | 1 Comment

Tools, Frameworks and Applications

I lamented the lack of public interest shown by the pharma companies in Open Source software and Geoff Hutchison commented:

When the Blue Obelisk met in San Francisco, we all heard from a pharma rep [PMR: was there a pharma rep in the Blue Obelisk meeting?] that applications get more attention than toolkits. So Jmol, Bioclipse, and hopefully Avogadro can get more attention than Open Babel, CDK, or JOELib.

PMR: We have different names for things, but I am assuming that an application is somethings that is shrinkwrapped, installs at the click of a button, has a user manual and does one or a small number of discrete actions (although applications bloat rapidly). Examples are rich renderers (hence Jmol) and editors (hence Avogadro). Then I’d see components where my ideal is a unix tool such as sed, ls, grep, with a small number of very well defined inputs and outputs. And finally a framework where several components can be configured in different ways by the user. I’d classify Bioclipse and OSCAR as  frameworks,  OpenBabel as a component and CDK, JOELib, JUMBO as component libraries. Stitching components together requires glueware which may be explicit (perl, python, ruby, java, etc.) or in a framework such as workflow (taverna, KNIME, Kepler, Pipeline Pilot). Glueware is difficult and expensive and often unique to the institution and I’ve talked to several people over the last week who have confirmed that one size does not fit all. and some problems aren’t fitted by any.
Using that terminology,  an application does a small number of things in a reapeatble manner. Applications are not easey to reconfigure, or put differently, it is difficult to build an application which can be easily reconfigured.  So increasingly the creativity is moving to those who can create “applications” from components, rather than relying on software manufacturers to do their thinking for them. If the pharma industry is to use knowledge technologies to help in its discovery it will have to get used to glueing together different data source and programmatic components. And hopefully at that stage it will discover that the Blue Obelisk can provide.

Posted in Uncategorized | Tagged | 5 Comments

Why oh why oh why….? Digital uncuration

var imagebase=’file://C:/Program Files/FeedReader30/’;

My colleague Nico Adams has written at great and useful length (Why oh why oh why….? ) about the appalling state of data capture, dissemination, preservation and curation.  He describes how he found some very valuable data only to discover that the packaging was a nested PDF. This is monstrosity so awful that we probably need a special name. I have borrowed the aphorism that turning a PDF into XML is like turning a hamburger into a cow.  (Having now worked in depth with PDFs that analogy is slightly unfair to hamburgers). So perhaps this is a triple decker flamed whopper.  The point is that within the PDF are more little PDFs like Russian dolls.  Wonderful material for the semantic web. The spectra – originally perfectly usable ASCII numbers had, of course,  been transformed into useless PDF hamburgers.

Read the post. Here are some snippets, and then I add some comments. Nico continues to lament how it should be so straightforward to capture the data, archive it and re-use it.

And that’s when it hit me really hard…..what a waste this is….all these spectra, reported and accessible to the world in this format. I mean I would love to get my grubby mittens on this data…..if there were some polymer property data there too, it could potentially be a wonderful dataset for mininng, structure-property relationships, the lot. But of course, this is not going to happen if I can only get at the data in the form of tiny spectral graphs in pdf…..there is just so little one can do with that. What I would really need is the digital raw data…preferably in some open format, which I can download, look at and work on. But because I cannot get to the data and do stuff with it, it is potentially lost.
[…] So who is to blame? The scientist for being completely unthinking and publishing his data as graphs in pdf? Adobe for messing with pdf? The publishers for using pdf and new formats and keeping the data inaccessible unless I as a user use their technology standard (quite apart from needing to subscribe to the journal etc.)? The chemical community at large for not having evolved mechanisms and processes to make it easy for researchers to distribute this data? The science infrastructure (e.g. libraries, learned societies etc.) for not providing necessary infrastructure to deal with data capture and distribution? Well, maybe everybody….a little….
Let’s start with ourselves, the scientists. Certainly, when I was a classically working synthetic chemist, data didn’t matter.  […] – even the not-so-standard chemists, namely the combinatorial and high throughput ones, which generate more data and should therefore be interested, often fall into this trap: the powerful combination of Word and Excel as datastores. And who can blame them- for them, too, data is a means, not an end. They are interested in the knowledge they can extract from the data…and once extracted, the data becomes secondary.
Now you might argue that it is not a chemist’s job to worry about data, it’s his job to do chemistry and make compounds (I know..it’s a myopic view of chemistry but let’s stick to it for now). And yes, that is a defensible point, though I think that certainly with the increasingly routine use of high-throughput and combinatorial techniques, that is becoming less defensible.Chemists need to realise that the data they produce has value beyond the immediate research project for which it was produced. Furthermore, it has usually been generated at great cost and effort and should be treated as a scarce resource. Apart from everything else, data produced through public funding is a public good and produced in the public interest. So I think chemists have to start thinking about data….and it won’t come easy to them. And one way of doing this, is of course, to get them where it hurts most: the money. So the recent BBSRC data policy initiative seems to me to be a step in the right direction:

BBSRC has launched its new data sharing policy, setting out expected standards of data sharing for BBSRC-supported researchers. The policy states that BBSRC expects research data generated as a result of BBSRC support to be made available with as few restrictions as possible in a timely and responsible manner to the scientific community for examination and use.In turn, BBSRC will provide support and funding to facilitate appropriate data sharing activities […]

This is a good step in the right direction (provided it is also policed!!) and one can only hope the EPSRC, which, as far as I know, does not have a formal policy at the moment, will follow suit.

PMR: You’r right. They don’t Astrid Wissenburg of ESRC  (Econonmic and Social Research Council) reviewed this. The EPSRC leaves everything to individual researchers. The implication is that the final product of research is a “paper” rather than a set of scholarly objects. And, of course, the creation and dissemination of scholarly papers is controlled by te  publishers. In the UJK this is the “level playing field” approach.

There’s another thing though. Educating people about data needs to be part of the curriculum starting with the undergraduate chemistry syllabus. And the few remaining chemical informaticians of the world need to get out of their server rooms and into the labs. [PMR: Yes]
If you think that organic chemists are bad in not wanting to have to do anything with informatics, well, informaticians are usually even worse in not wanting to have anything to do with flasks. And it makes me hopping mad when I hear that “this is not on the critical path”. Chemical informatics only makes sense in combination with experiments and it is the informaticians here that should lead the way and show the world just how successful a combination of laboratory and computing can be. It is that, which will educate the next generation of students and make them computer and data literate.
You might also argue, of course, that it should be a researcher’s institution that takes care of data produced by the research organisation. Which then brings us on to institutional repositories. Well trouble here is, can an institution really produce the tools for archiving and dissemination? What a strange question you will say. Is not Cambridge involved in DSpace and Spectra etc.? Yes. The point, though, is, that scientific data is incredibly varied and new data with new data models gets produced all the time. Will institutional repositories really be able to evolve quickly enough to accommodate all this or will they be limited to well-established data and information models because they typically operate a “one software fits all” model?
I may not be the best qualified person to judge this, but having worked in a number of large institutions in the past and observed the speed at which they evolve, there is nothing that leads me to believe that institutions and centralized software systems will be able to evolve rapidly enough. Jim, in a recent post, already alludes to something like this and makes reference to a post by Clifford Lynch, who defines institutional repositories as

a set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members”

[…]
Now the question here is what’s in it for the individual researcher? Not much, I would say. Sure, some will care about data preservation, some will acknowledge the fact that publicly funded research is a public good and the researcher therefore has a duty of care towards the product and will therefore care. But as discussed above: the point of generating data is ultimately getting the next publication and finishing the project…..what happens beyond that is often irrelevant to the researcher that generates the data, which means there is no point in expending much effort and maybe even climbing a learning curve to be able to archive it and disseminate it in any way other than by sticking it into pdf. Ultimately, the data disappears in the long tail. So if there is no obvious carrot, then maybe a stick? Well, the stick will live and die with for example the funders. having a data policy is great, but it also needs to be policed and enforced. The funders can either do this themselves, or in the case of public money, might even conceivably hand this to a national audit office. And I think the broad gist of this discusssion also applies to learned societies etc. So now we are back at the researcher. And back to needing to educate the researcher and the student…..and therefore ultimately back again at chemoinformaticians having to leave their server rooms and touch a flask…..
So how about the publishers? Haven’t they traditionally filled this role? Yes they have but of course given the fact that they are now trying to harvest the data themselves and re-sell it to us in the form of data- and knowledge bases (see Wiley’s Chemgate and eMolecules, for example). For that reason alone it seems utterly undesireable, to have a commercial publisher continue to fill that role. If the publisher is an open access publisher, then getting at the data is not a concern, but the data format is….a publisher is just as much an institution as a library and whether they will be able to be nimble enough to cope with constantly evolving data needs and models is doubtful. Which means, we would be back to the generic “one software for all” model.
Which, at least to me, seems bad. The same, sadly, applies to learned societies.
Hmmm…the longer I think about it, the more I come to the conclusion that the lab chemists or the departents will have to do it themselves, assisted and educated by the chemoinformaticians and their own institutions and setting up small-scale dedicated and light-weight repositories. The institutions will have to make a commitment to ensure long-term preservation, inter-linking and interoperability between repositories evolved by individual researchers or departments. And funders, finally, well funders will not only have to have a data policy like the BBSRC, but they will also have to police it and, in Jim’s words “keep the scientists honest”.

PMR: This is a brilliant analysis and maps directly onto our discussions for the last 3 days. When Simon Coles, Liz Lyon and I presented yesterday we also stressed the idea of  embedded informaticians – i.e. wearing white coats. And IMO the sustainability has to come from heads of departments – we spend zillions of dollars collecting high quality data and either bin it directly or let it decay.

 

It’s not easy. There isn’t an obvious career path or funding and it depends on traditions in countries. In the UK Informatics often means Computer Science whereas in the US it means Library-and-Information-Sciences 9LIS).

 

So if we need the spectrum of a polystyrene – where do we look? In the garbage bin … or worse.

Posted in Uncategorized | Tagged | 6 Comments

The unacceptable state of Chemoinformatics

Egon Willighagen has cricitized a numner of aspects of chemoinformatics: (I don’t blame Individuals in Commercial Chemoinformatics). This post was sparked off by an announcement that company A had agreed with company B that A would use B’s software. This touched a raw nerve in Egon who asked – why is this such a big deal? I understand Egon’s position – I have had 25+ years of these marketing announcements – X and Y announce some joint activity and I’m now inured to that. But he’s right – this is not news – it’s marketing.
Ultimately I blame the pharma companies – they have abrogated any sense of scientific or other rigour in chemoinformatics yet they effectively dominate the subject. Work in chemoinformatics is slanted towards what will appeal to pharma (and the appropriate publication mechanisms) rather than advancing our understanding of the science. The average publication relies on unobtainable datasets, closed bespoke software, undisclosed methodology and therefore unrepeatable science. It’s not surprising that we see the subject contracting – witness the closure of Cologne and the failure to fund the NIH Chemoinformatics program.
The pharma industry actively works against Open processes. Their exploratory science is in crisis. They would benefit from ODOSOS – Open Data (e.g. in basic chemistry, chemoinformatics, biological activity, ADME, pharmacology, etc.) Open Source and Open Standards. Yet they hide from sight, while actually using considerable amounts of Open Source software. and data. While this is not illegal, it’s certainly questionable whether an industry should take so much without returning anything – if nothing else on simple utilitarian grounds.
I’m feeling personally abused having been let down by a pharma company who approached me saying they were interested in Open Source software, wanted me to visit and then let me down. It wasted my time and left me out of pocket. It’s not the first time. The Blue Obelisk has now good products – Open Babel, CDK, Jmol, CML, etc. We know they are used in companies. It would be nice to feel this was valued. They could afford to feed something back.

Posted in Uncategorized | Tagged | 4 Comments

HypoScience

I have blogged earlier (cyberscience: Changing the business model for access to data) on the lack of access to data in cyberscience – there may be a data deluge in some areas but there is a drought in many others. I used the neologism “hypopublication” to describe the fragmentation and suboptimal publication of much of the research and suggested…

However the Internet has the power to pull together this fragmentation if the following conditions are met:

  • the data are fully Open and exposed. There must be no cost, no impediment to access, no registration (even if free), no forms to fill in.
  • the data must conform to a published standard and the software to manage that standard must be Openly available (almost necessarily Open Source). The metadata should be Open.
  • the exposing sites must be robot-friendly (and in return the robots should be courteous).

Carole Goble asked me about datuments and in reply I mentioned hypoblication. She liked the word and coined “HypoScience” in her talk. Again fragmented work though she also used the word mediocre – which I never intended (unless it was applied to the publishing industry)
Anyway today at the invitation of JISC and Mellon we are trying to work out where resources would be best targeted to increas awareness and effectiveness of digital curation. Liz Lyon, Simnon Coles and I are taking the case study of crystallography and extending it to science in general. But we exclude big-science – telescopes, colliders, satellites, etc. where the communities have a full understanding of the problem, and concentrate on laboratories – with chemicals, small furry creatures individual notebooks, white coats, etc. That’s where preservation (and often publication) is worst and where JISC and Mellon can make the most impact.

Posted in Uncategorized | Tagged | Leave a comment

Digital CurationDay – end

The DCC finished with the customary summing up – state-of-the… – presentation by Cliff Lynch. Cliff’s talks are always entertaining – no visuals so you actually have to listen to the words. [(That’s not a bad thing. I’m thinking of doing all my presentation today – see later  on a flip chart, but it’s a small audience. Of course I could use P8w*p£&nt.]  Literally notes from Cliff’s talk:.
Can we afford to keep changing standards?

What is acceptable data loss?
What is acceptable for recomuptability? Should we recompute on demand? Could be swung by workflows?
Cost of reobservability?
Boink for distributed computing. Gets buy-in from the public. Sony have put support in playStation. Sony is prepackaging protein folding
Where is aggregate resource for preservation going to com from?
How are we to set up organizations to do it?
May divide world into public objects funded by government OR into research universities and cultural memory organizations.
Talk of “business models” may be a way of avoiding facing up to real costs and sustainability.
Need business model that maps onto lifecycle of objects

So in conclusion I think that the iissues are well recognized by the particpants. Some are doing huge and exciting things. The well organized communities are probably well served – they have to concentrate on the bits and the media. But the scattered communities are much worse off.
Some very positive things for me personally:

  • ORE will fly. Jane Hunter has actually buily ORE into SCOPE (I should have mentioned that).  That’s very important because any system has to be seen to be implementable. Jim Downing blogs this (Roundup 14th Dec P.S.). The eChemistry project will use ORE as the fundamental architure. I described it as “changing the public face of chemical information”.  Perhaps it could be more accurate to say: “creating the public face of chemical information” – i.e. we have a new paradigm where centralised databases give way to scientist-based collections.
  • SPECTRa is being adopted. Everything in repository-world comes with the warning that you have to pay for the glue – there is and will not be a click-and-go approach. So it’s great to see that UIUC has budgedt for the glue.
  • Huge and vibrant contingent from Australia. Australia really has got its act together – they are a community where everyone works together. I’m told that this is partly because the government mandates and funds (sounds like “socialized repositories” :-)). I’m going to oz next February so it was great to meet som many – Margaret Henty, Andy Treloar, Jane Hunter…

Anyway thanks for a great meeting.  A huge communal sympathy for ChrisR – we all signed a book with parchment pages – I can predict that it at least will be preserved

Posted in Uncategorized | Tagged | Leave a comment

Digital Curation Day 2 – Jane Kwok and SCOPE

For me the highlight of the late morning session was Jane Hunter and colleagues (Queensland) describing their SCOPE system of managing Compound Document Objects (CDO’s). Jane is a materials scientist turned informaticist (hope that’s fair) and we’d already been partners on a grant application (didn’t get it, but …) – so this was the first time we’d met
Jane did a double act with her colleague Kwok Cheung and inter alia gave a demo of their SCOPE system – which manages CDOs while preserving an imporeesive metadata structure in RDF. Some bullets:
Increasing pressure to share and publish data while maintaining competitiveness.
Main problem lack of simple tools for recording, publishing, standards.
What is the incentive to curate and deposit? What granularity? concern for IP and ownership
Current problems with traditional systems – little semantic relationship, little provenance, little selectivity, interactivity , flexibility and often fixed rendering and interfaces.
no multilevel access. either all open or all restricted
usually hardwired presentation
Capture scientific provenance through RDF (and can capture events in physical and digital domain)
Compound Digital Objects – variable semantics, media, etc.
Typed relationships within the CDOs. (this is critical)
CDO has Identifier URI – can sit in different places.
SCOPE has a simplified tool for authoring these objects. Can create provenance graphs. Infer types as much as possible. RSS notification. Comes with a graphical provenance explorer.
Have developed an ontology for Relationships. Rules for inferences are coded in SWRL and run with Algernon. (PMR: Note, must look into SWRL seriously – we should/could use it).
Jane had picked up on the idea of datuments that Henry Rzepa and I coined a little while back and has suggested we expand these with semantics. Basically a datument is a compound document in XML, often with several namespaces and a separate rendering approach. She suggests a semantic datument – presumaby XML with embedded RDF. Sounds great.

Posted in Uncategorized | Leave a comment