Why oh why oh why….? Digital uncuration

Posted on December 15, 2007 by pm286

var imagebase=’file://C:/Program Files/FeedReader30/’;

My colleague Nico Adams has written at great and useful length (Why oh why oh why….? ) about the appalling state of data capture, dissemination, preservation and curation. He describes how he found some very valuable data only to discover that the packaging was a nested PDF. This is monstrosity so awful that we probably need a special name. I have borrowed the aphorism that turning a PDF into XML is like turning a hamburger into a cow. (Having now worked in depth with PDFs that analogy is slightly unfair to hamburgers). So perhaps this is a triple decker flamed whopper. The point is that within the PDF are more little PDFs like Russian dolls. Wonderful material for the semantic web. The spectra – originally perfectly usable ASCII numbers had, of course, been transformed into useless PDF hamburgers.

Read the post. Here are some snippets, and then I add some comments. Nico continues to lament how it should be so straightforward to capture the data, archive it and re-use it.

And that’s when it hit me really hard…..what a waste this is….all these spectra, reported and accessible to the world in this format. I mean I would love to get my grubby mittens on this data…..if there were some polymer property data there too, it could potentially be a wonderful dataset for mininng, structure-property relationships, the lot. But of course, this is not going to happen if I can only get at the data in the form of tiny spectral graphs in pdf…..there is just so little one can do with that. What I would really need is the digital raw data…preferably in some open format, which I can download, look at and work on. But because I cannot get to the data and do stuff with it, it is potentially lost.
[…] So who is to blame? The scientist for being completely unthinking and publishing his data as graphs in pdf? Adobe for messing with pdf? The publishers for using pdf and new formats and keeping the data inaccessible unless I as a user use their technology standard (quite apart from needing to subscribe to the journal etc.)? The chemical community at large for not having evolved mechanisms and processes to make it easy for researchers to distribute this data? The science infrastructure (e.g. libraries, learned societies etc.) for not providing necessary infrastructure to deal with data capture and distribution? Well, maybe everybody….a little….
Let’s start with ourselves, the scientists. Certainly, when I was a classically working synthetic chemist, data didn’t matter. […] – even the not-so-standard chemists, namely the combinatorial and high throughput ones, which generate more data and should therefore be interested, often fall into this trap: the powerful combination of Word and Excel as datastores. And who can blame them- for them, too, data is a means, not an end. They are interested in the knowledge they can extract from the data…and once extracted, the data becomes secondary.
Now you might argue that it is not a chemist’s job to worry about data, it’s his job to do chemistry and make compounds (I know..it’s a myopic view of chemistry but let’s stick to it for now). And yes, that is a defensible point, though I think that certainly with the increasingly routine use of high-throughput and combinatorial techniques, that is becoming less defensible.Chemists need to realise that the data they produce has value beyond the immediate research project for which it was produced. Furthermore, it has usually been generated at great cost and effort and should be treated as a scarce resource. Apart from everything else, data produced through public funding is a public good and produced in the public interest. So I think chemists have to start thinking about data….and it won’t come easy to them. And one way of doing this, is of course, to get them where it hurts most: the money. So the recent BBSRC data policy initiative seems to me to be a step in the right direction:

BBSRC has launched its new data sharing policy, setting out expected standards of data sharing for BBSRC-supported researchers. The policy states that BBSRC expects research data generated as a result of BBSRC support to be made available with as few restrictions as possible in a timely and responsible manner to the scientific community for examination and use.In turn, BBSRC will provide support and funding to facilitate appropriate data sharing activities […]

This is a good step in the right direction (provided it is also policed!!) and one can only hope the EPSRC, which, as far as I know, does not have a formal policy at the moment, will follow suit.

PMR: You’r right. They don’t Astrid Wissenburg of ESRC (Econonmic and Social Research Council) reviewed this. The EPSRC leaves everything to individual researchers. The implication is that the final product of research is a “paper” rather than a set of scholarly objects. And, of course, the creation and dissemination of scholarly papers is controlled by te publishers. In the UJK this is the “level playing field” approach.

There’s another thing though. Educating people about data needs to be part of the curriculum starting with the undergraduate chemistry syllabus. And the few remaining chemical informaticians of the world need to get out of their server rooms and into the labs. [PMR: Yes]
If you think that organic chemists are bad in not wanting to have to do anything with informatics, well, informaticians are usually even worse in not wanting to have anything to do with flasks. And it makes me hopping mad when I hear that “this is not on the critical path”. Chemical informatics only makes sense in combination with experiments and it is the informaticians here that should lead the way and show the world just how successful a combination of laboratory and computing can be. It is that, which will educate the next generation of students and make them computer and data literate.
You might also argue, of course, that it should be a researcher’s institution that takes care of data produced by the research organisation. Which then brings us on to institutional repositories. Well trouble here is, can an institution really produce the tools for archiving and dissemination? What a strange question you will say. Is not Cambridge involved in DSpace and Spectra etc.? Yes. The point, though, is, that scientific data is incredibly varied and new data with new data models gets produced all the time. Will institutional repositories really be able to evolve quickly enough to accommodate all this or will they be limited to well-established data and information models because they typically operate a “one software fits all” model?
I may not be the best qualified person to judge this, but having worked in a number of large institutions in the past and observed the speed at which they evolve, there is nothing that leads me to believe that institutions and centralized software systems will be able to evolve rapidly enough. Jim, in a recent post, already alludes to something like this and makes reference to a post by Clifford Lynch, who defines institutional repositories as

“a set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members”

[…]
Now the question here is what’s in it for the individual researcher? Not much, I would say. Sure, some will care about data preservation, some will acknowledge the fact that publicly funded research is a public good and the researcher therefore has a duty of care towards the product and will therefore care. But as discussed above: the point of generating data is ultimately getting the next publication and finishing the project…..what happens beyond that is often irrelevant to the researcher that generates the data, which means there is no point in expending much effort and maybe even climbing a learning curve to be able to archive it and disseminate it in any way other than by sticking it into pdf. Ultimately, the data disappears in the long tail. So if there is no obvious carrot, then maybe a stick? Well, the stick will live and die with for example the funders. having a data policy is great, but it also needs to be policed and enforced. The funders can either do this themselves, or in the case of public money, might even conceivably hand this to a national audit office. And I think the broad gist of this discusssion also applies to learned societies etc. So now we are back at the researcher. And back to needing to educate the researcher and the student…..and therefore ultimately back again at chemoinformaticians having to leave their server rooms and touch a flask…..
So how about the publishers? Haven’t they traditionally filled this role? Yes they have but of course given the fact that they are now trying to harvest the data themselves and re-sell it to us in the form of data- and knowledge bases (see Wiley’s Chemgate and eMolecules, for example). For that reason alone it seems utterly undesireable, to have a commercial publisher continue to fill that role. If the publisher is an open access publisher, then getting at the data is not a concern, but the data format is….a publisher is just as much an institution as a library and whether they will be able to be nimble enough to cope with constantly evolving data needs and models is doubtful. Which means, we would be back to the generic “one software for all” model.
Which, at least to me, seems bad. The same, sadly, applies to learned societies.
Hmmm…the longer I think about it, the more I come to the conclusion that the lab chemists or the departents will have to do it themselves, assisted and educated by the chemoinformaticians and their own institutions and setting up small-scale dedicated and light-weight repositories. The institutions will have to make a commitment to ensure long-term preservation, inter-linking and interoperability between repositories evolved by individual researchers or departments. And funders, finally, well funders will not only have to have a data policy like the BBSRC, but they will also have to police it and, in Jim’s words “keep the scientists honest”.

PMR: This is a brilliant analysis and maps directly onto our discussions for the last 3 days. When Simon Coles, Liz Lyon and I presented yesterday we also stressed the idea of embedded informaticians – i.e. wearing white coats. And IMO the sustainability has to come from heads of departments – we spend zillions of dollars collecting high quality data and either bin it directly or let it decay.

It’s not easy. There isn’t an obvious career path or funding and it depends on traditions in countries. In the UK Informatics often means Computer Science whereas in the US it means Library-and-Information-Sciences 9LIS).

So if we need the spectrum of a polystyrene – where do we look? In the garbage bin … or worse.

This entry was posted in Uncategorized and tagged data curation hamburger. Bookmark the permalink.

6 Responses to Why oh why oh why….? Digital uncuration

Rich Apodaca says:

December 15, 2007 at 7:35 pm

The reason for this problem is obvious: nobody has built an easy-to-use and cheap tool to manage, maintain, and publish spectral data. We see PDFs everywhere because they are, rightly or wrongly, being used as that tool.
The solution is “obvious”:
* Build the Tool *
Technically speaking, the world could have started blogging as early as 1994 or even sooner, but it didn’t. It took the emergence of software that made it cheap and easy to create a blog. In retrospect, it’s so obvious.
There’s nothing surprising here.
Raising awareness will not solve the problem. Publisher mandates will not solve the problem. And the tutoring of chemists by cheminformaticians will not solve the problem.
Only building the tool will solve the problem.

Reply
pm286 says:

December 15, 2007 at 8:36 pm

(1) Thanks Rich.
I don’t agree. The tools are there. Almost all spectrometer manufacturers allow JCAMP files to be emitted. You just have to press the right button. JCAMPs have valuable metadata and can be transformed into other forms (e.g. CML), viewed (with JSpecView) etc. We do all this with our SPECTRa system.
The crystallographers use CIF. CIF and JCAMP are at the same level. The crystallographers send their CIFs to the publisher. The publisher mounts them on the web page. The tools are email and zip. The same works for JCAMP only no-one does it. It’s not a question of tools but business practices.
I can’t see what tools we require that we haven’t already got. If you think there is a killer app here, please let us all know If you want all spectroscopy departments linked to repositories, it needs per-site glueware to link the tools. That’s what we are trying to solve.
But there are sufficient tools if people want to deposit spectra.

Reply
Nico Adams says:

December 15, 2007 at 8:36 pm

Rich, with all due respect, I disagree.
Tools alone are not enough. I have in the past set up tools for my old lab, which would allow the capture of all sorts of machine-generated experimental data into a database by clicking precisely one button. The machine would extract all the metadata, the user didn’t need to lift a finger anymore….the tool was trivially easy to use. And yet, it was almost impossible to convince our chemists to press that button. No tool is any good if it does not get used: and even making something trivially easy to use and cheap is no guarantee for use.
Rather, there needs to be an immediate perceived value. And in the case of any database or repository, there is hardly ever an immediate perceived benefit. Yes, you can reason with, for example, a PhD student and say: Look, submit these spectra now and when it comes to writing your thesis you will be able to pull all the material you need much better and much more quickly etc.. But if you say that to a first year PhD then the potential benefit (which is far down the line) by far outweighs the immediate disincentive: having to do work. Tools alone are not enough. Yes, sure, they need to be trivially easy and cheap to use – that’s almost a given. But also, they need to become part of a chemists natural workflow – you use them because you can accomplish something with them… and they need to feel as natural as opening your email client, or firing up a stirrer-hotplate. And an institutional repository certainly looks and feels and, I suspect, in most cases, is an afterthought. By the way, Jim has just announced that Spectra has been released – so the tools are being built and are starting to become available.
My point though, was far wider than this. Tools, particularly institutional tools are almost by necessity generic solutions, because they are being built by an institution. And so, they will cover a large numer of needs but not all of them and even more, science evolves new data models constantly – and I do not believe that an institution is fast enough/nimble enough to adapt to research scientists needs quickly enough. So the point was that yes, we need to build the tools and to use them, but the individual researcher or department needs to do that – not the institution. Once the researcher has built the tool – in our case the repository, it’s the institution’s job to look after long-term preservation, interoperability etc. But for that to happen, we first require that researchers are aware that there is a problem with pdf and the Word/Excel combination. To generate that awareness requires tuition by chemoinformaticians. Then we can start building the tools. And then we will also have a much better chance that the tools will get used: if a researcher has to build them then there is a real ownership of the tool by the researcher as opposed to some administrator or never seen chemoinformatician coming along and saying: Look, I have built you a tool, now use it.
Tools alone, I fear, are not enough.

Reply
Rich Apodaca says:

December 15, 2007 at 9:43 pm

Peter and Nico,
Very interesting discussion – thanks for your thoughts.
If the tools I’m talking about already exist, then I’d propose a friendly challenge.
Write an article illustrating how simple and easy it is for an organic chemist of ordinary computer skill to use your tool(s) of choice to build, manage, and share a collection of IR or NMR spectra.
“Ordinary computer skill” doesn’t include the use of a compiler, XML, mysql, or the command line ;-).

Reply
pm286 says:

December 15, 2007 at 10:27 pm

(4)
GOTO spectroscopist
GIVE sample
WAIT
REQUEST “JCAMP” output
if (haveMemoryStick) {
copy
else
GIVE spectroscopist emailAddress
RETURN to lab
ATTACH JCAMP to manuscriptSubmission
WAIT till manuscript accepted
REQUEST publisher to mount JCAMP on website
No tools required. All is business process
HOWEVER
REQUEST refusedby publisher
NO-LIKE attachments
NO-LIKE trouble
NO-LIKE anything-but-PDF
REPEAT until publisher changes mind or sky falls in

Reply
Steven Bachrach says:

December 18, 2007 at 3:11 pm

I think that Peter has pretty much nailed this argument.
The reason why CIFs are deposited and (potentially) re-used is that the editors of the crystallography journals require these data files as part of the journal article. For some reason, crystallographers established early on a database of structures (CCSD) ans the community recognized the value of this database and so made it mandatory that new structures be deposited there – and the CIF became the standard.
What will change the landscape of other chemical data is when editorial boards require deposition of spectral data, computed structures, kinetics results, etc. Without that carrot (or is it a stick?) your average research chemist will simply find it much easier to continue to author papers in the same way he/she has been doing it for the past 25 years. Get the spectra as a picture (i.e. a gif file) or even worse as simple a list of chemical shifts and spltting pattern for NMR, paste it into the MS Word document, submit that, have it turned into a pdf and that’s that.
Steve

Reply

Why oh why oh why….? Digital uncuration

6 Responses to Why oh why oh why….? Digital uncuration

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta