One thing I find remarkable about many data projects is how much effort goes into developing a shiny front-end for the material. Now I’m not knocking shiny front-ends, they’re important for providing a way for many users to get at the material (and very useful for demonstrating to funders where all the money went). But shiny front ends (SFEs from now on) do have various drawbacks:
- They often take over completely and start acting as a restriction on the way you can get data out of the system. (A classic example of this is the Millenium Development Goals website which has lots of shiny ajax which actually make it really hard to grab all of the data out of the system — please, please just give me a plain old csv file and a plain old url).
- Even if the SFE doesn’t actually get in the way, they do take money away from the central job of getting the data out there in a simple form, and …
- They tend to date rapidly. Think what a website designed five years ago looks like today (hello css). Then think about what will happen to that nifty ajax+css work you’ve just done. By contrast ascii text, csv files and plain old sql dumps (at least if done with some respect for the ascii standard) don’t date — they remain forever in style.
- They reflect an interface centric, rather than data centric, point of view. This is wrong. Many interfaces can be written to that data (and not just a web one) and it is likely (if not certain) that a better interface will be written by someone else (albeit perhaps with some delay). Furthermore the data can be used for many other purposes than read-only access. To summarize: The data is primary, the interface secondary.
- Taking this issue further, for many projects, because the interface is taken as primary, the data does not get released until the interface has been developed. This can cause significant delay in getting access to that data.
When such points are made people often reply: “But you don’t want the data raw, in all its complexity. We need to clean it up and present it for you.” To which we should reply:
“No, we want the data raw, and we want the data now”
PMR: I have sympathy for this motivation and we’ve seen a fair amount of traffic on this blog basically saying “Give us the data now in a form we know and understand“. And it’s taking longer than we thought because the data aren’t in a form that is easily dealt with. Not because we have built a SFE, but because the data “just grew”. It’s probably fair to say that making material available in HTML helps to promote the demand for the data but may hinder the structure.
What we have learnt is that each person has an implicit information environment which may be difficult to transport. It isn’t just one giant CSV file, it’s lot of submit scripts, scraping tools, etc. It is currently easier to ask Nick Day to run jobs in his environment that abstract them into mine. It will have to be done, but it’s harder than we budget for.
So one of the benefits of Open Data – albeit it painful – is that when you are asked for it, it helps if it’s structured well. That structure doesn’t evolve naturally – it has to be thought about. There is actually no raw data. There’s chunks of data mixed with metadata (often implicit) and tied together with strings of process glue. We’ll know better next time.
The best way to deal with this issue is to put the data in an Exhibit. You can have the data in a spreadsheet and the suer can request output in a variety of well-structured formats. For larger data sets, there’s a separate project.
(1) Thanks – good idea. We’ll consider this seriously
Peter, you’re raising a really good question here (as is Jim in the accompanying post) which I didn’t really address properly. My question to you here would be: what’s the simplest way you could provide *bulk* access (that is in addition to the web presentation you already do). Would a simple db dump + tarball of any data directories (plus a simple README.txt/INSTALL.txt/LICENSE.txt) be sufficient and if so how costly would that be.
(3) Would a simple db dump + tarball of any data directories (plus a simple README.txt/INSTALL.txt/LICENSE.txt) be sufficient and if so how costly would that be.
Unfortunately it’s not simple. That’s the problem. There are probably 5 million files and ca 10-50 GB. Also it gets updated daily so we have to make sure it contains link integrity. And I have a nasty feeling that some of the links may be hardcoded.
That’s why we are developing MOLECULAR REPOSITORIES!
Whether it would be sufficient is subjective, and the point of my post on the subject. I’d be more then happy with Atom feeds! I can talk about the practicalities of providing the dump files.
CrystalEye has no db to dump which simplifies things somewhat. The full directory structure occupies 56G, and would probably compress to around 8GB. Over half of that is CML, and a fair chunk of what remains is CIF (i.e. the HTML doesn’t occupy much, so if we wanted to provide all of this there’s no profit for us in extracting the CML). Andrew Dalke reckoned it would cost around $180 p.a. to provide this through S3, I think that’s a good estimate, and it’ll probably less if we can keep the search bots off it. The main cost would be the person-time preparing the files. We’re also not particularly keen to keep on doing this – providing incremental batches would not be as easy as the first set.