CrystalEye and repositories: distribution and integrity

Andrew Dalke has raised two useful issues and I will address them separately. The first is about integrity of a repository (I will start using that word rather than database).

  1. Andrew Dalke Says:
    November 4th, 2007 at 2:19 am eDuring that time the database will have grown by 10-15% so that that percentage of links will ipso facto be broken. So any redistribution will involve distributing a broken system.What? Are you saying you don’t have backups for your system? If it goes down and you recover from backups, will pages be broken? I hope not! And if not, then use the backups to generate the distribution. That can’t break the server.

PMR: Yes, we have backups, but that addresses the integrity of the filesystem at a given point in time, not the integrity of links in a hypermedia system. For example, if you are uploading a set of web pages to a server, and it is backed up in the middle of that, and you revert to that backup the filesystem will be correct at that time but some of the hyeprlinks may point to files that were uploaded after the backup point. That’s a difficult problem and unless you operate with bounded object sets or standoff linkbases it’s not soluble. At present CrystalEye does not use linkbases.
We are looking into lightweight repository technology for molecules and Jim will probably be writing about this elsewhere.
Now imagine that a spider starts to download the entries and “finishes” a month later. During that time several thousand new entries will have been added. The spider then extracts the bond length histograms. These will point to the old and new entries and our site will honour the links. But the spider will not have downloaded the new entries. So the histograms on the spider’s site will have hundreds of thousands of broken links. Users of the spider’s histograms may get the impression that our site is broken when it is not. That is a simple example of not being able to honour the integrity of the work.

  1. if the whole DB is zipped into a 100GB file, downloading that is likely to break the server and the connection
    ftp and bittorrent do very good jobs of transfering 100 GB files. I mentioned in another comment that using a system like Amazon’s S3 makes it easy to distribute the data, and cost about US $20 for the bandwidth costs of a 100GB download. (You would need to use multiple files because Amazon has a 5GB cap on file size.) Using S3 would not affect your systems at all, except for the one-shot upload time and the time it would take to put such a system into place.
  2. The data distribution site does not need to be on the same machine as your service. That’s a key part of a ReST architecture.

PMR: Thanks for the suggestion. This would be OK if we made a snapshot of CrystalEye every year. (Even then it’s hard work to produce a complete distribution that honours integrity). But we want to keep users updated and feel that Atom feeds (which we have already started) are the better way. Jim’s repository should be able to provide the mechanism fpr regular snapshots

This entry was posted in crystaleye. Bookmark the permalink.

One Response to CrystalEye and repositories: distribution and integrity

  1. Andrew Dalke says:

    I think it’s a poor idea to limit the idea of “backup” to mean restoring the “integrity of the filesystem.” I meant being able to create a backup of your repository so you have a valid, recoverable and usable snapshot. A filesystem backup is probably not sufficient without some thought and planning because it does not have the right transactional requirements. Eg, if my app saves to two files in order to generate a backup, and the filesystem backup occurs after the first file is written but before the second then there’s going to be a problem.
    It does sound like if you get a crash while data is being uploaded, and you restore from that backup, then there will be corruption in CrystalEye. That is, “some of the hyperlinks may point to files that were uploaded after the backup point.”
    If data is withdrawn from CrystalEye (eg, a flaw was found in the original data, or there were legal complications causing it to be removed), will that lead to similar problems?
    I’m rather fond of the old-fashioned “databank” term from the PDB, but it’s not a fashionable term and most would look askance at it, over “repository.”
    If you provided versioned records, so that users can retrieve historical forms of a record, then the solution is easy. Ask the spiders to first download a list of URLs which are valid for a given moment in the repository. They can then use those URLs to fetch the desired records, knowing that they will all be internally consistent. Doing this requires some architectural changes which might not be easy, so not necessarily useful to what you have now. I suspect that’s what you mean by “lightweight repository technology”. I’ve been thinking it would be interesting to try something like git, Mercurial or BZR as the source of that technology, but I wouldn’t want to commit to it without a lot of testing.
    BTW, what’s a “linkbase”? In the XLink spec it’s “[d]ocuments containing collections of inbound and third-party links.” There are only two Google hits for “standoff linkbase[s]”, and both are from you (one is this posting) without definition, so I cannot respond effectively. I don’t see how issues of internal data integrity had anything do with needing a linkbase. If all of the data was in some single ACID-compliant DBMS then it’s a well-solved problem.
    The solution I sketched above does solve this problem and it uses a list of versioned URLs so it might be a “linkbase”. But I can think of other possible solutions, like passing some sort of version token in the HTTP request, so as to retrieve the correct version of a record, or a service which takes the record id and that token and returns the specific document. That would be less ReSTy, but still a solution.
    As for your example of spidering a histogram, that depends on how the spider is implemented. If the 3rd party site which maintains some view of CrystalEye receives a request like that for something it doesn’t know about, it might at that moment query the primary CrystalEye repository and see if it actually is present. In that case the other site acts like a cache. It might do the same with normal pages, and use standard HTTP cache rules to check and freshen its copy so the user is less likely to be affected by version skew.
    There’s a question of tradeoff here. How much “corruption”, using your term, is the user willing to accept in order to get more or different functionality? It seems your view is you think there should be zero, no, null chance of corruption, which I think is laudable for a data provider. But evidence in existing use patterns suggests that people don’t mind the occasional error. I use Google despite knowing that it’s not necessarily up to date and can link to removed pages, or pages hidden behind a registration wall.
    Regarding a 100GB data set, the context was in response to the statement that “downloading that is likely to break the server and the connection.” I pointed out that this statement isn’t necessarily true, along with examples of possible solutions. That’s a different question than if such a distribution is appropriate, which is what I think you replied to here.
    However, I would suggest that the experience with GenBank and other bioinformatics data sets, as well as PubChem, has been that some sort of bulk download is useful. As a consumer of such data I prefer fetching the bulk data for my own use. It makes more efficient bandwidth use (vs. larger numbers of GET requests, even with HTTP 1.1 pipelining), it compresses better, I’m more certain about internal integrity, and I can more quickly get up and working because I can just point an ftp or similar client at it. When I see a data provider which requires scraping or record-by-record retrieval I feel they don’t care as much about letting others play in their garden.

Leave a Reply

Your email address will not be published. Required fields are marked *