Andrew Dalke has raised two useful issues and I will address them separately. The first is about integrity of a repository (I will start using that word rather than database).
PMR: Yes, we have backups, but that addresses the integrity of the filesystem at a given point in time, not the integrity of links in a hypermedia system. For example, if you are uploading a set of web pages to a server, and it is backed up in the middle of that, and you revert to that backup the filesystem will be correct at that time but some of the hyeprlinks may point to files that were uploaded after the backup point. That’s a difficult problem and unless you operate with bounded object sets or standoff linkbases it’s not soluble. At present CrystalEye does not use linkbases.
We are looking into lightweight repository technology for molecules and Jim will probably be writing about this elsewhere.
Now imagine that a spider starts to download the entries and “finishes” a month later. During that time several thousand new entries will have been added. The spider then extracts the bond length histograms. These will point to the old and new entries and our site will honour the links. But the spider will not have downloaded the new entries. So the histograms on the spider’s site will have hundreds of thousands of broken links. Users of the spider’s histograms may get the impression that our site is broken when it is not. That is a simple example of not being able to honour the integrity of the work.
-
if the whole DB is zipped into a 100GB file, downloading that is likely to break the server and the connection
ftp and bittorrent do very good jobs of transfering 100 GB files. I mentioned in another comment that using a system like Amazon’s S3 makes it easy to distribute the data, and cost about US $20 for the bandwidth costs of a 100GB download. (You would need to use multiple files because Amazon has a 5GB cap on file size.) Using S3 would not affect your systems at all, except for the one-shot upload time and the time it would take to put such a system into place. - The data distribution site does not need to be on the same machine as your service. That’s a key part of a ReST architecture.
PMR: Thanks for the suggestion. This would be OK if we made a snapshot of CrystalEye every year. (Even then it’s hard work to produce a complete distribution that honours integrity). But we want to keep users updated and feel that Atom feeds (which we have already started) are the better way. Jim’s repository should be able to provide the mechanism fpr regular snapshots
November 4th, 2007 at 2:19 am eDuring that time the database will have grown by 10-15% so that that percentage of links will ipso facto be broken. So any redistribution will involve distributing a broken system.What? Are you saying you don’t have backups for your system? If it goes down and you recover from backups, will pages be broken? I hope not! And if not, then use the backups to generate the distribution. That can’t break the server.