CrystalEye and repositories: Jim explains the why and how of Atom

Since Atom may not be familiar to everyone Jim Downing has written two expositions on his blog. These explain his thinking of why a series of medium-sized chunks is a better way to support the download of CrystalEye than one or two giant files. Note that he is working on making available some Java code to help with the download – this should do the caching and remember where you left off. If you have technical questions I suggest you leave them on Jim’s blog. If you want to help the project in general use my blog. If you want to hurry the process along by mailing Jim, please refrain. He works very well on occasional beers (he is a brewing aficionado).

Using the Crystaleye Atom feed – November 5th, 2007
Incremental harvesting is done by [the same mechanism], but with a couple of extra bells and whistles to minimize bandwidth and redundant downloads. There are three ways you might do this: –

  • The first way is to keep track of all the entry IDs you’ve seen, and to stop when you see an entry you’ve already seen.
  • The easiest way is to keep track of the time you last harvested, and add an If-Modified-Since header to the HTTP requests when you harvest – when you receive a 304 (Not Modified) in return, you’ve finished the increment.
  • The most thorough way is to keep track of the ETag header returned with each file, and use it in the If-None-Match header in your incremental harvest. Again, this will return 304 (Not Modified) whenever your copy is good.

Implementing a harvester

Atom archiving is easy to code to in any language with decent HTTP and XML support. As an example, I’ve written a Java harvester (binary, source). The source builds with Maven2. The binary can be run using
java -jar crystaleye-harvester.jar [directory to stick data in]
Letting this rip for a full harvest will take a while, and will take up ~10G of space (although less bandwidth since the content is compressed).

Being a friendly client

First and foremost, please do not multi-thread your requests. [PMR emphasis]
Please put a little delay in between requests. A few 100ms should be enough; the sample harvester uses 500ms – which should be as much as we need.
If you send an HTTP header “Accept-Encoding: gzip,deflate”, CrystalEye will send the content compressed (and you’ll need to gunzip it at the client end). This can save a lot of bandwidth, which helps.

PMR: Many people don’t think of things like this when they write spiders. Jim’s software should do all the thinking for you.

OJD: Atom History vs Large Archive Files – November 5th, 2007
Incremental harvest is a requirement for data repositories, and the “web-way” is to do it through the uniform interface (HTTP), and connected resources.
We don’t have the resource to provide DVD’s of content for everyone who wants the data. Or turning that around – we hope more people will want the data than we have resource to provide for. This is isn’t about the cost of a DVD, or the cost of postage, it’s about manpower, which costs orders of magnitude more than bits of plastic and stamps.
I’ve particularly valued Andrew Dalke’s input on this subject (and I’d love to kick off a discussion on the idea of versioning in CrystalEye, but I don’t have time right now): –

AD: However, I would suggest that the experience with GenBank and other bioinformatics data sets, as well as PubChem, has been that some sort of bulk download is useful. As a consumer of such data I prefer fetching the bulk data for my own use. It makes more efficient bandwidth use (vs. larger numbers of GET requests, even with HTTP 1.1 pipelining), it compresses better, I’m more certain about internal integrity, and I can more quickly get up and working because I can just point an ftp or similar client at it. When I see a data provider which requires scraping or record-by-record retrieval I feel they don’t care as much about letting others play in their garden.

(Andrew Dalke)… and earlier …

… using a system like Amazon’s S3 makes it easy to distribute the data, and cost about US $20 for the bandwidth costs of a 100GB download. (You would need to use multiple files because Amazon has a 5GB cap on file size.) Using S3 would not affect your systems at all, except for the one-shot upload time and the time it would take to put such a system into place.

(Andrew Dalke)Completely fair points. I’ll certainly look at implementing a system to offer access through S3, although everyone might have to be even more patient than they have been for these Atom feeds. We do care about making this data available – compare the slight technical difficulties in implementing an Atom harvester with the time and effort it’s taken Nick to implement and maintain spiders to get this data from the publishers in order to make it better available!
PMR: The clear point is that this is more work than it looks from the outside. Part of the point is that we are trying to improve the process of spidering the scientific web. Nick spent a LOT of time writing spiders – one publisher changed their website design several times over the last two years. If publishers were as friendly to spiders as we are trying to be the process would be much easier. So if the work with Atom is successful, forward-looking publishers might light to implement it.

ChemSpiderMan Says:
November 5th, 2007 at 4:33 pm eThanks to Egon for asking his question. As I have been expressing all I want to do also is “- link to CrystalEye entries”. We will work to support the atom feeds…
“- link to CrystalEye entries (assuming the CMLs contains a CrystalEye identifier). [PMR: The Atom feed contains the link – see above]”
all that is left now is access to the backfile of entries.

PMR: Again to clarify this. The links to the entries are already published in the Atom feed. Each link points to an entry from which (a) the InChI and (b) the CompleteCML are available. The access is there. It’s purely a question of the software (which anyone can write, but Jim is addressing for those who want it customised).

This entry was posted in crystaleye, open issues. Bookmark the permalink.

3 Responses to CrystalEye and repositories: Jim explains the why and how of Atom

  1. Andrew Dalke says:

    I can’t figure out how to email Jim, so leaving a message here.
    Consider also using Squid as a reverse proxy. If you follow the various HTTP requirements (entity tags, cache tags, etc.) then you’ll be able to off-load a lot of the semi-static pages to it. There’s also support for caches to update each other, but I have zero experience with it. Then again, I’ve little experience with Squid.
    You’re using Apache, so another thing to consider is lighttpd.
    You may also be interested in reading some of the case studies of developing large web servers. There’s the “YouTube Scalability” presentation (available from Google video). For something closer in size to what you’re dealing with, see the description of the scalability problems for the online community network eins.de. If you haven’t heard of it, memcached is another useful tool to have up your sleve.
    My strong belief is that anything in chemical informatics or crystallography deals with small data sets. There’s an old saying that “If you fit in L1 cache, you ain’t supercomputing” (attributed to Krste Asanovic). If the data set fits on an iPod nano, it ain’t a big data set. The biggest nano is 8GB while your data set it about 10GB, so while it’s not small, it’s not hefty. I see that as a relief. It means I don’t have to invent new tools just to solve my data management problem.
    Bittorrent had to add its own error detection code because when transferring hundred of terabytes the TCP/IP error detection isn’t good enough. I personally don’t want to go there.

  2. pm286 says:

    (1) Thanks Andrew

  3. Jim Downing says:

    Andrew, thanks for the comments. For reference, you can mail me on ojd20 AT cam.ac.uk (I’ll stick that in my blog template), or you could leave comments on my blog, which I pick up more easily than ones left here.
    Regarding the scalability stuff, thanks for the references. The theme of the ones I’ve seen before are the web way works!
    CrystalEye is a baked site, which was initially an ‘organic decision’ (i.e. the easiest way for Nick to do it), but since i) the size of the generated pages is smaller than the base data and ii) once built the pages will change only if someone improves the site; it’s a good design, and one I think we’ll stick with.
    It has also made it really very easy to add caching headers (already done by Apache), content encoding (easy in Apache) and backup (rsync), which isn’t always the case with a dynamic site.
    Lighttpd is definitely something we’ll look at if the threading issue becomes a problem. Jetty also has a non-blocking mode that would be interesting to look at if Apache starts to suffer.
    You hit the nail on the head: This isn’t a big dataset as they go, and although it grows steadily, neither the size nor the growth should be problematic, either for us or for people who want all / part of the data.
    Hopefully in the next couple of days I’ll get time to develop the harvester further to allow gradual build of the whole dataset (at the moment it requires an uninterrupted execution on the first go), and we’ve been discussing ways of enabling clients to easily discover resources relevant to their particular interest without having to download the CML.

Leave a Reply

Your email address will not be published. Required fields are marked *