CrystalEye and repositories: distribution and integrity (cont)

Continuing our discussions on how to disseminate CrystalEye without too much effort or breaking too much. In reading this please remember that CrystalEye – like many bioscience databases – was created as a research project. It had the following aims:

  • to see if high-quality scientific data could be extracted robotically from the current literature
  • to test the usefulness and quality of QM methods in high-throughput computation of solid state properties (e.g. by MOPAC)
  • to explore the construction of lightweight “repositories”. This was not a requirement at the start but evolved towards the latter half of the project.

Like most of our exploratory projects we use the filesystem to manage the data. The filesystem is actually extremely good – it’s probably 50 years old and is understood by everyone. And it’s easy to write programs to iterate over it. We never envisaged that we would be requested to share the work so we are having to address new concerns.
Relational and similar databases address the problems of integrity and copying – even so it’s not trivial and may be manufacturer-dependent. So we don’t expect this to be trivial either. On the other hand we are not flying a nuclear reactor so a few broken links are not the end of the world. But we’d like to limit this as much as possible.

Andrew Dalke Says: I think it’s a poor idea to limit the idea of “backup” to mean restoring the “integrity of the filesystem.” I meant being able to create a backup of your repository so you have a valid, recoverable and usable snapshot. A filesystem backup is probably not sufficient without some thought and planning because it does not have the right transactional requirements. Eg, if my app saves to two files in order to generate a backup, and the filesystem backup occurs after the first file is written but before the second then there’s going to be a problem. [PMR: AGREED]

It does sound like if you get a crash while data is being uploaded, and you restore from that backup, then there will be corruption in CrystalEye. That is, “some of the hyperlinks may point to files that were uploaded after the backup point.” [PMR: AGREED]

If data is withdrawn from CrystalEye (eg, a flaw was found in the original data, or there were legal complications causing it to be removed), will that lead to similar problems? [PMR: PROBABLY]
[…]
If you provided versioned records, so that users can retrieve historical forms of a record, then the solution is easy. Ask the spiders to first download a list of URLs which are valid for a given moment in the repository. They can then use those URLs to fetch the desired records, knowing that they will all be internally consistent. Doing this requires some architectural changes which might not be easy, so not necessarily useful to what you have now. I suspect that’s what you mean by “lightweight repository technology”. I’ve been thinking it would be interesting to try something like git, Mercurial or BZR as the source of that technology, but I wouldn’t want to commit to it without a lot of testing.        [PMR: Accurate analysis and thanks for the list]

BTW, what’s a “linkbase”? In the XLink spec it’s “[d]ocuments containing collections of inbound and third-party links.” There are only two Google hits for “standoff linkbase[s]”, and both are from you (one is this posting) without definition, so I cannot respond effectively. I don’t see how issues of internal data integrity had anything do with needing a linkbase. If all of the data was in some single ACID-compliant DBMS then it’s a well-solved problem.

PMR: In early days of “XML-Link” there was a lot of experience in hypermedia – Microcosm, Hyper-G, etc. which relied on bounded object sets (the system had to know what documents it owned and all contributors had to register with the system.

The solution I sketched above does solve this problem and it uses a list of versioned URLs so it might be a “linkbase”. But I can think of other possible solutions, like passing some sort of version token in the HTTP request, so as to retrieve the correct version of a record, or a service which takes the record id and that token and returns the specific document. That would be less ReSTy, but still a solution.

PMR: Generally agreed. However it’s still at the stage of being a research project in distributing repositories and it’s not something we have planned to do or are resourced to do. But there may be some simple approaches.

As for your example of spidering a histogram, that depends on how the spider is implemented. If the 3rd party site which maintains some view of CrystalEye receives a request like that for something it doesn’t know about, it might at that moment query the primary CrystalEye repository and see if it actually is present. In that case the other site acts like a cache. It might do the same with normal pages, and use standard HTTP cache rules to check and freshen its copy so the user is less likely to be affected by version skew.

PMR: The 3rd party is out of our control. That’s why we are building Atom feeds which solve some of the problems. It means implementing a client side tool to manage the caching and Jim will address this.

There’s a question of tradeoff here. How much “corruption”, using your term, is the user willing to accept in order to get more or different functionality? It seems your view is you think there should be zero, no, null chance of corruption, which I think is laudable for a data provider. But evidence in existing use patterns suggests that people don’t mind the occasional error. I use Google despite knowing that it’s not necessarily up to date and can link to removed pages, or pages hidden behind a registration wall. [PMR: see intro] […]

However, I would suggest that the experience with GenBank and other bioinformatics data sets, as well as PubChem, has been that some sort of bulk download is useful. As a consumer of such data I prefer fetching the bulk data for my own use. It makes more efficient bandwidth use (vs. larger numbers of GET requests, even with HTTP 1.1 pipelining), it compresses better, I’m more certain about internal integrity, and I can more quickly get up and working because I can just point an ftp or similar client at it. When I see a data provider which requires scraping or record-by-record retrieval I feel they don’t care as much about letting others play in their garden.

PMR: I don’t have any problems in general. The PDB and Swissprot (with which I’m most familiar) are collections of flat files so it’s easy to zip and download. CrystalEye contains more complex structure – not too bad, but still complex. It has (at least):

  • table of contents
  • entry pages
  • CIFs
  • raw CMLs
  • complete CMLs
  • moieties
  • fragments
  • images for many of these
  • feeds
  • histograms
  • indexes

So it would be easy to zip the entry pages but these would not have any images and the links would all be broken.  So we could zip all the CIFs (except those from publishers who copyrighted them). But then people would complain they couldn’t read CIFs. So we can zip all the CMLs – and that’s probably the best start. But it means no indexes, no tables of contents, no 2D images, no histograms, no fragments, no moieties.  It will be a very poor shadow of CrystalEye.
And if people are happy with that we’ll think about how to provide versions. No promises.

This entry was posted in crystaleye. Bookmark the permalink.

One Response to CrystalEye and repositories: distribution and integrity (cont)

  1. Depending on the differences between the RAW and COMPLETE CMLs or maybe CIF files, I would be interested in the one or the other. I am not interested in HTML pages (TOC, indices), images, feeds, histograms, etc, as that would be something my copy would do itself.
    The data corpus of CrystalEye, that’s what I would like to download. These CML files are a poor shadow of CrystalEye, only in terms of website functionality. But my interest would not be in setting up CrystalEye2, but would be to have access to the data to:
    – link to CrystalEye entries (assuming the CMLs contains a CrystalEye identifier)
    – derive properties myself
    – test the new CDK atom typing algorithms (which chemistry is there in the CIF files that CDK cannot deal with yet)
    – detect uncommon geometries
    – …
    (Not that I have concrete plans, because metabolomics is taking up most of my time now… though I am anxious to start play with the AtomCML feeds 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *