ATOMic crystals

How do we disseminate our CrystalEye data? If we use one large file, even zipped, it will run into gigabytes. Also it can’t easily be updated.  Jim Downing has started to set up AtomPP feeds for disseminating  it. Geoff Hutchison asks:

  1. Geoff Hutchison Says:
    October 29th, 2007 at 10:03 pm ePeter, as I mentioned to you earlier, I think many of us are looking for the open data in Crystal Eye, particularly fragments. Surely there’s an easier and more efficient way to get the data than AtomPP feeds. Will you have periodic dumps — say I get this quarter’s crystal structures and then can use the AtomPP feeds to just pull new entries?

I think this depends where someone starts. If they are a regular user of crystalEye then AtomPP would seem to be the best approach – it means you don’t have to remember when to download and what the size of chunk are. Is it a simple method to get the historical material when starting out? Jim, perhaps you can help here

This entry was posted in crystaleye, data, semanticWeb. Bookmark the permalink.

7 Responses to ATOMic crystals

  1. Geoff, I am with you that there needs to be an easier way to access the CrystalEye data. Specifically all I want to do is deposit the structures and structure IDs ONLY into ChemSPider and link back to CrystalEye. I don’t want to do the job to host the crystal data since Nick/Peter/others? has done such a good job with the CrystalEye I don’t want to even try to replicate it. Peter preferred not to provide an SDF file to us and asked us to scrape it.
    We are getting ready to scrape it but I have concern about how Open the data is…by that I mean have the publishers supported/granted the Open declaration of their data. I have detailed this here…
    Interestingly, days after sending my email to JACS editorial staff I have received no response. SO..I still don’t know.

  2. Rajarshi says:

    Wouldn’t it be possible to provide a current dump (say uptil this month) and after that provide monthly incremental dumps (i.e., just the new structures added that month). Much like Pubchem does

  3. Peter, if the data runs into the gigabytes in its current form, then you’re looking at least 3 people here who want to go pull your historical data. That’s some number of GB over your AtomPP feeds. I’m OK with that, but you might want to check with some network folks before we end up inadvertently giving you a denial of service as we pull from your feed!
    I’d personally be OK if you send me a DVD through the mail. It’s probably faster. I’d even be happy to pay for the cost of the media.

  4. Geoff…well there’s a simple idea! I’ll cover media cost and shipping too…I’d rather do that than scrape the data.

  5. Andrew Dalke says:

    If you’re worried about bandwidth use, consider a service like Amazon’s S3. Using them as an example, storage is US$0.15/GB/month + $0.10/GB transfer in + $0.18/GB transfer out (or less). Assuming it fits on a DVD (

  6. Andrew Dalke says:

    (Because of the lack of preview, I didn’t see that I used an unescaped < in my previous comment. Here’s take two.)
    If you’re worried about bandwidth use, consider a service like Amazon’s S3. Using them as an example, storage is US$0.15/GB/month + $0.10/GB transfer in + $0.18/GB transfer out (or less). Assuming it fits on a DVD (<10GB) and 100 people want the entire data set, that’s about $1.50/month + $180 in bandwidth fees.

  7. Jim Downing says:

    Hi all,
    sorry I’m a little late to this discussion. Some quick points: –
    * Bandwidth may be an issue, but probably won’t be – search engines soak 10s of GB of transfer a month from WWMM. An extra 10 or 15GB as a one off to allow collaborators to do real science is good value by comparison! (Thanks for the tip about S3 though, Andrew – it’s definitely worth considering)
    * Peter has mislabelled this as APP, it’s just Atom syndication with a standard archiving extension.
    * I’ll be publishing instructions on using the atom feed for harvesting along with a simple Java harvester on my blog, hopefully tomorrow.

Leave a Reply

Your email address will not be published. Required fields are marked *