CrystalEye: using the harvester

Jim Downing has written a harvester for CrystalEye. I thought I would have a try and see if I could iterate through all the entries and extract the temperature of the experiment. This is where XML really starts to show its value over legacy formats. Jim’s iterator reads each entry and copies it to a file; I decided to read the entry as an XML document, search for the temperature using XQuery and announce it. It’s simple enough that I thought I could do it while watching Liverpool (I used to live on Merseyside). Unfortunately (or fortunately) the torrent of goals distracted me so it had to wait till today.
The temperature is described in the IUCr dictionary and held in CML as (example):
293.0
So this is trivially locatable by XQuery (with local-name() and @dictRef):
// iterate through all entries
for (DataEntry de : doc.getDataEnclosures()) {
if (downloaded >= maxHarvest) {
return downloaded;
}
InputStream in = null;
try {
in = get(de.url);
// standard XOM XML parsing, creates a
Element rootElement = new Builder().build(in).getRootElement();
// standard xquery
Nodes nodes = rootElement.query(
".//*[local-name()='scalar'"+
and @dictRef='iucr:_cell_measurement_temperature']");
// if there is a temperatute extract the value
String temp = (nodes.size() == 0) ? "no temp given" : nodes.get(0).getValue();
System.out.println("temperature for "+rootElement.getAttributeValue("id")+": "+temp);
downloaded++;
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeQuietly(in);
}
}
and here’s the output: 1625 [main] DEBUG uk.ac.cam.ch.wwmm.crystaleye.client.Harvester - Getting http://wwmm.ch.cam.ac.uk/crystaleye/summary/rsc/ob/2007/22/data/b712503h/b712503hsup1_pob0401m/b712503hsup1_pob0401m.complete.cml.xml
temperature for rsc_ob_2007_22_b712503hsup1_pob0401m: 115.0
2297 [main] DEBUG uk.ac.cam.ch.wwmm.crystaleye.client.Harvester - Getting http://wwmm.ch.cam.ac.uk/crystaleye/summary/rsc/ob/2007/22/data/b710487a/b710487asup1_ljf130/b710487asup1_ljf130.complete.cml.xml
temperature for rsc_ob_2007_22_b710487asup1_ljf130: 150.0

etc.
It will take the best part of the day to iterate through the entries, but remember that CrystalEye is not a database. We are converting it to RDF (and anyone interested can also do this) when it can be searched in a trivial amount of time and with much more complex questions. (Remember that CrystalEye was not originally designed as a public resource). Until then anyone who wishes to use CrystalEye a lot would do best to download the entries and build their own index.
[Note: I will continue to try to format the code – WordPress makes it very difficult]

This entry was posted in crystaleye. Bookmark the permalink.

One Response to CrystalEye: using the harvester

  1. Jim Downing says:

    If one is going to be doing much work like this, it’s probably preferable to get a full download and iterate over the files. I’m going to improve the harvester to make it easier to get all the data.

Leave a Reply

Your email address will not be published. Required fields are marked *