A long post on CrystalEye by Antony Williams. I comment on points that still need answering:
06:19 06/05/2008, an a recent post about ChemSpider we’ve been accused of wanting a Free Lunch. I copy a segment of the post and comment with insertions.
“Data are normally produced for a particular purpose and the reuse them for another cost money. I’ll exemplify this by taking CrystalEye data – about 120,000 crystal structures and 1 million molecular fragments – which were aggregated, transformed and validated by Nick Day as part of his thesis. (BTW Nick is writing up – it’s a tribute to his work that CrystalEye runs without attention for months on end).
AJW> It is true…it is a tribute to Nick that CrystalEye can run for months without attention. Kudos. I am interested in how much pressure the site is under. How many searches/users in a day etc.? We find that our struggles in uptime (and these are negligible) are primarily based on stress on the servers. For nighttime users tonight things will have been slow…we deposited over 100,000 new molecules from 5 new data sources. That does create some slowness. We will hit about 40,000 transactions today. Our problems are ISP issues and powercuts. But we are also not in a University using thick pipes etc.
PMR: We do not monitor the number of searches. The maintenance is almost competely about journal TOC pages. The only reference to CIFs is usually on these HTML TOCs which are designed to look good for humans (i.e. with lots of pictures advertising the publisher) but are awful for machines. Every so often a journal changes its TOCs and the crystalEye robot breaks. It would be nice if publishers thought about the machine age and the Semantic Web sometimes.
One comment…it was 130,000 structures according to a previous blog and has been expanding since then from daily depositions. Right now I would expect it to be 140,000 rather than 120,000. When we did try scraping the data our best estimate was about 90,000. We might have missed something in our scraping and it’s why we asked for a dump of the data.
PMR: We don’t know how many unique structures there are. I’m guessing that there are about 130,000+ entries but that many are duplicates. We (or rather Nick) does a good job on disambiguating by cell dimensions but this is not foolproof and indeed no method is. We hope to develop better methods over the summer. The main duplication comes from the Crystallography Open Database which has about 45,000 structures. It is released periodically. Quite a number of the structures have syntactic problems and we do our best to fix them. So we really don’t know how many unique structures. Note also that the TOC in CrystalEye does not point to COD as it doesn’t have the appropriate structure.
The primary purpose of CrystalEye was to allow Nick to test the validity of QM calculations in high-throughput mode. It turned out that the collection might be useful so we have posted it as Open Data. To add to its value we have made it browsable by journal and article, searcahable by cell dimensions, searchable by chemical substructure and searchable by bond-length. This is a fair range of what the casual visitor might wish to have available. Andrew Walkingshaw has transformed it into RDF and built a SPARQL endpoint with the help of Talis. It has a Jmol applet and 2D diagrams, and links back to the papers. So there is a lot of functionality associated with it.
AJW> The team has done a good job in putting the site together. The JMol applet is an excellent utility for us all to use and thanks to that team for sure! Egon has been challenging us to RDF the site and it’s on our list, but keeps getting pushed down based on other requests. Since he’s the only voice asking it will keep getting pushed down unfortunately.
PMR: Andrew Walkingshaw has converted a subset of the data (in all the entries) to RDF and he is demonstrating it at XTech (Dublin) this week. I blogged earlier about his mashup with Google Earth.
Antony and I have had several discussions about CrystalEye – basically he would like to import it into his database (which is completely acceptable) but it’s not in the format he wants (multi-entry files in MDL’s SDF format, whereas CrystalEye is in CML and RDF).
AJW> To clarify, again. I DON’T want to import CrystalEye into ChemSpider. I DON’T! All I want is the set of structures and unique associated URLs so that users of ChemSpider can find that there is crystal structure information over on CrystalEye and can click the link and be on CrystalEye and get the benefit of Nick, Andrew and Peter’s work. I don’t want to reproduce their effort. I want to integrate to it. I’ve said it many times on Peter’s blog and on this one.
This type of problem arises everywhere in the data world. For example the problem of converting between map coordinates (especially in 3D) can be enormous. As Rich says, it costs money. There is generally no escape from the cost, but certain approaches such as using standards such as XML and RDF can dramatically lower the costs. Nevertheless there is a cost. Jim Downing made this investment by creating an Atom feed mechanism so that CrystalEeye couls be systematically downloaded but I don’t think Chemspider has used this.
AJW> If Jim can contact me by email and provide me with detailed instructions to download the entire file of structures ONLY and their associated URLs that would be excellent. I’ll send the request to him tonight.
PMR: There IS no entire file of structures. It has never been created and won’t be. That’s not because we want to make life difficult. It would take a month and we haven’t got a month. We believe we have a better way which Jim created for you in November (CrystalEye and repositories: Jim explains the why and how of Atom) – we did it for you… and we’d appreciate feedback.
[…] it costs money. It’s unrealistic to expect we should carry out the conversion for a commercial company for free. We’d be happy to a mutually acceptable business proposition and it could probably be done by hiring a summer student.
AJW> I am interested in what commercial benefit integrating to CrystalEye can have. It’s work on our side. I’m not sure what a mutually acceptable business proposition would look like. It can’t be that much work to send us a set of InChIStrings and URLs for the CrystalEye dataset..they already exist on CrystalEye. So, I’ll assume that this is a last comment on “No thanks to CrystalEye data in ChemSpider”. I have to ask why not put them in PubChem. Since PubChem is held as the standard of OpenData why not put CrystalEye there?
PMR: The only thing stopping us putting them in Pubchem, or anywhere, is work. We need to make sure that we have data integrity and referential integrity. We’re going to do it, but at present Nick is writing his thesis. We have some limited funding earmarked for this and hope to start it soon.
When it’s finished it will be in RDF/CML.
FWIW we are continuing to explore the ways in which CrystalEye is made available. We’re being funded by Microsoft as part of the OREChem project and the result of this could represent some of the way in which the Web technology is influencing scientific disciplines. We’d recommend that those interested in mashups and re-use in chemistry took a close look at RDF/SPARQL/CML/ORE as those are going to be standard in other fields.
AJW> It would be good to see CML be a standard. I’ve been following it for a decade and when it gets accepted by a larger majority then we might adopt it.
PMR: Chicken and egg… 🙂 You won’t adopt it until other people adopt it and they won’t adopt it till you do. But we make progress. It’s now mainstream in part of Accelrys software (funded by DTI). It’s being put into compchem codes by the COST project, and it’s really the only choice for datuments (combined data and documents) as in semantic publishing and the results of test-mining.
And we shall have one or two announcements soon…