Chemspider (Antony Williams) has asked on his blog for feedback on CrystalEye and shall respond in some detail. I shall try to exclude any personal judgment and not make statements about the value of the process. In essence it will be the factual material that I would write were I asked to review it for a scientific journal, but I shall omit any judgment. I’m spending time on this review because it may hopefully by useful in making clear parts of our approach to chemical ontology.
Firstly some background. I have blogged about the genesis of CrystalEye which was solely to support our research. It happened that soon after it appeared there were requests for the data. CrystalEye was NOT created with the idea that it was a redistributable. It had no unique ID system and was also heavily hyperlinked, some being hardcoded.
The data is Openly available and can, in principle be completely downloaded with wget or other spidering tools. There is no technical restriction though we would ask that people contact us beforehand or use a sensitive robot. However the site is updated daily and so it is extremely difficult to take a snapshot which has referential integrity – by the time that someone has finished spidering the data has changed considerably.
The data are complex. They consist of ca 100,000 entries, several millions entries of derived data (fragments, etc.) and statistics (bondlengths). Much of this is difficult to redistribute in principle.
It is critical to realise that one of the primary parts of the research was the preservation of semantics thus all CIF data has been transformed into CML and we believe this is almost lossless. Therefore the CML pages (which are Openly downloadable) are the primary semantic resource.
At this stage we (aminly Jim and Nick) provided limited redistributability through the following mechanisms:
- RSS feeds on new entries. This is easy to implement and, for example, I get a daily feed os some 10’s of new entries. That’s available to anyone with a feedreader. So the simplest way to get entries is to subscribe to the RSS feed.
- A tool – which Jim wrote specially – for downloading the “backlog”. This tool is publicly available and should be capable of downloading the complete set of entries (but not the derived data)
This is the only mechanisn we can produce for avoiding semantic loss and corruption.
We were, however, asked if we could provide the data in SDF (MDL molfile format). This necessarily involves massive semantic loss as the format cannot hold crystallography and CrystalEye is a crystallographic site. Moreoever many crystal structures do not fit into the molfile concept at all – diamond and graphite are clear examples. Many – perhaps most – crystals consist of two or more components (“moieties” in CIF). The identity of these cannot be preserved in molfiles.
We were therefore reluctant to convert the data to molfiles and or InChI for these and many other reasons. Put simply, they al likely to lead to massive loss and corruption.
We excluded many CrystalEye entries from w.hat we sent to Chemspider.
- Only molecular crystals with one molecule in the crystallochemical unit.
- No purely inorganic structrures and I believe no meta-organic ones
Moreoever what was provided was essential link information, not data. Essentially these were connectionTable-URL pairs.
Since CrystalEye was not developed for export we have not checked the validity of the connection tables or their suitability for linking. This is non-trivial (and in many cases not possible with the data we have – it would require extraction of data from full-text). We therefore make no guarantee about the suitability of the InChI or the connection table for any purpose. As examples we add information on bond order, charges and stereochemistry. This requires heuristics which we know to fail in certain cases. We do not have any metrics on this, though you will see later that it happens.
ChemSpiderMan Says: June 22nd, 2008 at 12:43 am e
Peter: Regarding “Jim Downing has put in considerable work to create a subset of CrystalEye for Chemspider who now wish use to review their site:” and “I have stressed several times that Nick is writing his thesis and has no time to review commercial sites – however I will do so sometime in the next few posts”.
Thanks to Jim for the considerable work, and I have already acknowledged it in an email to all of you.
It would be good to receive your feedback on the deposition of CrystalEye onto ChemSpider but you are under no obligation to do so. But, I would think that since ChemSpider is the first site other than CrystalEye itself (I believe this to be true) to host your data I thought you might want to look at it. If not that’s fine too.
Your modus operandi in regards to ChemSpider is to not provide direct feedback to us regarding any issues but rather to blog about any issues. My earlier email regarding providing feedback was a request to provide feedback directly to us on any obvious issues so we could resolve them. I understand your preference is to simply blog about them so I will monitor your site for the comments instead. I look forward to your comments.
PMR: You have made this point several times. I use my blog because (a) I know how it works (b) can add images to it (c) post long posts which IMO are not suitable for comments on other blogs (d) have a readership to which I address wider questions (e) know that you monitor this blog. In general I regard blog comments as useful for shortish comments which do not normally need large replies – this may be unusual but it’s how I treat this blog.
CS: To Nick- good luck with the thesis and apologies if the request to check out CrystalEye on ChemSpider was a distraction. I believe one of the outcomes of CrystalEye is Open Data that could proliferate to other sites ESPECIALLY now that Jim has done the hard work to create the dataset for download. I would hope that having your work more highly exposed would be good for you and you would have more bragging rights in your thesis in regards to your contribution to the domain of crystallography and Open Data since over 5000 people per day frequenting ChemSpider could end up over on CrystalEye as a result of the connection. I think this is good for the project personally. I look forward to seeing the data shown up on PubChem, eMolecules and other sites shortly.
Either way, of “referred sites” driving traffic to ChemSpider what is interesting to observe is that http://wwmm.ch.cam.ac.uk is now in the top 10 of referring sites so the benefit is mutual. Thanks!
PMR: Although it’s nice to have people visit CrystalEye – and this is not meant unkindly – the primary thing we have to focus on is whether its creation can be justified as a scientific resource which is worthy of being included in a thesis and whether the data is fit for purpose. We do not, in fact, have any idea of how many hits we get – this carries no weight with those who assess our science. Only publication and grants matter. Most of everything else is a distraction. If we end up with a new architecture for depositing data that could be of considerable interest to funders.
In the next post I shall give an obective analysis of the CrystalEye links in Chemspider