petermr's blog

A Scientist and the Web

 

Archive for the ‘crystaleye’ Category

Automatic assignment of charges by JUMBO

Thursday, January 31st, 2008

Egon has spotted a bug in our code for assignment of charges to atoms:

Why chemistry-rich RSS feeds matter… data minging,

The example shown by Peter was nicely chosen: something is wrong with that example. It uncovers a bug in the pipeline, that could have been uncovered by a simple agent monitoring the RSS feed. That is why this technology is important! It allows pipelining of information between services.

Anyway, before you read on, check the structure in the example yourself (Bis(pyrimidine-2-carboxylato-K2N,O)copper(II)).

Done? Checked it? You saw the problem, right? Good.

The charges in the structure are indeed wrong. There are two challenges…

  • for structures with more than one moiety (isolated fragment) in the structure it is formally impossible to know the changes if the author doesn’t give them.  The authors can give them in _chemical_formula_moiety but they are often difficult to parse correctly and in any case they often aren’t given. In those cases we don’t try to assign charges. (The crystallographic experiment itself cannot determine charges).
  • In cases where the fragment contains only light atoms it is usually (but not always) possible to allocate charges by machine. In cases with metals it’s usually impossible to do a good job. The molecule in questions is:

Summary page for crystal structure from DataBlock I in CIF xu2383sup1 from article xu2383 in issue 2008/01-00 of Acta Crystallographica, Section E.


 

The molecule itself is neutral. The easiest way is not to put any charges. Anything else in uncomfortable. We can have + charges on the N’s which is natural, but then there are 2 – charges on the CU. That’s formally correct but since the mertal is usually described as Cu(II) it’s not happy. Or we can play around with thearomaticity, or dissociate the Cu-N or C-O bonds but that’s not happy either. And this is simple compared with may metal structures.

What we have been doing is to disoociate the metal, do the aromaticity and charges, and then add the metal back. In doing so it’s easy to forget the charges and that is what has happened. We’ll try to fix it.

 

But in the end the only thing that matters is the total electron count and the spin state (which normally isn’t given except in the text). Cu2+ is d9 so it has one unpaired electron. But Fe is much more difficult and it’s virtually impossible to do anythig automatic. We’ll probably simply leave the charges off…

 

Chemistry Repositories

Wednesday, January 30th, 2008

Richard Van Noorden – writing in the RSC’s Chemistry World – has described the eChemistry repository project, Microsoft ventures into open access chemistry. This is very topical as Jim Downing, Jeremy Frey, Simon Coles and me are off to join the US members of the project at the weekend. It’s exciting, challenging, but eminently feasible. So what are the new ideas.

The main theme is repositories. Rather a fuzzy term and therefore valuable as a welcoming and comforting idea. Some of the things that repositories should encourage are:

  • ease of putting things in.  It doesn’t require a priesthood (as so many relational databases do). You should be able to put in a wide range of things – these, molecules, spectra, blogs, etc. You shouldn’t have to worry about datatypes, VARCHARS, third normal forms, etc.
  • it should also be easy to get things out.  That means a simple understandable structure to the repository. And being able to find the vocabulry used to describe the objects.
  • flexibility. Web 2.0 teaches us that people will do things in different ways. Should a spectrum contain a molecule or should a molecule contain a spectrum? Sme say one, some the other. So we have to support both. Sometimes  required information is not available, so it must be omitted and that shouldn’t break the system.
  • interoperability. If there are several repositories built by independent groups it should be possible for one lot to find out what the otehrs have done without mailing them. And the machines should be able to work this out. That’s hard but not impossile.
  • avoid preplanning. RDBs suffer from having to have aschema before you put data in. Repositories can describe a basic minimum and then we can work out later how to ingest or extract.
  • power is more important than performance (at least for me.) I’d rather take many minutes to find something difficult than not be ale to do it. When I started on relational databases for molecules it took at night to do a simple join. So everything is relative…

The core to the project is the ORE – Object Re-use and Exchange (ORE Specification and User Guide). A lot of work has gone into this and it’s been implemented at alpha, so we know it works. ORE is quite a meaty spec, but Jim understands it. Basically the repositories can be described in RDF and some subgraphs (or additional ones) are “named graphs” ( e.g. Named Graphs / Semantic Web Interest Group) which are used to describes the subsets of data that you may be interested in. There is quite strong constraint on naming conventions and you need to be well up with basic RDF. But then we can expect the power of the triple stores to start retrieving information in a flexible way. (As an example Andrew Walkingshaw has extrected 10 million triples from CrystalEye and show that these can be rapidly searched for bibliographic and other info). Adding chemistry will be more challenging and I’m not sure how this intergrates with RDF – but this is a research project. Maybe we’ll precompute a number of indexes. And, in principle, RDF can be used to search substructures but I suspect it will be a little slow to start with.

But maybe not… In which case we shall have made a very useful transition

Microsoft eChemistry Project and molecular repositories

Thursday, December 13th, 2007

Some of you may have picked up from – e.g. the Open Grid Forum – that Microsoft (Tony Hey, Lee Dirks, Savas Parastatidis) have been collaborating with Carl Lagoze (Cornell) and Herbert van de Sompel (LANL) on bringing together Chemistry and OAI-ORE – the next generation of interoperable repository software. We are delighted that Microsoft has now agreed to fund this project and when Carl, Lee, Simon Coles (Soton) and I had lunch yesterday Lee said I could publicly blog this. (There are contractual details to be settled on various sites).

In brief – Tony Hey was the architect of the UK eScience program and then moved to Microsoft Redmond where he has been developing approaches to Open Science (not sure if this is the correct term but it gives the idea) – for example it includes Open Access and permits/encourages Open Source in the project. Carl and Herbert developed the OAI-PMH protocol for repositories which allows exposure of metadata for harvesters. They have now developed ORE – Object Re-use and Exchange – which sees the future as composed of a large number of interoperating repositories rather than monolithic databases (I am on the advisory board of ORE).

There are 7-8 partmers in the program – MS, PubChem, Cornell, LANL, Lee Giles (PSU), Soton, Indiana and Cambridge. This is a really exciting development as we shall be able to create a number of well-populated molecular repositories with heterogeneous content (everything from crystallography to Wikipedia chemicals for example). One that we are currently developing is an RDF/CML-based repository of common chemicals – perhaps 5000 – which could serve as an amanuensis for the bench chemist or undergraduate needing reference material. CrystalEye will be in there as well and we shall also be “scraping” (ugly word) any material we can legally access. In this was we can hope to see the concept of World Wide Molecular Matrix start to emerge. Chemistry eTheses can also be reposited – we are starting to hear of universities who have mandated open theses.

Chemical substructure searching across repositories will be an exciting challenge but we have a number of ideas.

We shall have openings here so if you are interested let us know.

More later, but to reiterate our thanks to Tony and colleagues.

SPECTRa tools released

Friday, November 30th, 2007

The SPECTRa tools allow chemists (perhaps group or departmental analytical spectroscopy groups) to submit their data (spectra, crystal structures, compchem) to a repository.

From Jim Downing SPECTRa released

Now that a number of niggling bugs have been ironed out, we’ve released a stable version of the SPECTRa tools.

There are prebuilt binaries for spectra-filetool (command line tool helpful for performing batch validation, metadata extraction and conversion of JCAMP-DX, MDL Mol and CIF files), and spectra-sub (flexible web application for depositing chemistry data in repositories). The source code is available from the spectra-chem package, or from Subversion. All of these are available from the spectra-chem SourceForge site.

Mavenites can obtain the libraries (and source code) from the SPECTRa maven repo at http://spectra-chem.sourceforge.net/maven2/. The groupId is uk.ac.cam.spectra – browse around for artifact ids and versions.

PMR: This is an important tool in the chain and congratulations to Jim for designing and building it. It interfaces with a repository (as they say on kids toys “repository not included”) so that you can customise your own business process. We hope to see departments appreciating the need for repositing their data (it gets lost, it could be re-used, etc.).

The legacy formats (CIF, JCAMP, Gaussian, etc.) are well structured and  SPECTRa allows them to be used in a way which maximises the effort that went into creating them. The process is almost automatic for crystallography – a good CIF has all the metadata inside it) but a small amount of manual effort for spectra (the molecule is not normally embedded in the JCAMP so has to be provided separately).

The system is potentially searchable by chemistry – it might look something like Crystaleye  with a search provided by OpenBabel.

CrystalEye: using the harvester

Wednesday, November 7th, 2007

Jim Downing has written a harvester for CrystalEye. I thought I would have a try and see if I could iterate through all the entries and extract the temperature of the experiment. This is where XML really starts to show its value over legacy formats. Jim’s iterator reads each entry and copies it to a file; I decided to read the entry as an XML document, search for the temperature using XQuery and announce it. It’s simple enough that I thought I could do it while watching Liverpool (I used to live on Merseyside). Unfortunately (or fortunately) the torrent of goals distracted me so it had to wait till today.

The temperature is described in the IUCr dictionary and held in CML as (example):

293.0
So this is trivially locatable by XQuery (with local-name() and @dictRef):

// iterate through all entries
for (DataEntry de : doc.getDataEnclosures()) {
if (downloaded >= maxHarvest) {
return downloaded;
}
InputStream in = null;
try {
in = get(de.url);
// standard XOM XML parsing, creates a
Element rootElement = new Builder().build(in).getRootElement();
// standard xquery
Nodes nodes = rootElement.query(
".//*[local-name()='scalar'"+
and @dictRef='iucr:_cell_measurement_temperature']");
// if there is a temperatute extract the value
String temp = (nodes.size() == 0) ? "no temp given" : nodes.get(0).getValue();
System.out.println("temperature for "+rootElement.getAttributeValue("id")+": "+temp);
downloaded++;
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeQuietly(in);
}
}
and here’s the output: 1625 [main] DEBUG uk.ac.cam.ch.wwmm.crystaleye.client.Harvester - Getting http://wwmm.ch.cam.ac.uk/crystaleye/summary/rsc/ob/2007/22/data/b712503h/b712503hsup1_pob0401m/b712503hsup1_pob0401m.complete.cml.xml
temperature for rsc_ob_2007_22_b712503hsup1_pob0401m: 115.0
2297 [main] DEBUG uk.ac.cam.ch.wwmm.crystaleye.client.Harvester - Getting http://wwmm.ch.cam.ac.uk/crystaleye/summary/rsc/ob/2007/22/data/b710487a/b710487asup1_ljf130/b710487asup1_ljf130.complete.cml.xml
temperature for rsc_ob_2007_22_b710487asup1_ljf130: 150.0

etc.

It will take the best part of the day to iterate through the entries, but remember that CrystalEye is not a database. We are converting it to RDF (and anyone interested can also do this) when it can be searched in a trivial amount of time and with much more complex questions. (Remember that CrystalEye was not originally designed as a public resource). Until then anyone who wishes to use CrystalEye a lot would do best to download the entries and build their own index.

[Note: I will continue to try to format the code - WordPress makes it very difficult]

CrystalEye and repositories: Jim explains the why and how of Atom

Monday, November 5th, 2007

Since Atom may not be familiar to everyone Jim Downing has written two expositions on his blog. These explain his thinking of why a series of medium-sized chunks is a better way to support the download of CrystalEye than one or two giant files. Note that he is working on making available some Java code to help with the download – this should do the caching and remember where you left off. If you have technical questions I suggest you leave them on Jim’s blog. If you want to help the project in general use my blog. If you want to hurry the process along by mailing Jim, please refrain. He works very well on occasional beers (he is a brewing aficionado).

Using the Crystaleye Atom feed – November 5th, 2007
Incremental harvesting is done by [the same mechanism], but with a couple of extra bells and whistles to minimize bandwidth and redundant downloads. There are three ways you might do this: -

  • The first way is to keep track of all the entry IDs you’ve seen, and to stop when you see an entry you’ve already seen.
  • The easiest way is to keep track of the time you last harvested, and add an If-Modified-Since header to the HTTP requests when you harvest – when you receive a 304 (Not Modified) in return, you’ve finished the increment.
  • The most thorough way is to keep track of the ETag header returned with each file, and use it in the If-None-Match header in your incremental harvest. Again, this will return 304 (Not Modified) whenever your copy is good.

Implementing a harvester

Atom archiving is easy to code to in any language with decent HTTP and XML support. As an example, I’ve written a Java harvester (binary, source). The source builds with Maven2. The binary can be run using

java -jar crystaleye-harvester.jar [directory to stick data in]

Letting this rip for a full harvest will take a while, and will take up ~10G of space (although less bandwidth since the content is compressed).

Being a friendly client

First and foremost, please do not multi-thread your requests. [PMR emphasis]

Please put a little delay in between requests. A few 100ms should be enough; the sample harvester uses 500ms – which should be as much as we need.

If you send an HTTP header “Accept-Encoding: gzip,deflate”, CrystalEye will send the content compressed (and you’ll need to gunzip it at the client end). This can save a lot of bandwidth, which helps.

PMR: Many people don’t think of things like this when they write spiders. Jim’s software should do all the thinking for you.

OJD: Atom History vs Large Archive Files – November 5th, 2007

Incremental harvest is a requirement for data repositories, and the “web-way” is to do it through the uniform interface (HTTP), and connected resources.

We don’t have the resource to provide DVD’s of content for everyone who wants the data. Or turning that around – we hope more people will want the data than we have resource to provide for. This is isn’t about the cost of a DVD, or the cost of postage, it’s about manpower, which costs orders of magnitude more than bits of plastic and stamps.

I’ve particularly valued Andrew Dalke’s input on this subject (and I’d love to kick off a discussion on the idea of versioning in CrystalEye, but I don’t have time right now): -

AD: However, I would suggest that the experience with GenBank and other bioinformatics data sets, as well as PubChem, has been that some sort of bulk download is useful. As a consumer of such data I prefer fetching the bulk data for my own use. It makes more efficient bandwidth use (vs. larger numbers of GET requests, even with HTTP 1.1 pipelining), it compresses better, I’m more certain about internal integrity, and I can more quickly get up and working because I can just point an ftp or similar client at it. When I see a data provider which requires scraping or record-by-record retrieval I feel they don’t care as much about letting others play in their garden.

(Andrew Dalke)… and earlier …

… using a system like Amazon’s S3 makes it easy to distribute the data, and cost about US $20 for the bandwidth costs of a 100GB download. (You would need to use multiple files because Amazon has a 5GB cap on file size.) Using S3 would not affect your systems at all, except for the one-shot upload time and the time it would take to put such a system into place.

(Andrew Dalke)Completely fair points. I’ll certainly look at implementing a system to offer access through S3, although everyone might have to be even more patient than they have been for these Atom feeds. We do care about making this data available – compare the slight technical difficulties in implementing an Atom harvester with the time and effort it’s taken Nick to implement and maintain spiders to get this data from the publishers in order to make it better available!
PMR: The clear point is that this is more work than it looks from the outside. Part of the point is that we are trying to improve the process of spidering the scientific web. Nick spent a LOT of time writing spiders – one publisher changed their website design several times over the last two years. If publishers were as friendly to spiders as we are trying to be the process would be much easier. So if the work with Atom is successful, forward-looking publishers might light to implement it.

ChemSpiderMan Says:
November 5th, 2007 at 4:33 pm eThanks to Egon for asking his question. As I have been expressing all I want to do also is “- link to CrystalEye entries”. We will work to support the atom feeds…

“- link to CrystalEye entries (assuming the CMLs contains a CrystalEye identifier). [PMR: The Atom feed contains the link - see above]”

all that is left now is access to the backfile of entries.

PMR: Again to clarify this. The links to the entries are already published in the Atom feed. Each link points to an entry from which (a) the InChI and (b) the CompleteCML are available. The access is there. It’s purely a question of the software (which anyone can write, but Jim is addressing for those who want it customised).

CrystalEye: request for subsets

Monday, November 5th, 2007

Egon Willighagen has made a clear and appropriate statement/request of what he would like from CrystalEye:

Egon Willighagen Says:
November 5th, 2007 at 11:09 am e

Depending on the differences between the RAW and COMPLETE CMLs or maybe CIF files, I would be interested in the one or the other. I am not interested in HTML pages (TOC, indices), images, feeds, histograms, etc, as that would be something my copy would do itself.

The data corpus of CrystalEye, that’s what I would like to download. These CML files are a poor shadow of CrystalEye, only in terms of website functionality. But my interest would not be in setting up CrystalEye2, but would be to have access to the data to:

- link to CrystalEye entries (assuming the CMLs contains a CrystalEye identifier)
- derive properties myself
- test the new CDK atom typing algorithms (which chemistry is there in the CIF files that CDK cannot deal with yet)
- detect uncommon geometries
- …

PMR: This is clear and probably overlaps closely with what the crystallographic community wishes. I remind readers that this work was initiated by a summer student, Mark Holt who was sponsored by the International Union of Crystallography. Obviously, therefore, we are interested in feeding material back to the crystallographers if possible. We are also likely to be funded to continue some of the crystallographic stuff.

The information flow in CrystalEye is roughly:

  • robot harvests freely accessible CIFs. Each CIF is given an address/ID based on the provenance (e.g. /publisher/journal/year/issue/DOI). A typical address is http://wwmm.ch.cam.ac.uk/crystaleye/summary//acs/joceah/2007/23/data/jo701566v/jo701566vsup1_a/jo701566vsup1_a.cif.summary.html
  • For each address a page is created (“entry HTML”) which acts like a container. The RAW CML, Complete CML and CIF are addressed from this. Some publishers add copyright notices to their CIFs. While we feel this is unacceptable (because CIFs are facts), and we challenge this practice, we try to honour copyright which means that the CIFs from publishers such as ACS (above) are not held on our server.
  • These CIFs are held unchanged in the entry HTML. The CIFs are then passed through CIFXML-J which converts them to a semantically identical version without added information. There should be no semantic loss and the only syntactic losses are: the precise whitespace formatting, allowed case insensitivity, ordering in the CIF and methods of quoting strings. The result is RAW-CML. If you wish to reconstruct the CIF then CIFXML-J (on sourceforge) should do this without loss. Note that RAW-CML cannot have an InChI, Cartesian coordinates, layout, bond orders, moieties, etc. I do not know whether Jmol will display it correctly (I think it may) and I am believe that Open Babel will not transform the fractional coordinates.
  • RAW-CML is then fed into CIF2CML which contains a large number of transformations and heuristics to try to determine the chemical formula and other chemistry from the atom types and positions. It adds bonds, calculates moieties, iterates through that, calculates bond orders, tries to apportion formal charges, generates unique molecules (moieties) with Cartesians, calculates InChI and does a 2D layout. All this should be present in the output: CompleteCML. We expect that there may be bugs in this process due to the imprecision in creating chemistry from atom positions alone. Because CML is extensible CompleteCML should be a superset of the RAWCML – i.e. all that information is present and unchanged. But I’d welcome comments if it isn’t so.

markupa.PNG

This was the process when Nick reported it at ACS. It’s changed slightly. CIFDOM is now CIFXML and it emits RAWCML. CML* represents CompleteCML. The 2D coordinates (but not the actual images) are held in CompleteCML. Complete CML also contains the moieties, each with its own InChI.

To respond to your requests:

- link to CrystalEye entries (assuming the CMLs contains a CrystalEye identifier). [PMR: The Atom feed contains the link - see above]
- derive properties myself [PMR: certainly. You should be able to work directly on the crystal structure and/or the moieties. The only thing you don't have is the fragments.]
- test the new CDK atom typing algorithms (which chemistry is there in the CIF files that CDK cannot deal with yet) [PMR: I think there will be objective problems with coordination compounds and organometallics. We normally rely on the author to give the overall charge on a moiety. If they don't we are usually hosed for coordination compounds.]
- detect uncommon geometries. [PMR: Certainly and we’d love to help. The bondlength plots already do this, and they are probably the first place to start. We would also plan to organise by fragments. Some fragments will have only one occurrence, others will have thousands (we would expect over 100,000 examples of phenyl groups as it can occur 10 times in some structures). Ideally we would do a cluster analysis for each fragment and then you could look for outliers – I did a brief hack at this some years back – in fact I think we corresponded. There is also enormous possibility in intermolecular interactions, perhaps a histogram of close contacts for X..Y
And we invite collaboration.

CrystalEye and Open Data and Open Notebook Science

Monday, November 5th, 2007

There has been more interesting discussion on the contents of CrystalEye, derived data, and the concept of OpenData . I shall address some of the issues and welcome more discussion. Since I have been critical of others I am quite prepared to take criticism myself. Please remember that what you are looking at is the work of a graduate student, Nick Day, who is now writing up his thesis and has effectively finished work on CrystalEye software, other than fixing bugs which would affect his science. Note, however, that the compilation of the database continues automatically every time a new issue of a journal is published. So criticism of me is fair game, criticism of Nick and his work isn’t (and there hasn’t been any).
It is clear that the concepts of Open Data and Open Notebook Science are now of great interest and presumably of great value. I believe that CrystalEye fulfils essentially all the characteristics of ONS – it’s Open and it’s immediate. There is nothing hidden – everyone has access to the same material. Anyone can do research with it. So what are the issues? (more…)

CrystalEye and repositories: distribution and integrity (cont)

Monday, November 5th, 2007

Continuing our discussions on how to disseminate CrystalEye without too much effort or breaking too much. In reading this please remember that CrystalEye – like many bioscience databases – was created as a research project. It had the following aims:

  • to see if high-quality scientific data could be extracted robotically from the current literature
  • to test the usefulness and quality of QM methods in high-throughput computation of solid state properties (e.g. by MOPAC)
  • to explore the construction of lightweight “repositories”. This was not a requirement at the start but evolved towards the latter half of the project.

Like most of our exploratory projects we use the filesystem to manage the data. The filesystem is actually extremely good – it’s probably 50 years old and is understood by everyone. And it’s easy to write programs to iterate over it. We never envisaged that we would be requested to share the work so we are having to address new concerns.

Relational and similar databases address the problems of integrity and copying – even so it’s not trivial and may be manufacturer-dependent. So we don’t expect this to be trivial either. On the other hand we are not flying a nuclear reactor so a few broken links are not the end of the world. But we’d like to limit this as much as possible.

Andrew Dalke Says: I think it’s a poor idea to limit the idea of “backup” to mean restoring the “integrity of the filesystem.” I meant being able to create a backup of your repository so you have a valid, recoverable and usable snapshot. A filesystem backup is probably not sufficient without some thought and planning because it does not have the right transactional requirements. Eg, if my app saves to two files in order to generate a backup, and the filesystem backup occurs after the first file is written but before the second then there’s going to be a problem. [PMR: AGREED]

It does sound like if you get a crash while data is being uploaded, and you restore from that backup, then there will be corruption in CrystalEye. That is, “some of the hyperlinks may point to files that were uploaded after the backup point.” [PMR: AGREED]

If data is withdrawn from CrystalEye (eg, a flaw was found in the original data, or there were legal complications causing it to be removed), will that lead to similar problems? [PMR: PROBABLY]
[...]
If you provided versioned records, so that users can retrieve historical forms of a record, then the solution is easy. Ask the spiders to first download a list of URLs which are valid for a given moment in the repository. They can then use those URLs to fetch the desired records, knowing that they will all be internally consistent. Doing this requires some architectural changes which might not be easy, so not necessarily useful to what you have now. I suspect that’s what you mean by “lightweight repository technology”. I’ve been thinking it would be interesting to try something like git, Mercurial or BZR as the source of that technology, but I wouldn’t want to commit to it without a lot of testing.        [PMR: Accurate analysis and thanks for the list]

BTW, what’s a “linkbase”? In the XLink spec it’s “[d]ocuments containing collections of inbound and third-party links.” There are only two Google hits for “standoff linkbase[s]”, and both are from you (one is this posting) without definition, so I cannot respond effectively. I don’t see how issues of internal data integrity had anything do with needing a linkbase. If all of the data was in some single ACID-compliant DBMS then it’s a well-solved problem.

PMR: In early days of “XML-Link” there was a lot of experience in hypermedia – Microcosm, Hyper-G, etc. which relied on bounded object sets (the system had to know what documents it owned and all contributors had to register with the system.

The solution I sketched above does solve this problem and it uses a list of versioned URLs so it might be a “linkbase”. But I can think of other possible solutions, like passing some sort of version token in the HTTP request, so as to retrieve the correct version of a record, or a service which takes the record id and that token and returns the specific document. That would be less ReSTy, but still a solution.

PMR: Generally agreed. However it’s still at the stage of being a research project in distributing repositories and it’s not something we have planned to do or are resourced to do. But there may be some simple approaches.

As for your example of spidering a histogram, that depends on how the spider is implemented. If the 3rd party site which maintains some view of CrystalEye receives a request like that for something it doesn’t know about, it might at that moment query the primary CrystalEye repository and see if it actually is present. In that case the other site acts like a cache. It might do the same with normal pages, and use standard HTTP cache rules to check and freshen its copy so the user is less likely to be affected by version skew.

PMR: The 3rd party is out of our control. That’s why we are building Atom feeds which solve some of the problems. It means implementing a client side tool to manage the caching and Jim will address this.

There’s a question of tradeoff here. How much “corruption”, using your term, is the user willing to accept in order to get more or different functionality? It seems your view is you think there should be zero, no, null chance of corruption, which I think is laudable for a data provider. But evidence in existing use patterns suggests that people don’t mind the occasional error. I use Google despite knowing that it’s not necessarily up to date and can link to removed pages, or pages hidden behind a registration wall. [PMR: see intro] [...]

However, I would suggest that the experience with GenBank and other bioinformatics data sets, as well as PubChem, has been that some sort of bulk download is useful. As a consumer of such data I prefer fetching the bulk data for my own use. It makes more efficient bandwidth use (vs. larger numbers of GET requests, even with HTTP 1.1 pipelining), it compresses better, I’m more certain about internal integrity, and I can more quickly get up and working because I can just point an ftp or similar client at it. When I see a data provider which requires scraping or record-by-record retrieval I feel they don’t care as much about letting others play in their garden.

PMR: I don’t have any problems in general. The PDB and Swissprot (with which I’m most familiar) are collections of flat files so it’s easy to zip and download. CrystalEye contains more complex structure – not too bad, but still complex. It has (at least):

  • table of contents
  • entry pages
  • CIFs
  • raw CMLs
  • complete CMLs
  • moieties
  • fragments
  • images for many of these
  • feeds
  • histograms
  • indexes

So it would be easy to zip the entry pages but these would not have any images and the links would all be broken.  So we could zip all the CIFs (except those from publishers who copyrighted them). But then people would complain they couldn’t read CIFs. So we can zip all the CMLs – and that’s probably the best start. But it means no indexes, no tables of contents, no 2D images, no histograms, no fragments, no moieties.  It will be a very poor shadow of CrystalEye.

And if people are happy with that we’ll think about how to provide versions. No promises.

CrystalEye: what should InChIs reference?

Sunday, November 4th, 2007

In response to my post on the technical issues of CrystalEye, Egon has asked about InChIs (see Unofficial InChI FAQ):

  1. Egon Willighagen Says:
    November 4th, 2007 at 1:43 pm eRegarding the InChIs: I would prefer one InChI for each moiety, not one InChI for the full structure. Or not only, at least.

Thanks Egon. This is an important and complex area. I’ll try to show some recent examples and make some suggestions. As I go through I am also noticing bugs…
(entry A) 10.1107/S1600536807043048 single molecule per asym unit. No problem:
actanov1.png

InChI=1/C20H28O4/c1-10-11-5-6-12-19(4)8-7-14(21)18(2,3)13(19)9-15(22)20(12,16(10)23)17(11)24/h11-13,15,17,22,24H,1,5-9H2,2-4H3/t11-,12-,13+,15+,17+,19-,20-/m0/s1

(entry B) 10.1107/S160053680705338X.

one dication, two picrates and a solvent:

acatnov2.PNG

Nick has drawn the dication first, but the others are drawable by scrolling.

There is NO InChI for the complete molecule (I’m not sure if this is deliberate), but there IS an InChI for the dication under “Moities” as there also is for the solvent. (The anions are missing from the moities – this may be a CrystalEye bug or it may be an author problem). InChI for dication:

InChI=1/C11H18N4/c1-10-12(3)5-7-14(10)9-15-8-6-13(4)11(15)2/h5-8H,9H2,1-4H3/q+2

InChI for solvent (CH3CN):

InChI=1/C2H3N/c1-2-3/h1H3

(Nick: BUG. The picrates are in the complete CML file but they don’t have InChIs and they don’t appear in the pages)

(entry C) 10.1107/S160053680705009X

This structure has disorder, which always complicates the inpterpretation of the chemical structure (note the warning)

actanov3.PNG

The InChI is calculated for the major component.

(Entry D) 10.1107/S1600536807048301 This contains two identical solvate molecules to one of solvent (dioxan). Because the solvent lies on a symmetry element it is recorded as A B(0.5) rather than A2.B though this is probably inconsistent

actanov4.PNG

Note that this is also disordered.

The InChI here represents the compound molecule as A2.B

InChI=1/2C16H14N4O2.C4H8O2/c2*1-

11-19-16(22-20-11)15(9-17-10-18-21)14-8-4-6-12-5-2-3-7-13(12)14;1-2-6-4-3-5-1/h2*2-10,21H,1H3,(H,17,18);1-4H2/b2*15-9-;

In summary, therefore, I think we should certainly have InChIs for the moieties (and I think we have, at least in principle). I am less clear how useful it is for the overall crystal structure (as in D). Note that for inorganic structures without discrete moieties there are no InChIs. I am looking for some with discrete moieties.

That’s enough for now. I’ll tackle fragments in the new post or so.