petermr's blog

Talks on Open Notebook Science

Posted on November 5, 2007 by pm286

Two impressive blog posts from Cameron Neylon on Open Notebook Science:

Talks on Open Notebook Science – some initial thoughts

13:44 05/11/2007, Cameron Neylon, cansas, open notebook science, us trip oct/nov 07, Science in the open

So I have given three talks in ten days or so, one at the CanSAS meeting at NIST, one at Drexel University and one at MIT last night. Jean-Claude Bradley was kind enough to help me record the talk at Drexel as a screencast and you can see this in various formats here. He has also made some comments on the talk on the UsefulChem Blog and Scientific Blogging site.
The talks at Drexel and MIT were interesting. I was expecting the focus of questions to be more on the issues of being open, the risks and benefits, and problems. Actually the focus of questions was on the technicalities and in particular people wanting to get under the hood and play with the underlying data. Several of the questions I was asked could be translated as ‘do you have an API?’. The answer to this is at the moment no, but we know it is a direction we need to go in.
We have two crucial things we need to address at the moment: the first is the issue of automating some of the posting. We believe this needs to be achieved through an application or script that sits outside the blog itself and that it can be linked to the process of actually labelling the stuff we make. The second issue is that of an API or web service that allows people to get at the underlying data in an automated fashion. This will be useful for us as we move towards doing analysis of our data as well. Jean-Claude said he was also looking at how to automate processes so clearly this is the next big step forward.
Another question raised at MIT was how you could retro-fit our approach into an existing blog or wiki engine. The key issues here are templates (which is next on my list to describe here in detail) which would probably require some sort of plugin. The other issue is the metadata. Our blog engine goes one step beyond tagging by providing keys with values. Presumably this could be coded into a conventional engine using RDF or microformats – perhaps we should be doing this our Blog in any case?
Incidentally a point I made in both talks, partly in response to the question ‘does anyone really look at it’, is that in many cases it is your own access you are enabling. Making it open means you can always get at your own data, which is a surprisingly helpful thing.

[…]

Discussion with OpenWetWare people

18:37 05/11/2007, Cameron Neylon, e-notebook, metadata, us trip oct/nov 07, Science in the open

[…]

Neither Wikis nor Blogs provide all the functionality required. Wikis are good at providing a framework that within which to organise information where as blogs are good at logging information and providing it in a journal format. Barry showed me a hack that he uses in his Wiki based notebook that essentially provides a means of organising his lab book into experiments and projects but also provides a date style view. In the Southampton system we would achieve this through creating categories for different experiments, possibly independent blogs for different projects.

Feature requests at Southampton has been driven largely by me which means that system is being driven by the needs of the PI. At OpenWetWare the development has been driven by grad students which means it has focussed on their issues. The question was raised of where the best place to ‘promote’ these systems was. Is it the PI’s who, at least at the moment, will get the greatest tangible benefits from the system. Or is it better to persuade grad students to take this up as they are the end users. Both have very different needs.

Development based on the needs of a single person is unlikely to take us forward as the needs of a specific person are probably not general enough to be useful. Development should focus on enabling the interactions between people, therefore the minimum size ‘user unit’ is two (PI plus researcher, or group of researchers).

The biggest wins for these systems are where collaboration is required and is enabled by a shared space to work in. This is shown by the IGEM lab books and by uptake by my collaborators in the UK. This will be the best place to take development forward.

I need to add links to this post but will do so later. [PMR – know the feeling…]

PMR: This resonates with me – we are clearly at the start of the journey. I find both blogs and wikis too limited, especially for scientific and technical material. Blogs are awful at hypermedia – it’s a real effort to add the links and the software breaks frequently. I’d like to see a blog that guess what hyperlinks you need based on previous use. Wikis don’t give a sense of direction. Perhaps it makes sense ot have a blog which can then be published into a wiki – currently I write some of mine on this basis – that they can be scraped later.
I agree with the idea that the primary beneficiary is often the author or the team, rather than the world. I also realise how much implicit information there is – it’s very difficult to make everything available instantly. For example in the computational area we should theoretically expose evidence of submitting jobs but we often don’t keep this for ourselves, relying on the system to manage it for us. It brings home how much work we need to do to document fully what we do.

Posted in open notebook science | Leave a comment

Open NMR: How good is the prediction?

Posted on November 5, 2007 by pm286

Egon asks about the quality of the prediction:

Egon Willighagen Says:
November 5th, 2007 at 5:58 pm e
Peter/Henry, I was wondering about the carbon shifts for atoms in an aromatic ring? Can the QM method you are using work that out? Or can it not distinguish between C=C atoms and C:C atoms (SMILES notation)? The shift differences might really be too small… if I look up C6H6 and C6H8, they are indeed small.

PMR: Nick is the expert here and I’m waiting to talk with him. However here is the first aromatic structure on my list (nmrshiftdb2470-1 (solvent: chloroform)) – I haven’t gone searching.

PMR: You can see the agreement between obs and calc is good. The maximum deviation seems to be about 2 ppm. The aromatic region is quite disperse (the two substituents have widely different 13C shifts so that probably helps). The NMR confirms the assignment in NMRShiftDB. I would feel reasonably happy that the shift differences were large enough to identify the atoms. Obviously if the spread was smaller it would be harder. I’m waiting for Nick to collect all the data and mount it.
Note that this is an intra-molecule comparison and I haven’t concerned myself with absolute differences. We believe there is a solvent effect on many carbonyls. Henry is better placed to comment.

EW: How about 1H NMR prediction, or can QM not do that?

PMR: Yes it can, but the dispersion is much less. 13C is from 0-200, 1H is from 0-10. So an error of 2 ppm in C is not too bad, but in 1H it’s a disaster. Also H atoms are more likely to be affected by conformation, shielding effects and solvent. So it may well be useful, but it will require a lot more work

Posted in nmr | 3 Comments

CrystalEye and repositories: Jim explains the why and how of Atom

Posted on November 5, 2007 by pm286

Since Atom may not be familiar to everyone Jim Downing has written two expositions on his blog. These explain his thinking of why a series of medium-sized chunks is a better way to support the download of CrystalEye than one or two giant files. Note that he is working on making available some Java code to help with the download – this should do the caching and remember where you left off. If you have technical questions I suggest you leave them on Jim’s blog. If you want to help the project in general use my blog. If you want to hurry the process along by mailing Jim, please refrain. He works very well on occasional beers (he is a brewing aficionado).

Using the Crystaleye Atom feed – November 5th, 2007
Incremental harvesting is done by [the same mechanism], but with a couple of extra bells and whistles to minimize bandwidth and redundant downloads. There are three ways you might do this: –

The first way is to keep track of all the entry IDs you’ve seen, and to stop when you see an entry you’ve already seen.

The easiest way is to keep track of the time you last harvested, and add an If-Modified-Since header to the HTTP requests when you harvest – when you receive a 304 (Not Modified) in return, you’ve finished the increment.

The most thorough way is to keep track of the ETag header returned with each file, and use it in the If-None-Match header in your incremental harvest. Again, this will return 304 (Not Modified) whenever your copy is good.

Implementing a harvester

Atom archiving is easy to code to in any language with decent HTTP and XML support. As an example, I’ve written a Java harvester (binary, source). The source builds with Maven2. The binary can be run using
java -jar crystaleye-harvester.jar [directory to stick data in]
Letting this rip for a full harvest will take a while, and will take up ~10G of space (although less bandwidth since the content is compressed).

Being a friendly client

First and foremost, please do not multi-thread your requests. [PMR emphasis]
Please put a little delay in between requests. A few 100ms should be enough; the sample harvester uses 500ms – which should be as much as we need.
If you send an HTTP header “Accept-Encoding: gzip,deflate”, CrystalEye will send the content compressed (and you’ll need to gunzip it at the client end). This can save a lot of bandwidth, which helps.

PMR: Many people don’t think of things like this when they write spiders. Jim’s software should do all the thinking for you.

OJD: Atom History vs Large Archive Files – November 5th, 2007
Incremental harvest is a requirement for data repositories, and the “web-way” is to do it through the uniform interface (HTTP), and connected resources.
We don’t have the resource to provide DVD’s of content for everyone who wants the data. Or turning that around – we hope more people will want the data than we have resource to provide for. This is isn’t about the cost of a DVD, or the cost of postage, it’s about manpower, which costs orders of magnitude more than bits of plastic and stamps.
I’ve particularly valued Andrew Dalke’s input on this subject (and I’d love to kick off a discussion on the idea of versioning in CrystalEye, but I don’t have time right now): –

AD: However, I would suggest that the experience with GenBank and other bioinformatics data sets, as well as PubChem, has been that some sort of bulk download is useful. As a consumer of such data I prefer fetching the bulk data for my own use. It makes more efficient bandwidth use (vs. larger numbers of GET requests, even with HTTP 1.1 pipelining), it compresses better, I’m more certain about internal integrity, and I can more quickly get up and working because I can just point an ftp or similar client at it. When I see a data provider which requires scraping or record-by-record retrieval I feel they don’t care as much about letting others play in their garden.

(Andrew Dalke)… and earlier …

… using a system like Amazon’s S3 makes it easy to distribute the data, and cost about US $20 for the bandwidth costs of a 100GB download. (You would need to use multiple files because Amazon has a 5GB cap on file size.) Using S3 would not affect your systems at all, except for the one-shot upload time and the time it would take to put such a system into place.

(Andrew Dalke)Completely fair points. I’ll certainly look at implementing a system to offer access through S3, although everyone might have to be even more patient than they have been for these Atom feeds. We do care about making this data available – compare the slight technical difficulties in implementing an Atom harvester with the time and effort it’s taken Nick to implement and maintain spiders to get this data from the publishers in order to make it better available!
PMR: The clear point is that this is more work than it looks from the outside. Part of the point is that we are trying to improve the process of spidering the scientific web. Nick spent a LOT of time writing spiders – one publisher changed their website design several times over the last two years. If publishers were as friendly to spiders as we are trying to be the process would be much easier. So if the work with Atom is successful, forward-looking publishers might light to implement it.

ChemSpiderMan Says:
November 5th, 2007 at 4:33 pm eThanks to Egon for asking his question. As I have been expressing all I want to do also is “- link to CrystalEye entries”. We will work to support the atom feeds…
“- link to CrystalEye entries (assuming the CMLs contains a CrystalEye identifier). [PMR: The Atom feed contains the link – see above]”
all that is left now is access to the backfile of entries.

PMR: Again to clarify this. The links to the entries are already published in the Atom feed. Each link points to an entry from which (a) the InChI and (b) the CompleteCML are available. The access is there. It’s purely a question of the software (which anyone can write, but Jim is addressing for those who want it customised).

Posted in crystaleye, open issues | 3 Comments

CrystalEye: request for subsets

Posted on November 5, 2007 by pm286

Egon Willighagen has made a clear and appropriate statement/request of what he would like from CrystalEye:

Egon Willighagen Says:
November 5th, 2007 at 11:09 am e
Depending on the differences between the RAW and COMPLETE CMLs or maybe CIF files, I would be interested in the one or the other. I am not interested in HTML pages (TOC, indices), images, feeds, histograms, etc, as that would be something my copy would do itself.
The data corpus of CrystalEye, that’s what I would like to download. These CML files are a poor shadow of CrystalEye, only in terms of website functionality. But my interest would not be in setting up CrystalEye2, but would be to have access to the data to:
– link to CrystalEye entries (assuming the CMLs contains a CrystalEye identifier)
– derive properties myself
– test the new CDK atom typing algorithms (which chemistry is there in the CIF files that CDK cannot deal with yet)
– detect uncommon geometries
– …

PMR: This is clear and probably overlaps closely with what the crystallographic community wishes. I remind readers that this work was initiated by a summer student, Mark Holt who was sponsored by the International Union of Crystallography. Obviously, therefore, we are interested in feeding material back to the crystallographers if possible. We are also likely to be funded to continue some of the crystallographic stuff.
The information flow in CrystalEye is roughly:

robot harvests freely accessible CIFs. Each CIF is given an address/ID based on the provenance (e.g. /publisher/journal/year/issue/DOI). A typical address is http://wwmm.ch.cam.ac.uk/crystaleye/summary//acs/joceah/2007/23/data/jo701566v/jo701566vsup1_a/jo701566vsup1_a.cif.summary.html
For each address a page is created (“entry HTML”) which acts like a container. The RAW CML, Complete CML and CIF are addressed from this. Some publishers add copyright notices to their CIFs. While we feel this is unacceptable (because CIFs are facts), and we challenge this practice, we try to honour copyright which means that the CIFs from publishers such as ACS (above) are not held on our server.
These CIFs are held unchanged in the entry HTML. The CIFs are then passed through CIFXML-J which converts them to a semantically identical version without added information. There should be no semantic loss and the only syntactic losses are: the precise whitespace formatting, allowed case insensitivity, ordering in the CIF and methods of quoting strings. The result is RAW-CML. If you wish to reconstruct the CIF then CIFXML-J (on sourceforge) should do this without loss. Note that RAW-CML cannot have an InChI, Cartesian coordinates, layout, bond orders, moieties, etc. I do not know whether Jmol will display it correctly (I think it may) and I am believe that Open Babel will not transform the fractional coordinates.
RAW-CML is then fed into CIF2CML which contains a large number of transformations and heuristics to try to determine the chemical formula and other chemistry from the atom types and positions. It adds bonds, calculates moieties, iterates through that, calculates bond orders, tries to apportion formal charges, generates unique molecules (moieties) with Cartesians, calculates InChI and does a 2D layout. All this should be present in the output: CompleteCML. We expect that there may be bugs in this process due to the imprecision in creating chemistry from atom positions alone. Because CML is extensible CompleteCML should be a superset of the RAWCML – i.e. all that information is present and unchanged. But I’d welcome comments if it isn’t so.

This was the process when Nick reported it at ACS. It’s changed slightly. CIFDOM is now CIFXML and it emits RAWCML. CML* represents CompleteCML. The 2D coordinates (but not the actual images) are held in CompleteCML. Complete CML also contains the moieties, each with its own InChI.
To respond to your requests:
– link to CrystalEye entries (assuming the CMLs contains a CrystalEye identifier). [PMR: The Atom feed contains the link – see above]
– derive properties myself [PMR: certainly. You should be able to work directly on the crystal structure and/or the moieties. The only thing you don’t have is the fragments.]
– test the new CDK atom typing algorithms (which chemistry is there in the CIF files that CDK cannot deal with yet) [PMR: I think there will be objective problems with coordination compounds and organometallics. We normally rely on the author to give the overall charge on a moiety. If they don’t we are usually hosed for coordination compounds.]
– detect uncommon geometries. [PMR: Certainly and we’d love to help. The bondlength plots already do this, and they are probably the first place to start. We would also plan to organise by fragments. Some fragments will have only one occurrence, others will have thousands (we would expect over 100,000 examples of phenyl groups as it can occur 10 times in some structures). Ideally we would do a cluster analysis for each fragment and then you could look for outliers – I did a brief hack at this some years back – in fact I think we corresponded. There is also enormous possibility in intermolecular interactions, perhaps a histogram of close contacts for X..Y
And we invite collaboration.

Posted in crystaleye, open issues | 1 Comment

CrystalEye and Open Data and Open Notebook Science

Posted on November 5, 2007 by pm286

There has been more interesting discussion on the contents of CrystalEye, derived data, and the concept of OpenData . I shall address some of the issues and welcome more discussion. Since I have been critical of others I am quite prepared to take criticism myself. Please remember that what you are looking at is the work of a graduate student, Nick Day, who is now writing up his thesis and has effectively finished work on CrystalEye software, other than fixing bugs which would affect his science. Note, however, that the compilation of the database continues automatically every time a new issue of a journal is published. So criticism of me is fair game, criticism of Nick and his work isn’t (and there hasn’t been any).
It is clear that the concepts of Open Data and Open Notebook Science are now of great interest and presumably of great value. I believe that CrystalEye fulfils essentially all the characteristics of ONS – it’s Open and it’s immediate. There is nothing hidden – everyone has access to the same material. Anyone can do research with it. So what are the issues? Continue reading →

Posted in crystaleye | Leave a comment

CrystalEye and repositories: distribution and integrity (cont)

Posted on November 5, 2007 by pm286

Continuing our discussions on how to disseminate CrystalEye without too much effort or breaking too much. In reading this please remember that CrystalEye – like many bioscience databases – was created as a research project. It had the following aims:

to see if high-quality scientific data could be extracted robotically from the current literature
to test the usefulness and quality of QM methods in high-throughput computation of solid state properties (e.g. by MOPAC)
to explore the construction of lightweight “repositories”. This was not a requirement at the start but evolved towards the latter half of the project.

Like most of our exploratory projects we use the filesystem to manage the data. The filesystem is actually extremely good – it’s probably 50 years old and is understood by everyone. And it’s easy to write programs to iterate over it. We never envisaged that we would be requested to share the work so we are having to address new concerns.
Relational and similar databases address the problems of integrity and copying – even so it’s not trivial and may be manufacturer-dependent. So we don’t expect this to be trivial either. On the other hand we are not flying a nuclear reactor so a few broken links are not the end of the world. But we’d like to limit this as much as possible.

Andrew Dalke Says: I think it’s a poor idea to limit the idea of “backup” to mean restoring the “integrity of the filesystem.” I meant being able to create a backup of your repository so you have a valid, recoverable and usable snapshot. A filesystem backup is probably not sufficient without some thought and planning because it does not have the right transactional requirements. Eg, if my app saves to two files in order to generate a backup, and the filesystem backup occurs after the first file is written but before the second then there’s going to be a problem. [PMR: AGREED]

It does sound like if you get a crash while data is being uploaded, and you restore from that backup, then there will be corruption in CrystalEye. That is, “some of the hyperlinks may point to files that were uploaded after the backup point.” [PMR: AGREED]

If data is withdrawn from CrystalEye (eg, a flaw was found in the original data, or there were legal complications causing it to be removed), will that lead to similar problems? [PMR: PROBABLY]
[…]
If you provided versioned records, so that users can retrieve historical forms of a record, then the solution is easy. Ask the spiders to first download a list of URLs which are valid for a given moment in the repository. They can then use those URLs to fetch the desired records, knowing that they will all be internally consistent. Doing this requires some architectural changes which might not be easy, so not necessarily useful to what you have now. I suspect that’s what you mean by “lightweight repository technology”. I’ve been thinking it would be interesting to try something like git, Mercurial or BZR as the source of that technology, but I wouldn’t want to commit to it without a lot of testing. [PMR: Accurate analysis and thanks for the list]

BTW, what’s a “linkbase”? In the XLink spec it’s “[d]ocuments containing collections of inbound and third-party links.” There are only two Google hits for “standoff linkbase[s]”, and both are from you (one is this posting) without definition, so I cannot respond effectively. I don’t see how issues of internal data integrity had anything do with needing a linkbase. If all of the data was in some single ACID-compliant DBMS then it’s a well-solved problem.

PMR: In early days of “XML-Link” there was a lot of experience in hypermedia – Microcosm, Hyper-G, etc. which relied on bounded object sets (the system had to know what documents it owned and all contributors had to register with the system.

The solution I sketched above does solve this problem and it uses a list of versioned URLs so it might be a “linkbase”. But I can think of other possible solutions, like passing some sort of version token in the HTTP request, so as to retrieve the correct version of a record, or a service which takes the record id and that token and returns the specific document. That would be less ReSTy, but still a solution.

PMR: Generally agreed. However it’s still at the stage of being a research project in distributing repositories and it’s not something we have planned to do or are resourced to do. But there may be some simple approaches.

As for your example of spidering a histogram, that depends on how the spider is implemented. If the 3rd party site which maintains some view of CrystalEye receives a request like that for something it doesn’t know about, it might at that moment query the primary CrystalEye repository and see if it actually is present. In that case the other site acts like a cache. It might do the same with normal pages, and use standard HTTP cache rules to check and freshen its copy so the user is less likely to be affected by version skew.

PMR: The 3rd party is out of our control. That’s why we are building Atom feeds which solve some of the problems. It means implementing a client side tool to manage the caching and Jim will address this.

There’s a question of tradeoff here. How much “corruption”, using your term, is the user willing to accept in order to get more or different functionality? It seems your view is you think there should be zero, no, null chance of corruption, which I think is laudable for a data provider. But evidence in existing use patterns suggests that people don’t mind the occasional error. I use Google despite knowing that it’s not necessarily up to date and can link to removed pages, or pages hidden behind a registration wall. [PMR: see intro] […]

However, I would suggest that the experience with GenBank and other bioinformatics data sets, as well as PubChem, has been that some sort of bulk download is useful. As a consumer of such data I prefer fetching the bulk data for my own use. It makes more efficient bandwidth use (vs. larger numbers of GET requests, even with HTTP 1.1 pipelining), it compresses better, I’m more certain about internal integrity, and I can more quickly get up and working because I can just point an ftp or similar client at it. When I see a data provider which requires scraping or record-by-record retrieval I feel they don’t care as much about letting others play in their garden.

PMR: I don’t have any problems in general. The PDB and Swissprot (with which I’m most familiar) are collections of flat files so it’s easy to zip and download. CrystalEye contains more complex structure – not too bad, but still complex. It has (at least):

table of contents
entry pages
CIFs
raw CMLs
complete CMLs
moieties
fragments
images for many of these
feeds
histograms
indexes

So it would be easy to zip the entry pages but these would not have any images and the links would all be broken. So we could zip all the CIFs (except those from publishers who copyrighted them). But then people would complain they couldn’t read CIFs. So we can zip all the CMLs – and that’s probably the best start. But it means no indexes, no tables of contents, no 2D images, no histograms, no fragments, no moieties. It will be a very poor shadow of CrystalEye.
And if people are happy with that we’ll think about how to provide versions. No promises.

Posted in crystaleye | 1 Comment

Is science copyrightable?

Posted on November 4, 2007 by pm286

I have been jolted into wondering whether scientific publications are actually protectable by copyright. I’m almost certainly wrong, but here’s my little journey. I found Tim O’Reilly (et al.): Publishing Digital Fair Use 23:26 03/11/2007, Peter Brantley, Planet SciFoo:

Fair use is a doctrine in United States copyright law that allows limited use of copyrighted material without requiring permission from the rights holders, such as use for scholarship or review.

PMR: and he goes on the explain how a major publisher was ignorant of the basics of fair use:

Of the entire conversation, certainly the greatest disappointment to me was the obviously incomplete understanding held by my publishing colleague of what fair use actually is — in other words, its fundamental characteristics, such as its relativistic nature and lack of definitional precision, the 4-point multiple-factor test (see the good discussion at the University of Texas site, about halfway down, “Using the Four Factor Fair Use Test”), and what the doctrine aims to sustain. My sudden need for a basic, conceptual presentation took me by surprise, and given the fact that I was speaking to a whip-smart VP of a major publishing house, I felt it was an unfortunate one.

PMR: I followed up the Texas link which included:

1. Is the work protected?

Copyright does not protect, this Policy does not apply to, and anyone may freely use*:

Works that lack originality

logical, comprehensive compilations (like the phone book)

unoriginal reprints of public domain works

Works in the public domain

Freeware (not shareware, but really, expressly, available free of restrictions-ware — this may be protected by law, but the author has chosen to make it available without any restrictions)

US Government works

Facts

Ideas, processes, methods, and systems described in copyrighted works

PMR: and I followed the last link to Cornell:

§ 102. Subject matter of copyright: In general

(a) Copyright protection subsists, in accordance with this title, in original works of authorship fixed in any tangible medium of expression, now known or later developed, from which they can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. Works of authorship include the following categories:
(1) literary works;

(2) musical works, including any accompanying words;

(3) dramatic works, including any accompanying music;

(4) pantomimes and choreographic works;

(5) pictorial, graphic, and sculptural works;

(6) motion pictures and other audiovisual works;

(7) sound recordings; and

(8) architectural works.

(b) In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work.

PMR: Well, IANAL BUT the last sentence looks pretty clear to me. Large parts of a scientific paper are:

procedure
principle
discovery

PMR: and the form (including illustrations) is irrelevant.
This is, I assume, US only but nonetheless it seems pretty clear that I can:

reproduce a chemical synthesis/recipe (“procedure”)
reproduce a chemical graph (“discovery”)
reproduce a chemical molecule diagram (“concept”)

Now I’m quite happy to avoid reproducing the publishers pagination (I hate PDFs anyway). But can anyone disillusion me as to why I shouldn’t download and reproduce material from PARTS of scientific papers without permission. And is anyone happy to accompany me to the barricades?

Posted in open issues | 5 Comments

CrystalEye: what should InChIs reference?

Posted on November 4, 2007 by pm286

In response to my post on the technical issues of CrystalEye, Egon has asked about InChIs (see Unofficial InChI FAQ):

Egon Willighagen Says:
November 4th, 2007 at 1:43 pm eRegarding the InChIs: I would prefer one InChI for each moiety, not one InChI for the full structure. Or not only, at least.

Thanks Egon. This is an important and complex area. I’ll try to show some recent examples and make some suggestions. As I go through I am also noticing bugs…
(entry A) 10.1107/S1600536807043048 single molecule per asym unit. No problem:

InChI=1/C20H28O4/c1-10-11-5-6-12-19(4)8-7-14(21)18(2,3)13(19)9-15(22)20(12,16(10)23)17(11)24/h11-13,15,17,22,24H,1,5-9H2,2-4H3/t11-,12-,13+,15+,17+,19-,20-/m0/s1
(entry B) 10.1107/S160053680705338X.

one dication, two picrates and a solvent:

Nick has drawn the dication first, but the others are drawable by scrolling.
There is NO InChI for the complete molecule (I’m not sure if this is deliberate), but there IS an InChI for the dication under “Moities” as there also is for the solvent. (The anions are missing from the moities – this may be a CrystalEye bug or it may be an author problem). InChI for dication:
InChI=1/C11H18N4/c1-10-12(3)5-7-14(10)9-15-8-6-13(4)11(15)2/h5-8H,9H2,1-4H3/q+2
InChI for solvent (CH3CN):
InChI=1/C2H3N/c1-2-3/h1H3
(Nick: BUG. The picrates are in the complete CML file but they don’t have InChIs and they don’t appear in the pages)
(entry C) 10.1107/S160053680705009X
This structure has disorder, which always complicates the inpterpretation of the chemical structure (note the warning)

The InChI is calculated for the major component.
(Entry D) 10.1107/S1600536807048301 This contains two identical solvate molecules to one of solvent (dioxan). Because the solvent lies on a symmetry element it is recorded as A B(0.5) rather than A2.B though this is probably inconsistent

Note that this is also disordered.
The InChI here represents the compound molecule as A2.B
InChI=1/2C16H14N4O2.C4H8O2/c2*1-
11-19-16(22-20-11)15(9-17-10-18-21)14-8-4-6-12-5-2-3-7-13(12)14;1-2-6-4-3-5-1/h2*2-10,21H,1H3,(H,17,18);1-4H2/b2*15-9-;
In summary, therefore, I think we should certainly have InChIs for the moieties (and I think we have, at least in principle). I am less clear how useful it is for the overall crystal structure (as in D). Note that for inorganic structures without discrete moieties there are no InChIs. I am looking for some with discrete moieties.
That’s enough for now. I’ll tackle fragments in the new post or so.

Posted in crystaleye | 2 Comments

CrystalEye and repositories: distribution and integrity

Posted on November 4, 2007 by pm286

Andrew Dalke has raised two useful issues and I will address them separately. The first is about integrity of a repository (I will start using that word rather than database).

Andrew Dalke Says:
November 4th, 2007 at 2:19 am eDuring that time the database will have grown by 10-15% so that that percentage of links will ipso facto be broken. So any redistribution will involve distributing a broken system.What? Are you saying you don’t have backups for your system? If it goes down and you recover from backups, will pages be broken? I hope not! And if not, then use the backups to generate the distribution. That can’t break the server.

PMR: Yes, we have backups, but that addresses the integrity of the filesystem at a given point in time, not the integrity of links in a hypermedia system. For example, if you are uploading a set of web pages to a server, and it is backed up in the middle of that, and you revert to that backup the filesystem will be correct at that time but some of the hyeprlinks may point to files that were uploaded after the backup point. That’s a difficult problem and unless you operate with bounded object sets or standoff linkbases it’s not soluble. At present CrystalEye does not use linkbases.
We are looking into lightweight repository technology for molecules and Jim will probably be writing about this elsewhere.
Now imagine that a spider starts to download the entries and “finishes” a month later. During that time several thousand new entries will have been added. The spider then extracts the bond length histograms. These will point to the old and new entries and our site will honour the links. But the spider will not have downloaded the new entries. So the histograms on the spider’s site will have hundreds of thousands of broken links. Users of the spider’s histograms may get the impression that our site is broken when it is not. That is a simple example of not being able to honour the integrity of the work.

if the whole DB is zipped into a 100GB file, downloading that is likely to break the server and the connection
ftp and bittorrent do very good jobs of transfering 100 GB files. I mentioned in another comment that using a system like Amazon’s S3 makes it easy to distribute the data, and cost about US $20 for the bandwidth costs of a 100GB download. (You would need to use multiple files because Amazon has a 5GB cap on file size.) Using S3 would not affect your systems at all, except for the one-shot upload time and the time it would take to put such a system into place.
The data distribution site does not need to be on the same machine as your service. That’s a key part of a ReST architecture.

PMR: Thanks for the suggestion. This would be OK if we made a snapshot of CrystalEye every year. (Even then it’s hard work to produce a complete distribution that honours integrity). But we want to keep users updated and feel that Atom feeds (which we have already started) are the better way. Jim’s repository should be able to provide the mechanism fpr regular snapshots

Posted in crystaleye | 1 Comment

CrystalEye: data loss and corruption through legacy files

Posted on November 4, 2007 by pm286

Andrew Dalke raised the issue of data corruption:

Andrew Dalke Says:
November 4th, 2007 at 2:32 am e
PMR: Moreover crystal structures contain problems such as disorder and partial occupancy which are impossible to hold in an SDFile as far as I know without corrupting the data.
“Corruption” is a strong word. Why not think of it as the way you wrote in your “Round-trip format conversion” wikipedia article?

PMR: Here is a widespread and almost universal example of corruption which is almost entirely down to the use of SD (MOL) files and/or SMILES in particular (but is common to almost all legacy formats). Nitric oxide (WP) is a very important molecule – it is an essential signalling molecule in the vascular system, and also a serious pollutant from transport. Its formula is NO, one nitrogen atom and one oxygen atom.
A large number of freely accessible databases give other formulas:

eMolecules gives the formula NH=O and the molecular weight as 31.014.
The PubChem Project gives both NH=O and the correct formula (NO)
Items 1 – 20 of 20

One page.

1: CID: 945 Related Structures, Literature, Other Links

nitrogen monoxide; nitric oxide; Nitrogen oxide …

MW: 31.01404 | MF: HNO

2: CID: 145068 Related Structures, Assays, Literature, Other Links

nitric oxide; nitrogen monoxide; Nitrogen oxide …

IUPAC: nitric oxide

MW: 30.0061 | MF: NO
Chemspider gives 4 different formulae, (HN=O, NO+, NO, NO-) all with different molecular weights.

PMR: These variations are not because there are different opinions about what “nitric oxide” is, or whether the name may be used differently by different communities. They are because the use of SD/MOL or SMILES has corrupted the information. Because SD files have no mechanism for indicating that an atom does not have implicit hydrogens, many programs are “clever” and add them according to “valence rules”. While these are OK for a subset of chemistry they are a disaster for others. Nitric oxide is just one of many examples where they fail. So that is why I cannot answer Chemspider’s request for SD files of CrystalEye – I KNOW it will corrupt the information. It is possible that there is a simple algorithm that could filter out “most” of the entries which would not be corrupted, but it will not be watertight. That is why we have developed CML – it is designed to avoid corruption.

When a document in one format is converted to another there is likely to be information loss. Is “information loss” necessarily “corruption”? From my experience in dealing with PDB files, which has some of these crystallographic properties, I think there can be meaningful information despite the information loss. So long as the tools and the users understand that there are limitations in the conversion.

PMR: There are “obviously” parts of the information that can be omitted without corruption. An example is “iucr:_publ_contact_author_phone”. But what happens if you omit “occupancy” in an entry ? It looks like:

Notice that the _chemical_formula_sum contains non-integral atom counts – this is common in crystal structures nd is supported by the _atom_site_occupancy flag in CIF which points to the last field before the two dots.
_atom_site_type_symbol
_atom_site_label
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
_atom_site_U_iso_or_equiv
_atom_site_adp_type
_atom_site_calc_flag
_atom_site_refinement_flags
_atom_site_occupancy
_atom_site_disorder_assembly
_atom_site_disorder_group
Ni Ni1 0.05235(10) 0.2500 0.4203(2) 0.0152(4) Uani d S 1 . .
Ni Ni2 0.14310(14) 0.2500 0.0802(3) 0.0185(4) Uani d SP 0.80 . .
Ni Ni3 0.15349(14) -0.2500 0.5956(3) 0.0191(5) Uani d SP 0.80 . .
Te Te1 0.25343(5) 0.2500 0.42149(10) 0.0131(3) Uani d S 1 . .
Te Te2 0.00373(5) 0.2500 0.78163(10) 0.0146(3) Uani d S 1 . .
Confirm that Ni(1+0.8+0.8) => Ni2.6 and Te(1+1) => Te2. CML is designed to hold this without loss (through the occupancy attribute) but SD files, SMILES and almost all other legacy (except PDB and a few other crystallographic files) are not. Therefore using SD to bundle this entry and transmit is is guaranteed to corrupt it.
[Note added later. There is a well characterised HN=O molecule – see NIST Webbook – but it is nitrosyl hydride, not nitric oxide.]

Posted in chemistry, crystaleye | 8 Comments

Talks on Open Notebook Science

Open NMR: How good is the prediction?

CrystalEye and repositories: Jim explains the why and how of Atom

Implementing a harvester

Being a friendly client

CrystalEye: request for subsets

CrystalEye and Open Data and Open Notebook Science

CrystalEye and repositories: distribution and integrity (cont)

Is science copyrightable?

1. Is the work protected?

§ 102. Subject matter of copyright: In general

CrystalEye: what should InChIs reference?

CrystalEye and repositories: distribution and integrity

CrystalEye: data loss and corruption through legacy files

Recent Posts

Recent Comments

Archives

Categories

Meta