petermr's blog

A Scientist and the Web

 

Archive for November, 2007

SPECTRa tools released

Friday, November 30th, 2007

The SPECTRa tools allow chemists (perhaps group or departmental analytical spectroscopy groups) to submit their data (spectra, crystal structures, compchem) to a repository.

From Jim Downing SPECTRa released

Now that a number of niggling bugs have been ironed out, we’ve released a stable version of the SPECTRa tools.

There are prebuilt binaries for spectra-filetool (command line tool helpful for performing batch validation, metadata extraction and conversion of JCAMP-DX, MDL Mol and CIF files), and spectra-sub (flexible web application for depositing chemistry data in repositories). The source code is available from the spectra-chem package, or from Subversion. All of these are available from the spectra-chem SourceForge site.

Mavenites can obtain the libraries (and source code) from the SPECTRa maven repo at http://spectra-chem.sourceforge.net/maven2/. The groupId is uk.ac.cam.spectra – browse around for artifact ids and versions.

PMR: This is an important tool in the chain and congratulations to Jim for designing and building it. It interfaces with a repository (as they say on kids toys “repository not included”) so that you can customise your own business process. We hope to see departments appreciating the need for repositing their data (it gets lost, it could be re-used, etc.).

The legacy formats (CIF, JCAMP, Gaussian, etc.) are well structured and  SPECTRa allows them to be used in a way which maximises the effort that went into creating them. The process is almost automatic for crystallography – a good CIF has all the metadata inside it) but a small amount of manual effort for spectra (the molecule is not normally embedded in the JCAMP so has to be provided separately).

The system is potentially searchable by chemistry – it might look something like Crystaleye  with a search provided by OpenBabel.

CML/MathML and OpenStreetmap in Bremen

Friday, November 30th, 2007

I haven’t posted much because I’m in Bremen (DE) working with Michael Kohlhase on the integration of CML and MathML into scientific publications. We’re hoping to come up with some guidelines shortly.

Last tie I was here I got lost because I didn’t realise Bremen was strung out like pearls along the River Weser and many 10′s of km long. So this time I asked Google for a map of Bremen and was delighted to find Bremen – OpenStreetMap

Free map of Bremen

OpenStreetMap images (and underlying map data) are freely available under the OpenStreetMap License

OpenStreetMap is one of the great ideas for liberating data. Lots of volunteers produce bits of maps by various means – GPS on cycle couriers is one of the most exciting OpenStreetMap | Public GPS traces tagged with Bremen worth seeing the animated images.

Chemspider and Pubchem – open data

Friday, November 30th, 2007

I was very pleased to see:

ChemSpider Blog » Blog Archive » The Entire ChemSpider Database is On Its Way to PubChem!

which describes how the Chemspider database is being offered to Pubchem as “open data”. Chemspiderman has made a valuable attempt to navigate the complexities of Open Data and recursive licences. It is technically difficult and takes  us into unknown territory. For a start it is difficult to decribe what the final object is. I understand Pubchem as a collection of links coupled to authority – i.e. Pubchem holds links to the Chemspider compounds but does not actually hold the data. (I am not aware that Pubchem holds any data other than a fairly small amount of computed data (e.g. number of rotatable bonds) and names). It does, of course, hold the data that NIH collects through the roadmap program. But I’d be happy to be corrected.

Chemspider repeats my suggestions for criteria for Open Data and adds:

CS: For right now I am giving up on trying to track where Open Data might end up. Based on my previous discussions with Peter Suber regarding navigating the complexities of Open Access definitions, I understand there is a need to define our own policies. I’m not going to do that here but what I will be clear with is that once the ChemSpider structure set is deposited in PubChem then we are at the mercies of THEIR data sharing policies. I believe Peter [PMR, not sure which Peter - but if me, see below] holds up PubChem as the primary example of Open Data (but maybe not). So, I believe it should be true to say that the ChemSpider structure set IS Open Data when accessed/downloaded/shared from PubChem. But I understand that will then be the PubChem data set and all association with us will likely be lost. But that is fully acceptable!

PMR: This shows the complexities. We will need to see how the data actually end up in Pubchem. But at present Pubchem holds only links to authorities. Thus if I search for aspirin I get 61 suppliers of information (search result) each entry in which links back to the supplier’s site.So any “data” (e.g. melting point) is not in Pubchem. Unless Chemspider is different then I would expect that only the links would be held in Pubchem. If I am right, then accessing Chemspider through Pubchem is simply another way of accessing Chemspider.

In a comment Rich Apodaca says:

Regardless of how exactly linkage occurs, the end result would be that any third party could, independently of ChemSpider, reconstruct the entire ChemSpider compound database. By using the ChemSpider Web APIs, they could develop a parallel service that re-processes the ChemSpider analytical data and patent/primary literature data, possibly mashing up the data from other sources as well.

This sets the bar very high for Open data in chemistry. I’m not sure what to call it, but it’s a game-changer.

If Chemspider allows the direct download and re-use of their data from their site then I also congratulate them. This is completely independent of whether the entries are linked from Pubchem. However it will be necessary to add a licence statement to the Chemspider pages (not Pubchem) making this clear.

It may be picky but I don’t think that Pubchem – in common with many other bioscience sites – actually gives explicit permission for re-use. Agreed that it is a work of the US government so should be free of copyright. There is an unspoken tradition in bioscience that data and collections are “Open” in some way but it isn’t well spelt out.

It should be.

Data-driven science

Monday, November 26th, 2007

I’ll be writing more about this later. Catalysed by an email from Douglas Kell at Manchester – we share the same problem – that data-driven science is a second-class activity. He picked up

paper from http://www.sis.pitt.edu/~repwkshop/papers.html on data-driven science and laments:

the obsession of biologists, and especially molecular cell biologists, with hypothesis-DEPENDENT science. At least we are hoping to fix some of this in Systems Biology.
It is still hard for them to understand that it is difficult to make hypotheses about molecules you do not even know exist, and thereby do something REALLY new, and as you say they do not really recognise that building tools is VERY important.

So I have been lamenting the lack of data in chemistry  – he and Stephen Oliver laments the culture (BioEssays 26:99–105,  2003 Wiley Periodicals, Inc. BioEssays 26.1 99 – I expect it’s inaccesible to half the readers of this blog):

Summary
It is considered in some quarters that hypothesis-driven methods are the only valuable, reliable or significant
means of scientific advance. Data-driven or ‘inductive’
advances in scientific knowledge are then seen as
marginal, irrelevant, insecure or wrong-headed, while
the development of technology—which is not of itself
‘hypothesis-led’ (beyond the recognition that such tools
might be of value)—must be seen as equally irrelevant to
the hypothetico-deductive scientific agenda. We argue
here that data- and technology-driven programmes are
not alternatives to hypothesis-led studies in scientific
knowledge discovery but are complementary and iterative
partners with them. Many fields are data-rich but
hypothesis-poor. Here, computational methods of data
analysis, which may be automated, provide the means of
generating novel hypotheses, especially in

I hadn’t realised it was so bad elsewhere. Now I realise it is. No wonder we struggle in cyberscholarship – no data, and even if we have it it’s not “proper” research.

More later.

[I do not normally blog private emails but this is an obvious exception.]

Why is it so difficult to develop systems?

Sunday, November 25th, 2007

Dorothea Salo (who runs Caveat Lector blog) is concerned (Permalink) that developers and users (an ugly word) don’t understand each other:

(I posted a lengthy polemic to the DSpace-Tech mailing list in response to a gentle question about projected DSpace support for electronic theses and dissertations. I think the content is relevant to more than just the DSpace community, so I reproduce it here, with an added link or two.)

and

My sense is that DSpace development has only vaguely and loosely been guided by real-world use cases not arising from its inner circle of contributing institutions. E.g., repeated emails to the tech and dev lists concerning metadata-only deposits (the use case there generally being institutional-bibliography development), ETD management, true dark archiving, etc. etc. have not been answered by development initiatives, or often by anything but “why would you even want that?” incomprehension or “just hack it in like everybody else!” condescension.

PMR: This has been a perennial problem for many years and will continue to be so. I’m also not commenting on DSpace (although it is clearly acquiring a large code base).  But my impression of the last 10-15 years (especially W3C and Grid/eScience projects) is that they rapid become overcomplicated, overextended and fail to get people using them.

One the othe hand there are the piles of spaghetti bash, C, Pythin and so on which adorn academic projects and cause just as much heartache. Typical “head-down” or throwaway code.

The basic fact is that most systems are complicated. And there isn’t a lot that can be done easily. It’s summed up by the well-known  Conservation Of Complexity

This is a hypothesis that software complexity cannot be created or destroyed, it can only be shifted (or translated) from one place (or form) to another.

If, of course, you are familiar with the place that the complexity has shifted to it’s much easier. So if someone has spent time learning how to run spreadsheets, or workflows, or Python, and if the system has been adapted to those it may be easier. But if those systems are new then they will have serious rough edges. We found this with the Taverna workflow which works for bioscience but isn’t well suited (yet) for chemistry. We spent months on it, and but those involved have reverted to using Java code for much of our glueware. We understand it, our libraries work, and since it allows very good test-driven development and project management it’s ultimately cost-effective.

 

We went through something like the process Dorothea mentions when we started to create a submission tool for crystallography in the SPECTRa : JISC project.  We though t we could transfer the (proven) business process that Southampton had developed for the National Crystallographic Centre. And that the crystallographers would appreciate it. It would automate the management of the process from  receibving the crystal to repositing the results in DSpace.

 

It doesn’t work like that in the real world.

 

The crystallographers were happy to have a reposition tool, but they didn’t want to change their process and wouldn’t thank us for providing a bright shiny new one that was “better”. They wanted to stick with their paper copies, the way they disseminated theoir data. So we realised, and backtracked. It cost us three months, but that’s what we have to factor into these projects. It’s a lot better than wasting a year producing something people don’t want.

 

Ultimately much of the database and repository technology is too complicated for what we need at the start of the process. I am involved in one project where the database requires an expert to spend six months tooling it up. I thought DSpace was the right way to go to reposit my data but it wasn’t. I (or rather Jim) put150,000+ molecules into it but they aren’t indexed by Google and we can’t get them out en masse. Next time we’ll simply use web pages.

 

By contrast we find that individual scientists, if given the choice, revert to two or three simple, well-proven systems:

  • the hierarchical filesystem
  • the spreadsheet

A major reason these hide complexity is that they have no learning curve, and have literally millions of users or years’ experience. We take the filesystem for granted, but it’s actually a brilliant invention. The credit goes to Denis Ritchie in ca. 1969. (I well remember my backing store being composed of punched tape and cards).

If you want differential access to resources, and record locking and audit trails and rollback and integrity of commital and you are building it from scratch, it will be a lot of work. And you lose sight of your users.

So we’re looking seriously at systems based on simpler technology than databases – such as RDF triple stores copuled to the filesystem and XML.

And the main rule is that both the users and the developers have to eat the same dogfood.  It’s slow and not always tasty. And you can’t avoid Fred Brooks:

 Chemical engineers learned long ago that a process that works in the laboratory cannot be implemented in a factory in one step. An intermediate step called the pilot plant is necessary….In most [software] projects, the first system is barely usable. It may be too slow, too big, awkward to use, or all three. There is no alternative but to start again, smarting but smarter, and build a redesigned version in which these problems are solved…. Delivering the throwaway to customers buys time, but it does so only at the cost of agony for the user, distraction for the builders while they do the redesign, and a bad reputation for the product that the best redesign will find hard to live down. Hence, plan to throw one away; you will, anyhow.

Very simply, TTT: Things Take Time.

NMR Challenge: what can a machine deduce from a thesis?

Saturday, November 24th, 2007

One of the ways of extracting chemical structures from the literature is to use the NMR to constrain the possibilities. So, to give you an amusement for the weekend, here are some problems. I have a thesis (which I’m not identifying, but I know the author and he’s happy for the thesis to yield Open Data). I am not sure whether the compounds are in the public literature yet, but they are in the public domain if you know where to find the paper thesis.

Imagine that some future archaeologist had discovered the thesis and only a few scraps had survived. What could be deduced? I’m starting with smallish (hopefully fairly simple) structures and only feeding you some of the information. Depending on what you answer, I’ll either release more or select more complex compounds. All compounds are distinct.

Compound 172

dH (400 MHz, CDCl3): 1.15 (3H, t, J 7.1, OCH2CH3), 1.24 (3H, d, J 5.2, 6-H x 3), 2.84 (1H, qd, J 5.2, 2.0, 5-H), 3.05 (1H, dd, J 7.0, 2.0, 4-H), 4.07 (2H, q, J 7.1, OCH2CH3), 5.99 (1H, dd, J 15.7, 0.6, 2-H), 6.54 (1H, dd, J 15.7, 7.0, 3-H);

Compound 167

dC (100 MHz, CDCl3): 164.5 (CO), 160.3 (C), 107.5 (C), 95.6 (CH), 40.9 (CH2Cl), 24.7 (CH3 x 2);

compound 156

dC (100 MHz, CDCl3): 83.0 (3-C), 79.6 (2-C), 61.8 (5-C), 51.0 (1-C), 25.8 (SiC(CH3)3 x 1), 23.1 (4-C), 18.3 (SiC(CH3)3), -5.1 (Si(CH3) x 2);

Note that the molecular formula, molecular weight, etc. have all been destroyed by the ravages of time.

You can use any method you like, including searching in commercial databases.

What could a machine do with the information above?

How OSCAR interprets text and data

Saturday, November 24th, 2007

I recently posted ( Open NMR and OSCAR toolchains ) about how OSCAR can extract data from chemical articles, and in particular chemical theses. Wolfgang Robien points out November 24th, 2007 at 11:03 am e

I think, no, I am absolutely sure, this functionality can be achieved with a few basic UNIX-commands like ‘grep’, ‘cut’, ‘paste’, etc. What you need is the assignment of the signals to specific carbons in your structure, because this (and EXACTLY THIS) is the basis of spectrum prediction and structure verification – before this could be done, you need the structure itself.

Wolfgang is correct that the basis of this part of OSCAR is based on regular expressions (which are also used in grep). However developing such regular expressions that work across a variety of styles (journals, theses, etc.) is a lot of work – conservatively this took many months. The current set of regexes runs to many pages. Initially when I started this work about 7 years ago I thought that chemical papers could be solved by regexes alone, but this is quite infeasible. Even if the language is completely regular (as is possible, but not always observed in spectra data) we rapidly get a combinatorial explosion. Joe Townsend, Chris Waudby, Vanessa de Sousa and Sam Adams did much of the pioneering work here and showed the limitations. In fact the current OSCAR, which we are refactoring at this moment consists of several components:

  • natural language parsing techniques (including part of speech tagging and, to come, more sophisticated grammars)
  • identification of chemical names by Bayesian techniques
  • chemical name deconstruction (OPSIN)
  • heuristic chunking of the document
  • lookup in ontologies
  • regular expressions

These can interact in quite complex manners – for example chemical names and formula can be found in the middle of the data. For this reason OSCAR – and any parsing technique – can never be 100% perfect. (We should mention, and I will continue to do so, that parsing PDF – even single column – is a nightmare).

Wolfgang is right that we also need the assignment of the carbons to the peakList and also the chemical structure. Taking the structure first, we can try to determine it by the following methods:

  • interpreting the chemical name. OPSIN does a good job on simple compounds. I don’t have metrics for the current literature but I think it’s running at ca 20%. That may sound low, but name2structure requires the compilation of many sub-lexicons and sub-grammars (e.g. for multicyclic systems, saccharides, etc.) If there is a need for this, much can be done by community action.
  • interpreting the chemical diagram. Open tools are starting to emerge here and my own dabbling with PDF suggests that perhaps 20-25% can be extracted. The main problems are (a) finding the diagrams and linking them to the serial number and (b) the publishers’ claim that images are copyright.
  • using the crystallography. If a crystal structure is available then the conversion to connection table, including bond orders and hydrogens, is virtually 100%. Again there may be a problem in linking the structure to the formula.
  • reconstruction from spectral data. For simple molecules this should be possible – after all we set this in exam questions so a robot should be able to do some. The combination of HNMR, CNMR and IR should constrain the possibilities. Whether this is a brute force approach (generate all structures and remove incompatible ones) or whether it is based on logic and rules may depend on the software available and the system.

(Of course if the publisher or student makes an InChI available all this is unnecessary).

There are two ways of trying to add the assignment. One is simply by trusting the shifts from the calculation (whether GIAO or HOSE). A problem here is that the authors may – and do – omit peaks or mis-transcribe them. I think I have an approach to manage simple cases here. The other is by trying to interpret the author’s annotations. This is a nice exercise because there is no standard way of reporting it and there is almost certainly no numbering scheme. So we will need to build up databases of numbering schemes and also heuristics of how most authors annotate C13 spectra.

Open means Libre

Friday, November 23rd, 2007

In recent posts (e.g. Open Data – preservation) I have continued to raise the problem that “Open” should not just mean free of charge nbut also free to use. (Peter Suber calls these price and permission barriers).   So I am glad to see Jo Walsh making the point very strongly (Keeping “Open” Libre) and showing that much of the problem is because the English language cannot easily make the distinction.

We are no in great danger that the same thing is happening to the word “Open”. Strict BBB language requires “Open Access” to remove permission barriers, but as Peter Suber says (“regrettably”) it is starting to become used for all sorts of lesser approaches which reduce the permissions.

This is serious enough for Open Access where people spend huge amounts of energy and stress worrying about what they can and cannot do with published material. It’s even more of a problem for “Open Data” which is only just starting its career.

We are still seeing very little evidence that people in the scholarly publishing community care about keeping data libre. Please prove me wrong anf I can include it in my article.

Cameron’s Open proposal

Friday, November 23rd, 2007

In the last two days Cameron Neylon has posted an idea for Open Science and got a lot of interest, see: e-science for open science – an EPSRC research network proposal and Follow on to network proposal. The idea is to create networks of excellence  in escience (cyberscholarship) and Open (Notebook) Science would fit the bill perfectly. I’d be delighted to be part of Cameron’s proposal and this can be taken as a letter of support (unless the powers that be insist on a bit of paper with a logo which is a pain in the neck).

 

One of the secrets of cyberscholarship is that it flourishes when people do not want to run everything themselves in competition with everyone else. Chemistry often has a very bad culture of fortification rather than collaboration – hence there was so little effective chemistry in the UK eScience program. Southampton has been a notable exception, and for example, we are delighted to be part of the eCrystals program they are running. The last year has shown that at grass roots, chemistry is among the leaders in Open Science and Cameron has detailed this in his proposal

 

 There are several areas where we’d like to help:

  • making the literature available to machines (OSCAR) and thereby to the community
  • distributed collaborative management of combinatorial chemistry (we can now do this with CML for fragments)
  • shared molecular repositories (again we have a likely collaboration with Soton here)
  • creation of shared ontolgoies (we collaborated with Soton during the eScience program).

 

(I’ve been spending time coding rather than  blogging and – blink for a day – find out what I’d missed).

 

Open NMR and OSCAR toolchains

Thursday, November 22nd, 2007

I am currently refactoring Nick Day’s code that has supported “NMREye” – the collection of Open experiments and Data that he has generated as part of his thesis and have been trailed on this blog ( View post). One intention of this – which got lost in some of the other discussion is to be able to see whether published results are “correct”. This is, of course, not new to us – students here developed the OSCAR toolkit for checking experimental data (View post). The NMREye work suggest that it should be possible to validate the actual 13C NMR values reported in a scientific experiment.

Nick will take it as a compliment that I am refactoring his code. It was written on a very strict timescale – he had to write the code, collect and analyse the results in little more than a month. And his work has a wider applicability within our group. So I am trying to design a library system that supports his ideas while being generally re-usable. And this has very useful consequences for CML – the main question as always is “does CML support enough chemistry in a simple fashion and can it be coded?”. As an example here’s an example of data from a thesis we are analyzing in the SPECTRaT project:

13C (150 MHz) d 138.4 (Ar-ipso-C), 136.7 (C-2), 136.1 (C-1), 128.3, 127.6, 127.5 (Ar‑ortho-C, Ar-meta-C, Ar-para-C), 87.2 (C-3), 80.1 (C-4), 72.1 (OCH2Ph), 69.7 (CH2OBn), 58.0 (C-5), 26.7 (C-6), 20.9 ((CH3)AC-6), 17.9 ((CH3)BC-6), 11.3 (CH3C‑2), 0.5 (Si(CH3)3).

(the “d” is a delta but I think everything has been faithfully copied from the Word document. Note that OSCAR can :

  • understand that this is a 13C spectrum
  • extract the frequency
  • identify the peak values (shiofts) and identify the comments

Try to think how you would explain this to a robot and what additional information you would need. Indeed try to explain this to a non-chemist – it’s a useful exercise.

What OSCAR and the other tools cannot do yet is:

  • extract the solvent (this is mentioned elsewhere in the thesis)
  • understand the comments
  • manage the framework symmetry group of the phenyl ring
  • understand peakGroup (the aromatic ring)

So the toolchain has to cover this and much more. However the open source chemistry community (in this case all Blue Obelisk) has provided most of the components. More on this later.