Chemical Informatics and China; and challenges of language

I was honoured last week to be invited by Professor Xiaoxia Li to speak at the 14th Asian Chemistry Congress in Bangkok and then to visit her group in Beijing. Unfortunately I couldn’t spend longer but the visit impressed me very much, both with the focus of the group and their high morale. The group is in the Institute of Process Engineering, Chinese Academy of Sciences in the Haidian part of Beijing (where there are many universities and scientific institutes). Here is Xiaoxia (left) and some of her group [photos on my phone, so variable quality].

Xiaoxia’s group has a lot in common with us. Several of the group are involved in indexing and information retrieval from the Deep Chemical Web (http://chemport.ipe.ac.cn/IPE-ChINGroup/group-publications-en.html ). Most of the web is actually inaccessible to search engines because the information is exposed through a query interface. “Enter your search term: “. Often you have no idea what to enter, and so Bingle passes it by. Her group is developing heuristics and templates for exploring what is in databases and what information can be extracted. It’s very challenging.

We also talked about software for processing chemical names and natural language wholly or partially in Chinese. We tried an experiment with OPSIN (name to structure). Daniel Lowe has explored how Chinese (and other languages) representation of IUPAC names might be processed by OPSIN. He is moderately confident that the core of OPSIN is suitable, and it is a question of preprocessing and vocabulary. Here is an example taken from http://zh.wikipedia.org/wiki/IUPAC%E6%9C%89%E6%9C%BA%E7%89%A9%E5%91%BD%E5%90%8D%E6%B3%95_%28A%E9%83%A8%29 – a translation of the IUPAC nomenclature rules into Chinese.


2,7,8-
三甲基癸

When Daniel reads this from a file into OPSIN it interprets it as 2,7,8-trimethyldecane. We tried to reproduce this on a Chinese commandline, but ran into encoding problems. (Encoding is one of the commonest problems). However I am sure it is soluble.

 

I was also given a tour of the work in the Institute. There is a lot of exciting work on High-performance computing (using GPGPUs http://en.wikipedia.org/wiki/GPGPU ) and the institute has, I think, the 33rd most powerful machine in the world. Certainly the scale and ambition of investment in science was clear. Among the demonstrations I saw were the simulation of a fluidised bed reactor (flow, temperature) and also the molecular dynamics of a complete influenza H1N1 virus (neuraminidase, haemagglutinin, capsids, RNA, – everything). We have come a long way since I worked in influenza 20 years ago.

 

I was also very very well entertained – driven everywhere – and shown many of the sights of Beijing. Here are two with colleagues from Xiaoxia’s group:

And (in the Forbidden City)

I was very well exercised by the end!

It was great to talk with a group with mutual interests – the discovery and re-use of information on the web. I gave them an overview of our work, our recent manuscripts, and left a fairly complete copy of OSCAR/OPSIN and related software. Some – at least – should be fairly easily adaptable.

And by happy chance “YY” (Prof Yong Zhang, left) was free and invited me to dinner with his group at Tsinghua University. YY spent 3 years in our group at Cambridge and was responsible for the early development of the World Wide Molecular Matrix (publication in press http://www.dspace.cam.ac.uk/handle/1810/238387 ).

Again I was very well looked after.

It is always great to find other groups who interact synergistically. Chemical informatics is not always glamorous, but it’s important and will increase in value as the barriers to information discovery start to disappear. Many many thanks.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *