petermr's blog

A Scientist and the Web

 

Chemical Registry Systems and Public Databases

I am at a 2-day (closed) meeting at EBI on Chemical Registry systems and databases (http://www.ebi.ac.uk/industry/Workshops/workshops.html ). I can’t blog this as it’s a closed meeting. However as always I will try to set my own thoughts out. They are still jumbled (I am talking tomorrow so will be looking for input and inspiration from today’s speakers).

The thing that immediately strikes people in bioinformatics is how far behind in organisation, practice and thinking chemical information systems are. Effectively many are walled gardens where information is controlled by an organisation/authority. In bioscience information is put into national and international databases such as Uniprot, PDB, Genbank and so on. These are Open – people can download the whole lot, annotate it, rework and repurpose it, etc.

In contrast the chemical area consists of:

  • Databases of the (papers and reports) in the chemical literature. The best known are Chemical Abstracts and Beilstein (now transmuted through commercialism into Reaxsys (Elsevier)). ChEMBL abstracts the literature for compounds with activity data.
  • Databases of chemicals supplied (in bottles) by manufacturers.
  • Classifications and collections of information about chemicals (Wikipedia, ChEBI, and several others).
  • Collections of compound collections and their measured properties (including biological activity). The best known is NCI’s database of about 250,000 compounds. Many pharma companies have their own privates ones, though parts of these are starting to appear Openly.
  • Collections of compounds and measured experimental properties – a few are Open (NMRShiftDB, Crystaleye)
  • Hybrid collections (Chemspider includes structures, names, donated experimental data, some Open, some not))
  • Theoretical calculations on molecules – few are Open (that’s a reason for Quixote)

This workshop explores how (or whether) these can be brought together.

There are several challenges:

  • Socio-political. Some of the largest collectors (ACS, Elsevier) have a history of building closed, walled systems and there is no public hint they intend to change. It is effectively impossible to join together closed and open systems. So the question is whether to compete with them (this would have to be Open). If so it’s a large task. (It would be easier if they allowed textmining)
  • Walled gardens and centres-of-the-universe. The great thing about bioinformatics is that the various databases work together to create interoperable identifiers, and more importantly interoperable ontologies. There is a huge and successful biological ontology (Gene Ontology) and effectively little open chemical ontology (ChEBI is the most obvious, but it’s not universal). Most providers of information have their own view of the universe and it’s OK as long as you buy completely into it.
  • Molecules and compounds. Chemistry is described at several levels, but most importantly by the macroscopic (substance) and the microscopic (molecular). For many compounds (especially pharmaceutical) there is quite a good correspondence between molecule and compound. But it’s not perfect and it breaks down. The breakdown can only be represented properly by annotation and ontologies (sometimes computable).

I shall talk about all three. The latter is the most technically challenging and – effectively – chemists need to decide that they should adopt and create ontologies.

But this will be bitterly resisted by vested interests.

I don’t have a clear way forward. I’ll wait to see what comes out of the discussions.

Meanwhile here are some questions. My talk is called “Names, structures and compunds”

  • What is staurosporine?
  • How did you find out?
  • Do you believe the result?
  • What is glucose?
  • What is its NMR spectrum
  • Do you believe the result?
  • What is Mauveine?
  • How did you find out?
  • Do you believe the result?

Which is the most important question?

Leave a Reply