Repositories or Lists of Open Molecules

I am looking for lists (or repositories) of small molecules with connection tables (or machine-parsable molecular structures) which are Open. By Open I mean that anyone can, in principle download, copy or clone part or all of the site, re-use the information and redistribute without reference to the original site. At present I am aware of:

  • Pubchem (10 million+ , superset of many Open datasets including NCI. I use this term to subsume everything at
  • ChEBI (> 25 000 terms collected at EBI, not all with connection tables)
  • MSD (ligands in Protein structures, collected at EBI > 5000)
  • WWMM (250, 000 calculated structures from NCI database). Reposited in DSpace,
  • Crystallographic Open Database crystal structuires collected from the literature or donated. Soon to be complemented with CrystalEye. This should give nearly 100,000 crystal structures.
  • The BlueObelisk Data Repository (BODR). A collection of critical information collected by BO volunteers primarily as reference data for (Open) software. (includes non-molecular stuff like elemental properties). BODR is widely distributed on Gnome and other Open Source distros.

I’ve almost certainly missed some so please let us know. There should be a substantial amount of (molecules x attributes) in the collection and ideally it should provide complementary information to the above list. The connection tables (InChI, CML, SMILES, MOL) (or 3D coordinates) of the molecules should be easily accessible (i.e. not requiring resolving identifiers).
Note that there are many useful sites which offer free queries on databases (e.g. NIST Webbook) but these do not fit the criterion of Openness – could they be downloaded in toto, cloned and redistributed (with attribution, of course).
If we don’t already have it, perhaps there should be a page on the Blue Obelisk Wiki.

6 Responses to Repositories or Lists of Open Molecules

  1. pm286 says:

    (1) AFAIK ZINC is not Open – it requires explicit permission and cannot be redistributed

  2. Graham says:
    Prof David Wishart in Canada runs DrunBank which is Open.
    PS, thanks for alerting me to PubChem which I was not aware of.

  3. Hi, Peter. This is a great list and it fits very nicely with something we’ve been working on recently and are just about to ‘beta’-launch: Comprehensive Knowledge Archive Network. The Comprehensive Knowledge Archive Network (CKAN) is a registry of open knowledge packages (and a few closed ones). CKAN is a central place to search for open knowledge resources as well as register ones own. As such it would be the perfect place for the kind of information you list above.
    I’d also like to suggest hyperlinking ‘Open’ (as in ‘which are Open’) to the Open Knowledge Definition at (or now

  4. pm286 says:

    (3) The Drugbank (sic?) rubric reads:
    >DrugBank is offered as a freely available publicly available resource. Use and re-distribution of >the data, in whole or in part, for commercial purposes requires explicit permission of the authors >and explicit acknowledgment of the source material (DrugBank) and original publication (see >below). We ask that users who download significant portions of the database cite the DrugBank >paper in any resulting publications.
    This looks exciting. Although not cmpletely OPEN in the BOAI sense (which allows allows commercial re-use) this is enough for scientific innovation and discovery. If it is not already done the entries should be linked to Pubchem.
    As you haven’t come across Pubchem, I will blog about it.

  5. Have a look at MolBank:
    Molbank (ISSN 1422-8599, CODEN: MOLBAI) is fully open-access and publishes one-compound-per-paper short notes and communications on synthetic compounds and natural products. A MDL Mol file is available for each compound. The journal is fully HTML and downloadable with WGET over http.

