Category Archives: data

Open means Libre

In recent posts (e.g. Open Data - preservation) I have continued to raise the problem that "Open" should not just mean free of charge nbut also free to use. (Peter Suber calls these price and permission barriers).   So I am glad to see Jo Walsh making the point very strongly (Keeping “Open” Libre) and showing that much of the problem is because the English language cannot easily make the distinction.

We are no in great danger that the same thing is happening to the word "Open". Strict BBB language requires "Open Access" to remove permission barriers, but as Peter Suber says ("regrettably") it is starting to become used for all sorts of lesser approaches which reduce the permissions.

This is serious enough for Open Access where people spend huge amounts of energy and stress worrying about what they can and cannot do with published material. It's even more of a problem for "Open Data" which is only just starting its career.

We are still seeing very little evidence that people in the scholarly publishing community care about keeping data libre. Please prove me wrong anf I can include it in my article.

Cameron's Open proposal

In the last two days Cameron Neylon has posted an idea for Open Science and got a lot of interest, see: e-science for open science - an EPSRC research network proposal and Follow on to network proposal. The idea is to create networks of excellence  in escience (cyberscholarship) and Open (Notebook) Science would fit the bill perfectly. I'd be delighted to be part of Cameron's proposal and this can be taken as a letter of support (unless the powers that be insist on a bit of paper with a logo which is a pain in the neck).

 

One of the secrets of cyberscholarship is that it flourishes when people do not want to run everything themselves in competition with everyone else. Chemistry often has a very bad culture of fortification rather than collaboration - hence there was so little effective chemistry in the UK eScience program. Southampton has been a notable exception, and for example, we are delighted to be part of the eCrystals program they are running. The last year has shown that at grass roots, chemistry is among the leaders in Open Science and Cameron has detailed this in his proposal

 

 There are several areas where we'd like to help:

  • making the literature available to machines (OSCAR) and thereby to the community
  • distributed collaborative management of combinatorial chemistry (we can now do this with CML for fragments)
  • shared molecular repositories (again we have a likely collaboration with Soton here)
  • creation of shared ontolgoies (we collaborated with Soton during the eScience program).

 

(I've been spending time coding rather than  blogging and - blink for a day - find out what I'd missed).

 

Open NMR and OSCAR toolchains

I am currently refactoring Nick Day's code that has supported "NMREye" - the collection of Open experiments and Data that he has generated as part of his thesis and have been trailed on this blog ( View post). One intention of this - which got lost in some of the other discussion is to be able to see whether published results are "correct". This is, of course, not new to us - students here developed the OSCAR toolkit for checking experimental data (View post). The NMREye work suggest that it should be possible to validate the actual 13C NMR values reported in a scientific experiment.

Nick will take it as a compliment that I am refactoring his code. It was written on a very strict timescale - he had to write the code, collect and analyse the results in little more than a month. And his work has a wider applicability within our group. So I am trying to design a library system that supports his ideas while being generally re-usable. And this has very useful consequences for CML - the main question as always is "does CML support enough chemistry in a simple fashion and can it be coded?". As an example here's an example of data from a thesis we are analyzing in the SPECTRaT project:

13C (150 MHz) d 138.4 (Ar-ipso-C), 136.7 (C-2), 136.1 (C-1), 128.3, 127.6, 127.5 (Ar‑ortho-C, Ar-meta-C, Ar-para-C), 87.2 (C-3), 80.1 (C-4), 72.1 (OCH2Ph), 69.7 (CH2OBn), 58.0 (C-5), 26.7 (C-6), 20.9 ((CH3)AC-6), 17.9 ((CH3)BC-6), 11.3 (CH3C‑2), 0.5 (Si(CH3)3).

(the "d" is a delta but I think everything has been faithfully copied from the Word document. Note that OSCAR can :

  • understand that this is a 13C spectrum
  • extract the frequency
  • identify the peak values (shiofts) and identify the comments

Try to think how you would explain this to a robot and what additional information you would need. Indeed try to explain this to a non-chemist - it's a useful exercise.

What OSCAR and the other tools cannot do yet is:

  • extract the solvent (this is mentioned elsewhere in the thesis)
  • understand the comments
  • manage the framework symmetry group of the phenyl ring
  • understand peakGroup (the aromatic ring)

So the toolchain has to cover this and much more. However the open source chemistry community (in this case all Blue Obelisk) has provided most of the components. More on this later.

Open Data - preservation

An interchange with a correspondent...

You [PMR] said in your Blog:

It is critical to distinguish between “Free” and Open. “Free”, in this context, simply means that the provider has mounted the data (not necessarily the whole data) on a web page. There is often no licence, no copyright, no guarantee of availability, no commitment to archival, no explicit freedom of re-use. The materials database is in this category - and to be fair it didn’t call itself Open.

The Open Knowledge Initiative says:

1. Access The work shall be available as a whole and at no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The work must also be available in a convenient and modifiable form.

================================

[Correspondent] This does not seem to go far enough in that if I have all good intentions and post material on a web server and then drop dead tomorrow the info will disappear pretty soon after that. Possibly lost forever!

The distinction you make between "free" and "Open" suggests that Open means there is some permanency to the arrangement of having it available? Am I interpreting this correctly? How could this be monitored or managed?

PMR: You are absolutely right. I think the problem of preservation has not been addressed. Indeed until I starting thinking about it I hadn't realised how relatively simple the preservation of text was and how difficult the preservation of data was.

First: Who can one trust?It's currently easy to deposit material with anyone - Google, Amazon, whoever. And to trust to spinning media to be replicated. But it's very risky for long term preservation. There are many bodies working on this and the simple message is that it's difficult, depends on what we want to preserve and how long we want to do it. There are many levels - the bitstream, the semantic content, the ontological context, etc. Places like the UK's Digital Curation Centre understand and work on exactly this.

[ Correspondent] is worried - like me - about the archival of chemical data iwithin the laboratory. What should be done? My personal answers are:

  1. If the data is valuable enough an international data centre will store it. Biology is strong here and bioinformatics centres have an effective commitment to archival and also work on preservation. Chemistry relies on commercial and quasi-commercial organisations which generally accomplish far less.
  2. My own inclination is towards global domain-specific repositories where the data are difficult and the volume merits it; national ones where the problem is understood but needs supporting (e.g. in mainstream chemistry) and where possible departmental ones (e.g. for crystallography, spectroscopy, computational chemistry.)
  3. I am not a strong supporter that terabytes of data should be generally dumped in institutional repositories without good metadata and analysis software. Maybe future generations will welcome these hidden treasures and will have super-intelligent software.

Open Data - 2

I posted receently about the problems of describing Open Data -how strict should we be about boundaries? Peter Suber has replied What counts as open data?, Klaus Graf has also given an important emphasis on archiving in a comment to my post. Also Peter has blogged another example of an "Open Access" database: Jan Christian Bryne and eight co-authors, JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update, Nucleic Acids Research, November 2007.

Abstract:   JASPAR is a popular open-access database for matrix models describing DNA-binding preferences for transcription factors and other DNA patterns. [...] JASPAR is available [here].

In the last 2 months I have been thinking fairly furiously about Open Data and realising it can be considerably more complex than Open Access scholarly publishing. I'm certainly clear that the borderlines may have to be fuzzy, but not infinitely. Peter writes:

  • Just a quick note on my offline talk with PMR about Material Properties, which I called "OA" in a blog post.  Neither of us could find its licensing terms, so we couldn't tell just how open it was.  I needed (I still need, we all need) a generic term for such resources when we do know they are free of charge but don't know any details about their licensing terms.  For better or worse "OA" has become that generic term, even while it has a narrower, earlier, more technical and more proper sense through the BBB definition.  I readily and often acknowledge that I use the term "OA" both ways --widely and narrowly, as a generic term and as the technical term for the BBB level of openness.  I also readily and often acknowledge that this ambiguity causes problems --see for example the Poynder interview at pp. 30-31.  I can add that I resisted this dual sense as long as I could and only acquiesced when it became an undeniable fact of actual usage.  For perspective, I've also argued that this kind of semantic spread is not a special calamity for our technical term, but affects most technical terms in wide use and needn't prevent precise communication.
  • One tempting solution is to come up with a new generic term so that "OA" can be limited to its strict BBB sense.  That's desirable but difficult, since coining terms is not the same thing as assuring their use, let alone their intended use.  BTW, "free" would not make a better generic term, at least not yet, since it suggests to many people that a work is merely free of charge and does not also remove permission barriers.  A good generic term would cover all kinds of free online content, including those that are BBB OA.
  • I share PMR's hope that the term "open data" can stay fairly well tethered to its technical definition.  But the data world needs a generic term for the same reason that the publication world does.  If we had a good generic term for free online content, perhaps it could allow "open data" to remain univocal.

PMR: I agree with the sentiments here - but suspect we are both unclear of the way forward. I'm reluctant to use OA for databases since "OA" is already having to work pretty hard to manage the differences in practice and philosophy in scholarly publishing of articles (data is rarely included). "Open Source" has more or less got its act together, although there is both tension within the community of the Free/Open + viral schools; and also abuse of the term "Open Source". I am keen to avoid the abuse of "Open Data" while it is still struggling to play a role.

I n some disciplines "free" implies "Open". In biosciences there is an unwritten agreement that freely available data is Open. Sequences, strucures, genes are made available usually without formal copyright or formal licensing.  There are thousands of databases with the same attitude as JASPAR (above) - our own MACiE database is similar. In all these it's commonplace to download the whole data - for example we state "Each MACiE entry in the database can be downloaded separately as a CML file. This option is available from the left side panel, underneath the reaction step lists." The bioscience community has a tradition of sharing and re-use which doesn't need to be spelt out. Admittedly there is potential for confusion, and some databases do restrict their usage, And some will effectively have a no-commercial use clause. But there is a strong tradition of meetings where the principles are reinforced, where collaboration is made and traded. Generally it is expected that data will be re-used.

In chemistry, in contrast, the tradition is  of gathering information and reselling it. There are no public Chemoinformatics Centres in the same way as Bioinformatics - in fact the Bioinfromatics Centres are steadily taking over the biological parts of chemistry. So by default it has to be assumed that any database on the web, however freely accessible at a point in time has no guaranteed permanency of access. There are often explicit barriers to re-use. So it's important to have clear guidelines and clear labels - otherwise "Open Data" or "Open Access" is meaningless and acts only as a way of marketing warm feelings.

This is more difficult because data are undervalued in the peer-esteem economy. A "paper" - however poorly read, however bad - even to retraction, is part of the sacrosanct "scholarly record". Libraries and curation centres have a duty to capture this. In contrast there is nothing like the same obligation on any organisation to capture public datasets. Admittedly its' harder, but that's not the real reason.

So I reiterate some guidelines. I'm still working these out and would welcome comment. (I don't feel we should stray too far from the The Open Knowledge Foundation guidelines. ) As a start I would suggest the following:

  • There must be some mechanism whereby the community could, if it wished, capture the resource for public archival without permission. This could be as simple as spidering the site, or a relational dump, or a massive file, or an iterator.
  • There must be no permission barriers to re-use including commercial re-use.
  • The data must either be the whole work (at a given point in time) or be clearly bounded (i.e. there should be no hidden data that the world cannot get access to in the same way).
  • There should be no time limits on access and re-use.

Data is now acquiring the same power as software did two decades ago. It won't be surprising if there are tensions - commercial, political, social. We need to identify and plan for them.

Liz Lyon on Open Science

We've worked closelyn with Liz Lyon for some time - an advisory role on SPECTRa and now we are partners in the eCrystals Program. She's posted an impressive set of slides on hundreds of things happening in the data- and knowledge- revolution - eScience of cyberscholarship. There's a lot on chemistry - Drexel, Soton - We're pleased to get honourable mentions for some of our projects:  Open Science and the Research Library: Roles, Challenges and Opportunities?  The keynote address (slide presentation) at the Directors' Meeting of the Association of Research Libraries in Cambridge, Massachusetts, November 2007.

One example is the archib=ving of the web = Liz gives this blog as an example - http://web.archive.org/ "The Wayback Machine". It doesn't get everything by a long way, but it's a start

Open Data

There are several reasons why I'm currently thinking about Open Data (see Open Data at WP for some collected wisdom and links). We're currently collecting more chemistry data that we intend to make Openly available (see CrystalEyeknowledge base as an example). I've been asked to write an article for Serials review (Elsevier) on the subject and am putting my ideas in order. Chemspider announced Something New and Exciting Coming Soon… which contained an image with "Open Data" (no details). And Peter Suber announced New OA database on material properties, originally from the Chemistry Central blog which announced "The database is yet another of the free, on-line chemical services to have emerged in recent years. " The use of "OA" was, I think, Peter's.

I didn't agree with Peter in his description of Material Properties as an "Open Access" database, and I'm worried that we shall see the same imprecision in the use of "Open Data". So I wrote to Peter and am amplifying the arguments here. As a baseline Peter and I are both on the advisory board of the The Open Knowledge Foundation (initiated by Rufus Pollock) which has developed the Open Knowledge Definition. I think it's important to take this as a starting point for this analysis, thought there are aspects of databases which make the system much more complex.

It's good that the principle is simple to summarise:

In the simplest form the definition can be summed up in the statement that "A piece of knowledge is open if you are free to use, reuse, and redistribute it". For details read the latest version of the full definition (with explanatory annotations).

I'm going to look at the most important clauses for science/chemistry (emphases are mine) - I have omitted other clauses but I adhere to them as well:

1. Access The work shall be available as a whole and at no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The work must also be available in a convenient and modifiable form.

 

3. Reuse

The license must allow for modifications and derivative works and must allow them to be distributed under the terms of the original work. The license may impose some form of attribution and integrity requirements: see principle 5 (Attribution) and principle 6 (Integrity) below.

 

4. Absence of Technological Restriction

The work must be provided in such a form that there are no technological obstacles to the performance of the above activities. This can be achieved by the provision of the work in an open data format, i.e. one whose specification is publicly and freely available ...

 

5. Attribution

The license may require as a condition for redistribution and re-use the attribution of the contributors and creators to the work.

 

6. Integrity

The license may require as a condition for the work being distributed in modified form that the resulting work carry a different name or version number from the original work.

 

8. No Discrimination Against Fields of Endeavor

The license must not restrict anyone from making use of the work in a specific field of endeavor. For example, it may not restrict the work from being used in a business, or from being used for military research.

 

PMR: There are significantly different types of Open Data in science. There is raw data produced by the scientific experiment and increasingly published alongside "fulltext" or publications or theses. There is a curated, critical snapshot of a given experiment, perhaps images from a telescope or satellite. In this post I discuss the problems of "databases" or "knowledgebases" which are both fragmented and dynamic (e.g. CrystalEye and the Materials database.)

It is critical to distinguish between "Free" and Open. "Free", in this context, simply means that the provider has mounted the data (not necessarily the whole data) on a web page. There is often no licence, no copyright, no guarantee of availability, no commitment to archival, no explicit freedom of re-use. The materials database is in this category - and to be fair it didn't call itself Open.

A major problem, which we have discussed in some detail on this blog over CrystalEye, is that many databases are both hypermedia and dynamic. They are spread over many components and they change with time. Both CrystalEye and Materials fall into that category. It is technically difficult to make them easily available and there is no agreed mechanism for doing this.

The work must be available as a whole. I agree this is critical, but it's often difficult. Leaving aside the dynamic aspect there are a few possibilities.

  • Bundle the data into a single "file" or a set of files. This has worked historically for the Protein Databank. The difficulties are that there is not usually a single simple object to bundle, and that it requires considerable maintenance.
  • Provide an iterator over the data. This could either be a generic tool such as wget (which recurses over a hyperdocument) or a bespoke tool which is guaranteed to iterate over the data. This is the approach we have adopted (Jim Downing wrote it specifically to help the community download the data and has made it available under Open Source).
  • Collaborate with a data provider (e.g. a Bioinformatics institute). This is a good approach if your community supports the idea of Open Data, but chemistry has yet to see the light

 

 

A few other comments. "convenient and modifiable form" and "no technological obstacles" cannot be defined precisely but I would ague that if the Open Data provider had published their formats and if there was Open Source code that would read the data that was sufficient. Note that for many files ASCII is sufficient if the metadata is well provided. There is no requirement for the Open Data provider to provide installation help for downloaders if the instructions are minimally clear.

Open Access for scholarly publications implicitly guarantees certain aspects which are not guaranteed by default for Open Data:

  • The whole of the work is available. This is almost always trivial for articles (but as we have seen is a problem for some sorts of data).
  • There will be continued access to the work. This is based on (Gold) the permanence of Open Access publishers and the copying to inter/national repositories and (Green) the permanence of institutional repositories and in some cases inter/national repositories (self-archival on personal webpages does not guarantee permanent access). Repositories in general do not archive data.
  • The work can be re-used. This is clear if a licence is embedded in the work or provided by the repository. Note that many repositories do not make the licence position clear.
  • The work is in a convenient and modifiable form. Trivially readable for sighted humans. The rest is not always true.

Almost all these are major problems for Open Data.

So I very much hope that we can use Open Data in a strict form which adheres to the Open Knowledge Foundation guidelines. This is a good time to cement or challenge them. But it would be a serious problem if we allow "Freely accessible" to become synonymous with "Open Data".

Open NMR: update

I am very grateful to hko and Wolfgang Robien for their continued analysis of the results of Nick Day's automated calculation of NMR chemical shifts, using the GIAO approach (parameterized by Henry Rzepa). The discussion has shown that some structures are "wrong" and rather more are misassigned.

Wolfgang Robien Says:
November 11th, 2007 at 10:01 am e

we need ‘CORRECT’ data - many assignments of the early 70’s are absolutely correct and useful for comparison [...]
As a consequence of your QM-calculations 10 assignment corrections and 1 structure revision within a few hundred compounds have been performed by ‘hko’ (see postings above) - this corresponds to an
error rate of approx. 5% ! [PMR: In the data set we extracted from NMRShiftDB]. [... discussion of how such errors are detected snipped...]

PMR: Part of the exercise that Nick Day has undertaken was to give an objective analysis of the errors in the GIAO method. The intention was to select a data set objectively. It is extremely difficult to select a representative data set by any means - every collection is made with some purpose in mind. We assumed that NMRShiftDB was "roughly representative" of 13C NMR (and so fat this hasn't been an issue). It could be argued that it may not have many organometallics, minerals, proteins, etc. and I suspect that our discourse is mainly about "small organic molecules". But I don't know. It may certainly not be representative of the scope or GIAO or HOSE codes. Again I don't know. Having made the choice of data set the algorithm for selecting the test data was objective and Nick has stated it (< 20 heavy atoms, <= Cl except Br, no adjacent acyclic bonds). There may have been odd errors in implementing this (we got 2-3 compounds with adjacent acyclic bonds) but it was largely correct. And it could be re-run to remove these. We stress again that we did not know how many structures we would get and whether they would behave well in the GIAO method. In fact over 25% failed to complete the calculation. (We are continuing to find this - the atom count is not a perfect indication of how long a calculation will take which can vary by nearly a factor of 10).

We would not claim that the remaining ca. 250 compounds were "representative". There are no organometallics, no electron-deficient compounds, no overcrowded compounds, no major ring currents, etc. (all of which are areas where we might expect GIAO to do better than some empirical methods). In fact the compounds are generally ones that we would expect connection-table-based methods to score well on as there are few unusual groups (so well trained) and no examples where the connection table cannot describe the molecule well (e.g. Li4Me4, Fe(Cp)2, etc.

Our current conclusion is that the variance in the experimental data is sufficiently large (even after removal of misassignments) to hide errors in the GIAO method. This appears to give good agreement with an RMS of ca. 2 ppm. (but again we stress that the data set is not necessarily representative). If the Br/Cl correction had not been anticipated it would have been clearly visible and the exercise would have revealed it as a new effect. It is certainly possible that there are other undetected effects (especially for unusual chemistry). But, for common compounds I think we can claim that the GIAO method is a useful prediction tool. It should be particularly useful where connection tables break down and here are some systems I'd like to see it exposed to:

  • Li4Me4
  • Fe(Cp2) - although Fe is difficult to calculate well.
  • p-cyclophane (C1c(cc2)ccc2CCc(cc3)ccc3C1)
  • 18-annulene
  • PMR: So what I would like is a representative test data set that could be used for the GIAO method. The necessary criteria are:

    • It is agreed what the chemical scope is. I think we would all exclude minerals, probably all solid state, proteins, macromolecules (there are other communities which do that). But I think we should include a wide chemical range if possible.
    • The data set is prepared by one or more NMR-expert groups that have no particular interest in promoting one method over another. That rules out Henry, Wolfgang, ACDLabs, and probably NMRShiftDB.
    • The data set should provide experimental chemical shifts and the experts should have agreed the assignments by whatever methods are currently appropriate - these could include a group opinion. The assignments should NOT have been based on any of the potential competitive methodologies.

    For a competition there would be stronger requirements - it is essential it is seen to be fair as reputation and commercial success might hang on the result.

    So I make my request again. Please can anyone give me some data that we can use in an Open experiment to test (and if necessary validate/invalidate) the GIAO method? At this stage we'd be happy to take material from anyone's collections, but it would have to be Open so that other groups have the chance to comment.

    I hope someone can volunteer. If not we may have to resort to (machine) extraction of data from the current literature. Our experience with crystallography suggests that the reporting and quality of analytical data in general has increased over the last 10 years.

    Open science: competitions increase the quality of scientific prediction

    I previous posts and comments we have been discussing the value of certain predictive methods for NMR chemical shifts. In the next post I am going to make a proposal for an objective process which I hope will help take us forward. Chemistry (chemoinformatics) is often not good at providing objective reports of predictive quality - the data, algorithms, statistics and analysis are often not formally redistributable and so cannot be easily checked.

    In preparation for the suggestion here are some examples of how competitions enhance the quality of prediction:

    CASP

    Every 2 years (CASP1 (1994) | CASP2 (1996) | CASP3 (1998) | CASP4 (2000) | CASP5 (2002) | CASP6 (2004) | CASP7 (2006)) the Protein Structure Prediction Centre runs a competition:

    "Our goal is to help advance the methods of identifying protein structure from sequence. The Center has been organized to provide the means of objective testing of these methods via the process of blind prediction. In addition to support of the CASP meetings our goal is to promote an evaluation of prediction methods on a continuing basis."

    There are independent CASP assessors who give their time on an impartial basis to oversee the procedure and judge the results of the predictions. Some more details:

    For the experiment to succeed, it is essential that we obtain the help of the experimental community. As in previous CASPs, we will invite protein crystallographers and NMR spectroscopists to provide details of structures they expect to have made public before September 1, 2006. A target submission form will be available at this web site in mid-April. Prediction targets will be made available through this web site. All targets will be assigned an expiry date, and predictions must be received and accepted before that expiration date."

    As in previous CASPs, independent assessors will evaluate the predictions. Assessors will be provided with the results of numerical evaluation of the predictions, and will judge the results primarily on that basis. They will be asked to focus particularly on the effectiveness of different methods. Numerical evaluation criteria will as far as possible be similar to those used in previous CASPs, although the assessors may be permitted to introduce some additional ones.

    There are four assessors, representing expertise in the template-based modeling, template-free modeling, high accuracy modeling and function prediction: In accordance with CASP policy, assessors are not directly involved in the organization of the experiment, nor can they take part in the experiment as predictors. Predictors must not contact assessors directly with queries, but rather these should be sent to the escramble("casp","predictioncenter.org")casp@predictioncenter.org email address.

    and they follow up with a meeting.

    Text Retrieval Conference

    The TREC conference series has produced a series of test collections. Each of these collections consists of a set of documents, a set of topics (questions), and a corresponding set of relevance judgments (right answers). Different parts of the collections are available from different places as described on the data page (http://trec.nist.gov/data.html). In brief, the topics and relevance judgements are available at http://trec.nist.gov/data.html, and the documents are available from either the LDC (Tipster Disks 1--3) or NIST (TREC Disks 4--5), information on collections other than English can be found at http://trec.nist.gov/data.html.

    A Third Blind Test of Crystal Structure Prediction

    In May 2004 the CCDC hosted a meeting to discuss the results of the third blind test of Crystal Structure Prediction (CSP). The challenge of the competition was to predict the experimentally observed crystal structure of the 4 small organic molecules shown in figure 1 given information only on the molecular diagram, the crystallisation conditions and the fact that Z' would be no greater than 2. The results of the competition are presented including an analysis of each participants extended list of candidate structures. A computer program COMPACK has been developed to identify crystal structure similarity. This program is used to identify at what positions the observed structures appear in the extended lists. Also, predicted structures obtained from the various participants are compared to determine whether the different approaches and methodologies attempted produce similar lists of structures. The hydrogen bond motifs predicted for molecule I are also analysed and an assessment made as to the most commonly predicted motifs and a comparison made to common motifs observed for similar molecules found in the Cambridge Structural Database

    PMR: These havea range of objective (measured) and subjective (expert opinion) criteria for the "right" answer. The key components are:

    • the mechanism and evaluation must be independent of the competitors
    • all competitors must have an equal chance
    • the answers must be carefully created and hidden before the prediction
    • there is a closing date

    It is essential that the data are Open and seen to be a reasonable challenge and that the analysis process is transparent. It is not essential that competitors software is Open.

    Open Data for common molecules?

    Yesterday I needed the measured (i.e. not predicted) mass density for 2-bromo-propanoyl-bromide (CH3-CH(Br)C(=O)Br). This is a moderately common reagent and so I went to look for it on the Web - ultimately finding it on several sites. The value is ca. 2.061 g.cm-3 (many sites omit the units - argh!!). The temperature should also be reported - but isn't. I need the measured density because many chemical recipes give the volume of reagents and I want to work out the molar ratios in reactions for which I need the density. I may also be interested in other measured properties such as boiling point.
    The problem is that it's difficult to scrape these sites. They give little indication of copyright, are arcanely structured and often have poor semantics (e.g. units). The best known is the NIST Webbook, part of which reads:

    • Thermophysical property data for 74 fluids:
      • Density, specific volume
      • Heat capacity at constant pressure (Cp)
      • Heat capacity at constant volume (Cv)
      • Enthalpy
      • Internal energy
      • Entropy
      • Viscosity
      • Thermal conductivity
      • Joule-Thomson coefficient
      • Surface tension (saturation curve only)
      • Sound speed

    You can search for data on specific compounds in the Chemistry WebBook based on name, chemical formula, CAS registry number, molecular weight, chemical structure, or selected ion energetics and spectral properties.


    NIST reserves the right to charge for access to this database in the future.The National Institute of Standards and Technology (NIST) uses its best efforts to deliver a high quality copy of the Database and to verify that the data contained therein have been selected on the basis of sound scientific judgment. However, NIST makes no warranties to that effect, and NIST shall not be liable for any damage that may result from errors or omissions in the Database.


    © 1991, 1994, 1996, 1997, 1998, 1999, 2000, 2001, 2003, 2005 copyright by the U.S. Secretary of Commerce on behalf of the United States of America. All rights reserved.

    It's clear that this is not an Open site - most works of the US Government are required to make their works freely available but NIST has exemption for its databases so that it can raise money.

    Many suppliers list property information but scattered throughout somewhat uncoordinated pages. Moreover the copyright and crawling position is often not clear.
    My requirement is likely to be via robot - i.e. an asynchronous request for a property I don't have, with the ability to re-use it without explicit permission. I am therefore wondering whether there are Open sites for chemical data that can be accessed without explicit permission. I am not interested in collections of millions of compounds, but rather ca. 10,000 of the most commonly used.
    A good source of data is MSDS (Materials Safety data Sheets), and here is part of a typical one hosted by  a group at Oxford University:

    General

      Synonyms: nitrilo-2,2',2"-triethanol, tris(2-hydroxyethyl)amine, 2,2',2"-trihydroxy-triethylamine, trolamine, TEA, tri(hydroxyethyl)amine, 2,2',2"-nitrilotrisethanol, alkanolamine 244, daltogen, sterolamide, various further trade names
      Molecular formula: C6H15NO3
      CAS No: 102-71-6
      EC No: 203-049-8

    Physical data

      Appearance: viscous colourless or light yellow liquid or white solid
      Melting point: 18 - 21 C
      Boiling point: 190 - 193 C at 5 mm Hg, ca. 335 C at 760 mm Hg (decomposes)
      Vapour density: 0.01 mm Hg at 20 C
      Vapour pressure: 5.14
      Specific gravity: 1.124
      Flash point: 185 C
      Explosion limits: 1.3 % - 8.5 %
      Autoignition temperature: 315 C

    Stability

    Stable. Incompatible with oxidizing agents and acids. Light and air sensitive.

    It looks as if there are in the range of 5,000 to 100,000 compounds on the site - I haven't counted and if so this is close to what I am looking for. It looks as if the creators are happy for people to download it - their concern is that it shouldn't be seen as authoritative about safety (a perfectly reasonable request). If so, an Open Data sticker would be extremely useful and solve the problem. (There is the minor problem that there are no connection tables, but links to Pubchem should solve that).

    There has been talk of a Wikichemicals - and this is the sort of form it might take. It shouldn't be too difficult to create it and the factual data on the pages doesn't belong to anyone. So I'd like to know whether anyone has been doing this (measured, not predicted data) and whether there resource is Open.