Archive for the ‘data’ Category
In the last two days Cameron Neylon has posted an idea for Open Science and got a lot of interest, see: e-science for open science – an EPSRC research network proposal and Follow on to network proposal. The idea is to create networks of excellence in escience (cyberscholarship) and Open (Notebook) Science would fit the bill perfectly. I’d be delighted to be part of Cameron’s proposal and this can be taken as a letter of support (unless the powers that be insist on a bit of paper with a logo which is a pain in the neck).
One of the secrets of cyberscholarship is that it flourishes when people do not want to run everything themselves in competition with everyone else. Chemistry often has a very bad culture of fortification rather than collaboration – hence there was so little effective chemistry in the UK eScience program. Southampton has been a notable exception, and for example, we are delighted to be part of the eCrystals program they are running. The last year has shown that at grass roots, chemistry is among the leaders in Open Science and Cameron has detailed this in his proposal
There are several areas where we’d like to help:
- making the literature available to machines (OSCAR) and thereby to the community
- distributed collaborative management of combinatorial chemistry (we can now do this with CML for fragments)
- shared molecular repositories (again we have a likely collaboration with Soton here)
- creation of shared ontolgoies (we collaborated with Soton during the eScience program).
(I’ve been spending time coding rather than blogging and – blink for a day – find out what I’d missed).
13C (150 MHz) d 138.4 (Ar-ipso-C), 136.7 (C-2), 136.1 (C-1), 128.3, 127.6, 127.5 (Ar‑ortho-C, Ar-meta-C, Ar-para-C), 87.2 (C-3), 80.1 (C-4), 72.1 (OCH2Ph), 69.7 (CH2OBn), 58.0 (C-5), 26.7 (C-6), 20.9 ((CH3)AC-6), 17.9 ((CH3)BC-6), 11.3 (CH3C‑2), 0.5 (Si(CH3)3).(the “d” is a delta but I think everything has been faithfully copied from the Word document. Note that OSCAR can :
- understand that this is a 13C spectrum
- extract the frequency
- identify the peak values (shiofts) and identify the comments
- extract the solvent (this is mentioned elsewhere in the thesis)
- understand the comments
- manage the framework symmetry group of the phenyl ring
- understand peakGroup (the aromatic ring)
You [PMR] said in your Blog:
It is critical to distinguish between “Free” and Open. “Free”, in this context, simply means that the provider has mounted the data (not necessarily the whole data) on a web page. There is often no licence, no copyright, no guarantee of availability, no commitment to archival, no explicit freedom of re-use. The materials database is in this category – and to be fair it didn’t call itself Open.
The Open Knowledge Initiative says: 1. Access The work shall be available as a whole and at no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The work must also be available in a convenient and modifiable form. ================================ [Correspondent] This does not seem to go far enough in that if I have all good intentions and post material on a web server and then drop dead tomorrow the info will disappear pretty soon after that. Possibly lost forever! The distinction you make between “free” and “Open” suggests that Open means there is some permanency to the arrangement of having it available? Am I interpreting this correctly? How could this be monitored or managed?PMR: You are absolutely right. I think the problem of preservation has not been addressed. Indeed until I starting thinking about it I hadn’t realised how relatively simple the preservation of text was and how difficult the preservation of data was. First: Who can one trust?It’s currently easy to deposit material with anyone – Google, Amazon, whoever. And to trust to spinning media to be replicated. But it’s very risky for long term preservation. There are many bodies working on this and the simple message is that it’s difficult, depends on what we want to preserve and how long we want to do it. There are many levels – the bitstream, the semantic content, the ontological context, etc. Places like the UK’s Digital Curation Centre understand and work on exactly this. [ Correspondent] is worried – like me – about the archival of chemical data iwithin the laboratory. What should be done? My personal answers are:
- If the data is valuable enough an international data centre will store it. Biology is strong here and bioinformatics centres have an effective commitment to archival and also work on preservation. Chemistry relies on commercial and quasi-commercial organisations which generally accomplish far less.
- My own inclination is towards global domain-specific repositories where the data are difficult and the volume merits it; national ones where the problem is understood but needs supporting (e.g. in mainstream chemistry) and where possible departmental ones (e.g. for crystallography, spectroscopy, computational chemistry.)
- I am not a strong supporter that terabytes of data should be generally dumped in institutional repositories without good metadata and analysis software. Maybe future generations will welcome these hidden treasures and will have super-intelligent software.
Abstract: JASPAR is a popular open-access database for matrix models describing DNA-binding preferences for transcription factors and other DNA patterns. [...] JASPAR is available [here].In the last 2 months I have been thinking fairly furiously about Open Data and realising it can be considerably more complex than Open Access scholarly publishing. I’m certainly clear that the borderlines may have to be fuzzy, but not infinitely. Peter writes:
- Just a quick note on my offline talk with PMR about Material Properties, which I called “OA” in a blog post. Neither of us could find its licensing terms, so we couldn’t tell just how open it was. I needed (I still need, we all need) a generic term for such resources when we do know they are free of charge but don’t know any details about their licensing terms. For better or worse “OA” has become that generic term, even while it has a narrower, earlier, more technical and more proper sense through the BBB definition. I readily and often acknowledge that I use the term “OA” both ways –widely and narrowly, as a generic term and as the technical term for the BBB level of openness. I also readily and often acknowledge that this ambiguity causes problems –see for example the Poynder interview at pp. 30-31. I can add that I resisted this dual sense as long as I could and only acquiesced when it became an undeniable fact of actual usage. For perspective, I’ve also argued that this kind of semantic spread is not a special calamity for our technical term, but affects most technical terms in wide use and needn’t prevent precise communication.
- One tempting solution is to come up with a new generic term so that “OA” can be limited to its strict BBB sense. That’s desirable but difficult, since coining terms is not the same thing as assuring their use, let alone their intended use. BTW, “free” would not make a better generic term, at least not yet, since it suggests to many people that a work is merely free of charge and does not also remove permission barriers. A good generic term would cover all kinds of free online content, including those that are BBB OA.
- I share PMR’s hope that the term “open data” can stay fairly well tethered to its technical definition. But the data world needs a generic term for the same reason that the publication world does. If we had a good generic term for free online content, perhaps it could allow “open data” to remain univocal.
- There must be some mechanism whereby the community could, if it wished, capture the resource for public archival without permission. This could be as simple as spidering the site, or a relational dump, or a massive file, or an iterator.
- There must be no permission barriers to re-use including commercial re-use.
- The data must either be the whole work (at a given point in time) or be clearly bounded (i.e. there should be no hidden data that the world cannot get access to in the same way).
- There should be no time limits on access and re-use.
In the simplest form the definition can be summed up in the statement that “A piece of knowledge is open if you are free to use, reuse, and redistribute it”. For details read the latest version of the full definition (with explanatory annotations).
I’m going to look at the most important clauses for science/chemistry (emphases are mine) – I have omitted other clauses but I adhere to them as well:
1. Access The work shall be available as a whole and at no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The work must also be available in a convenient and modifiable form.PMR: There are significantly different types of Open Data in science. There is raw data produced by the scientific experiment and increasingly published alongside “fulltext” or publications or theses. There is a curated, critical snapshot of a given experiment, perhaps images from a telescope or satellite. In this post I discuss the problems of “databases” or “knowledgebases” which are both fragmented and dynamic (e.g. CrystalEye and the Materials database.) It is critical to distinguish between “Free” and Open. “Free”, in this context, simply means that the provider has mounted the data (not necessarily the whole data) on a web page. There is often no licence, no copyright, no guarantee of availability, no commitment to archival, no explicit freedom of re-use. The materials database is in this category – and to be fair it didn’t call itself Open. A major problem, which we have discussed in some detail on this blog over CrystalEye, is that many databases are both hypermedia and dynamic. They are spread over many components and they change with time. Both CrystalEye and Materials fall into that category. It is technically difficult to make them easily available and there is no agreed mechanism for doing this. The work must be available as a whole. I agree this is critical, but it’s often difficult. Leaving aside the dynamic aspect there are a few possibilities.
The license must allow for modifications and derivative works and must allow them to be distributed under the terms of the original work. The license may impose some form of attribution and integrity requirements: see principle 5 (Attribution) and principle 6 (Integrity) below.
4. Absence of Technological Restriction
The work must be provided in such a form that there are no technological obstacles to the performance of the above activities. This can be achieved by the provision of the work in an open data format, i.e. one whose specification is publicly and freely available …
The license may require as a condition for redistribution and re-use the attribution of the contributors and creators to the work.
The license may require as a condition for the work being distributed in modified form that the resulting work carry a different name or version number from the original work.
8. No Discrimination Against Fields of Endeavor
The license must not restrict anyone from making use of the work in a specific field of endeavor. For example, it may not restrict the work from being used in a business, or from being used for military research.
- Bundle the data into a single “file” or a set of files. This has worked historically for the Protein Databank. The difficulties are that there is not usually a single simple object to bundle, and that it requires considerable maintenance.
- Provide an iterator over the data. This could either be a generic tool such as wget (which recurses over a hyperdocument) or a bespoke tool which is guaranteed to iterate over the data. This is the approach we have adopted (Jim Downing wrote it specifically to help the community download the data and has made it available under Open Source).
- Collaborate with a data provider (e.g. a Bioinformatics institute). This is a good approach if your community supports the idea of Open Data, but chemistry has yet to see the light
A few other comments. “convenient and modifiable form” and “no technological obstacles” cannot be defined precisely but I would ague that if the Open Data provider had published their formats and if there was Open Source code that would read the data that was sufficient. Note that for many files ASCII is sufficient if the metadata is well provided. There is no requirement for the Open Data provider to provide installation help for downloaders if the instructions are minimally clear. Open Access for scholarly publications implicitly guarantees certain aspects which are not guaranteed by default for Open Data:
- The whole of the work is available. This is almost always trivial for articles (but as we have seen is a problem for some sorts of data).
- There will be continued access to the work. This is based on (Gold) the permanence of Open Access publishers and the copying to inter/national repositories and (Green) the permanence of institutional repositories and in some cases inter/national repositories (self-archival on personal webpages does not guarantee permanent access). Repositories in general do not archive data.
- The work can be re-used. This is clear if a licence is embedded in the work or provided by the repository. Note that many repositories do not make the licence position clear.
- The work is in a convenient and modifiable form. Trivially readable for sighted humans. The rest is not always true.
we need ‘CORRECT’ data – many assignments of the early 70’s are absolutely correct and useful for comparison [...] As a consequence of your QM-calculations 10 assignment corrections and 1 structure revision within a few hundred compounds have been performed by ‘hko’ (see postings above) – this corresponds to an error rate of approx. 5% ! [PMR: In the data set we extracted from NMRShiftDB]. [... discussion of how such errors are detected snipped...]PMR: Part of the exercise that Nick Day has undertaken was to give an objective analysis of the errors in the GIAO method. The intention was to select a data set objectively. It is extremely difficult to select a representative data set by any means – every collection is made with some purpose in mind. We assumed that NMRShiftDB was “roughly representative” of 13C NMR (and so fat this hasn’t been an issue). It could be argued that it may not have many organometallics, minerals, proteins, etc. and I suspect that our discourse is mainly about “small organic molecules”. But I don’t know. It may certainly not be representative of the scope or GIAO or HOSE codes. Again I don’t know. Having made the choice of data set the algorithm for selecting the test data was objective and Nick has stated it (< 20 heavy atoms, <= Cl except Br, no adjacent acyclic bonds). There may have been odd errors in implementing this (we got 2-3 compounds with adjacent acyclic bonds) but it was largely correct. And it could be re-run to remove these. We stress again that we did not know how many structures we would get and whether they would behave well in the GIAO method. In fact over 25% failed to complete the calculation. (We are continuing to find this – the atom count is not a perfect indication of how long a calculation will take which can vary by nearly a factor of 10). We would not claim that the remaining ca. 250 compounds were "representative". There are no organometallics, no electron-deficient compounds, no overcrowded compounds, no major ring currents, etc. (all of which are areas where we might expect GIAO to do better than some empirical methods). In fact the compounds are generally ones that we would expect connection-table-based methods to score well on as there are few unusual groups (so well trained) and no examples where the connection table cannot describe the molecule well (e.g. Li4Me4, Fe(Cp)2, etc. Our current conclusion is that the variance in the experimental data is sufficiently large (even after removal of misassignments) to hide errors in the GIAO method. This appears to give good agreement with an RMS of ca. 2 ppm. (but again we stress that the data set is not necessarily representative). If the Br/Cl correction had not been anticipated it would have been clearly visible and the exercise would have revealed it as a new effect. It is certainly possible that there are other undetected effects (especially for unusual chemistry). But, for common compounds I think we can claim that the GIAO method is a useful prediction tool. It should be particularly useful where connection tables break down and here are some systems I'd like to see it exposed to:
- Fe(Cp2) – although Fe is difficult to calculate well.
- p-cyclophane (C1c(cc2)ccc2CCc(cc3)ccc3C1)
- 18-annulene PMR: So what I would like is a representative test data set that could be used for the GIAO method. The necessary criteria are:
- It is agreed what the chemical scope is. I think we would all exclude minerals, probably all solid state, proteins, macromolecules (there are other communities which do that). But I think we should include a wide chemical range if possible.
- The data set is prepared by one or more NMR-expert groups that have no particular interest in promoting one method over another. That rules out Henry, Wolfgang, ACDLabs, and probably NMRShiftDB.
- The data set should provide experimental chemical shifts and the experts should have agreed the assignments by whatever methods are currently appropriate – these could include a group opinion. The assignments should NOT have been based on any of the potential competitive methodologies.
For the experiment to succeed, it is essential that we obtain the help of the experimental community. As in previous CASPs, we will invite protein crystallographers and NMR spectroscopists to provide details of structures they expect to have made public before September 1, 2006. A target submission form will be available at this web site in mid-April. Prediction targets will be made available through this web site. All targets will be assigned an expiry date, and predictions must be received and accepted before that expiration date.” As in previous CASPs, independent assessors will evaluate the predictions. Assessors will be provided with the results of numerical evaluation of the predictions, and will judge the results primarily on that basis. They will be asked to focus particularly on the effectiveness of different methods. Numerical evaluation criteria will as far as possible be similar to those used in previous CASPs, although the assessors may be permitted to introduce some additional ones.
There are four assessors, representing expertise in the template-based modeling, template-free modeling, high accuracy modeling and function prediction: In accordance with CASP policy, assessors are not directly involved in the organization of the experiment, nor can they take part in the experiment as predictors. Predictors must not contact assessors directly with queries, but rather these should be sent to the escramble(“casp”,”predictioncenter.org”)firstname.lastname@example.org email address.and they follow up with a meeting.
The TREC conference series has produced a series of test collections. Each of these collections consists of a set of documents, a set of topics (questions), and a corresponding set of relevance judgments (right answers). Different parts of the collections are available from different places as described on the data page (http://trec.nist.gov/data.html). In brief, the topics and relevance judgements are available at http://trec.nist.gov/data.html, and the documents are available from either the LDC (Tipster Disks 1–3) or NIST (TREC Disks 4–5), information on collections other than English can be found at http://trec.nist.gov/data.html.
In May 2004 the CCDC hosted a meeting to discuss the results of the third blind test of Crystal Structure Prediction (CSP). The challenge of the competition was to predict the experimentally observed crystal structure of the 4 small organic molecules shown in figure 1 given information only on the molecular diagram, the crystallisation conditions and the fact that Z’ would be no greater than 2. The results of the competition are presented including an analysis of each participants extended list of candidate structures. A computer program COMPACK has been developed to identify crystal structure similarity. This program is used to identify at what positions the observed structures appear in the extended lists. Also, predicted structures obtained from the various participants are compared to determine whether the different approaches and methodologies attempted produce similar lists of structures. The hydrogen bond motifs predicted for molecule I are also analysed and an assessment made as to the most commonly predicted motifs and a comparison made to common motifs observed for similar molecules found in the Cambridge Structural DatabasePMR: These havea range of objective (measured) and subjective (expert opinion) criteria for the “right” answer. The key components are:
- the mechanism and evaluation must be independent of the competitors
- all competitors must have an equal chance
- the answers must be carefully created and hidden before the prediction
- there is a closing date
It’s clear that this is not an Open site – most works of the US Government are required to make their works freely available but NIST has exemption for its databases so that it can raise money. Many suppliers list property information but scattered throughout somewhat uncoordinated pages. Moreover the copyright and crawling position is often not clear. My requirement is likely to be via robot – i.e. an asynchronous request for a property I don’t have, with the ability to re-use it without explicit permission. I am therefore wondering whether there are Open sites for chemical data that can be accessed without explicit permission. I am not interested in collections of millions of compounds, but rather ca. 10,000 of the most commonly used. A good source of data is MSDS (Materials Safety data Sheets), and here is part of a typical one hosted by a group at Oxford University:
You can search for data on specific compounds in the Chemistry WebBook based on name, chemical formula, CAS registry number, molecular weight, chemical structure, or selected ion energetics and spectral properties.
- Thermophysical property data for 74 fluids:
- Density, specific volume
- Heat capacity at constant pressure (Cp)
- Heat capacity at constant volume (Cv)
- Internal energy
- Thermal conductivity
- Joule-Thomson coefficient
- Surface tension (saturation curve only)
- Sound speed
NIST reserves the right to charge for access to this database in the future.The National Institute of Standards and Technology (NIST) uses its best efforts to deliver a high quality copy of the Database and to verify that the data contained therein have been selected on the basis of sound scientific judgment. However, NIST makes no warranties to that effect, and NIST shall not be liable for any damage that may result from errors or omissions in the Database.
© 1991, 1994, 1996, 1997, 1998, 1999, 2000, 2001, 2003, 2005 copyright by the U.S. Secretary of Commerce on behalf of the United States of America. All rights reserved.
It looks as if there are in the range of 5,000 to 100,000 compounds on the site – I haven’t counted and if so this is close to what I am looking for. It looks as if the creators are happy for people to download it – their concern is that it shouldn’t be seen as authoritative about safety (a perfectly reasonable request). If so, an Open Data sticker would be extremely useful and solve the problem. (There is the minor problem that there are no connection tables, but links to Pubchem should solve that). There has been talk of a Wikichemicals – and this is the sort of form it might take. It shouldn’t be too difficult to create it and the factual data on the pages doesn’t belong to anyone. So I’d like to know whether anyone has been doing this (measured, not predicted data) and whether there resource is Open.
Synonyms: nitrilo-2,2′,2″-triethanol, tris(2-hydroxyethyl)amine, 2,2′,2″-trihydroxy-triethylamine, trolamine, TEA, tri(hydroxyethyl)amine, 2,2′,2″-nitrilotrisethanol, alkanolamine 244, daltogen, sterolamide, various further trade names Molecular formula: C6H15NO3 CAS No: 102-71-6 EC No: 203-049-8
Appearance: viscous colourless or light yellow liquid or white solid Melting point: 18 – 21 C Boiling point: 190 – 193 C at 5 mm Hg, ca. 335 C at 760 mm Hg (decomposes) Vapour density: 0.01 mm Hg at 20 C Vapour pressure: 5.14 Specific gravity: 1.124 Flash point: 185 C Explosion limits: 1.3 % – 8.5 % Autoignition temperature: 315 C
StabilityStable. Incompatible with oxidizing agents and acids. Light and air sensitive.