The Crystallography Open Dataase is an early and excellent example of the way that a community can start to help itself and make its data open
I’ve taken most of my information from the website (although I have also met the founder, Armel Le Bail 3 years ago). The COD arose out of the frustration of a number of scientists that much published crystallographic data was not Openly or (usually) freely available. There are clear contrasts in the discipline in that the structures of protein molecules are Openly available in the RCSB Protein Data Bank but that other crystals (organic, organometallic, inorganic, metals, etc.) are not. Le Bail wrote to these databases and to the International Union of Crystallography asking that crystallographic data should be made freely available. There is a considerable correspondence on the web site (here). There is also a petition of 1000+ signatures (including mine) requesting that crystallographic data be made openly available.
I shan’t discuss the history here, but simply outline the current situation. An author publishes a crystal structure as part or all of an experiment. Almost all journals require the data to be made available, inter alia to validate the experiment and to allow it to be reproduced. (It’s also proved extremely valuable for re-use). In the past the only place the data could be stored electronically was these databases, which not unreasonably had to recover costs. Now, however, many journals publish the data directly on their websites (RSC, ACS, IUCr, most but not all Open Access journals, American Mineralogist). Others (Wiley, Springer, Elsevier) do not publish this data (though ironically they may publish other data such as spectra) but require it to be sent to the appropriate crystallographic database. The data is then not Openly available, and is mainly used by annual subscription (although individual entries may sometimes be free). Many scientists do not have access (including those who only have an occasional use). This is now acting as an impediment to access rather than a support.
Examples (mostly in 2005) from the COD petitioners include:
We are doing a lot of molecular simulations on all kinds of minerals and access to crystallography structure data bases is crucial for our work…USA
As the author of webmineral.com, any public access to mineral data promotes and encourages further understanding of the mineral (material) sciences….USA
The information on data basis is generated by persons who, in same cases, have to pay for obtaining it. It is not reasonable; it should be freely available!…Brazil
Because I am only an occasional user of crystallographic databases (for example, during the mineralogy section of my soil chemistry course) I cannot justify paying full price for access to a database. I generally just use crystal structures that are prepackaged with CrystalMaker. Open access would allow my students and I to learn to use the crystallography software for a large variety of soil minerals. … (USA)
The data is all obtained by the scientific community and should be available without charge. As a fact of nature it should not be copywritable…. USA
I am a Crystallographer and our University is suffering from a constant lack of funding – so the databases are not being updated and are also not easily available… Australia
This is really a good proposal for people from developing countries…. India
I trust there should be no economical barriers to knowledge, and databases are essential tools for the scientific community and should be accessibile to everyone…. Italia
Egypt is a devoloping Country and We canot have the regular Crystallographic data base which we realy need it in our work accordingly if the crystallographic community allow us to have it on line it will be a big achievement for the third world countries…Egypt
As a result of our poor exchange rate, researchers in South Africa have to pay a very high price to obtain these databases. As a young academic and researcher, I could not afford the CSD in the first two years of my career, which made crystallographic research very difficult. Even now, obtaining the CSD is extremely expensive, and a large percentage of my research funds is used for this…. South Africa
… and many more.
So the volunteers at the COD started soliciting contributions, and – I think – typing up some historical and classic structures. (It is quite difficult to get Open structures for the sorts of materials required in undergraduate teaching, especially minerals). As I understand it the COD sources include:
- typing up public information
- donations from various regular sources – Am. Mineral and others
- donations from individual groups and laboratories.
The COD has now reached an extremely impressive 70, 000+ structures and it’s enormously valuable for our work and for CrystalEye. It seems to cover all fields of crystallography (except, of course, proteins). We have assimilated a great deal of COD into CrystalEye Here’s a typical example:
The CIF:
;
Neue Verbindungen mit Ba6 Ln2 (M3+)4 O15-Typ: Ba6 Nd2 Fe4 O15, Ba5 Sr
La2 Fe4 O15 und Ba5 Sr Nd2 Fe4 O15
;
loop_
_publ_author_name
‘Mevs, H’
‘Mueller-Buschbaum, Hk’
_journal_name_full ‘Journal of the Less-Common Metals’
_journal_volume 158
_journal_year 1990
_journal_page_first 147
_journal_page_last 152
which you can view in CrystalEye.
The COD comes as individual CIFs in large ZIP files and we have done bulk imports when releases became available. We are now using the data for Nick Day’s MOPAC calculations.
Note that at present the COD data is in the databases it is not indexed by bibliography (though it is available by bond-length search). That’s because CrystalEye was not set up as a repository and the main bibliographic indexing system is based on regular publishers. It’s something we are reviewing over the summer for our in-house crystallographic repository and then we shall be able to ingest CIFs from arbitrary sources.
We also have a problem in that many entries are duplicates of structures we already have, That’s because they were published in journals and also submitted to the COD. We don’t want to put them in twice, but it’s not easy to create a uniquifier when metadata are missing- we currently use structural formula and cell dimensions.
The data in the COD is variable. Some are duplicates as mentioned. Some are old structures retyped, some are new depositions. Metadata varies a lot. At present some CIFs are syntactically incorrect and we have various heuristics to read them. (The COD does not edit the CIFs it receives – at least it didn’t). So quality is a problem.
However it is extremely useful. Many of our inorganic materials come directly from there. So if you are using CrystalEye you are also using almost all of the COD. We’ll be gradually adding metadata to all our data so that the COD is more prominent and acknowledged. Meanwhile many thanks.