petermr's blog

A Scientist and the Web


Archive for the ‘data’ Category

Open Data in Science

Sunday, January 6th, 2008

I have been invited to write an article for Elsevier’s Serials Review and mentioned it in an earlier post (Open Data: Datument submitted to Elsevier’s Serials Review). I had hoped to post the manuscript immediately afterward but (a) our DSpace crashed and (b) Nature Precedings doesn’t accept HTML So DSpace is up again and you can see the article. This post is about the content, not the technology
[NOTE: The document was created as a full hyperlinked datument, but DSpace cannot handle hyperlinks and it numbers each of the components as a completely separate object with an unpredictable address. So none of the images show up - it's probably not a complete disaster - and you lose any force of the datument concept (available here as zip) which contains an interactive molecule (Jmol) ]

The abstract:

Open Data (OD) is an emerging term in the process of defining how scientific data may be published and re-used without price or permission barriers. Scientists generally see published data as belonging to the scientific community, but many publishers claim copyright over data and will not allow its re-use without permission. This is a major impediment to the progress of scholarship in the digital age. This article reviews the need for Open Data, shows examples of why Open Data are valuable and summarises some early initiatives in formalising the right of access to and re-use of scientific data.

PMR: The article tries not to be too polemic and to review objectively the area of Open Data (in scientific scholarship), in the style that I have done for Wikipedia. The next section shows Open Data in action, both on individual articles and when aggregating large numbers (> 100,000) articles. Although the illustrations are from chemistry and crystallography the message should transcend the details. Finally I try to review the various intitiatives that have happened very recently and I would welcome comments and corrections. I think I understand the issues raised in the last month but they will take time to sink in.

So, for example, the last section I describe and pay tribute to the Open Knowledge Foundation, Talis and colleagues, and Science/Creative Commons. I will blog this later but there is a now a formal apparatus for managing Open Data (unlike Open Access where the lack of this causes serious problems for science data). In summary, se now have:

  • Community Norms(“this is how the community expects A and B and C to behave – the norms have no legal force but if you don’t work with them you might be ostracized, get no grants, etc.”)
  • Protocols. These are high-level declarations which allow licences to be constructed. Both Science Commons and The Open Knowledge Foundation have such instruments. They describe the principles to which conformant licences must honour. I use the term meta-licence (analogous to XML, a meta-markuplanguage for creating markup languages).
  • Licences. These include PDDL and CC0 which conform to the protocol.

Throughout the article I stress the need for licences, and draw much analogy from the Open/Free Source communities which have meta-licences and then lists of conformant licences. I think the licence approach will be successful and will be rapidly adopted.

The relationship between Open Access and Open Data will require detailed work – they are distinct and can exist together or independently.  In conclusion I write:

Open Data in science is now recognised as a critically important area which needs much careful and coordinated work if it is to develop successfully. Much of this requires advocacy and it is likely that when scientists are made aware of the value of labeling their work the movement will grow rapidly. Besides the licences and buttons there are other tools which can make it easier to create Open Data (for example modifying software so that it can mark the work and also to add hash codes to protect the digital integrity).

Creative Commons is well known outside Open Access and has a large following. Outside of software, it is seen by many as the default way of protecting their work while making it available in the way they wish. CC has the resources, the community respect and the commitment to continue to develop appropriate tools and strategies.

But there is much more that needs to be done. Full Open Access is the simplest solution but if we have to coexist with closed full-text the problem of embedded data must be addressed, by recognising the right to extract and index data. And in any case conventional publication discourages the full publication of the scientific record. The adoption of Open Notebook Science in parallel with the formal publications of the work can do much to liberate the data. Although data quality and formats are not strictly part of Open Data, their adoption will have marked improvements. The general realisation of the value of reuse will create strong pressure for more and better data. If publishers do not gladly accept this challenge, then scientists will rapidly find other ways of publishing data, probably through institutional, departmental, national or international subject repositories. In any case the community will rapidly move to Open Data and publishers resisting this will be seen as a problem to be circumvented

Does the semantic web work for chemical reactions

Friday, January 4th, 2008

A very exciting post from Jean-Claude Bradley asking whether we can formalize the semantics of chemical reactions and synthetic procedures. Excerpts, and then comment…

Modularizing Results and Analysis in Chemistry

Chemical research has traditionally been organized in either experiment-centric or molecule-centric models.

This makes sense from the chemist’s standpoint.

When we think about doing chemistry, we conceptualize experiments as the fundamental unit of progress. This is reflected in the laboratory notebook, where each page is an experiment, with an objective, a procedure, the results, their analysis and a final conclusion optimally directly answering the stated objective.

When we think about searching for chemistry, we generally imagine molecules and transformations. This is reflected in the search engines that are available to chemists, with most allowing at least the drawing or representation of a single molecule or class of molecules (via substructure searching).

But these are not the only perspectives possible.

What would chemistry look like from a results-centric view?

Lets see with a specific example. Take EXP150, where we are trying to synthesize a Ugi product as a potential anti-malarial agent and identify Ugi products that crystallize from their reaction mixture.

If we extract the information contained here based on individual results, something very interesting happens. By using some standard representation for actions we can come up with something that looks like it should be machine readable without much difficulty:

  • ADD container (type=one dram screwcap vial)
  • ADD methanol (InChIKey=OKKJLVBELUTLKV-UHFFFAOYAX, volume=1 ml)
  • WAIT (time=15 min)
  • ADD benzylamine (InChIKey=WGQKYBSKWIADBV-UHFFFAOYAL, volume=54.6 ul)
  • VORTEX (time=15 s)
  • WAIT (time=4 min)
  • ADD phenanthrene-9-carboxaldehyde (InChIKey=QECIGCMPORCORE-UHFFFAOYAE, mass=103.1 mg)
  • VORTEX (time=4 min)
  • WAIT (time=22 min)
  • ADD crotonic acid (InChIKey=LDHQCZJRKDOVOX-JSWHHWTPCJ, mass=43.0 mg)
  • VORTEX (time=30 s)
  • WAIT (time=14 min)
  • ADD tert-butyl isocyanide (InChIKey=FAGLEPBREOXSAC-UHFFFAOYAL, volume=56.5 ul)
  • VORTEX (time=5.5 min)

It turns out that for this CombiUgi project very few commands are required to describe all possible actions:

  • ADD
  • WAIT

By focusing on each result independently, it no longer matters if the objective of the experiment was reached or if the experiment was aborted at a later point.

Also, if we recorded chemistry this way we could do searches that are currently not possible:

  • What happens (pictures, NMRs) when an amine and an aromatic aldehyde are mixed in an alcoholic solvent for more than 3 hours with at least 15 s vortexing after the addition of both reagents?
  • What happens (picture, NMRs) when an isonitrile, amine, aldehyde and carboxylic acid are mixed in that specific order, with at least 2 vortexing steps of any duration?

I am not sure if we can get to that level of query control, but ChemSpider will investigate representing our results in a database in this way to see how far we can get.

Note that we can’t represent everything using this approach. For example observations made in the experiment log don’t show up here, as well as anything unexpected. Therefore, at least as long as we have human beings recording experiments, we’re going to continue to use the wiki as the official lab notebook of my group. But hopefully I’ve shown how we can translate from freeform to structured format fairly easily.

Now one reason I think that this is a good time to generate results-centric databases is the inevitable rise of automation. It turns out that it is difficult for humans to record an experiment log accurately. (Take a look at the lab notebooks in a typical organic chemistry lab – can you really reproduce all those experiments without talking to the researcher?)

But machines are good at recording dates and times of actions and all the tedious details of executing a protocol. This is something that we would like to address in the automation component of our next proposal.

Does that mean that machines will replace chemists in the near future? Not any more than calculators have replaced mathematicians. I think that automating result production will leave more time for analysis, which is really the test of a true chemist (as opposed to a technician).

Here is an example


database, as long as attribution is provided. (If anyone knows of any accepted XML for experimental actions let me know and we’ll adopt that.)


I think this takes us a step closer from freeform Open Notebook Science to the chemical semantic web, something that both Cameron Neylon and I have been discussing for a while now.

PMR: This is very important to follow – and I’ll give some of our insights. Firstly, we have been tackling this for ca. 5 years, starting from the results as recorded in scientific papers or theses. Most recently we have been concentrating very hard on theses and have just taken delivery of a batch of about 20, all from the same lab.

I agree absolutely with J-C that traditional recording of chemical syntheses in papers and theses is very variable and almost always misses large amounts of essential details. I also agree absolutely that the way to get the info is to record the experiment as it happens. That’s what the Southampton projects CombeChem and R4L spent a lot of time doing. The rouble is it’s hard. Hard socially. Hard to get chemists interested (if it was easy we’d be doing it by now). We are doing exactly the same with some industrial partners. They want to keep the lab book.The paper lab book. That’s why electronic notebook systems have been so slow to take off. The lab book works – up to a point – and it also serves the critical issues of managing safety and intellectual property. Not very well, but well enough.

J-C asks

If anyone knows of any accepted XML for experimental actions let me know and we’ll adopt that

CML has been designed to support and Lezan Hawizy in our group has been working in detail over the last 4 months to see if CML works. It’s capable of managing inter alia:

  • observations
  • actions
  • substances, molecules, amounts
  • parameters
  • properties (molecules and reactions)
  • reactions (in detail) with their conditions
  • scientific units

We have now taken a good subset of literature reactions (abbreviated though they may be) and worked out some of the syntactic, semantic, ontological and lexical environment that is required. Here is a typical result, which has a lot in common with J-C’s synthesis.


(Click to enlarge. ) I have cut out the actual compounds though in the real example they have full formulae in CML, and can be used to manage balance of reactions, masses, volumes, molar amounts, etc. JUMBO is capable of working out which reagents are present in excess, for example. It can also tell you how much of every you will need and how long the reaction will take. No magic, just housekeeping.

CML is designed with a fluid vocabulary, so that anything which isn’t already known is found in dictionaries and repositories. So we have collections of:

  • solvents
  • reagents
  • apparatus
  • procedures
  • appearances
  • units
  • common molecules

A word of warning. It looks attractive, almost trivial, when you start. But as you look at more examples and particularly widen your scope it gets less and less productive. I’ve probably looked through several hundred papers. There is always a balance between precision and recall and Zipf’s law. You will never manage everything. There will be procedures, substances, etc, that defy representation. There are anonymous compounds and anaphora.

So we can’t yet build a semantic robot that is capable of doing everything. We probably can build examples that work in specific labs where the reactions are systematically similar – as in combinatorial chemistry.

So, yes, J-C – we would love to explore how CML can support this…

Chemical information on the web – typical problem

Monday, December 31st, 2007

Here’s a typical problem with chemical (and other) data on the web and elsewhere. I illustrate it with an entry from Wikipedia, knowing that they’ll probably correct it and similar as soon as it’s pointed out. You don’t have to know much science to solve this one:

Molecular formula XeO4
Molar mass 195.29 g mol−1
Appearance Yellow solid below −36°C
Density ? g cm−3, solid
Melting point −35.9 °C

Here’s part of the infobox for Xenon tetroxide in WP. Why are the data questionable? The problem is universal… [The info box didn't copy so you'll have to look at the web page - probably a better idea anyway. Here's a screenshot] infobox.PNG

UPDATE: The problem comes in the character(s) before the numbers. It is not ASCII character 45, which is what most anglophone keyboards emit when the “-” is typed. From Wikipedia:

Character codes

Read Character Unicode ASCII URL HTML (others)
Plus + U+002B + %2B
Minus U+2212 − or or
Hyphen-minus - U+002D - %2D

The Unicode minus sign is designed to be the same length and height as the plus and equals signs. In most fonts these are the same width as digits in order to facilitate the alignment of numbers in tables. The hyphen-minus sign (-) is the ASCII version of the minus sign, and doubles as a hyphen. It is usually shorter in length than the plus sign and sometimes at a different height. It can be used as a substitute for the true minus sign when the character set is limited to ASCII.

There is a tension here between scientific practice and the norms of typesetting and presentation. When the WP XML for this entry is viewed it looks something like:

x<td><a href="/wiki/Molar_mass" title="Molar mass">Molar mass</a></td>
<td>195.29 g mol<sup>−1</sup></td>
<td>Yellow solid below −36°C</td>
<td><a href="/wiki/Density" title="Density">Density</a></td>
<td> ? g cm<sup>−3</sup>, solid</td>
<td><a href="/wiki/Melting_point" title="Melting point">Melting point</a></td>
<p>−35.9 °C</p>

where the “minus” is represented by 3 bytes, which here print as


Note also that the degree sign is composed of two characters.

If the document is Unicode then this may be strictly correct, but in a scientific context it is universal that ASCII 45 is used for minus.

The consequence is that a large amount of HTML is not machine-readable in the way that a human reads it.

The answer for “minus” is clear – in a scientific context always use ASCII 45. It is difficult to know what to do with the other characters such as degrees. They can be guaranteed to cause problems at some stage when transforming XML, HTML or any other format unless there is very strict discipline on character encodings in documents, prgrams and stylesheets.

Which is not common.

Note, of course, that’s it’s much worse in Word documents. We have examples in published manuscripts (i.e. on publisher web sites) where numbers are taken not from the normal ASCII range (48-57) but from any of a number of symbols fonts. These are almost impossible for machines to manage correctly.

What does USD 29 billion buy? and what’s its value?

Friday, December 28th, 2007

Like many others I’d like to thank the The Alliance for Taxpayer Access

… a coalition of patient, academic, research, and publishing organizations that supports open public access to the results of federally funded research. The Alliance was formed in 2004 to urge that peer-reviewed articles stemming from taxpayer-funded research become fully accessible and available online at no extra cost to the American public. Details on the ATA may be found at

for its campaigning for the NIH bill. From the ATA site:

The provision directs the NIH to change its existing Public Access Policy, implemented as a voluntary measure in 2005, so that participation is required for agency-funded investigators. Researchers will now be required to deposit electronic copies of their peer-reviewed manuscripts into the National Library of Medicine’s online archive, PubMed Central. Full texts of the articles will be publicly available and searchable online in PubMed Central no later than 12 months after publication in a journal.

“Facilitated access to new knowledge is key to the rapid advancement of science,” said Harold Varmus, president of the Memorial Sloan-Kettering Cancer Center and Nobel Prize Winner. “The tremendous benefits of broad, unfettered access to information are already clear from the Human Genome Project, which has made its DNA sequences immediately and freely available to all via the Internet. Providing widespread access, even with a one-year delay, to the full text of research articles supported by funds from all institutes at the NIH will increase those benefits dramatically.”

PMR: Heather Joseph -one of the miain architects of the struggle – comments:

“Congress has just unlocked the taxpayers’ $29 billion investment in NIH,” said Heather Joseph, Executive Director of SPARC (the Scholarly Publishing and Academic Resources Coalition, a founding member of the ATA). “This policy will directly improve the sharing of scientific findings, the pace of medical advances, and the rate of return on benefits to the taxpayer.”

PMR: Within the rejoicing we must be very careful not to overlook the need to publish research data in full. So, as HaroldV says, “the Human Genome Project [...]made its DNA sequences immediately and freely available to all via the Internet”. This was the essential component. If only the fulltext of the papers are available the sequences could not have been used – we’d still be trying to hack PDFs for sequences.

So what is the 29 USD billion? I suspect that it’s the cost of the research, not the market value of the fulltext PDFs (which is probably much less than $29B ). If the full data of this research were available I suspect its value would be much more than $29B.

So I have lots of questions and hope that PubMed, Heather and others can answer them

  • what does $29B represent?
  • will PubMed require the deposition of data (e.g. crystal structures, spectra, gels, etc.)
  • if not, will PubMed encourage deposition?
  • if not, will PubMed support deposition?
  • if not, what are we going to do about it?

So, while Cinderella_Open_Access may be going to the ball is Cinderella_Open_Data still sitting by the ashes hoping that she’ll get a few leftovers from the party?

Update on Open crystallography

Saturday, December 22nd, 2007

There’s now a growing movement to publishing crystallography directly into the Open. Several threads include:

… so it was no great surprise when Jean Claude blogged:

X-Ray Crystallography Collaborator

20:41 20/12/2007, Jean-Claude Bradley, Useful Chemistry

We have another collaborator who is comfortable with working openly: Matthias Zeller from Youngstown State University.

With the fastest turnaround for any crystal structure analysis I’ve ever submitted, we now have the structure for the Ugi product UC-150D. For a nice picture of the crystals see here.

PMR: J-C also mailed us and asked how w/he could archive and disseminate the crystallography. So here’s a rough overview.

Crystallography is a microcosm of chemistry and we encounter many different challenges:

  • not all structures are Open (some not initially, some never). Managing the differential access is harder than it looks. It has to be owned by the Department or Institution. So you probably need access control, and probably an embargo system.
  • Institutional repositories are not generally oriented towards data. Some may, indeed, only accept “fulltext”. So there may be nowhere obvious to go.
  • The raw data (CIF) contains metadata, but not in a form where search engines can find it. That’s a important part of what SPECTRa does – extracts metadata and repurposes it.
  • The CIF can, but almost universally does not, contain chemical metadata. So part of JUMBO is devoted to trying to extract chemistry out of atomic positions.  Needs a fair amount of heuristic code.

So in conjunction with eChemistry and eCrystals and in the momentum of SPECTRa we are continuing to develop software for crystallographic repositories. There are several reasons why people want such repositories:

  • as a high-quality lab companion – somewhere to put your data and get it back later.
  • as somewhere to provide knowledge for data-driven science (e.g. CrystalEye)
  • as somewhere to save your data for publication and dissemination
  • as somewhere to archive your data for posterity (e.g. an IR)

These put different stresses on the software, so Jim and I are developing context-independent tools that can be used in any. I’m hacking the JUMBO software (CrystalTool) and he is hacking CrystalEye so it becomes a true repository.

This is our relaxation over the holiday.


Open Access Data, Open Data Commons PDDL and CCZero

Monday, December 17th, 2007

This is great news. We now have a widely agreed protocol for Open Data, channeled through Science Commons but with great input for several sources including Talis, and the Open Knowledge Foundation. Here is the OKFN report (I also got a mail from Paul Miller or Talis without a clear link to a webpage).


This means that the vast majority of scientists can simply add CCZero to their data. I shall do this from now on. Although I am sure that there will be edge cases it shouldn’t apply to ANYTHING in chemistry.

Good news for open data: Protocol for Implementing Open Access Data, Open Data Commons PDDL and CCZero

15:21 17/12/2007, Jonathan Gray, external, news, okf, open access, open data, open geodata, open knowledge definition, Open Knowledge Foundation Weblog

Last night Science Commons announced the release of the Protocol for Implementing Open Access Data:

The Protocol is a method for ensuring that scientific databases can be legally integrated with one another. The Protocol is built on the public domain status of data in many countries (including the United States) and provides legal certainty to both data deposit and data use. The protocol is not a license or legal tool in itself, but instead a methodology for a) creating such legal tools and b) marking data already in the public domain for machine-assisted discovery.

As well as working closely with the Open Knowledge Foundation, Talis and Jordan Hatcher, Science Commons have spent the last year consulting widely with international geospatial and biodiversity scientific communities. They’ve also made sure that the protocol is conformant with the Open Knowledge Definition:

We are also pleased to announce that the Open Knowledge Foundation has certified the Protocol as conforming to the Open Knowledge Definition. We think it’s important to avoid legal fragmentation at the early stages, and that one way to avoid that fragmentation is to work with the existing thought leaders like the OKF.

Also, Jordan Hatcher has just released a draft of the Public Domain Dedication & Licence (PDDL) and an accompanying document on open data community norms. This is also conformant with the Open Knowledge Definition:

The current draft PDDL is compliant with the newly released Science Commons draft protocol for the “Open Access Data Mark” and with the Open Knowledge Foundation’s Open Definition.

Furthermore Creative Commons have recently made public a new protocol called CCZero which will be released in January. CCZero will allow people:

(a) ASSERT that a workhas no legal restrictions attached to it, OR
(b) WAIVE any rights associated with a work so it has not legal restrictions attached to it,
(c) “SIGN” the assertion or waiver.

All of this is fantastic news for open data!


Thursday, December 13th, 2007

Last spring I visited Illinois (UIUC) and presented the SPECTRa tools. Scott Wilson who runs the crystallographic facility and many of the LIS community were keen to see how it could be used for capturing their crystallography. Yesterday I met Sarah Shreeve at the DCC conference and she told me that they had now budgeted to install a SPECTRa system. This is great – Jim Downing and I will be discussing the technical details – but we’ll be hoping to have some more news RSN.

If anyone else at DCC is interested in SPECTRa  for ingesting crystallography, spectroscopy or compchem, catch me at coffee – I’m around till Saturday.

Scraping HTML

Monday, December 3rd, 2007

As we have mentioned earlier, we are looking at how experimental data can be extracted from web sources. There is a rough scale of feasibility:


I have been looking at several sites which produce chemical information (more later). One exposes SDF (a legacy ASCII file of molecular structures and data). The others all expose HTML. This is infinitely better than PDF, BUT…

I had not realised how awful it can be. The problems include:

  • encodings. If any characters outside the printing ANSI range (32-127) are used they will almost certainly cause problems. Few sites add an encoding and even if they do the interconversion is not necessarily trivial.
  • symbols. Many sites use “smart quotes” for quotes. These are outside the ANSI range and almost invariably cause problems. The author can be slightly forgiven since manu tools (including WordPress) convert to smart quotes (“”) automatically. Even worse is the use of “mdash” (—) for “minus” in numerical values. This can be transformed into a “?” or a block character or even lost. Dropping a minus sign can cause crashes and death. (We also find papers in Word where the numbers are in symbol font and get converted to whatever or deleted.)
  • non-HTML tags. Some tools make up their own tags (e.g. I found <startfornow>) and these can cause HTMLTidy to fail.
  • non-well-formed HTML. Although there are acceptable ways of doing this (e.g. “br” can miss out the end tag) there are many that are not interpretable. The use of <p> to separate paragraphs rather than contain them is very bad style.
  • javascript, php, etc. Hopefully it can be ignored. But often it can’t.
  • linear structure rather than groupings. Sections can be created with the “div” tag but many pages assume that a bold heading (h2) is the right way to declare a section. This may be obvious when humans read it, but it causes great problems for machines – it is difficult to know when something finishes.
  • variable markup. For a long-established web resource – even where pages are autogenerated – the markup tends to evolve and it may be difficult to find a single approach to understanding it.  This is also true of multi-author sites where there is no clear specification for the markup – Wikipedia is a good example of this.

As a result it is not usually possible to extract all the information from HTML pages and precision and recall both fall well short of 100%. The only real solution is to persuade people to create machine-friendly pages based on RSS, RDF, XML and related technology. This solves 90% of the above problems. That’s why we are looking very closely at Jim Downing’s approach of using Atom Publishing Protocol for web sites.

Survey of open chemistry in Chemistry World

Monday, December 3rd, 2007

Richard Noorden has written a balanced and informative view of Open Chemistry ( Surfing Web2O, Chemistry World, December 2007. )  He has read much of the chemistry blogosphere and talked with many of us on the phone. The article highlights the opportunities and the frustrations.  Here is a brief excerpt:

The rapid evolution of the world wide web is creating fresh opportunities – and challenges – for chemistry….

  • The internet is becoming flooded with free chemical information: from blogs to videos and databases
  • Linking this data together and interacting via the ‘social web’ could revolutionise the practice and teaching of chemistry
  • So-called ‘Open Chemistry’ faces many challenges: not least maintaining data quality and co-existing with trusted subscription databases…

PMR: I think we are beginning to see some movemen. The dam is built of sand and trickles are appearing. Some of us and encouraging this and at some stage it must burst.


We are going to need a new technology. Structured databases and portals will start to disappear and semi-structured collections of data (repositories) and people (collaboratories) will grow. There is a lot of interest from outside chemistry. Although chemistry per se is not interested in communal resources there is a big demand in bioscience and we shall get a strong “piggy-back” on the work happening there in text-mining, ontologies and semantic web. We’ll also see the push from repositories in academia and since chemistry is technically one of the easiest places to start, we expect to “leverage” this  [an unhappy verb].


Chemspider and Pubchem – open data

Friday, November 30th, 2007

I was very pleased to see:

ChemSpider Blog » Blog Archive » The Entire ChemSpider Database is On Its Way to PubChem!

which describes how the Chemspider database is being offered to Pubchem as “open data”. Chemspiderman has made a valuable attempt to navigate the complexities of Open Data and recursive licences. It is technically difficult and takes  us into unknown territory. For a start it is difficult to decribe what the final object is. I understand Pubchem as a collection of links coupled to authority – i.e. Pubchem holds links to the Chemspider compounds but does not actually hold the data. (I am not aware that Pubchem holds any data other than a fairly small amount of computed data (e.g. number of rotatable bonds) and names). It does, of course, hold the data that NIH collects through the roadmap program. But I’d be happy to be corrected.

Chemspider repeats my suggestions for criteria for Open Data and adds:

CS: For right now I am giving up on trying to track where Open Data might end up. Based on my previous discussions with Peter Suber regarding navigating the complexities of Open Access definitions, I understand there is a need to define our own policies. I’m not going to do that here but what I will be clear with is that once the ChemSpider structure set is deposited in PubChem then we are at the mercies of THEIR data sharing policies. I believe Peter [PMR, not sure which Peter - but if me, see below] holds up PubChem as the primary example of Open Data (but maybe not). So, I believe it should be true to say that the ChemSpider structure set IS Open Data when accessed/downloaded/shared from PubChem. But I understand that will then be the PubChem data set and all association with us will likely be lost. But that is fully acceptable!

PMR: This shows the complexities. We will need to see how the data actually end up in Pubchem. But at present Pubchem holds only links to authorities. Thus if I search for aspirin I get 61 suppliers of information (search result) each entry in which links back to the supplier’s site.So any “data” (e.g. melting point) is not in Pubchem. Unless Chemspider is different then I would expect that only the links would be held in Pubchem. If I am right, then accessing Chemspider through Pubchem is simply another way of accessing Chemspider.

In a comment Rich Apodaca says:

Regardless of how exactly linkage occurs, the end result would be that any third party could, independently of ChemSpider, reconstruct the entire ChemSpider compound database. By using the ChemSpider Web APIs, they could develop a parallel service that re-processes the ChemSpider analytical data and patent/primary literature data, possibly mashing up the data from other sources as well.

This sets the bar very high for Open data in chemistry. I’m not sure what to call it, but it’s a game-changer.

If Chemspider allows the direct download and re-use of their data from their site then I also congratulate them. This is completely independent of whether the entries are linked from Pubchem. However it will be necessary to add a licence statement to the Chemspider pages (not Pubchem) making this clear.

It may be picky but I don’t think that Pubchem – in common with many other bioscience sites – actually gives explicit permission for re-use. Agreed that it is a work of the US government so should be free of copyright. There is an unspoken tradition in bioscience that data and collections are “Open” in some way but it isn’t well spelt out.

It should be.