Monthly Archives: May 2008

What I said to the NIH

The NIH has asked for public comments on its access policy. 150+ people and organisations have responded. Almost no-one has said anything about text/data mining.

So I have:

1. Do you have recommendations for alternative implementation approaches to those already reflected in the NIH Public Access Policy?

There is an urgent requirement in bioscience to use machines to extract information from the full-text of papers ("text-mining" mining and "data-mining"). Examples of this use are the machine-assisted annotation of genomes, the extraction of concepts from text and the linking of information from many different disciplines. In my own field of molecular informatics it is possible to scan a million Pubmed abstracts a day and extrcat mention of new chemical compounds of biological interest. It is now well known that abstracts alone do not give sufficient information and that access to the full-text is required.

Many publications are accompanied by data, and indeed for many of these (e.g. about sequences and structures of biomolecules) the data are often more important than the fulltext. Although the STM publishers have urged their members to regard data as facts and therefore free of copyright, several publishers label data as copyright, thus effectively barring the legitimate re-use of data. It is important that the NIH challenges this and forbids it on PMC.

Many data are embedded in the full text and can be extracted by machines ("text-mining"). This process is made more tractable if the text is available in XML form (including XHTML) and I support the use of these formats.

Text-mining"  and "data-mining" are hardly mentioned - if at all - in the NIH's description and requirements. I would therefore wish to see positive indication that the NIH supports the re-use of the material, in high-throughput mode.


3  In addition to the information already posted at, what additional information, training or communications related to the NIH Public Access Policy would be helpful to you?

The information provided gives users very little positive indication that the can legitimately re-use the material published on PMC. I write a blog on Open Access and Open Data ( and the informed opinion was that PMC does not allow data- or text-mining and that attempts to do this will result in the NIH server cutting off access to the given IP. The words "fair use" are useless. In practice no scientist has enough knowledge of case law to know what is and is not fair use and the term effectively frightens many into "no use".

I would urge that the NIH make clear what their policy on data- and text-mining is, using those terms. I would also suggest that the NIH add machine-readable versions of licences or similar documents so that robots are aware of what they may and not do.
Do you have other comments related to the NIH Public Access Policy

I am a user of the material available on the NIH sites, including PubChem, and PubMed. The volume of information is now so great that machines are essential to use it properly. I believe it is essential for the NIH to enable text/data-mining of its information if it is to recoup the maximum value of its research investment.

Comments on the NIH policy

I knew that the NIH had solicited comments on its publication mandate policy and that I had plenty of time to think about it. Now it's urgent. Here's Peter Suber:

03:38 30/05/2008, Peter Suber,

Public comments on the OA mandate at the NIH are due by 5:00 pm (Eastern Standard Time), Saturday, May 31, 2008, less than two days from now.

Submit your comments through the NIH web form.  But before you do, see some of the comments already submitted.  The pro-OA comments will give you ideas, and the anti-OA comments will show you what objections to answer and what perspective might predominate if you don't send in your own.

This time the NIH wants separate answers to four separate questions.  The web form has four separate spaces for them:

  1. Do you have recommendations for alternative implementation approaches to those already reflected in the NIH Public Access Policy?
  2. In light of the change in law that makes NIH’s public access policy mandatory, do you have recommendations for monitoring and ensuring compliance with the NIH Public Access Policy?
  3. In addition to the information already posted [here], what additional information, training or communications related to the NIH Public Access Policy would be helpful to you?
  4. Do you have other comments related to the NIH Public Access Policy?

If you're thinking that the NIH just concluded a round of public comments for its March 20 meeting, you're right.  See the comments generated by that round (and my blog post on them).  One persistent publisher objection is that the policy has not been sufficiently vetted and one purpose of the new round no doubt is to give the stakeholders one more chance to speak.  We must use it.  Publishers will.

Please submit a comment and spread the word. Even if you have no suggestions to improve the policy, it's important to express your support.

PMR: I'd like to comment, but it's not very clear whether it's US-only.  I'm going to look through the current comments (which are all public) and see whether there's anything non-US. Any help will be valuable.

Assuming I am eligible I'll say something about the critical need to allow access to data. Firstly the NIH could state that data in the papers is factual and therefore not subject to copyright. A public satement of this would be very valuable. Secondly that it will not impose server-side limitations of robotic downloads. It's the responsibility of the reader not to break copyright. Thirdly I will encourage the NIH to require the submission of data as well as free text.

So I'd welcome fedback and also urge anyone who is qualified to post. If that's everyone, then everyone should post. From what I can see they could use a few more...

Robots can detect error; but images MUST be Open

Here is an extremely compelling reasons why data - including image and graphs in papers MUST be regarding as Open Data, not as published owned copyright. Simply, many researchers "doctor" their data. Not "most", but "many". The intent may be simply to make the data look "better" or they may be deliberately falsifying their data to obtain a prestigious publication. Or somewhere in between - the research is sloppy, with unclear results and will "benefit" from enhanced data. Here's a brief excerpt, but read the article. I'll comment later and make a proposal to publishers...

OA enhances error correction Jeffrey Young, Journals Find Fakery in Many Images Submitted to Support Research, Chronicle of Higher Education, May 29, 2008.

...As computer programs make images easier than ever to manipulate, editors at a growing number of scientific publications are turning into image detectives, examining figures to test their authenticity.

And the level of tampering they find is alarming....

One new check on science images, though, is the blogosphere. As more papers are published in open-access journals, an informal group of watchdogs has emerged online.

"There's a lot of folks who in their idle moments just take a good look at some figures randomly," says John E. Dahlberg, director of the division of investigative oversight at the Office of Research Integrity [at the US Department of Health and Human Services, which includes the NIH]. "We get allegations almost weekly involving people picking up problems with figures in grant applications or papers."

Such online watchdogs were among those who first identified problems with images and other data in a cloning paper published in Science by Woo Suk Hwang, a South Korean researcher. The research was eventually found to be fraudulent, and the journal retracted the paper....

PMR: A typical example is a gel, used to show how many proteins or nucleic acids you have got and how pure they are. I've taken this from Wikipedia:


This is a good gel - the bands are parallel and there are no thumb prints. Many gels don't "run straight" so it's tempting for the author to "striaghten" then in Photshop or similar.

Robots can detect this. Better than humans.

We have software in our group that can detect errors in chemistry. In graphs, molecular structures, text, etc. It would be fairly straightforward to download all the world's published chemistry and check it for errors. Note that in chemistry errors are mainly due to human error rather than fraud. However we find an awful lot. In closed access journals. (That's most of them as most chemistry is closed). Our robots did a check on a journal issues  by a prestigious chemical publisher and found an error in almost every article. Some were trivial - missing punctuation, some were spelling errors. Some were serious. In a recent article we've been looking at 30% of the chemical names are seriously wrong. Our robots find that sort of thing. Some of the  chemical formulae are wrong. Some of the molecular masses are wrong.

I think this matters. I think the editors of the journal would agree. I think the publishers would not like to have incorrect science in their journal.

We can do this at near-zero cost. Some of our methods (OSCAR) are well tested, others are still alpha. They don't work on all papers. But we'd like to see if we can detect errors. (And it's not just error detection, but adding a very significant degree of semantics - I showed some of our automatic thesis-eating robot last week at the RSC).

The only thing holding us back is that we may be accused of stealing content. We will shan't do this. We would download the articles, scan then for data and delete the articles. We wouldn't sell them to non-subscribers. We shan't post them on the web-site (though we would post the extracted data).

So this is a genuine offer to publishers. We are interested in seeing whether we can extract data and detect errors in publications. Last wek all the publishers at the RSC agreed that facts weren't copyright and they agreed that scientific images and artwork should not be regarded as being protected by publisher copyright.

So I'd like feedback from publishers. The time has come when robotic extraction and analysis of data makes sense.

And if a publisher forbids the automatic analysis of scientific data for error detection, and defends this through copyright or server-side police, is that advancing the cause of science? Wouldn't it make sense to make all the data Open?


Where should we get our computing?

Three independent events have made me re-ask the question - where do we get our computing from? They are:

  • A visit to the Barcelona supercomputing centre. We had a personal tour during the COST D37 visit - thanks very much to all involved. It's a splendid sight - a huge glass box occupying most of the interior of a chapel, with racks of blades (some 2500 I think). And the lovely coloured wiring of the fibre, electric coolant, etc.
  • The building of Barcelona Supercomputing Center is a former chapel.
  • I'm chair of the Computer Services Committee in the Department. We're quite a federated organization and that means there are several server rooms. All of them require power. All of them require space. All of them require cooling. All of them need backing up. All of them chop and change as the kit wears out and my colleagues get new grants. All of them need connecting to the network. We have an excellent group of Computer Officers so I don't have to think about them but it's a lot of work and a lot of money.
  • And we have a High Performance Computing facility in the University. It got into the top something-or-other for size or power or ... I'm not sure what the finances are (well I'm not going to blog them) but we are urged to consider it as a primary resource.

And today Jim sent me a recent critique of HPC: HPC Considered Harmful. I have some sympathy with these views (like "Making sure their programs produce correct answers"). So why I am not enthusiastic about HPC?

HPC comes out of the "big science" tradition. CERN, NASA, etc. Where there are teams of engineers on hand to solve problems. Where there are managed projects with project managers. Where there are career staff to support the facilities. Where there are facilities.

Chemistry is long-tail science. Where the unit of allegiance is the lab. There are certainly problems which actively require large machines with large memory. But they often hit the problems of scale. You don't get usually get sixteen times as much power by building a machine sixteen times as big. OK, You don't always get sixteen times more science with sixteen times more graduate students.

The Australian e-research effort identified four potential bottlenecks:

  • cpu
  • bandwidth
  • storage
  • data

and concluded that the biggest bottleneck was data.

I'd agree.  Often the primary problem is that we don't have data. That's what much of the blog is about. And, at the other end it's often much easier to produce simulated data than to use it.

So who knows how to manage large-scale computing? The large companies. Amazon, etc. COST D37 had a talk from (I think) Rod Jones at CERN who said that in building computing systems he always benchmarked the costs against Amazon.

I'm certainly looking in that sort of direction.

Open Source Drug Disovery and Closed Access

I got a mail this morning about an article in Cell:


India Takes an Open Source Approach to Drug Discovery

Open source software may have been around for 17 years, but using an open source model to speed up drug discovery is a relatively new idea. This month, India is launching a new open source initiative for developing drugs to treat diseases such as tuberculosis, malaria, and HIV.

Seema Singha

aBangalore, India

Available online 17 April 2008.
PMR This mentioned many of the things that we have been interested in the Open Science and Blue Obelisk communities. A lot of the people we collaborate with - Jean-Claude Bradley, Rajarshi Guha, Matt Todd and mentioned. This is really exciting. I'd love to tell you about it. It's great.
But I can't. The article is closed access. I couldn't read it at home. If I post a copy of it from Cambridge I will be pursued by Elsevier lawyers.
So here is a really important development. A substantial investment in new ways of developing drugs (and we need new ways, because humans are very poor at discovering drugs and machines are even worse).
If you want to read the article and you don't belong to a rich university it will cost you 31 USD. It's just over 2 pages long.
I suspect that most of the scientists in India who might wish to read this won't have access...
So this is the choice that we need to make...

Institutional Repositories: Caveat Roach

Dorothea Salo at the University of Wisconsin is involved with the Institutional Repository - she is outspoken about librarianship and repositories so I hesitate to label her with either label. On her blog (Caveat Lector) she has created a fictional campus (University of Achaea, with Dr Helen Troia as leading character) who inter alia researches and teaches in Basketology - so maybe CavLec is best described as a basketologist).  She occasionally comments on my blog and recently made a reference to "Roach Motel" - a common phrase on her blog. In the UK a roach is a fish but I know that in the US it's a cockroach (inter alia), but I didn't understand "Roach Motel".

It turns out (thank you Wikipedia) that Roach Motel is a trap for cocroaches with the slogan

"Roaches check in -- but they don't check out!"

The analogy is that many things are checked into institutional repositories but few are checked out. She has formally expanded this view under the general theme that IRs aren't working and put the preprint in her own  (or rather HPOW's in library-speak) IR:

Please use this identifier to cite or link to this item:

Title: Innkeeper at the Roach Motel
Authors: Salo, Dorothea
Keywords: institutional repositories
open access
Issue Date: 11-Dec-2007
Citation: Salo, Dorothea. "Innkeeper at the Roach Motel." Library Trends 57:2 (Fall 2008).
Abstract: Trapped by faculty apathy and library uncertainty, institutional repositories face a crossroads: adapt or die. The “build it and they will come” proposition has been decisively proven wrong. Citation advantages and preservation have not attracted faculty participants, though current-generation software and services offer faculty little else. Academic librarianship has not supported repositories or their managers. Most libraries consistently under-resource and understaff repositories, further worsening the participation gap. Software and services are wildly out of touch with faculty needs and the realities of repository management. These problems are not insoluble, but they demand serious reconsideration of repository missions, goals, and means.
Appears in Collections: General Library Collection

Refworks Export

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

PMR: I'll comment on the cntent below but my immediate comment on the meta-data is that I have no indication of my rights to this information. It's Openly visible and, I assume, permanently visible so it qualifies for OA1 (==weak OA, I think). So we can label it Open Access. But I don't know whether she has transferred copyright to the publisher and - as I don't read m/any journals in this area I don't know whether the journal has any Open Access policy. The mantra about DSpace copyright shown typical librarian paranoia - if we don't know what the copyright is, put on the strongest protection possible so we won't get into trouble. So, if I were a basketologist, I would be frightened off using this article in a course on Open Access and Repositories.  Dorothea is a DSpace expert so I imagine it took her within the Harnad-Carr six minute event horizon for deposition.

Dorothea is very clear that IRs aren't working and libraries have a limited future unless they change. Read the article. I have to say I am in sympathy with this view. I've come to meet many librarians and repositarians over the last two years or so and I wish them well. In general they don't impact on my colleagues research (I can't comment on teaching). If librarians are to continue in a research environment they have to discover what researchers want and give it to them. Here are some suggestions:

  • Please help me with the mechanical and arcane parts of putting my CV together.
  • Please help me write the grant and give me a higher chance of success. How many institutions have any idea why their grants fail?
  • I spend hours preparing papers for publication - please do the boring bits for me
  • I've just got this collaboration with ... - and we need an electronic collaborative environment
  • I've just lost all my data.
  • I've lost all my data again
  • I don't want to lose my data a third time

These are the sorts of things researcher want. Maybe IRs can help with some of them. But not yet as we know them.

(In fairness JISC has several projects addressing some of these. But they don't yet scale).

Files in This Item:

File Description Size Format Handle
RoachMotel.pdf 211Kb Adobe PDF View/Open
RoachMotel.doc 248Kb Microsoft Word View/Open

Visualising Open Knowledge

I had hoped to get to the OKF's Visualisation Workshop but had problems on the Continent so had to miss. Here's a report:

14:47 26/05/2008, Jonathan Gray,

open visualisation workshop

The first Open Visualisation Workshop took place on Saturday as we mentioned last week.

Details, notes and links are available on the event’s wiki page.

The event took place at Trampoline Systems’ new site in East London. To make sure the event was as informal as it was billed to be - we left the schedule open until the day, so we could see what people were interested in doing and plan the workshop accordingly!

After introductions and some brainstorming, we had impromptu talks and demos from:

  • Martin Dittus,
  • Julie Tolmie, Centre for Computing in the Humanities, King’s College London
  • Jan Berkel, Trampoline Systems
  • Jonathan Lister, Osmosoft
  • Gregory Jordan, European Bioinformatics Institute
  • David Aanensen, Division of Epidemiology, Public Health and Primary Care, Imperial College London
  • Jonathan Gray, The Open Knowledge Foundation (me)

Most of the day was occupied by demos and discussions - so we didn’t get around to doing much tinkering with software packages. However, participants said they found it very useful to see people’s work in other fields - and were keen to continue to meet regularly. It was interesting to see how much commonality there existed between visualisation work in very different fields.

Suggestions for future activities included:

  • continuing to build on the list of open source visualisation packages (on the wiki)- possibly including notes, comments and example visualisations from people who have experience using these;
  • domain specific sessions (e.g. visualisation for bioinformatics);
  • shared project to work on, using open source visualisation software to represent an open knowledge package - e.g. using Prefuse to represent data from omdb
  • using different visualisation software to represent the same open dataset - and comparing the results;
  • making very brief screencasts of different visualisation projects with voiceovers from their developers;
  • promoting the open-visualisation mailing list to researchers, developers and practitioners - as participants weren’t aware of any other general mailing list for open-source visualisation technologies;
  • developing a wish-list of features that participants would ideally like to see in open source visualisation software.

It was suggested we have another workshop in June to keep the ball rolling. Nearly everyone there was keen - so we’ve created a doodle page to fix the date.

If you’d like to participate, please:

  1. add your name to the Open Visualisation Workshop wiki page;
  2. select which dates you are free on the doodle page;
  • sign up to the open-visualisation mailing list.
  • PMR: I have no idea what the visualisation is actually of - maybe readers would like to guess.

    How Closed Access makes progress difficult

    I'm interested in the nmrdb database and toolkit for NMR spectra. I don't know how long this has been going, but I have only known about it for a day. I used it to predict a spectrum (NMR Prediction through, and now I'd like to know how it did it. I guessed it was based in some way on geometrical models because protons which were topologically equivalent had different chemical shifts and couplings. So I went to the home page and found:

    This page allows to predict the spectrum from the chemical structure based on "Spinus". You may find more information on the authors website.


    PMR: I missed the link to the authors' webpage (the font is small) and thought I would start by looking at the literature references. There should be enough in the abstracts to give me a general idea. [to avoid reproducing the whole abstract - which might go beyond fair use - I have removed the past tense of the verb "to be"]. The first abstract read

    Copyright © 2001 American Chemical Society

    [... authors snipped...]

    Counterpropagation neural networks [...] applied to the fast prediction of 1H NMR chemical shifts of CHn groups in organic compounds. The training set consisted of 744 examples of protons that [...] represented by physicochemical, topological, and geometric descriptors. The selection of descriptors [...] performed by genetic algorithms, and the models obtained were compared to those containing all the descriptors. The best models yielded very good predictions for an independent prediction set of 259 cases (mean absolute error for whole set, 0.25 ppm; mean absolute error for 90% of cases, 0.19 ppm) and for application cases consisting of four natural products recently described. Some stereochemical effects could be correctly predicted. A useful feature of the system resides in its ability to be retrained with a specific data set of compounds if improved predictions for related structures are required.

    PMR: This gives some, but not really enough , information about the method. In particular were 3D coordinates of the molecule generated, and if so how. Were shifts averaged across topologically equivalent protons? So I went to the second abstract (again copyright ACS so the full abstract is not given):

    Feed-forward neural networks [...] trained for the general prediction of 1H NMR chemical shifts of CHn protons in organic compounds in CDCl3. The training set consisted of 744 1H NMR chemical shifts from 120 molecular structures. The method [...] optimized in terms of selected proton descriptors (selection of variables), the number of hidden neurons, and integration of different networks in ensembles. Predictions [...] obtained for an independent test set of 952 cases with a mean average error of 0.29 ppm (0.20 ppm for 90% of the cases). The results [...]significantly better than those obtained with counterpropagation neural networks.

    PMR: Still nowhere near enough information. Now I'm at home (it' a public holiday in UK) and I'm watching the cricket (which is absorbing). Although I could get a login to Cambridge library I choose not to as it gives me the position of a second-class citizen empoversihed through closed access. So maybe the authors have bought ACS "Author Choice". Unfortunately not. I will have to pay 2 * 25 USD for access to these articles. And the access only lasts for 48 hours. These are the sort of papers than can't easily be fully digested in 2 days. I'm not sure what happens if I keep copies on my hard disk - I expect that it bursts into flame like Mission Impossible and I daren't take the risk. BTW I can see no possible point in restricting access to 48 hours - and it's yet another indication of the publisher treating the scientific community solely as a source of revenue. Maybe the ACS staff who read this article will enlighten us.

    The point is that this type of procedure - however necessary or not to the survival of the publisher - causes great problems to the community. So the price of preserving a reader-pays publishing economy is to slow down science, encourage many scientitsts to avoid reading the literature and generally to reduce the coherence of the scientific process. I, for example, am unlikely ever to read these papers now.

    [NMRDB note. The link I missed explained that the prediction was based on CORINA structures. I am still unclear generally about - does it have spectra? how many? are they freely available? etc. The web site says very little.]

    Green OA and Open Data - more

    Peter Suber has responded very quickly to my clarification of the connection - or lack of it - between Green OA and Open Data. He has provided some very useful additional information, and I think we are in more or less complete agreement. I reproduce his response and then comment further. (Note that this has nothing to do with the strong/weak OA discussion of 2-3 weeks ago). I'll start by saying that the word "irrelevant" was probably a poor choice and I'll try to choose another one below

    Green OA and open data

    [PMR response snipped...]

    PS Comments

    • First, I generally agree with PMR's opening characterization of green OA. I'd only add that we should distinguish green OA itself from the strategy proposal (which I do not endorse) to slow down on the pursuit of open data until we succeed with open texts. As usual, I think we should proceed on all fronts at once. I generally agree as well with PMR's understanding of the state of open data in OA repositories. But in describing this state, I'd put the accent in a different place.
    • It's true that most OA repositories today are optimized for texts and not optimized for data. It's also true that few institutions (universities, funders, publishers) encourage or require the deposit of data files in repositories. Finally, it's true that most OA repositories will accept data files, even if few researchers are depositing data files. With this background, my response reduces to to two quick points:
      1. First, it doesn't follow that green OA is "irrelevant" for open data, merely that we are under-using the opportunities it provides for open data. We shouldn't confuse researcher practices or institutional policies with repository capacities or green OA. If under-using an opportunity made it irrelevant, then conservation would be irrelevant to climate change and green OA would be irrelevant even to text files.
      2. Second, we have a long way to go to make most repositories as useful for data files as they are for text files. But it doesn't follow that green OA is irrelevant or harmful for open data, merely that its capacity to help users do useful work with OA data files must continue evolving.

    There are many projects trying, in many different ways, to make green OA even more relevant and useful for data than it is now, e.g. by increasing data deposits in repositories and allowing fuller use of data already on deposit. For example, see ASSDA (from ANU), CESSDA (from NSD), Commons of Geographic Data (from the U of Maine), DANS (from the Royal Netherlands Academy of Arts and Sciences), LEAP (from AHDS), LinkingOpenData (from W3C), Pangaea (from a coalition of German research institutions), and StORe (from JISC).

    PMR: I agree with all this and thank PS for the list of institutions encouraging data. I'll try to rephrase:

    • CC-BY and BBB-compliant OA necessarily support Open Data. This is a logical coupling. If you publish an article in this way you automatically give the world permission to use it (and its associated files, by implication) permission to re-use
    • "strong OA" *may* go some way to supporting Open Data. It is possible that some licences will alloe re-use. Note that Non-commercial use is incompatible with Open Data and Open Knowledge as currently defined, and it is possible that some "strongOA" sites offer full removal of permission barriers. Even without full removal, the removal of *some* barriers at least points in the right directin and alerts the community (author/reader/repositarian) to the fact that there are barriers and that some people want to remove them. It's not logically coupled, but there may be some empathy with Open Data. It's also possible (though I have no evidence) that strong OA offerings are more conscious of the non-copyrightability of factual data and the value of adding supplemental files to web sites.
    • gold OA is often paid for. Gold OA may remove none, some or all permission barriers. Only the latter leads to Open Data. Since Gold OA requires the author or the funder to pay money, both of them should think hard about what they are getting. PeterS has often noted that certain agreements give authors and readers very little more than what GreenOA offers, but that the author has paid a lot of money. An example is that some publishers expose fulltexts after an initial period (say 2 years) after which GoldOA and Green OA lose any advantage over the freely accessible text.
    • hybrid offerings. I think I have a special concern about hybrid publications (where a joyrnal can contain "OA" and "not-OA". There was a rash of these last year and they did not impress me - poor value for money, poor clarity of presentation, little additional exposure for authors, poor quality of access to data, etc. It's not surprisng that - at least in chemistry - there has been almost no take up. I reviewed a random number of these offerings in this blog and felt that funders were paying a lot for relatively little. Admittedly I was taking BBB as a baseline and downgrading anything weaker. In principle a hybrid offering could offer almost nothing other than visibility and charge a lot. The main value that hybrids have over green OA is that they are discoverable on the publishers site which may be important.
    • green OA.  Although logically there is no reason that data cannot be deposited greenOA offers no logical and little social encouragement to do so. If, tomorrow, the whole world had adopted  greenOA but re-use was strictly forbidden then it would be little use to data-rich sciences. That is a caricature, but there is no doubt that BBB - with its insistence on re-use is of much greater practical value in amny subjects.

    So, to rephrase:

    Open Access and Open Data are logically coupled if and only if the particular flavour and expression of OA requires the removal of all permission barriers. Many people and institutions may indeed jointly promote OA and OD where there is no logical connection; however anything less than BBB may only encourage, not demand Open Data. Apart from BBB-compliance, Open Access has been largely decoupled from the requirement to honour factual data as free from copyright and other permissions.

    Open Data cannot and should not rely on progress in Open Access to promote its cause. And, indeed, there may be initially be cases where we can insist on Open Data on supplemental data, and data embedded in fulltext without being able to achieve any remission of permission barriers on the full text itself.

    NMR Prediction through

    Chemspider recently reported (NMR Prediction Now Available Via ChemSpider) a new NMR service at Our OSCAR tool is able to extract large numbers of NMR peak lists to I thought I'd try it out. Since our new software can also extract structures we can create files for input into nmrdb. It seems necessary to use MOL file input and get the answer in the inline format used in many journals (this is referred to as "ACS" but I'm not sure whether this is actually specified by the ACS. So I took the first molecule from one of our theses, used JUMBO to convert to a MOL file (with 2-D coordinates) and input it manually

    Here's the result:

    7.254 (4, 1H, dddd, J=7.718, J=7.710, J=1.682, J=1.681), 7.473 (5, 1H, dddd, J=7.874, J=7.710, J=5.054, J=1.760), 7.278 (6, 1H, dddd, J=7.723, J=7.718, J=5.199, J=1.760), 7.304 (7, 1H, dddd, J=7.723, J=5.054, J=1.682, J=1.521), 7.304 (9, 1H, dddd, J=7.874, J=5.199, J=1.681, J=1.521), 3.039 (11, 1H), 3.040 (11, 1H), 3.015 (13, 1H, ddd, J=12.296, J=9.880, J=1.750), 3.289 (13, 1H, ddd, J=12.296, J=4.300, J=1.760), 1.823 (14, 1H, dtt, J=13.309, J=4.300, J=1.750), 1.804 (14, 1H, dtt, J=13.309, J=9.880, J=1.760), 3.289 (15, 1H, ddd, J=12.296, J=4.300, J=1.760), 3.015 (15, 1H, ddd, J=12.296, J=9.880, J=1.750), 2.187 (16, 3H)

    It appears that the numeric labels refer to the atom to which the H is attached. Since I have the experimental values (which are not all assigned by the author) we can write something like:
    obs 7.20
    7.254 (4, 1H, dddd, J=7.718, J=7.710, J=1.682, J=1.681),

    obs 7.94, 7.40 (2*2H not assigned)
    7.473 (5, 1H, dddd, J=7.874, J=7.710, J=5.054, J=1.760),
    7.278 (6, 1H, dddd, J=7.723, J=7.718, J=5.199, J=1.760),

    7.304 (7, 1H, dddd, J=7.723, J=5.054, J=1.682, J=1.521),
    7.304 (9, 1H, dddd, J=7.874, J=5.199, J=1.681, J=1.521),

    obs: 3.15
    3.039 (11, 1H),
    3.040 (11, 1H),

    obs 2.74:
    3.015 (13, 1H, ddd, J=12.296, J=9.880, J=1.750),
    3.015 (15, 1H, ddd, J=12.296, J=9.880, J=1.750),
    3.289 (13, 1H, ddd, J=12.296, J=4.300, J=1.760),
    3.289 (15, 1H, ddd, J=12.296, J=4.300, J=1.760),

    obs 1.96
    1.823 (14, 1H, dtt, J=13.309, J=4.300, J=1.750),
    1.804 (14, 1H, dtt, J=13.309, J=9.880, J=1.760),

    obs 1.85:
    2.187 (16, 3H)
    There is no exact symmetry in the result (we would expected 13 and 15 to be identical for example) although there is approximate symmetry. It would have been useful for the machine to label the couplings. Overall the RMS is about 0.3 Hz which is probably quite useful for high-throughput assignment and for checking for major errors.

    There are similar observations on the Chemspider blog.

    If the service can provide an API, and if it persists, that would be of considerable value.