petermr's blog

A Scientist and the Web


Archive for December, 2007

New free journal from Springer – but no Open Data

Monday, December 31st, 2007

Peter Suber reports:

New free journal from Springer

Neuroethics is a new peer-reviewed journal from Springer.  Instead of using Springer’s Open Choice hybrid model, it will offer free online access to all its articles, at least for 2008 and 2009.

The page on instructions for authors says nothing about publication fees.  It does, however, require authors to transfer copyright to Springer, which it justifies by saying, “This will ensure the widest possible dissemination of information under copyright laws.”  For the moment I’m less interested in the incorrectness of this statement than in the fact that Springer’s hybrid journals use an equivalent of the CC-BY license.  It looks like Springer is experimenting with a new access model:  free online access for all articles in a journal (hence, not hybrid); no publication fees; but no reuse rights beyond fair use.  The copyright transfer agreement permits self-archiving of the published version of the text but not the published PDF.

Also see my post last week on Springer’s new Evolution: Education and Outreach, with a similar access policy but a few confusing wrinkles of its own.

PMR: Whatever the rights and wrongs of this approach – I accept PeterS’s analysis of most situations – it represents one of my fears – the increasing complexity of per-publisher offerings. Springer now has at least 3 models – Closed, OpenChoice and FreeOnlineAccess. Even for the expert it will be non-trivial to decide what can and cannot be done, what should and should not be done. If all the major closed publishers do this, each with a slightly different model where the licence matters, we have chaos. This type of licence proliferation makes it harder to work towards common agreements for access to data (it seems clear that the present one is a step away from Open Data).

I used to think instrument manufacturers were bad, bringing out a different data format with every new machine.  I still do. Now they have been joined by publishers.

Is the scientific archive safe with publishers?

Monday, December 31st, 2007

“In the pipeline” is an impressive and much-followed part of the chemical blogosphere. I’m a bit late on its post Kids These Days! which deals in depth with a case (Menger / Christl pyridinium incident) of published scientific error. The case even got as far as Der Spiegel – the German magazine. It’s worth reading (the link will take you to other links and also a very worthwhile set of comments from the blogosphere).

My summary is that: some chemists reported the synthesis of a novel set of compounds, published in Angewandte Chemie (Wiley) (2007) and Organic Letters (ACS) , (2006). After publication, doubt was thrown on the identification of the products, claiming that analytical evidence had been misinterpreted. As a result the original authors withdrew their claim. [The blogosphere has the usual range of opinions - the referees should have picked this up, the authors were sloppy, the criticism was rude, the reaction had been known for 100 years, etc. All perfectly reasonable - this is a fundamental part of science - it must be open to criticism and falsifiability. We expect a range of opinions on acceptable practice.]

What worried me was one comment that the publisher had altered the scientific record.

17. Metalate on December 1, 2007 11:00 AM writes…

Has anyone noticed that OL has removed all but the first page of the Supporting Info from the 2006 paper? Is this policy on retracted papers? And if so, why?

Permalink to Comment

PMR: I wasn’t reading this story originally, so went back to the article:


As I am currently not in I cannot get the paper without paying 25 USD (and I don’t want to take the risk that there is nothing there. I’ll visit in a day or two).

But the ACS DOES allow anyone to read the supporting information for free (whether they can re-use it is unclear and it takes the ACS months to even reply on this). So I thought it would be an idea to see if our NMREye calculations would show that the products were inconsistent with the data. I go to the supporting information

and find:


[On another day I would have criticized the use of hamburger bitmaps to store scientific information but that's not today's concern.]
There is only one page. As it ends in mid sentence I am sure Metalate is correct.

The publishers have altered the scientific record

I don’t know what they have done to the fulltext article. Replaced it by dev/null? Or removed all but the title page?

This is the equivalent of going to a library and cutting out pages you don’t agree with. The irony is that there is almost certainly nothing wrong with the supporting information. It should be a factual record of what the authors did and observed. There is no suggestion that they didn’t do the work, make compounds, record their melting points, spectra, etc. All these are potentially valuable scientific data. They may have misinterpreted their result but the work is still part of the scientific record. For all I know (and I can’t because the publisher has censored the data) the compounds they made were actually novel (if uninteresting). Even if they weren’t novel it could be valuable to have additional measurements on them.

I have a perfectly legitimate scholarly quest. I want to see how well chemical data supports the claims made in the literature. We have been doing this with crystallography and other analytical data for several years. It’s hard because most data is thrown away or in PDF but when we can get it the approach works. We contend that if this paper had been made available to high throughput NMR calculation (“robot referees”) – by whatever method – it might have been shown to be false. It’s even possible that the compounds proposed might have been shown to be unstable – I don’t know enough without doing the calculations.

But the publisher’s censorship has prevented me from doing this.

The ACS takes archival seriously: C&EN: Editor’s Page – Socialized Science:

As I’ve [Rudy Baum] written on this page in the past, one important consequence of electronic publishing is to shift primary responsibility for maintaining the archive of STM literature from libraries to publishers. I know that publishers like the American Chemical Society are committed to maintaining the archive of material they publish.

PMR: I am not an archivist but I know some and I don’t know of any who deliberately censor the past. So I have some open questions to the American Chemical Society (and to other publishers who have taken on the self-appointed role of archivist):

  • what is the justification for this alteration of the record? Why is the original not still available with an annotation?
  • who – apart from the publisher – holds the actual formal record of publications? And how do I get it? (Remember that a University library who subscribes to a journal will probably lose all back issues – unlike paper journals the library has not purchased the articles, only rented them). I assume that some deposit libraries hold copies but I bet it’s not trivial to get this out of the British Library.
  • where and how can I get hold of the original supplemental data? And yes, I want it for scientific purposes – to do NMR calculations. Since it was originally free, I assume it is still free.

Surely the appropriate way to tackle this is through versions or annotations? One of the many strengths of Wikipedia is that it has a top-class approach to versions and annotations. If someone writes something that others disagree with, the latter can change it. BUT the original version still exists and can be easily located. If there is still disagreement, then WP may put a stamp of the form “this entry is disputed”. Readers know exactly where they are and they can see the whole history of the dispute.

So here, surely, the simple answer is to preserve, not censor, the scientific record. The work may be “junk science” but it is still reported science. Surely an editor should simply add “The authors have retracted this paper because…” on all documents and otherwise leave them in full.

It is obvious that this problem cannot arise with Open Access CC-BY papers because anyone can make a complete historical record as soon as they are published.

[UPDATE. I have now looked at the original article and this seems to have been treated satisfactorily - the fulltext is still available, with an annotation that "The authors have retracted this paper on November 15, 2007 (Org. Lett. 2007, 24, 5139) due to uncertainties regarding what products are formed in the reaction described." That's fair and I have relatively little quibble - although it would still be valuable to see the original and not simply an annotated version.

But the arguments about the supplemental data still persist. If it's deliberate it's very worrying. If it's a technical error in archival it's also very worrying. ]

Why authoring HTML is still a mess

Monday, December 31st, 2007

When HTML was launched it us was simple. And it worked if you got it nearly right. (that was in 1993). Now there are so many additions, scripts and so forth that it becomes impossible to re-use parts of other people’s HTML. In my past post I set a small question and to illustrate it copied some HTML from Wikipedia (using cut and paste). IT looked OK in my editor, so I posted it. When I looked at the final version in Firefox the pasted infobox had disappeared:


So I assumed it hadn’t got into the final version and blamed WordPress.

Then I looked in IE and found:


and as you can see the infobox shows up perfectly.

So we are still a long way from having decent editing and even longer from semantic editing unless we agree to collaborate and concentrate on making a small set of tools work properly. ICE (Integrated Content Environment) is starting to do that – it needs all our support.

[ARGGGH... the box has now appeared in Firefox - halfway into the succeeding post. Obviously it doesn't show on single posts. Or it comes and goes as it feels like...]

Chemical information on the web – typical problem

Monday, December 31st, 2007

Here’s a typical problem with chemical (and other) data on the web and elsewhere. I illustrate it with an entry from Wikipedia, knowing that they’ll probably correct it and similar as soon as it’s pointed out. You don’t have to know much science to solve this one:

Molecular formula XeO4
Molar mass 195.29 g mol−1
Appearance Yellow solid below −36°C
Density ? g cm−3, solid
Melting point −35.9 °C

Here’s part of the infobox for Xenon tetroxide in WP. Why are the data questionable? The problem is universal… [The info box didn't copy so you'll have to look at the web page - probably a better idea anyway. Here's a screenshot] infobox.PNG

UPDATE: The problem comes in the character(s) before the numbers. It is not ASCII character 45, which is what most anglophone keyboards emit when the “-” is typed. From Wikipedia:

Character codes

Read Character Unicode ASCII URL HTML (others)
Plus + U+002B + %2B
Minus U+2212 − or or
Hyphen-minus - U+002D - %2D

The Unicode minus sign is designed to be the same length and height as the plus and equals signs. In most fonts these are the same width as digits in order to facilitate the alignment of numbers in tables. The hyphen-minus sign (-) is the ASCII version of the minus sign, and doubles as a hyphen. It is usually shorter in length than the plus sign and sometimes at a different height. It can be used as a substitute for the true minus sign when the character set is limited to ASCII.

There is a tension here between scientific practice and the norms of typesetting and presentation. When the WP XML for this entry is viewed it looks something like:

x<td><a href="/wiki/Molar_mass" title="Molar mass">Molar mass</a></td>
<td>195.29 g mol<sup>−1</sup></td>
<td>Yellow solid below −36°C</td>
<td><a href="/wiki/Density" title="Density">Density</a></td>
<td> ? g cm<sup>−3</sup>, solid</td>
<td><a href="/wiki/Melting_point" title="Melting point">Melting point</a></td>
<p>−35.9 °C</p>

where the “minus” is represented by 3 bytes, which here print as


Note also that the degree sign is composed of two characters.

If the document is Unicode then this may be strictly correct, but in a scientific context it is universal that ASCII 45 is used for minus.

The consequence is that a large amount of HTML is not machine-readable in the way that a human reads it.

The answer for “minus” is clear – in a scientific context always use ASCII 45. It is difficult to know what to do with the other characters such as degrees. They can be guaranteed to cause problems at some stage when transforming XML, HTML or any other format unless there is very strict discipline on character encodings in documents, prgrams and stylesheets.

Which is not common.

Note, of course, that’s it’s much worse in Word documents. We have examples in published manuscripts (i.e. on publisher web sites) where numbers are taken not from the normal ASCII range (48-57) but from any of a number of symbols fonts. These are almost impossible for machines to manage correctly.

Exploring RDF and CML

Sunday, December 30th, 2007

I’ve taken the chance pf a few days without commitments to investigate how we shall be using RDF. We’ve got several projects where we are starting to use it – CrystalEye – WWMM, eChemistry, SPECTRa : JISC and other ORE-based projects. I’ve been convinced for a few years that CML+RDF has to be the way forward for representing chemistry – the only question was when. CML gives the precision that is required for defining the local structure of objects (such as molecules) and RDF gives the flexibility for supporting a very diverse community who have different approaches and needs. It’s a balance between these two.

RDF represents information by triples – classically

subject – predicate – object

Here’s an example from WP:



        <rdf:Description rdf:about="">
                <dc:title>Tony Benn</dc:title>

To an English-speaking person, the same information could be represented simply as:

The title of this resource, which is published by Wikipedia, is ‘Tony Benn’

[Tony Benn is a well-known socialist UK politician much respected by people of all parties and none.]

This can be represented by a graph (from the W3C validator service) :


This is a very simple graph. The strength of RDF is that you can add a new triple anywhere and keep on doing it. The weakness of RDF is that you can add a new triple anywhere and keep on doing it. You end up with graphs of arbitrary structure. The challenge of ORE is to make sense of these.

Molecules have a variable RDF structure, We have to cater for molecules with no names, a hundred names, many properties, parameter constraints, etc. And the data are changing constantly and can come from many places. So there needs to be a versioning system and RDF is almost certainly the best way to tackle this. So here is a typical molecule:


The quality is bad because the graph is much larger and had to be scaled down (you can click it). But it shows the general structure – a “molecule” node, with about 10 “properties” (in the RDF sense) and 3-4 layers.

The learning curve for RDF is steep. The nomenclature is abstract and takes some time to become familiar with. Irritatingly there are at least 4 different syntaxes and some parts of them are very similar. Several query languages as well. However having spent a day with Jena, I can now create RDF from CML and it makes a lot of sense. (Note that it’s relatively easy to create RDF from XML, but no guarantee that arbitrary RDF can be transformed to XML).

The key thing that you have to learn is that almost everything is a Uniform Resource Identifier (URI) or a literal. So up to now we have things in CML such as dictRef, convention, units. In RDF alll these have to be described by URIs. This is hard work but very good discipline and helps to firm up CML vocabulary and dictionaries.

So we now have over 100,000 chemical triples and should be able to do useful things very soon.

What does USD 29 billion buy? and what’s its value?

Friday, December 28th, 2007

Like many others I’d like to thank the The Alliance for Taxpayer Access

… a coalition of patient, academic, research, and publishing organizations that supports open public access to the results of federally funded research. The Alliance was formed in 2004 to urge that peer-reviewed articles stemming from taxpayer-funded research become fully accessible and available online at no extra cost to the American public. Details on the ATA may be found at

for its campaigning for the NIH bill. From the ATA site:

The provision directs the NIH to change its existing Public Access Policy, implemented as a voluntary measure in 2005, so that participation is required for agency-funded investigators. Researchers will now be required to deposit electronic copies of their peer-reviewed manuscripts into the National Library of Medicine’s online archive, PubMed Central. Full texts of the articles will be publicly available and searchable online in PubMed Central no later than 12 months after publication in a journal.

“Facilitated access to new knowledge is key to the rapid advancement of science,” said Harold Varmus, president of the Memorial Sloan-Kettering Cancer Center and Nobel Prize Winner. “The tremendous benefits of broad, unfettered access to information are already clear from the Human Genome Project, which has made its DNA sequences immediately and freely available to all via the Internet. Providing widespread access, even with a one-year delay, to the full text of research articles supported by funds from all institutes at the NIH will increase those benefits dramatically.”

PMR: Heather Joseph -one of the miain architects of the struggle – comments:

“Congress has just unlocked the taxpayers’ $29 billion investment in NIH,” said Heather Joseph, Executive Director of SPARC (the Scholarly Publishing and Academic Resources Coalition, a founding member of the ATA). “This policy will directly improve the sharing of scientific findings, the pace of medical advances, and the rate of return on benefits to the taxpayer.”

PMR: Within the rejoicing we must be very careful not to overlook the need to publish research data in full. So, as HaroldV says, “the Human Genome Project [...]made its DNA sequences immediately and freely available to all via the Internet”. This was the essential component. If only the fulltext of the papers are available the sequences could not have been used – we’d still be trying to hack PDFs for sequences.

So what is the 29 USD billion? I suspect that it’s the cost of the research, not the market value of the fulltext PDFs (which is probably much less than $29B ). If the full data of this research were available I suspect its value would be much more than $29B.

So I have lots of questions and hope that PubMed, Heather and others can answer them

  • what does $29B represent?
  • will PubMed require the deposition of data (e.g. crystal structures, spectra, gels, etc.)
  • if not, will PubMed encourage deposition?
  • if not, will PubMed support deposition?
  • if not, what are we going to do about it?

So, while Cinderella_Open_Access may be going to the ball is Cinderella_Open_Data still sitting by the ashes hoping that she’ll get a few leftovers from the party?

Why the NIH bill does not require copyright violation

Friday, December 28th, 2007


Rich Apodaca is a founder member of the BlueObelisk – which advocates ODOSOS – Open Data, Open Source and Open Standards (mainly in chemistry). Rich has made major contributions in this area and adds valuable insights on his Depth-First blog. So I was interested that he feels that the NIH bill is misdirected and won’t work because it requires authors to publish as Open Access.


[Note, by the way, that the Blue Obelisk deliberately did not include Open Access in its scope - we are not a universal free love and flowers cult but one that addresses why chemistry needs an overhaul in how its data and knowledge are communicated now and for posterity. We felt that Open Access was orthogonal to ODOSOS. All of us at times publish in closed access journals. Moreover it does not require monk-like adherence to all its principles all the time - but that's another story.] I quote in full since the premises are important…


Rich Apodaca – A New Beginning or More of the Same?

21:02 27/12/2007, Rich Apodaca,

As discussed by Peter Suber, Peter Murray-Rust and others, President Bush signed H.R. 2764 into law yesterday. Among the many items in this bill is one that proponents argue could change the nature of the Open Access debate. Does this new law represent a fundamentally changed game, or just the next inning of the old one?

The text of the new law spells out what is now required:

SEC. 218. The Director of the National Institutes of Health shall require that all investigators funded by the NIH submit or have submitted for them to the National Library of Medicine’s PubMed Central an electronic version of their final, peer-reviewed manuscripts upon acceptance for publication, to be made publicly available no later than 12 months after the official date of publication: Provided, That the NIH shall implement the public access policy in a manner consistent with copyright law.

IANAL, but the provision requiring the policy to be implemented “in a manner consistent with copyright law” offers publishers (and scientists) all the flexibility they need to continue business as usual.

The reason is simple. Transfer of copyright from the author of a scientific paper to the publisher is usually one of the first things to happen “upon acceptance” of a manuscript for publication. And the new law makes it perfectly clear that copyright law takes precedence over deposition into PubMed Central.

Most of the journals in question will be hostile to the idea of having their copyrighted material deposited into PubMed Central and so understandably won’t allow it to be done by the authors of papers or anyone else.

Take this hypothetical scenario for example: Professor Gross at California University gets his manuscript approved for publication in the Journal of Nanoscale Devices (JND). Professor Gross is fully aware both of HR 2764 and JND’s refusal to deposit manuscripts into PubMed Central – the reasons why Professor Gross would choose JND anyway are interesting, but not relevant here. Along with the acceptance letter, JND requests prompt return of a signed copyright transfer agreement. Professor Gross sends in the signed form and from that point on, all rights to his article belong to JND. As is their policy, JND refuses Professor Gross permission to deposit a copy of his paper into PubMed Central within 12 months after publication.

Unless I’m missing something, neither Professor Gross nor JND have violated any laws. The assumption made by proponents of the new law seems to be that to implement the new policy, the Director of NIH will forbid publication by grant recipients in journals that don’t allow deposition of articles into PubMed Central.

How many influential scientist do you know of who would tolerate the government telling them which journals they can and can’t publish in? The minute such a misguided policy is put in place, the national scientific outcry would more than overwhelm anything Open Access proponents could muster.

PMR: There are several of the common counterarguments here and I shan’t address all of them. As an axiom let me state that some of them are peculiar to the US and make little sense outside.

The primary confusion is that here the NIH is acting as a grant-giving organisation, not an instrument of government in general. There is no universal US law here, but a contractual agreement between a provider of funds and the recipient. The funder says IF you receive a grant from us THEN you must do X. There is no law requiring anyone in the US or elsewhere to apply for funding to NIH. There are many other funders who support medicine and health including Wellcome, HHMI, Cancer Research UK, etc. Each has its conditions. No one has to apply to any of them.

Almost all funders limit the scope of their funding and impose conditions on recipients. For example a Cancer funder will normally require that the work is related to cancer, a children’s charity to children, etc. There would be cases where national laws might override this (it is likely that funding which is clearly racist would be challenged but it is possible to have a gender specific funder).

All research is likely to be a compromise between:

  • what the researcher would like to do
  • what the funder would like to be done
  • what is feasible and valuable

For example a funder might require that no research involved living animals and some will go further and forbid the use of any animal tissue. The applicant has a choice as to whether they wish to work with constraints or look elsewhere. In some cases (and I hope readers can add them) national funding agencies take strong lines on the permitted use of biotechnology in the work – and this differs from country to country.

In the current case that funder has a contractual requirement that the work be published openly after 12 months. I imagine that this requirement will occur in something like the US CFR

The Code of Federal Regulations (CFR) is the codification of the general and permanent rules published in the Federal Register by the executive departments and agencies of the Federal Government. It is divided into 50 titles that represent broad areas subject to Federal regulation. Each volume of the CFR is updated once each calendar year and is issued on a quarterly basis. More.

These are regulations on how government is carried out. An application for a new drug has to conform to 21CFR11 (and probably many more) . No one is required to develop new drugs but if they do they have to conform. So I hypothesize that in the current case the regulation (which has the force of law) requires the NIH to require grantees to publish their work openly in a specified time frame.

Nothing is said about the manner of publication. The author might, for example, start their own journal specifically for this purpose. They might set up an Open Notebook wiki. (I skip problems of patient confidentiality, etc.). The only requirement would be to satisfy the funders that they had met the regulations. I would not be surprised if the words did not actually specify peer-review (can anyone comment?). If the grant consists of staged contributions then the grantee would have to satisfy the program manager that the work had been published as rapidly as is consistent with good science. I would be amazed if the regulations specified a limited set of journals that were the only ones that could be used, and even more if these were defined by a citation metric algorithm (“you can only publish in journals with IF > 10.0″). There is real scope here for novel types of publication.

Rich: Neither HR 2764 nor any form of government intervention will bring widespread Open Access into being. The only things that will change the status quo are: (1) the availability of tools for making it happen; and (2) the realization by individual investigators that continuing to give away their hard-earned copyright makes them far less competitive than their peers who don’t.

PMR: HR 2764 will have a major impact. Partly because there are many scientists who will be directly affected by it, but partly because it is symbolic. Other funders (e.g. European or national governments) will now be compared against the NIH. I can write to the UK EPSRC and ask them why they don’t do the same. (Of the 7 research councils in the UK, the EPSRC is almost alone in not requiring some form of Open publication). I know the current answer, but who knows – they may have already started to change. Europe has been debating whether European research must be made open.

An analogy with Open Source may be useful. Several funders require that all software created in a program should be released as Open Source. Many universities require that academics maximise the income they generate from their research. These two are often in conflict. My own approach is to release most software as Open Source. However in some cases I have taken industrial funding and the output of that is usually different. If I felt that this would be against fundamental principles I would turn the funding down. Simple.

Open Access proponents should forget about getting the Federal Government to fix the mess that modern scientific publication has become. Instead, they should focus on making Open Access-like options more attractive to scientists.

PMR: This is a purely US argument which is almost incomprehensible on this side of the Atlantic and probably almost everywhere else. No one likes paying taxes, but we accept that government tries to spend them wisely. It  [the argument] is epitomized in Rudy Baum’s “Socialized Science” and More Socialized Science articles.

The word socialize means:

1. socialise – take part in social activities; interact with others;
2. socialise – train for a social environment; “The children must be properly socialized”
3. socialise – prepare for social life; “Children have to be socialized in school”
4. socialise – make conform to socialist ideas and philosophies; “Health care should be socialized!”

Meaning 4 (presumably Rudy’s usage) is – I think – entirely unknown outside the US. When I used the apparent synonym “socialist” Rudy corrected me. I therefore have no idea what the word means other than that it seems to be pejorative. There is clearly a strong US-only political undercurrent which we outsiders should not try to swim in.

To finish: Open Access enthusiasts are working very hard to create attractive options. A major part of this (“the tools”) are new publishers and organs.It takes ca. 5 years for a new conventional journal to achieve serious impact factors and a number of these have and are being launched. I expect that, like OUP and BMC Bioinformatics, we shall see many of the new ones prosper.

What I really fear is the growth of “hybrid horrors”. This is where the publishers create something which isn’t really Open but is covered by such a mass of verbiage that it is almost impossible to work through. I’ve spent weeks earlier this year trying to uncover publisher policies and in some cases failing. When I do find out what is happening it is heavily publisher-specific and often not even implemented as they say it is. So I expect to see a continued stream of “slightly-Open” offerings trumpeted as NIH-compliant. This requires heavy work to investigate and police – work which is entirely unproductive and usually unfunded.

The great advantage of the  requirement to deposit in Pubmed (rather than simply to expose on a publisher or other website) is that the act is clear. You can’t “half-deposit” in Pubmed. They have the resources to decide whether any copyright statement allows the appropriate use of the information or is suffiently restrrictive that it does not meet the NIH rules.

At some stage the community will get tired of the continual drain on innovation set by the current approach to publihing. Whether when that happens many publishers will be left is unclear.

Thank you President Bush

Wednesday, December 26th, 2007

From Peter Suber:

OA mandate at NIH now law

This morning President Bush signed the omnibus spending bill requiring the US National Institutes of Health (NIH) to mandate OA for NIH-funded research.  

Here’s the language that just became law:

The Director of the National Institutes of Health shall require that all investigators funded by the NIH submit or have submitted for them to the National Library of Medicine’s PubMed Central an electronic version of their final, peer-reviewed manuscripts upon acceptance for publication to be made publicly available no later than 12 months after the official date of publication: Provided, That the NIH shall implement the public access policy in a manner consistent with copyright law.

PMR: We Can now celebrate.

The hard work continues. But now all fulltext derived from NIH work will be available on PubMed. Other funders will follow suit (if they are not ahead). So our journal-eating-robot OSCAR will have huge amounts of text to mine.

The good news is that we believe that this text-mining will, in itself, uncover new science. How much we don’t know, but we hope it’s significant. And if so, that will be a further argument for freeing the fulltext of every science publication.

Update on Open crystallography

Saturday, December 22nd, 2007

There’s now a growing movement to publishing crystallography directly into the Open. Several threads include:

… so it was no great surprise when Jean Claude blogged:

X-Ray Crystallography Collaborator

20:41 20/12/2007, Jean-Claude Bradley, Useful Chemistry

We have another collaborator who is comfortable with working openly: Matthias Zeller from Youngstown State University.

With the fastest turnaround for any crystal structure analysis I’ve ever submitted, we now have the structure for the Ugi product UC-150D. For a nice picture of the crystals see here.

PMR: J-C also mailed us and asked how w/he could archive and disseminate the crystallography. So here’s a rough overview.

Crystallography is a microcosm of chemistry and we encounter many different challenges:

  • not all structures are Open (some not initially, some never). Managing the differential access is harder than it looks. It has to be owned by the Department or Institution. So you probably need access control, and probably an embargo system.
  • Institutional repositories are not generally oriented towards data. Some may, indeed, only accept “fulltext”. So there may be nowhere obvious to go.
  • The raw data (CIF) contains metadata, but not in a form where search engines can find it. That’s a important part of what SPECTRa does – extracts metadata and repurposes it.
  • The CIF can, but almost universally does not, contain chemical metadata. So part of JUMBO is devoted to trying to extract chemistry out of atomic positions.  Needs a fair amount of heuristic code.

So in conjunction with eChemistry and eCrystals and in the momentum of SPECTRa we are continuing to develop software for crystallographic repositories. There are several reasons why people want such repositories:

  • as a high-quality lab companion – somewhere to put your data and get it back later.
  • as somewhere to provide knowledge for data-driven science (e.g. CrystalEye)
  • as somewhere to save your data for publication and dissemination
  • as somewhere to archive your data for posterity (e.g. an IR)

These put different stresses on the software, so Jim and I are developing context-independent tools that can be used in any. I’m hacking the JUMBO software (CrystalTool) and he is hacking CrystalEye so it becomes a true repository.

This is our relaxation over the holiday.


FoX marches on

Saturday, December 22nd, 2007

Toby White joined us – Jim Downing, Peter Corbett and me – in the pub yesterday to unwind and explore the challenges of tomorrow’s information. Toby has been one of the pillars of supporting CML – there was no requirement to do so but he and colleagues (mainly in Earth Sciences) saw the value and used it anyway. The added challenge is FORTRAN. FORTRAN is a great language – my first encounter was ca 1970.  It’s oriented towards  rectangular data – of variable dimensionality. It is extremely good at scientific computing with large number of numbers and it understands – as much as most – how real numbers work.

But it’s not easy to interface with XML unless your data model is also rectangular. Historically molecular data was – atoms vertically, coordinates and other properties across. Bit of a problem if data are missing – hacks include magic numbers (e.g. 1.0e-bignumber, or zero-and-hope, or a row of stars (great fun when reading back in)).

So Toby has written FoX – a real labour of love. If you develop ANY FORTRAN code, please use FoX for the data i/o. It’s easy and it saves huge amount of messy glueware. There’s now no technical reason why all comp.chem software shouldn’t emit XML/CML. It’s not just “another file format” – it’s a new way of thinking about information.


From: Toby White


Subject: [FoX] Release of version 3.1

This is to announce the release of version 3.1 of the FoX library.
(download from <>)
This new release features
* extended portability across compilers

(see <>)
* a “dummy library” capability

(see <>)
* extended DOM functionality, including several more Level 3 functions,

and additional Fortran utility wrappers

(see <>)

Merry Christmas,


PMR:  Enjoy!