PMR: Whatever the rights and wrongs of this approach – I accept PeterS’s analysis of most situations – it represents one of my fears – the increasing complexity of per-publisher offerings. Springer now has at least 3 models – Closed, OpenChoice and FreeOnlineAccess. Even for the expert it will be non-trivial to decide what can and cannot be done, what should and should not be done. If all the major closed publishers do this, each with a slightly different model where the licence matters, we have chaos. This type of licence proliferation makes it harder to work towards common agreements for access to data (it seems clear that the present one is a step away from Open Data). I used to think instrument manufacturers were bad, bringing out a different data format with every new machine. I still do. Now they have been joined by publishers.New free journal from Springer
Neuroethics is a new peer-reviewed journal from Springer. Instead of using Springer’s Open Choice hybrid model, it will offer free online access to all its articles, at least for 2008 and 2009. The page on instructions for authors says nothing about publication fees. It does, however, require authors to transfer copyright to Springer, which it justifies by saying, “This will ensure the widest possible dissemination of information under copyright laws.” For the moment I’m less interested in the incorrectness of this statement than in the fact that Springer’s hybrid journals use an equivalent of the CC-BY license. It looks like Springer is experimenting with a new access model: free online access for all articles in a journal (hence, not hybrid); no publication fees; but no reuse rights beyond fair use. The copyright transfer agreement permits self-archiving of the published version of the text but not the published PDF. Also see my post last week on Springer’s new Evolution: Education and Outreach, with a similar access policy but a few confusing wrinkles of its own.
Archive for December, 2007
New free journal from Springer – but no Open Data
Monday, December 31st, 2007Is the scientific archive safe with publishers?
Monday, December 31st, 200717. Metalate on December 1, 2007 11:00 AM writes… Has anyone noticed that OL has removed all but the first page of the Supporting Info from the 2006 paper? Is this policy on retracted papers? And if so, why? Permalink to CommentPMR: I wasn’t reading this story originally, so went back to the article:
As I am currently not in cam.ac.uk I cannot get the paper without paying 25 USD (and I don’t want to take the risk that there is nothing there. I’ll visit in a day or two).
But the ACS DOES allow anyone to read the supporting information for free (whether they can re-use it is unclear and it takes the ACS months to even reply on this). So I thought it would be an idea to see if our NMREye calculations would show that the products were inconsistent with the data. I go to the supporting information
and find:
[On another day I would have criticized the use of hamburger bitmaps to store scientific information but that's not today's concern.]
There is only one page. As it ends in mid sentence I am sure Metalate is correct.
The publishers have altered the scientific record
I don’t know what they have done to the fulltext article. Replaced it by dev/null? Or removed all but the title page?
This is the equivalent of going to a library and cutting out pages you don’t agree with. The irony is that there is almost certainly nothing wrong with the supporting information. It should be a factual record of what the authors did and observed. There is no suggestion that they didn’t do the work, make compounds, record their melting points, spectra, etc. All these are potentially valuable scientific data. They may have misinterpreted their result but the work is still part of the scientific record. For all I know (and I can’t because the publisher has censored the data) the compounds they made were actually novel (if uninteresting). Even if they weren’t novel it could be valuable to have additional measurements on them.
I have a perfectly legitimate scholarly quest. I want to see how well chemical data supports the claims made in the literature. We have been doing this with crystallography and other analytical data for several years. It’s hard because most data is thrown away or in PDF but when we can get it the approach works. We contend that if this paper had been made available to high throughput NMR calculation (“robot referees”) – by whatever method – it might have been shown to be false. It’s even possible that the compounds proposed might have been shown to be unstable – I don’t know enough without doing the calculations.
But the publisher’s censorship has prevented me from doing this.
The ACS takes archival seriously: C&EN: Editor’s Page – Socialized Science:
As I’ve [Rudy Baum] written on this page in the past, one important consequence of electronic publishing is to shift primary responsibility for maintaining the archive of STM literature from libraries to publishers. I know that publishers like the American Chemical Society are committed to maintaining the archive of material they publish.PMR: I am not an archivist but I know some and I don’t know of any who deliberately censor the past. So I have some open questions to the American Chemical Society (and to other publishers who have taken on the self-appointed role of archivist):
- what is the justification for this alteration of the record? Why is the original not still available with an annotation?
- who – apart from the publisher – holds the actual formal record of publications? And how do I get it? (Remember that a University library who subscribes to a journal will probably lose all back issues – unlike paper journals the library has not purchased the articles, only rented them). I assume that some deposit libraries hold copies but I bet it’s not trivial to get this out of the British Library.
- where and how can I get hold of the original supplemental data? And yes, I want it for scientific purposes – to do NMR calculations. Since it was originally free, I assume it is still free.
Why authoring HTML is still a mess
Monday, December 31st, 2007
So I assumed it hadn’t got into the final version and blamed WordPress.
Then I looked in IE and found:
and as you can see the infobox shows up perfectly.
So we are still a long way from having decent editing and even longer from semantic editing unless we agree to collaborate and concentrate on making a small set of tools work properly. ICE (Integrated Content Environment) is starting to do that – it needs all our support.
[ARGGGH... the box has now appeared in Firefox - halfway into the succeeding post. Obviously it doesn't show on single posts. Or it comes and goes as it feels like...] Chemical information on the web – typical problem
Monday, December 31st, 2007| Molecular formula | XeO4 |
| Molar mass | 195.29 g mol−1 |
| Appearance | Yellow solid below −36°C |
| Density | ? g cm−3, solid |
| Melting point | −35.9 °C |
UPDATE: The problem comes in the character(s) before the numbers. It is not ASCII character 45, which is what most anglophone keyboards emit when the “-” is typed. From Wikipedia:
Character codes
There is a tension here between scientific practice and the norms of typesetting and presentation. When the WP XML for this entry is viewed it looks something like:The Unicode minus sign is designed to be the same length and height as the plus and equals signs. In most fonts these are the same width as digits in order to facilitate the alignment of numbers in tables. The hyphen-minus sign (-) is the ASCII version of the minus sign, and doubles as a hyphen. It is usually shorter in length than the plus sign and sometimes at a different height. It can be used as a substitute for the true minus sign when the character set is limited to ASCII.
Read Character Unicode ASCII URL HTML (others) Plus + U+002B +%2BMinus − U+2212 −or−or−Hyphen-minus - U+002D -%2D
x<td><a href="/wiki/Molar_mass" title="Molar mass">Molar mass</a></td> <td>195.29 g mol<sup>−1</sup></td> </tr> <tr> <td>Appearance</td> <td>Yellow solid below −36°C</td> </tr> <tr> <td><a href="/wiki/Density" title="Density">Density</a></td> <td> ? g cm<sup>−3</sup>, solid</td> </tr> <tr> <td><a href="/wiki/Melting_point" title="Melting point">Melting point</a></td> <td> <p>−35.9 °C</p>where the “minus” is represented by 3 bytes, which here print as
−Note also that the degree sign is composed of two characters. If the document is Unicode then this may be strictly correct, but in a scientific context it is universal that ASCII 45 is used for minus. The consequence is that a large amount of HTML is not machine-readable in the way that a human reads it. The answer for “minus” is clear – in a scientific context always use ASCII 45. It is difficult to know what to do with the other characters such as degrees. They can be guaranteed to cause problems at some stage when transforming XML, HTML or any other format unless there is very strict discipline on character encodings in documents, prgrams and stylesheets. Which is not common. Note, of course, that’s it’s much worse in Word documents. We have examples in published manuscripts (i.e. on publisher web sites) where numbers are taken not from the normal ASCII range (48-57) but from any of a number of symbols fonts. These are almost impossible for machines to manage correctly.
Exploring RDF and CML
Sunday, December 30th, 2007
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn"> <dc:title>Tony Benn</dc:title> <dc:publisher>Wikipedia</dc:publisher> </rdf:Description> </rdf:RDF>
To an English-speaking person, the same information could be represented simply as: The title of this resource, which is published by Wikipedia, is ‘Tony Benn’[Tony Benn is a well-known socialist UK politician much respected by people of all parties and none.] This can be represented by a graph (from the W3C validator service) :
This is a very simple graph. The strength of RDF is that you can add a new triple anywhere and keep on doing it. The weakness of RDF is that you can add a new triple anywhere and keep on doing it. You end up with graphs of arbitrary structure. The challenge of ORE is to make sense of these.
Molecules have a variable RDF structure, We have to cater for molecules with no names, a hundred names, many properties, parameter constraints, etc. And the data are changing constantly and can come from many places. So there needs to be a versioning system and RDF is almost certainly the best way to tackle this. So here is a typical molecule:
The quality is bad because the graph is much larger and had to be scaled down (you can click it). But it shows the general structure – a “molecule” node, with about 10 “properties” (in the RDF sense) and 3-4 layers.
The learning curve for RDF is steep. The nomenclature is abstract and takes some time to become familiar with. Irritatingly there are at least 4 different syntaxes and some parts of them are very similar. Several query languages as well. However having spent a day with Jena, I can now create RDF from CML and it makes a lot of sense. (Note that it’s relatively easy to create RDF from XML, but no guarantee that arbitrary RDF can be transformed to XML).
The key thing that you have to learn is that almost everything is a Uniform Resource Identifier (URI) or a literal. So up to now we have things in CML such as dictRef, convention, units. In RDF alll these have to be described by URIs. This is hard work but very good discipline and helps to firm up CML vocabulary and dictionaries.
So we now have over 100,000 chemical triples and should be able to do useful things very soon. What does USD 29 billion buy? and what’s its value?
Friday, December 28th, 2007… a coalition of patient, academic, research, and publishing organizations that supports open public access to the results of federally funded research. The Alliance was formed in 2004 to urge that peer-reviewed articles stemming from taxpayer-funded research become fully accessible and available online at no extra cost to the American public. Details on the ATA may be found at http://www.taxpayeraccess.org.for its campaigning for the NIH bill. From the ATA site:
The provision directs the NIH to change its existing Public Access Policy, implemented as a voluntary measure in 2005, so that participation is required for agency-funded investigators. Researchers will now be required to deposit electronic copies of their peer-reviewed manuscripts into the National Library of Medicine’s online archive, PubMed Central. Full texts of the articles will be publicly available and searchable online in PubMed Central no later than 12 months after publication in a journal. “Facilitated access to new knowledge is key to the rapid advancement of science,” said Harold Varmus, president of the Memorial Sloan-Kettering Cancer Center and Nobel Prize Winner. “The tremendous benefits of broad, unfettered access to information are already clear from the Human Genome Project, which has made its DNA sequences immediately and freely available to all via the Internet. Providing widespread access, even with a one-year delay, to the full text of research articles supported by funds from all institutes at the NIH will increase those benefits dramatically.”PMR: Heather Joseph -one of the miain architects of the struggle – comments:
“Congress has just unlocked the taxpayers’ $29 billion investment in NIH,” said Heather Joseph, Executive Director of SPARC (the Scholarly Publishing and Academic Resources Coalition, a founding member of the ATA). “This policy will directly improve the sharing of scientific findings, the pace of medical advances, and the rate of return on benefits to the taxpayer.”PMR: Within the rejoicing we must be very careful not to overlook the need to publish research data in full. So, as HaroldV says, “the Human Genome Project [...]made its DNA sequences immediately and freely available to all via the Internet”. This was the essential component. If only the fulltext of the papers are available the sequences could not have been used – we’d still be trying to hack PDFs for sequences. So what is the 29 USD billion? I suspect that it’s the cost of the research, not the market value of the fulltext PDFs (which is probably much less than $29B ). If the full data of this research were available I suspect its value would be much more than $29B. So I have lots of questions and hope that PubMed, Heather and others can answer them
- what does $29B represent?
- will PubMed require the deposition of data (e.g. crystal structures, spectra, gels, etc.)
- if not, will PubMed encourage deposition?
- if not, will PubMed support deposition?
- if not, what are we going to do about it?
Why the NIH bill does not require copyright violation
Friday, December 28th, 2007
Rich Apodaca is a founder member of the BlueObelisk – which advocates ODOSOS – Open Data, Open Source and Open Standards (mainly in chemistry). Rich has made major contributions in this area and adds valuable insights on his Depth-First blog. So I was interested that he feels that the NIH bill is misdirected and won’t work because it requires authors to publish as Open Access.
[Note, by the way, that the Blue Obelisk deliberately did not include Open Access in its scope - we are not a universal free love and flowers cult but one that addresses why chemistry needs an overhaul in how its data and knowledge are communicated now and for posterity. We felt that Open Access was orthogonal to ODOSOS. All of us at times publish in closed access journals. Moreover it does not require monk-like adherence to all its principles all the time - but that's another story.] I quote in full since the premises are important…
Rich Apodaca – A New Beginning or More of the Same?
21:02 27/12/2007,
As discussed by Peter Suber, Peter Murray-Rust and others, President Bush signed H.R. 2764 into law yesterday. Among the many items in this bill is one that proponents argue could change the nature of the Open Access debate. Does this new law represent a fundamentally changed game, or just the next inning of the old one?
The text of the new law spells out what is now required:SEC. 218. The Director of the National Institutes of Health shall require that all investigators funded by the NIH submit or have submitted for them to the National Library of Medicine’s PubMed Central an electronic version of their final, peer-reviewed manuscripts upon acceptance for publication, to be made publicly available no later than 12 months after the official date of publication: Provided, That the NIH shall implement the public access policy in a manner consistent with copyright law.IANAL, but the provision requiring the policy to be implemented “in a manner consistent with copyright law” offers publishers (and scientists) all the flexibility they need to continue business as usual. The reason is simple. Transfer of copyright from the author of a scientific paper to the publisher is usually one of the first things to happen “upon acceptance” of a manuscript for publication. And the new law makes it perfectly clear that copyright law takes precedence over deposition into PubMed Central. Most of the journals in question will be hostile to the idea of having their copyrighted material deposited into PubMed Central and so understandably won’t allow it to be done by the authors of papers or anyone else. Take this hypothetical scenario for example: Professor Gross at California University gets his manuscript approved for publication in the Journal of Nanoscale Devices (JND). Professor Gross is fully aware both of HR 2764 and JND’s refusal to deposit manuscripts into PubMed Central – the reasons why Professor Gross would choose JND anyway are interesting, but not relevant here. Along with the acceptance letter, JND requests prompt return of a signed copyright transfer agreement. Professor Gross sends in the signed form and from that point on, all rights to his article belong to JND. As is their policy, JND refuses Professor Gross permission to deposit a copy of his paper into PubMed Central within 12 months after publication. Unless I’m missing something, neither Professor Gross nor JND have violated any laws. The assumption made by proponents of the new law seems to be that to implement the new policy, the Director of NIH will forbid publication by grant recipients in journals that don’t allow deposition of articles into PubMed Central. How many influential scientist do you know of who would tolerate the government telling them which journals they can and can’t publish in? The minute such a misguided policy is put in place, the national scientific outcry would more than overwhelm anything Open Access proponents could muster.
PMR: There are several of the common counterarguments here and I shan’t address all of them. As an axiom let me state that some of them are peculiar to the US and make little sense outside.
The primary confusion is that here the NIH is acting as a grant-giving organisation, not an instrument of government in general. There is no universal US law here, but a contractual agreement between a provider of funds and the recipient. The funder says IF you receive a grant from us THEN you must do X. There is no law requiring anyone in the US or elsewhere to apply for funding to NIH. There are many other funders who support medicine and health including Wellcome, HHMI, Cancer Research UK, etc. Each has its conditions. No one has to apply to any of them. Almost all funders limit the scope of their funding and impose conditions on recipients. For example a Cancer funder will normally require that the work is related to cancer, a children’s charity to children, etc. There would be cases where national laws might override this (it is likely that funding which is clearly racist would be challenged but it is possible to have a gender specific funder). All research is likely to be a compromise between:- what the researcher would like to do
- what the funder would like to be done
- what is feasible and valuable
The Code of Federal Regulations (CFR) is the codification of the general and permanent rules published in the Federal Register by the executive departments and agencies of the Federal Government. It is divided into 50 titles that represent broad areas subject to Federal regulation. Each volume of the CFR is updated once each calendar year and is issued on a quarterly basis. More.These are regulations on how government is carried out. An application for a new drug has to conform to 21CFR11 (and probably many more) . No one is required to develop new drugs but if they do they have to conform. So I hypothesize that in the current case the regulation (which has the force of law) requires the NIH to require grantees to publish their work openly in a specified time frame. Nothing is said about the manner of publication. The author might, for example, start their own journal specifically for this purpose. They might set up an Open Notebook wiki. (I skip problems of patient confidentiality, etc.). The only requirement would be to satisfy the funders that they had met the regulations. I would not be surprised if the words did not actually specify peer-review (can anyone comment?). If the grant consists of staged contributions then the grantee would have to satisfy the program manager that the work had been published as rapidly as is consistent with good science. I would be amazed if the regulations specified a limited set of journals that were the only ones that could be used, and even more if these were defined by a citation metric algorithm (“you can only publish in journals with IF > 10.0″). There is real scope here for novel types of publication.
Rich: Neither HR 2764 nor any form of government intervention will bring widespread Open Access into being. The only things that will change the status quo are: (1) the availability of tools for making it happen; and (2) the realization by individual investigators that continuing to give away their hard-earned copyright makes them far less competitive than their peers who don’t.PMR: HR 2764 will have a major impact. Partly because there are many scientists who will be directly affected by it, but partly because it is symbolic. Other funders (e.g. European or national governments) will now be compared against the NIH. I can write to the UK EPSRC and ask them why they don’t do the same. (Of the 7 research councils in the UK, the EPSRC is almost alone in not requiring some form of Open publication). I know the current answer, but who knows – they may have already started to change. Europe has been debating whether European research must be made open. An analogy with Open Source may be useful. Several funders require that all software created in a program should be released as Open Source. Many universities require that academics maximise the income they generate from their research. These two are often in conflict. My own approach is to release most software as Open Source. However in some cases I have taken industrial funding and the output of that is usually different. If I felt that this would be against fundamental principles I would turn the funding down. Simple.
Open Access proponents should forget about getting the Federal Government to fix the mess that modern scientific publication has become. Instead, they should focus on making Open Access-like options more attractive to scientists.PMR: This is a purely US argument which is almost incomprehensible on this side of the Atlantic and probably almost everywhere else. No one likes paying taxes, but we accept that government tries to spend them wisely.
1. socialise – take part in social activities; interact with others; 2. socialise – train for a social environment; “The children must be properly socialized” 3. socialise – prepare for social life; “Children have to be socialized in school” 4. socialise – make conform to socialist ideas and philosophies; “Health care should be socialized!”Meaning 4 (presumably Rudy’s usage) is – I think – entirely unknown outside the US. When I used the apparent synonym “socialist” Rudy corrected me. I therefore have no idea what the word means other than that it seems to be pejorative. There is clearly a strong US-only political undercurrent which we outsiders should not try to swim in. To finish: Open Access enthusiasts are working very hard to create attractive options. A major part of this (“the tools”) are new publishers and organs.It takes ca. 5 years for a new conventional journal to achieve serious impact factors and a number of these have and are being launched. I expect that, like OUP and BMC Bioinformatics, we shall see many of the new ones prosper. What I really fear is the growth of “hybrid horrors”. This is where the publishers create something which isn’t really Open but is covered by such a mass of verbiage that it is almost impossible to work through. I’ve spent weeks earlier this year trying to uncover publisher policies and in some cases failing. When I do find out what is happening it is heavily publisher-specific and often not even implemented as they say it is. So I expect to see a continued stream of “slightly-Open” offerings trumpeted as NIH-compliant. This requires heavy work to investigate and police – work which is entirely unproductive and usually unfunded. The great advantage of the requirement to deposit in Pubmed (rather than simply to expose on a publisher or other website) is that the act is clear. You can’t “half-deposit” in Pubmed. They have the resources to decide whether any copyright statement allows the appropriate use of the information or is suffiently restrrictive that it does not meet the NIH rules. At some stage the community will get tired of the continual drain on innovation set by the current approach to publihing. Whether when that happens many publishers will be left is unclear.
Thank you President Bush
Wednesday, December 26th, 2007PMR: We Can now celebrate. The hard work continues. But now all fulltext derived from NIH work will be available on PubMed. Other funders will follow suit (if they are not ahead). So our journal-eating-robot OSCAR will have huge amounts of text to mine. The good news is that we believe that this text-mining will, in itself, uncover new science. How much we don’t know, but we hope it’s significant. And if so, that will be a further argument for freeing the fulltext of every science publication.This morning President Bush signed the omnibus spending bill requiring the US National Institutes of Health (NIH) to mandate OA for NIH-funded research. Here’s the language that just became law:
The Director of the National Institutes of Health shall require that all investigators funded by the NIH submit or have submitted for them to the National Library of Medicine’s PubMed Central an electronic version of their final, peer-reviewed manuscripts upon acceptance for publication to be made publicly available no later than 12 months after the official date of publication: Provided, That the NIH shall implement the public access policy in a manner consistent with copyright law.
Update on Open crystallography
Saturday, December 22nd, 2007- The Crystallography Open Database which pioneered the idea of collecting crystallographic data and making them Openly available.
- Nick Day’s CrystalEye – aggregation of published Open structures (from journals which don’t appropriate facts)
- the eCrystals collection at Southampton, initially the repository for the National Crystallographic Service and now a JISC-sponsored project to federate crystallographic repositories.
- Other collaborative groups including Reciprocal Net and STaRBURSTT
- the Microsoft eChemistry Project and molecular repositories (see blog)
- we are getting increasing queries about our SPECTRa project.
PMR: J-C also mailed us and asked how w/he could archive and disseminate the crystallography. So here’s a rough overview. Crystallography is a microcosm of chemistry and we encounter many different challenges:X-Ray Crystallography Collaborator
20:41 20/12/2007,
Useful Chemistry
We have another collaborator who is comfortable with working openly: Matthias Zeller from Youngstown State University. With the fastest turnaround for any crystal structure analysis I’ve ever submitted, we now have the structure for the Ugi product UC-150D. For a nice picture of the crystals see here.
- not all structures are Open (some not initially, some never). Managing the differential access is harder than it looks. It has to be owned by the Department or Institution. So you probably need access control, and probably an embargo system.
- Institutional repositories are not generally oriented towards data. Some may, indeed, only accept “fulltext”. So there may be nowhere obvious to go.
- The raw data (CIF) contains metadata, but not in a form where search engines can find it. That’s a important part of what SPECTRa does – extracts metadata and repurposes it.
- The CIF can, but almost universally does not, contain chemical metadata. So part of JUMBO is devoted to trying to extract chemistry out of atomic positions. Needs a fair amount of heuristic code.
- as a high-quality lab companion – somewhere to put your data and get it back later.
- as somewhere to provide knowledge for data-driven science (e.g. CrystalEye)
- as somewhere to save your data for publication and dissemination
- as somewhere to archive your data for posterity (e.g. an IR)
FoX marches on
Saturday, December 22nd, 2007From: Toby White To: FoX@lists.uszla.me.uk Subject: [FoX] Release of version 3.1 This is to announce the release of version 3.1 of the FoX library. (download from <http://source.uszla.me.uk/FoX/FoX-3.1.tgz>) This new release features * extended portability across compilers (see <http://uszla.me.uk/space/software/FoX/compat/>) * a “dummy library” capability (see <http://www.uszla.me.uk/FoX/DoX/Compilation.html#dummy_library>) * extended DOM functionality, including several more Level 3 functions, and additional Fortran utility wrappers (see <http://www.uszla.me.uk/FoX/DoX/FoX_dom.html#dataExtraction>) Merry Christmas,
TobyPMR: Enjoy!

