save our spectra

Data in chemistry publications is very standardized which makes it possible (not easy) to think about robotic extraction of information. I’ve blogged earlier about the use of text, but what about graphics? This post shows the potential, but also the current unnecessary destruction of data. You don’t need to be a chemist to understand the issue.
Types of graphical object that occur frequently in chemistry are

  • chemical structure diagrams (more later)
  • graphs (i.e. plots, not topology, though these also occur)
  • spectra (used to probe the nature of compounds and also to act as fingerprints.

Here I show some proton-NMR spectra (1H NMR), which are very powerful ways of looking into molecules containing hydrogen atoms (almost all do). It’s closely related to NMRI used for medical imaging. What is remarkable is its precision – the frequency used is (often) 500 Mhz (i.e. 5 * 108 per second. Because of this precision the frequency axis is usually expressed in parts per million (ppm). The scale runs from 10 to 0 ppm. This is recorded digitally, usually with 2N points, such as 8192, 16384 or even more. So that means that for each ppm there are about 1000 points or more.
The values and the precise shapes of the peaks are very important. They are usually quoted to 2 decimal places and the fine structure (“coupling”) can be meaningful even if as small as 1 Hz (i.e. 0.02 ppm).
In the SPECTRa-We’ve been looking at how we can preserve this valuable data – it comes out of the machine in digital form, but then it is often transcribed into a PDF. Sometimes this preserves the graphics structure, sometimes it converts it to a pixellated image. This is the worst sort of hamburger.
Since the spectra are important tools in ensuring reproducibility, and chemists frequently refer to literature values, why do some journals allow such awful spectra. I suppose it’s better than having no spectra at all. Here are some good bad and ugly from supplemental info for recent synthetic chemistry papers. Since at least 3 of them carry a copyright I shan’t identify the journals. I claim that they are (a) data (b) a small portion of the work (c) publication does not affect sales (d) that most people would be ashamed to copyright them anyway.
Note that they all cover about 1 ppm (although for some you have to take the numerals on trust)
aciee11.GIF
aciee11.GIF
Fig. 1 The fuzz is real, but quite a bit is visible
jorg1.GIF
Fig.2 Good. this seems to have preserved most of the data
jorg2.GIF
Fig. 3 What are those figures??? Yes, I can guess – but I shouldn’t have to. But the limited pixel resolution has destroyed the peak shapes as well. Look at the non-linearity of the horizontal axis.
aciee2.GIF
Fig 4. I’ve made this larger so the fuzziness from the pixellation is revealed.
aciee3.GIF
Fig. 5 Quite good. You can certainly see peaks separated by 1 Hz.
jorg5.GIF
Fig 6. Oh dear. This has the added fun of being a JPG which adds some dots to the spectrum which are nothing to do with the data. JPGs should not be used for this sort of thing.
jorg4.GIF
Fig 7. This is 8-7 ppm. Another JPG
So non-chemists should be able to see the point. If an article costs USD 3000 then the scientific community deserves better. How many chemists have cursed the unreadability of numeric data mangled by graphics tools? There is no technical reason why the digital data shouldn’t be deposited with the publisher, the instituion, the department.
The simple question is: do chemists care?

Posted in data | 2 Comments

open data: public domain?

David Wiley has a useful post on the “public domain”. I had always assumed that the public domain was fairly simple – certain types of content were de facto PD, authors could easily donate their work to the PD, and that after a given period (country-dependent) the work became PD. But apparently it’s not that easy. The post is for those who love to see how the law makes simple concepts complicated – I shall only quote one snippet:

Assymetry, Hypocrisy, and Public Domain

Thousands of people complain that the term of copyright is too long. They point out that documents like the US Constitution make it plain that the term of copyright should be finite, and that it is absolutely critical that copyrighted educational content and other cultural artifacts eventually enter the public domain. This was recently demonstrated by the way people rallied around Lessig’s challenge of the Sony Bono Copyright Extension Act in the US.
[…]
The critical importance of the public domain is one reason for the new Open Education License draft. Why not simply create a mechanism for putting works in the public domain? Again, quoting Lessig:

We have no direct license that you can link to so as to place your material in the public domain. This is not because we wouldn’t like to offer such a license. It is instead because the law does not make such simplicity possible. While for most of our history, there were a thousand ways to move creative material into the public domain, most lawyers today are puzzled about whether there is any way to move work into the public domain. We have tried to build a way, but it is not automatic. If you follow this link, there are a number of steps you can take to put material into the public domain. We believe that if you follow these steps, then your work is in the public domain. Again, there’s no way to be certain about this. But this is our best guess, given the murky state of the law.

PMR: I’ll leave it there. I’ll go back to trying to develop new algorithms for describing molecular structure and robotic analysis of the literature. It’s hard. But it’s easier for robots to read data than for a human to read lawyers.

Posted in data, open issues | Leave a comment

open data: concepts from David Wiley

David Wiley has commented very clearly on the issues involved in licensing content (or putting it in the public domain). This is the first of two posts, with my comments interjected.
By background, David seems to be writing in an educational context (i.e. material created for or by instructors and students for the teaching and learning process). There are concepts in his “Four Rs” which probably don’t apply to data, but they are still helpful. At one level it can be argued that facts are not copyrightable and simply pursue that, but “data” is possibly more complex. So while I don’t like the idea of putting licences on data, I think it’s worse to leave them off.
Bits are snipped, but I have included a lot:

Open Education License Draft

If you follow this blog with any regularity you’ll have seen this coming for several weeks now. When I began recommending that people quit using OpenContent licenses and begin using Creative Commons licenses, I said it was one of the hardest things I had ever done. And it was.Today I take the lid off the next most difficult thing I’ve done. As I describe below, I hate the idea of license proliferation. However, I feel that there are several convincing arguments that we need a new license at this point in the history of open content, and specifically in the history of open education. After providing the arguments and my thoughts below, you’ll find a draft of the first license issued by OpenContent in eight years – the Open Education License.

The Four Rs of Open Content

When I began promoting the idea of open content almost 10 years ago, there were four main types of activity I was interested in promoting (although it took me some time to get to the point where I could articulate them clearly). The four main types of activity enabled by open content can be summarized as “the four Rs”:

  • Reuse – Use the work verbatim, just exactly as you found it
  • Rework – Alter or transform the work so that it better meets your needs
  • Remix – Combine the (verbatim or altered) work with other works to better meet your needs
  • Redistribute – Share the verbatim work, the reworked work, or the remixed work with others

Notice how each of the first three Rs encompasses those that came before it. Reusing involves copying, displaying, performing, and making other uses of a work just as you found it. Reworking involves altering or transforming content, which one would only do if afterward they would be able to reuse the derivative work. Remixing involves creating a mashup of several works – some of which will be reworked as part of the remixing process – which one would only do if afterward they would be able to reuse the remix. (A “remix” in which no reworking is done is an anthology (a collection of simple reuses) and not particularly interesting for the purposes of this discussion.)

PMR: The word “work” emphasizes that much of what DavidW is talking about is creative. I don’t know whether everyone would apply “work” to data, but let’s try it.I have been using simply “re-use”. I want to use all of David’s 4 W’s but I think it may further confuse if I split my language into another 4 concepts. It may be that we need an overarching term (“reprocessing” is not ideal but it gives the idea).
In the learning objects literature and elsewhere, endless problems have been caused by the fact that people say “reuse” when they actually mean “rework” or “remix,” or some combination of the first three Rs. This is a classic problem of imprecision; of talking fast and loose. Add to this difficulty the fact that each of these three Rs thrives under different conditions, and you’ve got a recipe for general confusion.For example, take “rework.” This R deals with creating a derivative by altering or adapting a work. Traditionally licenses have tried to strengthen the rework activity through the “copyleft” mechanism. Copyleft is an idea borrowed directly from the world of free or open source software, requiring that derivative works be licensed using the exact same license as the original. This insures that when derivatives are created from a copylefted open content work, those children and grandchildren works remain open content, licensed using exactly the same license as the original.
distribution of copyleft licenses
However, while copyleft strictly requires that all future generations of derivative works be free and open, copyleft significantly hinders the remix activity. For example, conservative estimates say that there are approximately 40 million creative works that are currently licensed using a Creative Commons license. About half of these use the ShareAlike clause (Creative Commons’ copyleft clause). Of those creative works that use SA, about two thirds (~13 million) use By-NC-SA, while the other third (~7 million) uses By-SA. While statistics on GFDL adoption are harder to come by, because Wikipedia and the other Wikimedia projects use the GFDL we can safely estimate at least 7 million works are licensed using the GFDL (which contains its own copyleft clause). Since half of all CC licensed materials are licensed using a copyleft clause and all GFDL licensed materials are licensed using a copyleft clause, this means that over half of the world’s open content is copylefted. And while the CC and GFDL copyleft clauses guarantee that all derivative works will be “open,” they also guarantee that they can never be used in remixes with the majority of other copylefted works. You can’t remix a GFDL work with a By-NC-SA work when the licenses require that the child be licensed exactly as the parent. Each parent had one and only one license – which license would the derivative use? It’s just not possible to legally remix these materials; copyleft prevents this remixing.
PMR: I am not in favour of copyleft for data. I have no fundamental objection to creating a copyrighted work from data as long as there is significant added value. And copyleft is viral – deliberately. If any item in a system/collection/program etc. is copyleft, then the whole is (at least by the algorithm).
While promoting rework at the expense of remix – in other words, taking the copyleft approach – is fine for software, it is problematic for content and extremely problematic for education. As educators, we are always remixing materials for use in our classrooms both in the “real” world and online. Your mileage may vary, but over my last 15 years of teaching I would estimate that my remixing activities outnumber my reworking activities 10:1 or more. If other teachers are like me in this regard, then, copyleft is a huge problem for open education. Like the American football coach who tries to use his successful offensive and defensive strategies with a European football (or soccer) team, the open source advocate who brings the successful idea of copyleft into the world of open content will eventually be disappointed. The primary activity of the open source software developer is reworking; the primary activity of the open educator is remixing. Different activities require different supporting strategies to be successful.If we are serious about wanting the freedom to legally and frictionlessly remix educational materials, we have one of two choices: either ignore the OpenCourseWares, Wikipedia, and other copylefted open content of the world (i.e., work only with open content that isn’t copylefted), or forcibly constrain ourselves to one subset of the “open” content universe. Do you see the irony?
PMR: Yes. I would argue that if I get factual information from WP then it cannot carry a copyleft. I need the fundamental physical constants and get them from WP. I don’t think that my data and programs are thereby copyleft. All algorithms are now slightly fuzzy.

About the Copyleft and Attribution Restrictions

Some supporters of copyleft licenses like CC By-SA and the GFDL claim that they give users the ability to use and reuse open content with “no restrictions.” Obviously, requirements for attribution and copylefting of derivatives are very real restrictions that should not be overlooked. While supporters claim that “some restrictions are necessary to protect freedom,” and that requirements for attribution and copylefting fall into this category, both these restrictions can be problematic both practically and philosophically. I’ve spent a significant amount of time above describing why this is the case for the copyleft restriction.
When you contemplate the different cultures and cultural values in the world, it isn’t hard to imagine scenarios in which the requirement for attribution would prevent appropriate uses of open content. One need only contemplate any of the areas of enduring unrest in the world to understand that the requirement to attribute a reuse or rework of content to a Sunni or Shia author, for example, will prevent members of the other group from using the content. Sadly, over a dozen other examples of this kind (Israeli / Palestinian, etc.) could be given. It quickly becomes clear that the requirement to attribute the original author can be a subtle but no less real way of discriminating against persons or groups. (If the accusation of being an instrument of discrimination is not convincing enough to some open source advocates, this situation also puts the seemingly innocuous requirement for attribution at odds with one of the basic premises of the open source definition.) I believe it is absolutely crucial that we do everything we can to live up to the ideals of nondiscrimination expressed in the definition, our institutions, and civilization generally.

PMR: I appreciate the logic, but do not feel that attribution for scientific data can cause problems.

Why Not a Public Domain Dedication?

If the appropriate goal for a license is, as it appears, to make open content available without any restrictions, why not simply dedicate the works in question to the public domain? There are a number of problems with a public domain dedication (like that offered by Creative Commons). First, dedicating a work to the public domain is a significantly more involved process than licensing a work. While Creative Commons is rightly famous for how easy their license selection technology and little green buttons make licensing your work with a CC license, the public domain dedication is much more complicated and includes a number of steps, including making a request for Creative Commons to send you an email regarding your intent to place a work in the public domain. This rigamarole is not the fault of Creative Commons; they have simplified as much as possible the process of putting a work in the public domain in the US.
But secondly, and more importantly, it may be impossible under the law in some jurisdictions to place a work in the public domain. For example, in the EU authors have certain rights that cannot be contracted or licensed away, making it impossible for an author to legally relinquish all rights to a work (or put it in the public domain). Creative Commons also recognizes this problem with the statement that their public domain dedication “may not be valid outside of the United States.” Hence, a public domain dedication is not an internationally viable mechanism for open content.

PMR: Certain data is already clearly in the public domain such as works of the US government. That doesn’t apply to some collections such as made by the National Institute of Science and Technology (NIST) beacuse there is a special bill allowing them to recover costs. Public domain data should not cause problems, but if we every get to the stage “the data are PD so we can put a commercial licence on them” then we have a problem that needs addressing quickly.

About the Four Rs and the Four Freedoms

I hate definitions and taxonomies outside the hard sciences. I hate them particularly because I have been involved in the political contests of creating and perpetuating them – specifically, definitions and taxonomies of “learning objects.” Whose definition of learning object is best? Whose taxonomy is best? These are largely meaningless political battles I left behind many years ago.
It therefore surprises no one more than it surprised me that I felt the need to list and explicate the Four Rs, especially in the context of the existing “Four Freedoms.” While the Four Freedoms have their roots in free or open source software, they have been discussed in the context of open content as well. Wikipedia’s Terry Foote summarized the freedoms at our 2005 Open Education Conference as:

  • Freedom to copy
  • Freedom to modify
  • Freedom to redistribute
  • Freedom to redistribute modified versions

Freedom 1 is analogous to the first R, reuse. Freedoms 3 and 4 are analogous to the final R, redistribute. Freedom 2 is either analogous to the second R, rework, or is an amalgamation of the second and third Rs, rework and remix. In either case, the Four Freedoms do not distinguish sufficiently between the rework and remix activities. This leads to the problems described above in which rework is considered and supported at the cost of remix. These are distinct activities that require different environmental conditions.

PMR: Since data do not (I hope) have problems of remixing these seem clear and simple. I would be happy to define “data re-use” as the 4 F’s above.
The Four Freedoms as listed by Freedom Defined also fail to make this distinction:
  • the freedom to use the work and enjoy the benefits of using it
  • the freedom to study the work and to apply knowledge acquired from it
  • the freedom to make and redistribute copies, in whole or in part, of the information or expression
  • the freedom to make changes and improvements, and to distribute derivative works

While the “father knows best” approach of copyleft places only incentive obstacles in the path of would-be creators of derivative works (by stripping them of the ability to choose how to license their derivative works), copyleft places legal obstacles in the path of would be remixers. This problem is difficult to see through the imprecision of the way the Four Freedoms deals with “modify,” and this is one reason I felt justified in listing and explaining the Four Rs.

PMR: This seems less useful for data. David now offers a Open Education Licence draft. I think it’s not relevant to data, so it’s snipped.
PMR: So I’m going to continue to use the word “re-use”. It includes
  • Freedom to copy
  • Freedom to modify
  • Freedom to redistribute
  • Freedom to redistribute modified versions
but also includes the concept of input into programs, creation of new data derived algorithmically or stochastically from the data, aggregation with other data sources. We probably also need something on metadata.
Posted in data, open issues | 1 Comment

open data: are licenses needed?

Now that I’m back to regular rhythms and the intensity of scifoo has subsided I’m back to the current main obsession of this blog: access to data and its re-use. It’s catalysed by a post from Peter Suber commenting on David Wiley’s posts on open content licences. I shall quote a lot. (Yes, I could simply transclude by link, but I think it’s useful to highlight the words that DavidW uses.)
I was asked yesterday to summarise for a reporter why I had issues with certain publishers (I’ll post when the report appears). What I am trying to do on this blog at the moment is (a) to find out what the current situations for data access and re-use ARE and (b) then to highlight the cases which I and others think are unsatisfactory for modern data-driven research. I am not “anti-publisher” or “anti-capitalist”, but I am “anti-fuzz” and “anti-FUD”. I try to be relatively fair and I have lauded two publishers whose policies are now clear to me. Sometimes the discourse here seems tedious and repetitive – but that’s the way it is at present.
Since I am a physical scientist and a programmer I often see things in a literal and algorithmic way. If “open access” is defined in a declaration, and everyone in the publishing industry knows about that declaration then I assume by default that the words have a logical constraint or enablement on the content . But that is clearly not true. Various publishers (and I am not rehashing their words today) assume that “open access” can be used in whatever way they choose to define. Perhaps. But it isn’t generally helpful. Similarly others assume that copyright and licences are linked in some manner that is obvious to them but not to me. So, it seems that clear copyright and clear licences are going to have to be part of the future. “Data are not copyrightable” is a simple algorithm but (a) not everyone agrees what data are and (b) some people (especially Europeans) think it doesn’t apply in some cases.
I should also stress that when we use robots to read the literature (as we are now doing) we have to have clear licences. A robot is generally less smart than an adult human and needs telling clearly what it can and cannot do. If that clarity is missing, then default assumptions will be detrimental to some or all of the parties.
So, as this is a long post already, here’s the link to David Wiley’s posts:
Open Education License Draft and Assymetry, Hypocrisy, and Public Domain
the next posts [*][*] comment on issues therein.

Posted in data, open issues | 2 Comments

scifoo: images

This blog doesn’t have many pictures but these remind me of three sessions at scifoo with a chance to say a little more after the event. I shan’t (== can’t) identify everyone so feel free to annotate…
cimg1269a.JPG
Andrew Walkingshaw presenting his Golem system. Tim O’Reilly (under the rocket) listened attentively. Golem addresses the important question of how doe we find out what is in data files when we know the vocabulary used, but not the structure of the document. Data was a key issue in the meeting.
cimg1270a.JPG
The blogosphere (part). Deepak Singh (closest) and Jean-Claude Bradley. There were more people than this photo suggests. As we skipped from blogger to blogger, Bora Zivkovic brought up their blog on the screen and scrolled through it.
cimg1273a.JPG
Andrew Walkingshaw (left) and Alex Palazzo. (right) in animated conversation with Philip Campbell (centre, Nature) after the session A+A ran on young scientists and the culture of fear. This was probably the highlight of the meeting for me – where else could you get an idea which surfaced at 0930 on one day and 26 hours later there was a deep debate among equals?

Posted in scifoo | 1 Comment

blogging 101

Today I seem to be catching up with the continuing background radiation from scifoo and it’s a good way to wind down the jetlag. Here’s Richard Akerman again showing that we really went to scifoo. This learning session also was responsible for the two very short posts on this blog where we were showing how it works…

17:55 09/08/2007, Richard Akerman, scifoo2007, web/tech, weblogs, Science Library Pad
This post lists a few basics about blogging (and feeds) and the tools that I use, it also serves as an example of why I blog: sure I could send this as an email, or bookmark links for my own use, but if I’m going to that effort, I might as well just share it with everyone.
[DSC00450]
Peter Murray-Rust showing his blog
John Santini had the perhaps-misfortune of asking Peter Murray-Rust and I about both the reasons for and the mechanics of blogging, we proceeded to outgeek one another with dueling laptops showing the following:
www.typepad.com is what I use for a blogging platform, you have to pay but that does have the benefit of separating your site out from the unfortunate profusion of spam blogs on
www.blogger.com Google’s free blogging platform
To prevent the flood of spam comments that inevitably flow to all blogs, Peter has a filtering system plus moderation, and I use TypePad’s CAPTCHA system and moderation. It’s unfortunately not possible to filter trackbacks in this way, although you can moderate them.
To track get a full picture of your visitors, you need to track both web hits and (RSS) feed hits. I use StatCounter for my web hits, plus both Peter and I use FeedBurner (now owned by Google) to track our feed hits. Google Analytics is another web hit tracking option, but it’s more for high-volume sites. All these tracking tools are free.
You can also track references to your blog through Technorati and other blog/feed search tools, e.g. here are links to Peter’s blog:
http://www.technorati.com/blogs/wwmm.ch.cam.ac.uk/blogs/murrayrust/?reactions
Peter uses Feed Reader to read RSS feeds, I use Bloglines (you can see what I read at http://www.bloglines.com/public/rakerman ).
In terms of reasons and other meta-blogging areas, I blog mainly to have online searchable notes of stuff that I am sure to forget, and also to connect into the library technology community, which I entered only a few years ago. If making connections like that is important to you, make sure to be generous with your outbound links.
John asked about how much of your identity you have to reveal online, you have every choice ranging from fully anonymous to complete disclosure. Depending on your topic, revealing at least your work title may help to establish your position in the community for people who are reading yoru blog.
That’s about it, it’s quite easy to start blogging and through the magic of linking and Google, if you write it, they will come.
Peter has blogged some of his thoughts on the topic in scifoo: blogsession.

PMR: and the photo shows off the CML t-shirt that Mo-seph created for my Christmas present. (His t-shirt style is very individual and I think elegantly simple. But I am not an independent reviewer).

Posted in scifoo | 1 Comment

towards repeatability: push to re-run

Although repeatability has always been a key part of formal scientific procedure we are now finding several new tools to help us. In principle we can capture every moment of the scientific process and “replay” it for others. Here is Richard Akerman picking up on my summarization of many in the (chemical) blogosphere and asking whether we can add better metadata about reproducibility.

the peer review logo

In the session “Reinventing scientific publication (Web 2.0, 3.0, and their impact on science)” led by James Hendler at SciFoo, one of the items was an idea from Geoffrey Bilder, for publishers to provide a “peer review logo” that could be attached to (at this point I am interpreting based on my own understanding) e.g. blog postings, some sort of idea of a digital signature to indicate peer reviewed content.  (I know the list well since I’m afraid my major contribution to the evening, despite having thought about this topic a lot, was transcribing the list).
2) ID, logoing, review status tag, trust mechanisms
– other peer review status

I wonder if we should make a wiki where we list all of the grand (and not so grand) challenges of web science communication and discovery, and then people can pick off projects.  The SciFoo prototypes list is one angle on this.  Of course, in the perpetual-beta web world, it’s probably faster to just create a wiki, than to try to start a discussion about whether one should be created.  It’s in that “just do it” spirit that I’m pleased to find there is already a peer review logo initiative in the works, although the angle is to indicate that you’re writing about a reviewed work, not that your work itself has been reviewed.  From Planet SciFoo:
Cognitive Daily – A better way for bloggers to identify peer-reviewed research, by Dave Munger

[we] have decided to work together to develop such an icon, along with a web site where we can link to bloggers who’ve pledged to use it following the guidelines we develop

via Bora Zivkovic, via Peter Murray-Rust
(it’s strange and also good to be blogging now about people that I’ve finally met)

PMR: and vice versa
UPDATE: I do have a vague idea in a similar space, which would be a “repeatability counter”.
As I have learned more about peer review, I have understood that it has many aspects, but preventing fraud is not one of them.  Peer review can help to create a paper that is well-written and has “reasonable” science, but it can’t stop a determined fraudster.  (This isn’t my insight, but comes from a presentation I saw by Andrew Mulligan of Elsevier – “Perceptions and Misperceptions – Attitudes to Peer Review”.)  What does address fraud, and keep science progressing, is falsifiability: someone else does the experiment and sees if they get the same results.  Now I realise there are many different classes of results, but it’s interesting that many of these are not publishable, and are maybe not captured in the current system:
  • We tried to repeat the experiment, but it failed because we didn’t have enough information on the protocol
  • We tried to repeat the experiment, but it failed and we think the paper is in error
  • We successfully repeated the experiment
  • (probably more scenarios I haven’t considered)

So I think it would be interesting to have a sort of “results linking service” where you would click and you would get links to all the people who had tried to reproduce the results, and indications of whether or not they succeeded.  We use citation count as a sort of proxy for this, but it’s imperfect, not least of which because there is no semantic tagging of the citation so you don’t know if it was cited for being correct or incorrect.  I think this kind of experiment linking might add a lot of value to Open Notebook Science and to protocols reporting (whether in the literature like Nature Protocols, or in a web system like myExperiment).  Otherwise I worry that the amount of raw information from a lab notebook makes it hard to extract a lot of value from it.

I have also had a similar idea, specifically in computation-based science. Too many papers read as:

  • We took the following /molecules/data/ (but we can’t tell you exactly what they are as they are confidential/licensed from a possessive publisher/in a binary format/ etc.
  • we tried all sorts of methods until we found the one that works best for this particular data set. We don’t bother to tell you the ones that didn’t work
  • we generated features for machine-learning using /softwareX/our magic recipe/a recorded procedure which we modified without telling anyone the details/
  • we used /our own version/expensive commercial package/ of naive Bayes/support vector/monte carlo/genetic algorithm/ant colony/adaptive learning/other impressive-sounding algorithm/
  • and plotted the following graph/cluster/histograms (in PDF so you can’t get the data points)
  • and compared it with our competitors’ results  – and – wow! ours are better.

there are hundreds of papers like this. They are not repeatable.
So I had a plan to survey a few journals and come up with an index-of-potential-reproducibility. It would indicate what couldn’t be repeated. Things like:

  • how easy is it to get the original data (access, format, cost, etc.)
  • how easy is it to get the software (cost, platform, installation)
  • are the plotted data available?

That’s a simple index to compute. I expect it holds for many fields (substitute mouse for code, etc.). For more experimental fields the recording-based ideas of JoVE: Journal of Visualized Experiments and Useful Chemistry are obviously valuable. For software we just need a “push to re-run button”.

Posted in data, open issues | 1 Comment

Repositories: give us the tools

From Peter Sefton’s blog:

00:43 09/08/2007, Sefton
I have already mentioned this blog post lamenting the use of PDF instead of HTML in an online journal:

In short, choosing to use PDF rather than HTML tends to make the content less open than it otherwise could be. That feels wrong to me, especially for an open access journal! One could just about justify this approach for a journal destined to be published both on paper and online (though even in that case I think it would be wrong) but surely not for an online-only ‘open’ publication?
http://efoundations.typepad.com/efoundations/2007/08/open-online-jou.html

One of the commenters nails the issue:

Go find ’em a workflow that produces good HTML as well as PDF, and I’m sure they’ll sign right on.
Posted by: Dorothea Salo | August 06, 2007 at 01:54 PM

The workflow that produces good HTML as well as PDF is what we’re after with the ICE-RS project. I talked about the project in my paper for the ETD 07 conference. I use ICE to write this blog, and you get both HTML and PDF. And the e-Journal of Instructional Science and Technology (e-JIST) is published in ICE, meaning that all the papers are in HTML and PDF. Anyone who wants help trying out ICE contact me.

Now why is that paper of mine only available in PDF at the moment?

It’s because it’s a real pain to add it to the Eprints software we use at USQ you have to upload the HTML and all its images and so on one at a time.

If you’re using other repository software, at least the stuff that’s commonly used in a Australia, then you’re out of luck as most of it doesn’t handle HTML at all.

It would help for the Open Access community and repository software publishers to help drive the adoption of HTML by making OA repositories first-class web citizens. Why isn’t it easy to put HTML into Eprints, DSpace, VITAL and Fez?

To do our bit, we’re planning to integrate ICE with Eprints, DSpace and Fedora later this year building on the outcomes from the SWORD project when that’s done I’ll update my papers in the USQ repository, over the Atom Publishing Protocol interface that SWORD is developing.

PMR: PeterS is right. The time has come for a proper investment in tools. Filling repositories with PDFs is a very limited solution and it does nothing for data-driven science. At present if anyone asks me where they should reposit their data I’m tempted to tell them “in the Cloud” rather than in their repository.
HTML (XHTML) is the necessary first step. It will emphasize the need for structured documents, compund documents, structured document collections, etc. I’m looking forward to SWORD.
See also Peter Suber commenting on this:


  • I strongly support tools to improve the quality, handling, and professional uptake of HTML. The sooner we have HTML editions of scholarly eprints, next to or instead of PDF editions, the better. HTML and PDF files can both be OA, but HTML facilitates re-use of the content and PDF (deliberately) retards it.
  • “ICE-RS” stands for Integrated Content Environment for Research and Scholarship.

PMR: notice the little word “re-use”. Start practising how to say it. Then how to explain it. Then how to make it happen.

Posted in data, open issues | Leave a comment

Wiley: your supporting information for chemistry isn't satisfactory

It has become increasingly common for journals to offer – or require – “supporting information” (“supplemental data”, etc.) as an adjunct to the “full-text” article. This is now an essential part of much publications and this post shows how when it isn’t done fully it harms the community.
In the days of real paper it was difficult to publish experiemental data. The publisher had a real need to keep down the number of pages and so large lists of tables, spectra, etc took up costly space. I don’t know the exact date (ca. 1970-5?) but I remember the crystallography community, presumably through the International Union (IUCr), starting to insist that authors and journals should capture the essential data. This was initially though real paper or faxes to the journal, who would then store them (I wonder what the faxes look like now?!) or by deposition with the crystallographic structure databases or both. I suspect that the IUCr led the publishing field here, but others emulated this with requirements to deposit protein sequences (late 1970s?). At that stage it was often difficult to coerce authors and I remember Nature being one of the last journals to require their authors to send in such material.
With electronic publication the economics change. There is no resource-limitation on what can be deposited – it is purely a balance between the interests of readers, publishers and authors. Here is what the blogosphere (commenting on TotallySynthetic) have to say about supplemental info in a Wiley journal (Angewandte Chemie == ACIEE). I have kept only comments which refer to suppinfo (SI) – the numbering does not reflect the original.

Geigerin

geigerin.jpg
Deprés and Carret. ACIEE, 2007, EarlyView. DOI: 10.1002/anie.200702031.
[… TotSynth’s comments clipped]

41 Responses to “Geigerin”

  1. Spiro Says:
    August 7th, 2007 at 1:23 “In their previous papers, they had to use a metal to take diphosgene to dichloroketene, but in this case, a bit of ultrasound worked rather well.”Ultra sound avoids the use of activated zinc (ref 14), but you definitely need a metal.
    Maybe I am wrong but there is no supp info available, as often with Angewandte :-(
  2. aa Says:
    August 7th, 2007 at 1:58 spiro, the supporting info is available at http://www.wiley-vch.de/contents/jc_2002/2007/z702031_s.pdfand i think the procedure for the 2+2 is found in their previous methodology paper,referenced in this one.
  3. Spiro Says:
    August 7th, 2007 at 2:49 aa, thanks for your dedication, but I had read these “supporting information” before writing my discontentment.
    It is just that I do not consider this to be a decent supporting information section, even though the three procedures they show are the most important of the article.
    I do not blame the authors, just the journal. If my boss tells me to write a paper without supp info, I cheer. But this is a bad habit IMHO.
    For example, I am perplex about transformation c in scheme 3, especially when I read ref 19. One way or another there may be something which is missing in the conditions (acid?), and a written procedure could clarify things.
  4. carbazole Says:
    August 7th, 2007 at 3:09 The lack of supp info in ACIEE is really frustrating. If you’ve done a total synthesis, why can’t the supp info include any procedures for making compounds not already found in the literature? If I’m doing a lit search, and I find a reaction in Org Lett that I can use, I cheer because it will have a procedure most likely. If it’s for ACIEE, I groan, because the supp infos are so spotty.
  5. Gilgerto Says:
    August 7th, 2007 at 14:10 I totally agree with you carbazole, I cannot conceive that in 2007, a supp. info for a total synthesis includes only 2 procedures and 4 nmr. It is clearly a lack of rigour from Angew…
  6. aa Says:
    August 7th, 2007 at 15:31 spiro- right, sorry about that. yes, the lack of SI in ACIE is pretty terrible. i especially hate when their a reaction in a tot syn that you would like to try and can’t get a detailed procedure.
  7. willyoubemine Says:
    August 7th, 2007 at 15:52 Supporting Info is all that really matters in these papers anyway right? I mean its nice that someone made something, but its irrelevant if irreproducible bc of spotty SI.
  8. HPCC Says:
    August 7th, 2007 at 16:15 [previous]: The ultimate best example was last year’s synthesis by James La Clair… Deoxoudol, or the molecule-that-shall-change-name-upon-criticism-of-its-synthesis! [1]
  9. JamesB Says:
    August 7th, 2007 at 16:42 Isn’t an author obliged to provide experimental detail/spectral data on request for published reactions? My ex-boss certainly behaves like it – and believe me, the hours I put into scanning spectra and compiling supp. info means the SI isn’t “spotty” in the slightest.
  10. carbazole Says:
    August 8th, 2007 at 4:40 Sure, being required to provide spectra/procedures on request is fine, but why can’t it be included online at the time of publication? Are their servers running low on hard disk space? It was different before online publication obviously, journal pages were precious. Why should I have to email someone for something that really should be provided in the first place?
  11. Jose Says:
    August 8th, 2007 at 5:31 It makes me wonder if that might be why so often high level papers get sent to Angew over JACS by certain groups in particular….
  12. willyoubemine Says:
    August 8th, 2007 at 15:56 Jose is onto something.Anyone see baran’s SI for Chartelline in JACS. It was immaculate, the way SI should be.
  13. tom Says:
    August 8th, 2007 at 18:23 Most of the time when I email someone for supporting info I don’t get it.. I emailed one of sharpless’s underlings for SI on allyic azide precursors and got not a single response.

PMR: [1] Deoxoudol – this (or hexacyclinol) is a molecule whose structure was seriously disputed and where the use of calculation and resynthesis relied on supporting information to help decide the problem. I haven’t followed this in detail, but it would be fair to say that many in the community have serious doubts about the original publication.
PMR: There are some very clear and cogent messages from the blogosphere:

  • they take scientific procedures – especially reproducibility – very seriously. Correctness in reporting is critical. The blogosphere has periodic opinions that certain groups represent their work in a better light than the raw facts.
  • in many cases the data are almost all that matters. They are used to repeat work both for testing and because people want some of the material. If the recipe is wrong careers can be blighted. Many young workers have been required by their supervisors to repeat work that is “wrong” and have suffered as a result when they can’t get it to work.
  • publishing supplemental info is critical. It’s very tedious, but the effort is worth it to the community as a whole. And journals are expected to help enforce this policy.

So some plaudits:

  • The SI in JACS (Journal of the American Chemical Society) is very good. (I refrain from saying “excellent” only because it’s in PDF, not machine-understandable).
  • The crystal structures from IUCr and RSC (Royal Soc Chemistry) are top-class. We have processed over 50,000 with virtually no detectable errors. They are an epitome of how data should be published. We are working on the ACS ones – they are also pretty good, with a few buglets. And we have collected these in crystaleye.

The Wiley suppinfo that the blogosphere has taken issue with consists of 7 pages of which the first is:
aciee.GIF
as you can see the information – DATA – is copyrighted by the publisher. I have mentioned this before, but I note that I lay myself open to being pursued by Wiley for showing any of this scientific information without their permission (see my post Sued for 10 Data Points for current practice in Wiley journals). So I will only post a very little bit and hope this counts as fair use:
spectrum5.GIF
(it’s only a very small amount of 1 page, promise!).
So to add to the blogosphere’s concerns this is an awful way of transmitting scientific information. To be fair I can find this sort of hamburger elsewhere, but they are right that it makes it much harder to use if the data are fuzzy (that’s real fuzz on the spectrum).
It’s clear that the authors have been selective in their SI. I can’t read the original paper (I could if I made the effort to get a password) but there are certainly 11 compounds and only details for 4 in the SI. That means, essentially, that there isn’t enough detailed information to repeat the work.
There is no TECHNICAL reason why all the information cannot be included. The spectrum was a born-digital cow with 32000 points and it’s been squashed to a messy hamburger. More than half the lab info has been held back. And the publisher makes it very difficult (copyright, passwords) to navigate all this.
The chemists at the bench deserve better. We know how to publish spectra and crystal structures without losing information. Let’s see some journals take a pro-active stance here!

Posted in chemistry, data, open issues | 4 Comments

blogging peer-reviewed articles – icons and greasemonkey

One of the features fo having subscribed to planeScifoo is that I am now getting lots of new feeds. I probably shan’t continue some of them, but here Bora highlights something similar to what the chemical blogosphere has been doing. (Bora Zivkovic: A bloggers’ icon for posts about Peer-Reviewed Research)

A better way for bloggers to identify peer-reviewed research

Category: General / Site news
Posted on: August 8, 2007 10:12 AM, by Dave Munger

Most CogDaily readers are familiar with the little icon we developed to indicate when we were reporting on peer reviewed research. We created it when we began to offer links to news and blog posts, as a way of distinguishing those less “serious” posts from when we were talking about peer-reviewed journal articles.But Sister Edith Bogue of Monastic Musings recently pointed out that other academic bloggers could also make use of the icon, to distinguish when they’re blogging about news, family, books, etc., from serious scholarship. But our icon isn’t ideal for this purpose since its design is specifically linked to our site. I also think a public icon should come with some guidelines for use.
So Sister Edith and I, along with ScienceBloggers John Wilkins and Mike Dunford, have decided to work together to develop such an icon, along with a web site where we can link to bloggers who’ve pledged to use it following the guidelines we develop. But we don’t represent the blogging community as a whole, so we thought we’d also ask for your input. I’ll start the discussion with a few key questions. You can post answers — or your own questions — in the comments section.
  • Is “Report on Peer Reviewed Research” a good tagline? Any suggestions for a different wording?
  • What should we call the organization that sponsors the icon? I was thinking something on the lines of “Bloggers for peer review.” Any other ideas?
  • What, exactly, should the icon signify? At a minimum, the blogger should have carefully read the original research report. Any other guidelines? (On CogDaily, it means that both Greta and I have read the report, and that we’re attempting to offer a thoughtful summary of the results)
  • How do we define “peer review?” For example, some conference presentations are technically peer reviewed, but this process seems to me too cursory to qualify — after all, the reviewers haven’t even seen the final product. Some journals with very limited peer review processes also might not qualify. How do we decide what’s in and what’s out? Do we make a list?
  • Should there be a process for policing abuses of the icon? How would that work?
  • What about copyright for the icon? Should it be in the public domain? Or would some sort of license like the GPL or Creative Commons be better?
  • How should we design the icon? A contest? How would results be judged?

That ought to be enough to get the discussion rolling. As I suggested, feel free to offer both answers and additional questions. This is an exciting project!

PMR: Read the comments as well. My understanding is that this gives a blogger a chance to tag their post as being about a peer-reviewed article (not that the blogger has been certified as fit to blog! – you have to make your own choice about this). In chemistry several of the blogs routinely comment on peer-reviewed articles, some almost exclusively. Have a look at TotallySynthetic’s blog which reviews organic syntheses from major journals. Typical example:

Geigerin

6 August 2007

geigerin.jpg
Deprés and Carret. ACIEE, 2007, EarlyView. DOI: 10.1002/anie.200702031.
Isolated from the plant colloquially known as the “vomiting bush”, Geigerin is a member of the guaianolide family, represented in the 5,7,5 tricyclic structure. No further mention of the biological activity is noted in this paper, but they do point out that this isn’t the first synthesis of this family, referencing Lee and Bartons works.
The synthesis starts with a bit of chemistry quite familiar to the group, doing a regioselective [2+2] cycloaddition with diphosgene 2,2,2-trichloroacetyl chloride and 7-methylcycloheptatriene. In their previous papers, they had to use a metal to take diphosgene to dichloroketene, but in this case, a bit of ultrasound worked rather well. They then ring-expanded to give the 5,7 motif required.
geigerin_1.jpg
A few steps further along, they used DMDO to generate the hydroxyl group with good substrate control, which formed the lactone under acid conditions. To reuse a phrase I type a lot – not new chemistry, but nice!
geigerin_2.jpg

PMR: Notice there are FORTY comments already – I select a few of the early ones (numbering is unreliable)

40 Responses to “Geigerin”

    1. Spiro Says:
      “In their previous papers, they had to use a metal to take diphosgene to dichloroketene, but in this case, a bit of ultrasound worked rather well.”Ultra sound avoids the use of activated zinc (ref 14), but you definitely need a metal.
      Maybe I am wrong but there is no supp info available, as often with Angewandte :-(
    2. aa Says:
      spiro, the supporting info is available at http://www.wiley-vch.de/contents/jc_2002/2007/z702031_s.pdfand i think the procedure for the 2+2 is found in their previous methodology paper,referenced in this one.
    3. Spiro Says:
      aa, thanks for your dedication, but I had read these “supporting information” before writing my discontentment.
      It is just that I do not consider this to be a decent supporting information section, even though the three procedures they show are the most important of the article.
      I do not blame the authors, just the journal. If my boss tells me to write a paper without supp info, I cheer. But this is a bad habit IMHO.
      For example, I am perplex about transformation c in scheme 3, especially when I read ref 19. One way or another there may be something which is missing in the conditions (acid?), and a written procedure could clarify things.
    4. carbazole Says:
      The lack of supp info in ACIEE is really frustrating. If you’ve done a total synthesis, why can’t the supp info include any procedures for making compounds not already found in the literature? If I’m doing a lit search, and I find a reaction in Org Lett that I can use, I cheer because it will have a procedure most likely. If it’s for ACIEE, I groan, because the supp infos are so spotty.
    5. kiwi Says:
      Tropylium tetrafluoroborate is the listed starting material – now thats something you don’t see everyday. I sure hope they weren’t buying it…
    6. Liquidcarbon Says:
      Is it stereocontrolled enolization of methyl propionate that introduces the side-chain stereocenter in the “double conjugate” addition?
    7. Gilgerto Says:
      I totally agree with you carbazole, I cannot conceive that in 2007, a supp. info for a total synthesis includes only 2 procedures and 4 nmr. It is clearly a lack of rigour from Angew…

    PMR: Impressive. Many thoughtful comments here. Close to my own heart is the concern about the lack of supplemental info – the raw experimental stuff. The chemists at the bench are crying out for this and the journal doesn’t provide it. [ACIEE == Angewandte Chemie, flagship chemistry journal from Wiley].
    What Dave Munger and colleagues suggest is that this post from TotallySynthetic carries a little icon *in the post* indicating that the post is about a peer-reviewed article. That’s easy and a good idea as we would have ways of aggregating all these.
    Dave – the chemical blogosphere has taken this idea further with a greasemonkey script that enhances a Firefox view of the original article (Travels of the Blue Obelisk Greasemonkey). This tool works out that the article has a DOI which relates to one or more blogosphere posts and pops up an icon when viewing the TOC. This is completely independent of the publisher. (Whether publishers approve of this – as they should – isn’t technically relevant). Of course the reader has to have the greasemonkey installed but that is almost trivial. In that way you get bidirectional feedback – readers of the blog get pointed to the article – readers of the article get pointed to the review.
    What this means is that the blogosphere can make a complete range of comments on articles. Most of TotSynth’s are complementary – I think he selects articles because they are inspiring rather than flawed. But he and others will certainly criticize articles which are suspect and there has been a fair amount of this over the last year.
    (Good time to mention that Nick Day and I will shortly be announcing a greasemonkey for crystaleye that links to crystal data in publications).

    Posted in "virtual communities", chemistry | 5 Comments