petermr's blog

Did I write this paper???

Posted on January 4, 2008 by pm286

I just looked for one of my papers using Google (not Google Scholar) and found :

(Click to enlarge)
Hmm – don’t remember publishing something in “Cheminform” in 2005? I normally check who my co-authors are before they send a paper off. And I hope they would as well. Moreover these are my immediate colleagues?
So I look for the PDF (under 8K) to make sure what I have written, and get:

What on earth is going on? I have authored a paper I don’t know about with a “PDF” that makes no sense with a publisher that I haven’t engaged with
What’s going on? I think I know and if I’m right I’m unhappy about it. But I’d like your comments first

Posted in publishing | 4 Comments

Open Data: I want my data back!

Posted on January 4, 2008 by pm286

var imagebase=\’file://C:/Program Files/FeedReader30/\’;

Although I am mainly concerned with campaigning for data associated with schoilarly publishing to be Open, the term Open Data has also been used in conjunction with personal data “given” or “lent” to third parties (see Open Data – Wikipedia) which contains Jon Bosak’s quote “I want my data back”). Here is a good example of the problems of getting one’s personal data (and possibly other people’s) back from Paul Miller of Talis: Scoble, Facebook, Plaxo, open data; time for change?. Excerpts (read the whole post for the details)

I am of course talking, like so many others, about Robert Scoble being barred from Facebook for using an as-yet unlaunched capability of Plaxo that clearly and unambiguously breached Facebook’s Terms and Conditions.
It all began with a ‘tweet’ from Robert Scoble, about the time that post-holiday blues kicked in for those returning to work this (UK) morning;

“Oh, oh, Facebook blocked my account because I was hitting it with a script. Naughty, naughty Scoble!”

Twitter exploded, closely followed by large chunks of the blogosphere. …
Minutiae aside, the whole affair raises a couple of points pertinent to one of the biggest issues for 2008; ownership, portability and openness of data.

I want to be able to take my data from a service such as Facebook, and use it somewhere else. That’s what Marc Canter has been arguing forever, along with the AttentionTrust, OpenSocial (to a degree), DataPortability.org and many more. That’s part of the rationale behind all the work we’ve been doing on the Open Data Commons, too. However, whether I want to or not, doing it the way Scoble did is a breach of the terms and conditions of Facebook; terms and conditions to which I – and he – signed up when we chose to use the site. If you don’t like the terms, don’t use the service. It’s as simple as that;

Even were I allowed to export ‘my’ data, there’s a fuzzy line between that which is mine and that which isn’t. The fact that I am a Facebook friend with Nova Spivack certainly should be mine to take wherever I choose. The contact details Nova chooses to surface to me as part of that relationship, however? Are they mine to take with me, or his to control where I can surface them? There’s clearly work to do there, although it’s interesting that ‘even’ people such as Tara Hunt are reacting (also on Twitter, of course) with;

“I’m appalled that someone can take my info 2 other networks w/o my permission. Rights belong 2 friends, too.”

PMR: I have no additional comments on this other than to say it’s going to take hard work, forethought to anticipate problems of this sort and probably a lot of legal work. Kudos to Paul and Talis and their collaborators for helping in these general areas.

In science it’s easy. Our data are ours. They don’t belong to Wiley, ACS, Elsevier, Springer. I’ve just finished a paper on this which you should all see shortly.

We want our data back.

And in future we want to make sure we don’t give away our rights to them. Is that a simple message for 2008?

Technorati Tags: Facebook, open data, Open Data Commons, Plaxo, Twitter, Robert Scoble, Talis

Posted in open issues | Tagged open data, talis | 4 Comments

How to create interactive maps (and graphs?)

Posted on January 2, 2008 by pm286

Peter Sefton at USQ has developed the ICE system – a carefully thought out and engineered system for authoring compound documents in a scholarly environment based largely on XML. He has no illusions about the difficulty of this and the appalling state of current software. I’m visiting Peter next month and in my usual optimistic fashion think that we can create a proof of concept for chemistry in a day or so. (Peter has already shown this is possible). In his latest post he shows how you can create interactive maps. Read Peter’s original post as it works very nicely (but doesn’t paste directly into my blog)

ICE Mashups, part one, take two

[…]
First up, I invite you to admire this map which shows the route of a bike ride I went on with the Toowoomba Bicycle Users Group.
[… map snipped as it crashed WordPress editing …]
And if you click on the title at the top (open it in a new window) it will take you to Bikely where you can look at the ups and downs of riding on the Darling Downs by clicking on Show then Elevation profile.
[… snipped …]

(Note: I think there’s something wrong; it shows a very dramatic dip in the ride that I don’t remember, around the 26km mark).

PMR: (The graph has been automatically created from Peter’s GPS system and part of the post is about how to get the data easily into ICE).
My interest is in having interactive graphs. We have been doing this so far with SVG but it’s a lot of labour and really out to be a solved problem already. Graphs are so important. How many times do you want to calculate something from a graph? For example in the current case it’s natural to ask what is the gradient of the steep drop? Looks to be about 100 m drop in 100 m horizontal. That’s beyond the range of the average Pommie, although I suspect is so trivial in Oz they don’t even recall such minor dips. Maybe swinging on a Tarzan rope would do it.
XY graphs are so common and so important that we ought to be able to cut and paste them, preserving he semantics of the underlying data. One of my New Year WIBNIs.

Posted in Uncategorized | 4 Comments

New Year's resolutions

Posted on January 1, 2008 by pm286

Cameron Neylon has made Some New Year’s resolutions

I don’t usually do New Year’s resolutions. But in the spirit of the several posts from people looking back and looking forwards I thought I would offer a few. This being an open process there will be people to hold me to these so there will be a bit of encouragement there. This promises to be a year in which Open issues move much further up the agenda. These things are little ways that we can take this forward and help to build the momentum.

I will adopt the NIH Open Access Mandate as a minimum standard for papers submitted in 2008. Where possible we will submit to fully Open Access journals but where there is not an appropriate journal in terms of subject area or status we will only submit to journals that allow us to submit a complete version of the paper to PubMed Central within 12 months.

I will get more of our existing (non-ONS) data online and freely available.

Going forward all members of my group will be committed to an Open Notebook Science approach unless this is prohibited or made impractical by the research funders. Where this is the case these projects will be publically flagged as non-ONS and I will apply the principle of the NIH OA Mandate (12 months maximum embargo) wherever possible.

I will do more to publicise Open Notebook Science. Specifically I will give ONS a mention in every scientific talk and presentation I give.

Regardless of the outcome of the funding application I will attempt to get funding to support an international meeting focussed on developing Open Approaches in Research.

PMR: This is highly commendable, especially from someone early in their career. Some comments:

In some subjects it’s hard to find Open Access journals whose scope covers the work. That’s very true of chemistry, and there is some sacrifice required. However, there is a high-risk investment here – publish in an OA journal and you are likely to get higher publicity than from a non-OA journal of similar standing. Senior faculty (like me) must promote the idea that it’s what you publish rather than where you publish that matters. All journals start small, but many grow, including OA ones.
ONS. This is technically hard in many areas. At this stage the effort is as important as the achievement – get as much online as you can afford. But complex internal workflows do not lend themselves to ONS easily and we certainly need a new generation of tools
I don’t know of any funders who explicitly forbid ONS (other than for confidentiality, etc.) Funders should not be concerned about where the work is published, only that it is reviewed and reasonably visible. Funders certainly shouldn’t dictate the proposed journal and that’s the only obvious mechanism for forbidding ONS
Obviously I hope the application succeeds and we shall be there

Best of fortune

Posted in open issues | Tagged cameron neylon, ons | Leave a comment

New free journal from Springer – but no Open Data

Posted on December 31, 2007 by pm286

Peter Suber reports:

New free journal from Springer

Neuroethics is a new peer-reviewed journal from Springer. Instead of using Springer’s Open Choice hybrid model, it will offer free online access to all its articles, at least for 2008 and 2009.
The page on instructions for authors says nothing about publication fees. It does, however, require authors to transfer copyright to Springer, which it justifies by saying, “This will ensure the widest possible dissemination of information under copyright laws.” For the moment I’m less interested in the incorrectness of this statement than in the fact that Springer’s hybrid journals use an equivalent of the CC-BY license. It looks like Springer is experimenting with a new access model: free online access for all articles in a journal (hence, not hybrid); no publication fees; but no reuse rights beyond fair use. The copyright transfer agreement permits self-archiving of the published version of the text but not the published PDF.
Also see my post last week on Springer’s new Evolution: Education and Outreach, with a similar access policy but a few confusing wrinkles of its own.

PMR: Whatever the rights and wrongs of this approach – I accept PeterS’s analysis of most situations – it represents one of my fears – the increasing complexity of per-publisher offerings. Springer now has at least 3 models – Closed, OpenChoice and FreeOnlineAccess. Even for the expert it will be non-trivial to decide what can and cannot be done, what should and should not be done. If all the major closed publishers do this, each with a slightly different model where the licence matters, we have chaos. This type of licence proliferation makes it harder to work towards common agreements for access to data (it seems clear that the present one is a step away from Open Data).
I used to think instrument manufacturers were bad, bringing out a different data format with every new machine. I still do. Now they have been joined by publishers.

Posted in open issues, publishing | Leave a comment

Is the scientific archive safe with publishers?

Posted on December 31, 2007 by pm286

“In the pipeline” is an impressive and much-followed part of the chemical blogosphere. I’m a bit late on its post Kids These Days! which deals in depth with a case (Menger / Christl pyridinium incident) of published scientific error. The case even got as far as Der Spiegel – the German magazine. It’s worth reading (the link will take you to other links and also a very worthwhile set of comments from the blogosphere).
My summary is that: some chemists reported the synthesis of a novel set of compounds, published in Angewandte Chemie (Wiley) (2007) and Organic Letters (ACS) , (2006). After publication, doubt was thrown on the identification of the products, claiming that analytical evidence had been misinterpreted. As a result the original authors withdrew their claim. [The blogosphere has the usual range of opinions – the referees should have picked this up, the authors were sloppy, the criticism was rude, the reaction had been known for 100 years, etc. All perfectly reasonable – this is a fundamental part of science – it must be open to criticism and falsifiability. We expect a range of opinions on acceptable practice.]
What worried me was one comment that the publisher had altered the scientific record.

17. Metalate on December 1, 2007 11:00 AM writes…
Has anyone noticed that OL has removed all but the first page of the Supporting Info from the 2006 paper? Is this policy on retracted papers? And if so, why?
Permalink to Comment

PMR: I wasn’t reading this story originally, so went back to the article:

As I am currently not in cam.ac.uk I cannot get the paper without paying 25 USD (and I don’t want to take the risk that there is nothing there. I’ll visit in a day or two).
But the ACS DOES allow anyone to read the supporting information for free (whether they can re-use it is unclear and it takes the ACS months to even reply on this). So I thought it would be an idea to see if our NMREye calculations would show that the products were inconsistent with the data. I go to the supporting information
and find:

[On another day I would have criticized the use of hamburger bitmaps to store scientific information but that’s not today’s concern.]
There is only one page. As it ends in mid sentence I am sure Metalate is correct.
The publishers have altered the scientific record
I don’t know what they have done to the fulltext article. Replaced it by dev/null? Or removed all but the title page?
This is the equivalent of going to a library and cutting out pages you don’t agree with. The irony is that there is almost certainly nothing wrong with the supporting information. It should be a factual record of what the authors did and observed. There is no suggestion that they didn’t do the work, make compounds, record their melting points, spectra, etc. All these are potentially valuable scientific data. They may have misinterpreted their result but the work is still part of the scientific record. For all I know (and I can’t because the publisher has censored the data) the compounds they made were actually novel (if uninteresting). Even if they weren’t novel it could be valuable to have additional measurements on them.
I have a perfectly legitimate scholarly quest. I want to see how well chemical data supports the claims made in the literature. We have been doing this with crystallography and other analytical data for several years. It’s hard because most data is thrown away or in PDF but when we can get it the approach works. We contend that if this paper had been made available to high throughput NMR calculation (“robot referees”) – by whatever method – it might have been shown to be false. It’s even possible that the compounds proposed might have been shown to be unstable – I don’t know enough without doing the calculations.
But the publisher’s censorship has prevented me from doing this.
The ACS takes archival seriously: C&EN: Editor’s Page – Socialized Science:

As I’ve [Rudy Baum] written on this page in the past, one important consequence of electronic publishing is to shift primary responsibility for maintaining the archive of STM literature from libraries to publishers. I know that publishers like the American Chemical Society are committed to maintaining the archive of material they publish.

PMR: I am not an archivist but I know some and I don’t know of any who deliberately censor the past. So I have some open questions to the American Chemical Society (and to other publishers who have taken on the self-appointed role of archivist):

what is the justification for this alteration of the record? Why is the original not still available with an annotation?
who – apart from the publisher – holds the actual formal record of publications? And how do I get it? (Remember that a University library who subscribes to a journal will probably lose all back issues – unlike paper journals the library has not purchased the articles, only rented them). I assume that some deposit libraries hold copies but I bet it’s not trivial to get this out of the British Library.
where and how can I get hold of the original supplemental data? And yes, I want it for scientific purposes – to do NMR calculations. Since it was originally free, I assume it is still free.

Surely the appropriate way to tackle this is through versions or annotations? One of the many strengths of Wikipedia is that it has a top-class approach to versions and annotations. If someone writes something that others disagree with, the latter can change it. BUT the original version still exists and can be easily located. If there is still disagreement, then WP may put a stamp of the form “this entry is disputed”. Readers know exactly where they are and they can see the whole history of the dispute.
So here, surely, the simple answer is to preserve, not censor, the scientific record. The work may be “junk science” but it is still reported science. Surely an editor should simply add “The authors have retracted this paper because…” on all documents and otherwise leave them in full.
It is obvious that this problem cannot arise with Open Access CC-BY papers because anyone can make a complete historical record as soon as they are published.
[UPDATE. I have now looked at the original article and this seems to have been treated satisfactorily – the fulltext is still available, with an annotation that “The authors have retracted this paper on November 15, 2007 (Org. Lett. 2007, 24, 5139) due to uncertainties regarding what products are formed in the reaction described.” That’s fair and I have relatively little quibble – although it would still be valuable to see the original and not simply an annotated version.
But the arguments about the supplemental data still persist. If it’s deliberate it’s very worrying. If it’s a technical error in archival it’s also very worrying. ]

Posted in chemistry | Tagged publishing, retraction | 3 Comments

Why authoring HTML is still a mess

Posted on December 31, 2007 by pm286

When HTML was launched it us was simple. And it worked if you got it nearly right. (that was in 1993). Now there are so many additions, scripts and so forth that it becomes impossible to re-use parts of other people’s HTML. In my past post I set a small question and to illustrate it copied some HTML from Wikipedia (using cut and paste). IT looked OK in my editor, so I posted it. When I looked at the final version in Firefox the pasted infobox had disappeared:

So I assumed it hadn’t got into the final version and blamed WordPress.
Then I looked in IE and found:

and as you can see the infobox shows up perfectly.
So we are still a long way from having decent editing and even longer from semantic editing unless we agree to collaborate and concentrate on making a small set of tools work properly. ICE (Integrated Content Environment) is starting to do that – it needs all our support.
[ARGGGH… the box has now appeared in Firefox – halfway into the succeeding post. Obviously it doesn’t show on single posts. Or it comes and goes as it feels like…]

Posted in semanticWeb | Tagged ice, usq | 1 Comment

Chemical information on the web – typical problem

Posted on December 31, 2007 by pm286

Here’s a typical problem with chemical (and other) data on the web and elsewhere. I illustrate it with an entry from Wikipedia, knowing that they’ll probably correct it and similar as soon as it’s pointed out. You don’t have to know much science to solve this one:

Molecular formula	XeO₄
Molar mass	195.29 g mol⁻¹
Appearance	Yellow solid below −36°C
Density	? g cm⁻³, solid
Melting point	−35.9 °C

Here’s part of the infobox for Xenon tetroxide in WP. Why are the data questionable? The problem is universal… [The info box didn’t copy so you’ll have to look at the web page – probably a better idea anyway. Here’s a screenshot]
UPDATE: The problem comes in the character(s) before the numbers. It is not ASCII character 45, which is what most anglophone keyboards emit when the “-” is typed. From Wikipedia:

Character codes

Read Character Unicode ASCII URL HTML (others)

Plus + U+002B + %2B

Minus − U+2212 − or − or −

Hyphen-minus – U+002D - %2D

The Unicode minus sign is designed to be the same length and height as the plus and equals signs. In most fonts these are the same width as digits in order to facilitate the alignment of numbers in tables. The hyphen-minus sign (-) is the ASCII version of the minus sign, and doubles as a hyphen. It is usually shorter in length than the plus sign and sometimes at a different height. It can be used as a substitute for the true minus sign when the character set is limited to ASCII.

Read	Character	Unicode	ASCII	URL	HTML (others)
Plus	+	U+002B	`+`	`%2B`
Minus	−	U+2212			`−` or `−` or `−`
Hyphen-minus	–	U+002D	`-`	`%2D`

There is a tension here between scientific practice and the norms of typesetting and presentation. When the WP XML for this entry is viewed it looks something like:

x<td><a href="/wiki/Molar_mass" title="Molar mass">Molar mass</a></td>
<td>195.29 g mol<sup>âˆ’1</sup></td>
</tr>
<tr>
<td>Appearance</td>
<td>Yellow solid below âˆ’36°C</td>
</tr>
<tr>
<td><a href="/wiki/Density" title="Density">Density</a></td>
<td> ? g cm<sup>âˆ’3</sup>, solid</td>
</tr>
<tr>
<td><a href="/wiki/Melting_point" title="Melting point">Melting point</a></td>
<td>
<p>âˆ’35.9 °C</p>

where the “minus” is represented by 3 bytes, which here print as

 âˆ’

Note also that the degree sign is composed of two characters.
If the document is Unicode then this may be strictly correct, but in a scientific context it is universal that ASCII 45 is used for minus.
The consequence is that a large amount of HTML is not machine-readable in the way that a human reads it.
The answer for “minus” is clear – in a scientific context always use ASCII 45. It is difficult to know what to do with the other characters such as degrees. They can be guaranteed to cause problems at some stage when transforming XML, HTML or any other format unless there is very strict discipline on character encodings in documents, prgrams and stylesheets.
Which is not common.
Note, of course, that’s it’s much worse in Word documents. We have examples in published manuscripts (i.e. on publisher web sites) where numbers are taken not from the normal ASCII range (48-57) but from any of a number of symbols fonts. These are almost impossible for machines to manage correctly.

Posted in data, fun | Tagged wikipedia | 1 Comment

Exploring RDF and CML

Posted on December 30, 2007 by pm286

I’ve taken the chance pf a few days without commitments to investigate how we shall be using RDF. We’ve got several projects where we are starting to use it – CrystalEye – WWMM, eChemistry, SPECTRa : JISC and other ORE-based projects. I’ve been convinced for a few years that CML+RDF has to be the way forward for representing chemistry – the only question was when. CML gives the precision that is required for defining the local structure of objects (such as molecules) and RDF gives the flexibility for supporting a very diverse community who have different approaches and needs. It’s a balance between these two.
RDF represents information by triples – classically
subject – predicate – object
Here’s an example from WP:

<rdf:RDF
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns:dc="http://purl.org/dc/elements/1.1/">
        <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn">
                <dc:title>Tony Benn</dc:title>
                <dc:publisher>Wikipedia</dc:publisher>
        </rdf:Description>
</rdf:RDF>

To an English-speaking person, the same information could be represented simply as:
The title of this resource, which is published by Wikipedia, is ‘Tony Benn’

[Tony Benn is a well-known socialist UK politician much respected by people of all parties and none.]
This can be represented by a graph (from the W3C validator service) :

This is a very simple graph. The strength of RDF is that you can add a new triple anywhere and keep on doing it. The weakness of RDF is that you can add a new triple anywhere and keep on doing it. You end up with graphs of arbitrary structure. The challenge of ORE is to make sense of these.
Molecules have a variable RDF structure, We have to cater for molecules with no names, a hundred names, many properties, parameter constraints, etc. And the data are changing constantly and can come from many places. So there needs to be a versioning system and RDF is almost certainly the best way to tackle this. So here is a typical molecule:

The quality is bad because the graph is much larger and had to be scaled down (you can click it). But it shows the general structure – a “molecule” node, with about 10 “properties” (in the RDF sense) and 3-4 layers.
The learning curve for RDF is steep. The nomenclature is abstract and takes some time to become familiar with. Irritatingly there are at least 4 different syntaxes and some parts of them are very similar. Several query languages as well. However having spent a day with Jena, I can now create RDF from CML and it makes a lot of sense. (Note that it’s relatively easy to create RDF from XML, but no guarantee that arbitrary RDF can be transformed to XML).
The key thing that you have to learn is that almost everything is a Uniform Resource Identifier (URI) or a literal. So up to now we have things in CML such as dictRef, convention, units. In RDF alll these have to be described by URIs. This is hard work but very good discipline and helps to firm up CML vocabulary and dictionaries.
So we now have over 100,000 chemical triples and should be able to do useful things very soon.

Posted in semanticWeb, XML | Tagged rdf, semantic web | 5 Comments

What does USD 29 billion buy? and what's its value?

Posted on December 28, 2007 by pm286

Like many others I’d like to thank the The Alliance for Taxpayer Access …

… a coalition of patient, academic, research, and publishing organizations that supports open public access to the results of federally funded research. The Alliance was formed in 2004 to urge that peer-reviewed articles stemming from taxpayer-funded research become fully accessible and available online at no extra cost to the American public. Details on the ATA may be found at http://www.taxpayeraccess.org.

for its campaigning for the NIH bill. From the ATA site:

The provision directs the NIH to change its existing Public Access Policy, implemented as a voluntary measure in 2005, so that participation is required for agency-funded investigators. Researchers will now be required to deposit electronic copies of their peer-reviewed manuscripts into the National Library of Medicine’s online archive, PubMed Central. Full texts of the articles will be publicly available and searchable online in PubMed Central no later than 12 months after publication in a journal.
“Facilitated access to new knowledge is key to the rapid advancement of science,” said Harold Varmus, president of the Memorial Sloan-Kettering Cancer Center and Nobel Prize Winner. “The tremendous benefits of broad, unfettered access to information are already clear from the Human Genome Project, which has made its DNA sequences immediately and freely available to all via the Internet. Providing widespread access, even with a one-year delay, to the full text of research articles supported by funds from all institutes at the NIH will increase those benefits dramatically.”

PMR: Heather Joseph -one of the miain architects of the struggle – comments:

“Congress has just unlocked the taxpayers’ $29 billion investment in NIH,” said Heather Joseph, Executive Director of SPARC (the Scholarly Publishing and Academic Resources Coalition, a founding member of the ATA). “This policy will directly improve the sharing of scientific findings, the pace of medical advances, and the rate of return on benefits to the taxpayer.”

PMR: Within the rejoicing we must be very careful not to overlook the need to publish research data in full. So, as HaroldV says, “the Human Genome Project […]made its DNA sequences immediately and freely available to all via the Internet”. This was the essential component. If only the fulltext of the papers are available the sequences could not have been used – we’d still be trying to hack PDFs for sequences.
So what is the 29 USD billion? I suspect that it’s the cost of the research, not the market value of the fulltext PDFs (which is probably much less than $29B ). If the full data of this research were available I suspect its value would be much more than $29B.
So I have lots of questions and hope that PubMed, Heather and others can answer them

what does $29B represent?
will PubMed require the deposition of data (e.g. crystal structures, spectra, gels, etc.)
if not, will PubMed encourage deposition?
if not, will PubMed support deposition?
if not, what are we going to do about it?

So, while Cinderella_Open_Access may be going to the ball is Cinderella_Open_Data still sitting by the ashes hoping that she’ll get a few leftovers from the party?

Posted in data, open issues | Tagged ata, nih, pubmed | Leave a comment

Did I write this paper???

Open Data: I want my data back!

How to create interactive maps (and graphs?)

ICE Mashups, part one, take two

New Year's resolutions

New free journal from Springer – but no Open Data

Is the scientific archive safe with publishers?

Why authoring HTML is still a mess

Chemical information on the web – typical problem

Character codes

Exploring RDF and CML

What does USD 29 billion buy? and what's its value?

Recent Posts

Recent Comments

Archives

Categories

Meta