petermr's blog

A Scientist and the Web


Archive for the ‘www2007’ Category

semantic chemistry – “dbchempedia” and crystaleye

Friday, May 11th, 2007

An obvious requirement for the chemical semantic web is that we have chemistry – non-trivial as most is in walled-gardens. But things have really moved in the last hour. I left a message on Martin Walker’s Talk on WP, then at lunch met two of the semantic wiki-people – Chris Bizer who is creating dbpedia and Denny V who is creating a semantic chemistry Wiki in Karlsruhe. By the end of lunch Martin replied as below:

  1. Martin Walker Says:
    May 11th, 2007 at 7:45 pm eAs far as I can tell, there are around 3000 compounds with chemboxes, and over 2000 with drugboxes. I think we have many compounds on WP without chemboxes, but they are typically very brief articles (stubs) with little information. Of course linking into the mainstream of chemical information, as dbpedia seeks to do, may provide an incentive for more wikichemists to work on adding chemboxes. Sounds great!
    Martin A. Walker (Walkerma on WP)

So now they are all in touch and will work out a way that chemistry infoboxes on WP can be extracted into RDF. That will be sensational. It will give everyone a semantic chemistry handbook. You’ll be able to search it with the next generation of RDF tools – these are no longer vapourware. TimBL has a “tabulator” which can browse a RDF triple collection, fetching from the web where necessary. There are many other tools, and people are looking for content for them. So chemistry could be an exciting demonstration.

At the same time we are looking at Nick Day’s RSS feeds from CrystalEye and it looks like these are great starting places for SPARQL et al.

dbpedia – structured information from Wikipedia => dbchem

Friday, May 11th, 2007

I’m at a session at WWW 2007 on Linking Data – which I think will be of enormously important for us. Something I had never heard of before (it’s new this year):

it scrapes 750, 000 infoboxes from WP and truns them into structured RDF. The message is simple – the more implicit structure there is in WP, the easier it is for DBP to extract it. If there is a template for a given category (e.g. chemical compounds) then we can easily create an interface to extract structured RDF. For example DBP now has:
1,600,000 concepts

58, 000 persons

75, 000 YAGO categories

207 , 000 WP categories

and I am sure it will be relatively easy to extract the chemistry (Martin, how many compounds are there with infoboxes?)

DBP has a SPARQL endopint, on an OpenLink Virtuoso server (I am sitting next to these guys) Typical Q:

“All German musicians born in Berlin in 19th Century”

Extensions include

  • free text search
  • COUNT()

Key components are:

  • All concepts are identified by URIs
  • All URIs dereferenceable over the web into a small RDF snippet.

The fantastic thing is that we now have a complete RDF resource FOR FREE. One example which was shown was “von Baeyer”, so whenever we refer to him we get his date of birth, history, probably even his FOAFs! DBP is becoming one of the central information hubs of the emerging web of data.

In that way DBP can become the “popular” chemical hub, while Pubchem-RDF will become the “specialist” chemical hub. Of course they will be linked and possibly even indistinguishable in some RDF snippets.

The queries are fantastic:

“A soccer player with #11 shirt in a club with a stadium of over 40,000 seats born in a country with over 10 M inhabitants”

Let’s think what the Blue Obelisk will be able to do for chemistry. TBL has said we can lash/mash things up “in an afternoon” I am going to find out today what we can do with the chemistry we have got.

The other RDF resources in the same web are books, US census, geonames, CIA factbook, DBLP, dbtune, FOAF, Revyu

600 RDF triples. This is staggering. 100Klinks out of DBPedia
And then in 2 months music, gutenbreg, SW-lifesci, flickr, eurostat, freebase, HTMLweb GRDDL , blogosphere (SIOC), music brainz…

So – let;s do dbchem…!!! There is still a lot for me to learn. There are starting to be several large hubs of links. Which is the hub for a community will depend on what they want and what they create.

tbl+13 – the magic exposed (if not explained)

Thursday, May 10th, 2007

Here is the theme from TBL’s keynote: (slide 14/39)


The Two Magics of Web Science. He used many examples of the slide below, emphasizing the cyclic nature of the process. Start at the top and cycle clockwise.


This shows how Google evolved – a need (or issue) – developments in technology and social structure – success – and then the dark side (in this case spoofing) starts to foul the system. Here’s a similar ddiagram for wikis

The stars (dotted lines) are the magic – you need both. Creativity has been common but collaboration – real collaboration – is rarer. Sometimes it evolves – e.g. in flickr, where people tag each others’ slides – sometimes its built into the system such as in bioinformatics. Here’s blogs


and here is the bottom-up – or lowercase – semantic web.


So the challenge for the Blue Obelisk community and the blogosphere is what we can do to maximise the chance of building – or evolving – collaboration. Of course we have that on Sourceforge but we need to move to a wider community – the excitement of the blogosphere – the Blue Obelisk Cemetery, etc.

When I tell people here about the coherence, excitement and qaulity of the chemical blogosphere they are impressed. There are certainly other communities in other disciplines but Chemistry can feel it’s making excellent progress. In the next posts I’ll explain how easy it is to create the lc-semantic web and what benefits it will bring us. Don’t be frightened of SPARQL (a query language), RDF (the basis of the semantic web) or OWL (an ontology reasoning language). There are – now – sufficient tools to tackle this.

WWW 2007 Presentation

Thursday, May 10th, 2007

[This is roughly my presentation for the meeting, with conclusions. I may edit it during the day so early feed readers will have captured early versions]

The presentation concentrates on science, but applies to all scholarly journals. Addresses copyright and licenses; patents are completely separate issue.
Background Resources:

petermr posts:

We must act:

  • Need statement on Open Data (c.f. Open Access)
  • Funders must insist on Open Data
  • Institutions must insist that staff publish Open Data
  • Authors should use Science Commons Author Addenda in all data
  • Publishers should make all supporting information Open

In any case the scientific semantic web (2.0) will become so powerful it will ultimately sweep away twentieth century practices. Publishers, you have been warned.

Access to and re-use of Open Data in chemistry – impressions

Wednesday, May 9th, 2007

Continuing the preparation of material for WWW 2007 …

It is almost universally held (see Open Data – Wikipedia) that facts cannot be copyrighted. It is common for scientific papers to be accompanied by “supporting information” or “supplemental data”. In most people’s vision, including many publishers, these are “facts” – melting point – molecular weight – amounts of compound obtained, etc. Not “creative works” – any scientist who is “creative” with their facts deserves no sympathy.

But some publishers see it differently. Here’s the American Chemical Society:

Electronic Supporting Information files are available without a subscription to ACS Web Editions. All files are copyrighted by the American Chemical Society. Files may be downloaded for personal use; users are not permitted to reproduce, republish, redistribute, or resell any Supporting Information, either in whole or in part, in either machine-readable form or any other form. For permission to reproduce this material, contact the ACS Copyright Office by e-mail at or by fax at 202-776-8112.

at least it’s viewable (but not usable) for free.

By contrast here’s Wiley:

Angewandte Chemie (no public supporting information)
… Blackwell

Chemical Biology and Web design

no public supporting information, but I could purchase the complete article and post it…

Quick Price Estimate

For a quick price estimate to reuse the content enter the information below and click Quick Price. To order, click Place Order.

I would like to…

send it in an e-mail republish it in an academic coursepack republish it in a book republish it on a CD-ROM/DVD republish it in a brochure or pamphlet republish it in a journal or magazine republish it in a newsletter republish it in a newspaper post it on a Web site post it on an intranet site purchase the article

No content delivery. This service provides permission for reuse only.

User type

Individual Educational institution STM signatory Pharmaceutical corporation Health care organization Other organization/institution Author of the article

Portion of the article

Entire article Text extract Any 1 figure Any 2 figures Any 3 figures Any 4+ figures

Quick Price


Presumably this sale can be made many times – once for each purchaser.

Elsevier..Tetrahedrom – no public supporting info

… so …

There seems to be a complete lack of Open Data among these publishers. My recollection may be faulty but I thought that the data used to be more exposed. But the current reality is that major publshers expose virtually nothing of the data…

…One representative of Wiley told me that’s because they want to sell it back to us.

The pit-bull and the pendulum

Wednesday, May 9th, 2007

Continuing the preparation of my WWW 2007 panel material blogwise (and with
apologies to those who have heard me before on this) the following epitomises the difference of interests in the Open/Closed Access/Data community. In 1994 Rudy Baum (C&EN: Editor’s Page – Socialized Science) wrote strongly against Opening chemical data:

National Institutes of Health director Elias A. Zerhouni seems hell-bent on imposing an “open access” model of publishing on researchers receiving NIH grants. His action will inflict long-term damage on the communication of scientific results and on maintenance of the archive of scientific knowledge.More important, Zerhouni’s action is the opening salvo in the open-access movement’s unstated, but clearly evident, goal of placing responsibility for the entire scientific enterprise in the federal government’s hand. Open access, in fact, equates with socialized science.Late on Friday, Sept. 3, NIH posted its proposed new policy on its website, setting in motion a 60-day public comment period (C&EN, Sept. 13, page 7). Under the policy, once manuscripts describing research supported by NIH have been peer reviewed and accepted for publication, they would have to be submitted to PubMed Central, NIH’s free archive of biomedical research. The manuscripts would be posted on the site six months after journal publication.

Many observers believe that, if the NIH policy takes effect, other funding agencies will quickly follow suit. In short order, all research supported by the federal government would be posted on government websites six months after publication. This is unlikely to satisfy open-access advocates, who will continue to push for immediate posting of the research.

I find it incredible that a Republican Administration would institute a policy that will have the long-term effect of shifting responsibility for communicating scientific research and maintaining the archive of science, technology, and medical (STM) literature from the private sector to the federal government. It’s especially hard to understand because access to the STM literature is more open today than it ever has been: Anyone can do a search of the literature and obtain papers that interest them, so long as they are willing to pay a reasonable fee for access to the material.

What is important to realize is that a subscription to an STM journal is no longer what people used to think of as a subscription; in fact, it is an access fee to a database maintained by the publisher. Sure, many libraries still receive weekly or monthly copies of journals printed on paper and bound as part of their subscription. Those paper copies of journals are becoming artifacts of a publishing world that is fast receding into the past. What matters is the database of articles in electronic form.

As I’ve written on this page in the past, one important consequence of electronic publishing is to shift primary responsibility for maintaining the archive of STM literature from libraries to publishers. I know that publishers like the American Chemical Society are committed to maintaining the archive of material they publish. Maintaining an archive, however, costs money. It is not hard to imagine a scenario in which some publishers, their revenues squeezed at least in part by loss of subscriptions as a result of open-access policies, decide to cut costs by turning off access to their archives. The material, they would rationalize, is posted on government websites.

Which is, I suspect, the outcome desired by open-access advocates. Their unspoken crusade is to socialize all aspects of science, putting the federal government in charge of funding science, communicating science, and maintaining the archive of scientific knowledge. If that sounds like a good idea to you, then NIH’s open-access policy should suit you just fine.

“put the [] government in charge of funding science, communicating science, and maintaining the archive of scientific knowledge. If that sounds like a good idea to you, then [] open-access policy should suit you just fine.

Well, I can’t see much wrong with that – it’s certainly a major theme of funding in the UK. It’s not the government alone, of course, there’s the splendid work being done by the Wellcome Trust and other funding bodies. There is the problem of cost, of course, and publishing and archiving costs money. But if a funding body funds research it has a right (and a duty IMO) to make sure that work is as widely available as possible for the longets possible time.

Of course not all publishers use words like “socialized science” – which sounds slightly strange in other countries. But, lest you think that this was a storm in a teacup 3 years ago we have (news @ – PR’s ‘pit bull’ takes on open access – Journal )

Published online: 24 January 2007; Corrected online: 25 January 2007 | doi:10.1038/445347a

PR’s ‘pit bull’ takes on open access

Journal publishers lock horns with free-information movement.Jim Giles

The author of Nail ‘Em! Confronting High-Profile Attacks on Celebrities and Businesses is not the kind of figure normally associated with the relatively sedate world of scientific publishing. Besides writing the odd novel, Eric Dezenhall has made a name for himself helping companies and celebrities protect their reputations, working for example with Jeffrey Skilling, the former Enron chief now serving a 24-year jail term for fraud.

Although Dezenhall declines to comment on Skilling and his other clients, his firm, Dezenhall Resources, was also reported by Business Week to have used money from oil giant ExxonMobil to criticize the environmental group Greenpeace. “He’s the pit bull of public relations,” says Kevin McCauley, an editor at the magazine O’Dwyer’s PR Report.

Now, Nature has learned, a group of big scientific publishers has hired the pit bull to take on the free-information movement, which campaigns for scientific results to be made freely available. Some traditional journals, which depend on subscription charges, say that open-access journals and public databases of scientific papers such as the National Institutes of Health’s (NIH’s) PubMed Central, threaten their livelihoods.

Media messaging is not the same as intellectual debate.

From e-mails passed to Nature, it seems Dezenhall spoke to employees from Elsevier, Wiley and the American Chemical Society at a meeting arranged last July by the Association of American Publishers (AAP). A follow-up message in which Dezenhall suggests a strategy for the publishers provides some insight into the approach they are considering taking.

The consultant advised them to focus on simple messages, such as “Public access equals government censorship”. He hinted that the publishers should attempt to equate traditional publishing models with peer review, and “paint a picture of what the world would look like without peer-reviewed articles”.

Dezenhall also recommended joining forces with groups that may be ideologically opposed to government-mandated projects such as PubMed Central, including organizations that have angered scientists. One suggestion was the Competitive Enterprise Institute, a conservative think-tank based in Washington DC, which has used oil-industry money to promote sceptical views on climate change. Dezenhall estimated his fee for the campaign at $300,000–500,000.

In an enthusiastic e-mail sent to colleagues after the meeting, Susan Spilka, Wiley’s director of corporate communications, said Dezenhall explained that publishers had acted too defensively on the free-information issue and worried too much about making precise statements. Dezenhall noted that if the other side is on the defensive, it doesn’t matter if they can discredit your statements, she added: “Media messaging is not the same as intellectual debate”.

Officials at the AAP would not comment to Nature on the details of their work with Dezenhall, or the money involved, but acknowledged that they had met him and subsequently contracted his firm to work on the issue.

“We’re like any firm under siege,” says Barbara Meredith, a vice-president at the organization. “It’s common to hire a PR firm when you’re under siege.” She says the AAP needs to counter messages from groups such as the Public Library of Science (PLoS), an open-access publisher and prominent advocate of free access to information. PLoS’s publicity budget stretches to television advertisements produced by North Woods Advertising of Minneapolis, a firm best known for its role in the unexpected election of former professional wrestler Jesse Ventura to the governorship of Minnesota.

The publishers’ link with Dezenhall reflects how seriously they are taking recent developments on access to information. Minutes of a 2006 AAP meeting sent to Nature show that particular attention is being paid to PubMed Central. Since 2005, the NIH has asked all researchers that it funds to send copies of accepted papers to the archive, but only a small percentage actually do. Congress is expected to consider a bill later this year that would make submission compulsory.

Brian Crawford, a senior vice-president at the American Chemical Society and a member of the AAP executive chair, says that Dezenhall’s suggestions have been refined and that the publishers have not to his knowledge sought to work with the Competitive Enterprise Institute. On the censorship message, he adds: “When any government or funding agency houses and disseminates for public consumption only the work it itself funds, that constitutes a form of selection and self-promotion of that entity’s interests.”

So the pit-bull is loose – which way will the pendulum swing?

The reality of closed access

Wednesday, May 9th, 2007

Here’s a typical example of getting information from the literature. Assume I don’t belong to a rich University
and I need to find out about cystic fibrosis. I can go to the splendid Pubmed (MEDLINE)

PubMed is a service of the U.S. National Library of Medicine that includes over 17 million citations from MEDLINE and other life science journals for biomedical articles back to the 1950s. PubMed includes links to full text articles and other related resources.

I type in “cystic fibrosis” and find 27497 articles! Here’s #7:

7: Dellon EP, Leigh MW, Yankaskas JR, Noah TL. Related Articles, Links
Abstract Effects of lung transplantation on inpatient end of life care in cystic fibrosis.
J Cyst Fibros. 2007 May 2; [Epub ahead of print]
PMID: 17481967 [PubMed - as supplied by publisher]

<!– //analyze cookie to see if client wants automatic redirection, then do it if ( getCookie(WKCookieName)=='ovid' || getCookie(WKCookieName)=='lww' ) { //user has WKRedirect cookie, expose redirect message document.write('.Redir {display: block;}’) //tracking , based on user having cookie and what product redirected document.write(‘‘); //initiate auto redirect ID = setTimeout(“redirectSelection(getCookie(WKCookieName));”,redirectDelay); } else { //make visible main WK product selection text and links document.write(‘.forkcontainer{visibility: visible;}’); document.write(‘‘);//tracking , user no cookie condition } // –> Sounds important, let’s read it First we get the abstract (the summary)

[title and authors omitted]

BACKGROUND: The impact of lung transplantation on end of life care in cystic fibrosis (CF) has not been widely investigated. METHODS: Information about end of life care was collected from records of all patients who died in our hospital from complications of CF between 1995 and 2005. Transplant and non-transplant patients were compared. RESULTS: Of 38 patients who died, 20 (53%) had received or were awaiting lung transplantation (“transplant” group), and 18 (47%) were not referred, declined transplant, or were removed from the waiting list (“non-transplant”). Transplant patients were more likely than non-transplant patients to die in the intensive care unit (17 (85%) versus 9 (50%); P=0.04). 16 (80%) transplant patients remained intubated at or shortly before death, versus 7 (39%) non-transplant patients (P=0.02). Do-not-resuscitate orders were written later for transplant patients; 12 (60%) on the day of death versus 5 (28%) in non-transplant patients (P=0.02). Transplant patients were less likely to participate in this decision. Alternatives to hospital death were rarely discussed. CONCLUSIONS: Receiving or awaiting lung transplantation affords more aggressive inpatient end of life care. Despite the chronic nature of CF and knowledge of a shortened life span, discussions about terminal care are often delayed until patients themselves are unable to participate.

But I want to know more… maybe the data need re-interpreting … so let’s read the whole article …

Access Online Article
Effects of lung transplantation on inpatient end of life care in cystic fibrosis
Journal of Cystic Fibrosis, In Press, Corrected Proof, Available online 3 May 2007,
Elisabeth P. Dellon, Margaret W. Leigh, James R. Yankaskas and Terry L. Noah View Abstract
You must have cookies enabled on your browser to successfully login.
If you have a User Name & Password, you may already have access to this article. Please login below.
User Name:
Athens/Institution Login
Forgotten your User Name or Password?
If you do not have a User Name and Password, click the “Register to Purchase” button below to purchase this article.Price: US $ 30.00
Register to Purchase

… and it will cost you 30 USD…

So that – I hope – is an accurate depiction of the difference between Open (PubMed) and Closed (Journal of Cystic Fibrosis)

The importance of Open Data

Wednesday, May 9th, 2007

(Note for true effect, go to the real live pages mentioned here).

Here is a page from the Canadian National Committee for CODATA (sent by Alison Ball). I’m going to choose just one of many data sources:


About This Database

This database is devoted to the collection of mutations in the CFTR gene and is currently maintained by Julian Zielenski, Anluan O’Brien and Lap-Chee Tsui as a resource for the international cystic fibrosis genetics research community. It was initiated by the Cystic Fibrosis Genetic Analysis Consortium in 1989 to increase and facilitate communications among CF researchers. The specific aim of the database is to provide CF researchers and other related professionals with up to date information about individual mutations in the CFTR gene and phenotypic data associated with CFTR genotypes. While we will continue to ensure the quality of the data, we urge the international community to give us feedback and suggestions. Since the purpose of this database is to facilitate research, we ask our colleagues to use the information with great discretion in clinical settings. Similarly, we ask those who are looking for genotype-phenotype correlation to exercise extreme care in interpreting the recorded data. For information related to this mutation database, please send an email to cftr.admin. For general information on cystic fibrosis, please use our linked sites. Previous website can be found here.

Comments or questions? Please email to cftr.admin
The Database was last updated at Mar 02, 2007

If you have never seen a bioinformatics database, try following: mRNA(cDNA) and Polypeptide Sequence

and you might get something like:

Click in the following graph to get the CFTR mRNA(cDNA) sequence of 600nts

<!– –>


Enter the start and end Nucleotide of the CFTR mRNA(cDNA) sequence

From: To: 100

Enter the start and end Amino Acid of the CFTR polypeptide sequence

From: To:


100nt 500nt 1000nt 2000nt 5000nt Move Left Move right Don’t Move


200nt 400nt Zoom In Zoom Out Don’t Zoom

mRNA(cDNA) and polypeptide sequence:

Get a sequence only copy:

DNA sequence Three-letter symbol polypeptide sequence One-letter symbol polypeptide sequence

The point here is that the data is of great interest to many people – by no means just scientists. It allows you to query mutations in every part of the genome. Imagine if this data were locked up behind a commercial firewall…

Open Canada

Wednesday, May 9th, 2007

(and maybe a reader will give me the French translation – I do not know the gender).

I am delighted to be speaking at WWW 2007 in Banff Canada as Canada has a very high profile in Open activities. This is is not a comprehensive survey and depends on people I have met or have encountered virtually – order is idiosyncratic.

Heather Morrison has been very active in promoting Openness – coincidentally she posted a a request for a summary of open data just today.

Here’s her home page, splendidly called the The Imaginary Journal of Poetic Economics

Imagine a world where anyone can instantly access all of the world’s scholarly knowledge – as profound a change as the invention of the printing press. Technically, this is within reach. All that is needed is a little imagination, to reconsider the economics of scholarly communications from a poetic viewpoint.

At the same time, she got a reply from Alison Ball:


There are descriptions of Canadian data sites at Some examples of data sets that could be of interest to laypersons are:

  • Canadian Poisonous Plants Information System
  • Canadian Bird Trends
  • CADRMP Adverse Reaction Database
  • National Climate Data and Information Archive
  • Ontario Sport Fish Contaminant Monitoring Program

Wow! Just the sort of material I was looking for! These are exactly the sorts of data that should be Open. Alison is in the National Research Council of Canada which is also much more proactive than most at insisting that data become free.
And here are some more people and instances that I have been sent and Canada can be proud of:

Of the 16 people at the Budapest meeting that became BOAI – three are
Canadian: Jean-Claude Guedon, Leslie Chan, and
Stevan Harnad.

Francis Ouellette – see the Ouellette declaration. Andrew Waller, librarian from the University of Calgary and OA advocate. John Willinsky & the PKP Project / OJS. Canada has a recently, well-funded project called Synergies to support scholarly publishing – most of the nodes (SFU, University of Calgary) are strong supporters of open access.

Public Knowledge Project / Open Journal Systems:
First International PKP Scholarly Publishing Conference, July 11-13,
Vancouver, BC:
(highly recommended!!)
Canadian Association of Research Libraries (CARL) IR Project (all
Canadian university research libraries either have, or are developing
an IR – the CARL metadata harvester site is at:
(Currently 12 IRs; work is being done to roughly double this number
in the very near future.
Canadian Institutes of Health Research – draft Access to Research
Outputs policy:
If passed in a form similar to the draft, this will be a very strong
policy, which calls for open data as well as published results of
Open Access Declaration for the Ouellette Laboratory
IJPE Canadian Leadership in the Open Access Movement series
And Heather is even now teaching a whole (1-credit) course on open access – is this the first?:

TBL+13: If everybody did it it would be awesome

Wednesday, May 9th, 2007

13 years ago I sat entranced listening to Tim Berners-Lee giving the closing address at the first WWW conference in CERN, Geneva. I was particularly influenced by one diagram which changed the way I thought about the world. I don’t know whether the one below – which I pinched from Dan Connolly’s tutorial – is Tim’s original but it captures the idea:

The Semantic Web… is an open world and universal space for machine-readable data.

things in documents

To a computer, then, the web is a flat, boring world devoid of meaning…This is a pity, as in fact documents on the web describe real objects and imaginary concepts, and give particular relationships between them…Adding semantics to the web involves two things: allowing documents which have information in machine-readable forms, and allowing links to be created with relationship values.

TimBL, WWW1994

This diagram showed that there was a “subterranean” semantic world of documents that moved in sync with the real world. Which drives which we are still discovering!

So what is today’s message? Since there have been 13 years of world effort it’s probably not as epoch-aking for me. But the “two magics” are creativity and collaboration. Both are critical. And the chemical blogosophere and the Blue Obelisk is where creativity and collaboration meet for chemistry. It has to be the start of the future.
And the single message to take away from Tim’s talk:
“If everybody did it it would be awesome”.