Bioclipse – Rich Client

I’m at the Bioclipse workshop in Uppsala – excellently run by Ola Spjuth and colleagues. Rich clients – where the client has significant functionality beyond the basic browser – are critical for the interchange of scientific information. A typical example is Maestro (NASA image viewer) where the typical browser does not – and cannot – have the local power and functionality required. So NASA wrote their own and you can download it and run it locally.
It’s easy to confuse Rich clients with AJAX services. For, say, Google maps all the functionality is on the server – cut off the web and you cannot use Google Maps. The maps are downloaded during use (you can often see the tiles coming down and covering the area). Nothing wrong with this, but you have to do it the way that Google has designed – you cannot re-use the maps in different ways (there are doubtless complex copyright issues anyway).
But isn’t everything moving towards a “cloud” model where all our data, functionality, etc. is remote – on Google, Yahoo, or whatever? Yes, for a lot of our everyday lives. But I think science is different. (Maybe I’m stuck in 20th C thinking… but I’ll try anyway). A scientist – or their lab – has data which is mentally on their laptop, or in the instruments, or in the fume hood or wherever. Most labs are probably not yet prepared to store this in Google. And they have local programs and protocols which only run on their local machines.
Moreover relying on GYM (Google, Yahoo, Microsoft) to provide all your environment means a loss of control. Scientists have already lost much of their independence to publishers (they control what you can publish – I just heard today of a publisher who would only accept graphics as EPS – why? – because it makes it easier for their technical department.)
There are obvious challenges to using a Rich Client. If every discipline has their own then the user gets confused. The developer has to manage every platform, every OS (at least in the future). Doesn’t this get unmanageable?
Yes, if everyone does things differently. But if everyone uses the same framework it becomes much easier. And that’s what we have with Eclipse. It’s the world’s leading code development environment and there are thousands of commercial plugins. It’s produced by IBM and is Open Source, Free (obviously) and designed for extensibility. So it will prosper.
If the scientific community converges on Eclipse as a Rich Client then we have enormous economies of scale. That’s what Bioclipse is doing in the chemistry and bio-sciences. In fact, however, much of the work is generic – data manipulation, display, RDF, etc. so other sciences can build on that and contribute their own expertise.
There are downsides. Everybody is familiar with browsers – very few scientists yet know Eclipse. But that can change. Eclipse has many tools for easy installation, tutorials, guided learning, updates, etc. All for free. So we expect to go through a period of “early adoption”.
In my talk yesterday I described Bioclipse as Disruptive technology (WP) – technology which destabilises current practice and leads to improvements in quality and cost. Even more importantly it returns power to the scientist – they are in control of their data and how to repurpose it. We hope to develop Bioclipse as a browser-publisher so that the scientist works in an environment where they decide how to emit data, not subservient to the technical editing departments of publishers or the proprietary formats so common in chemical information.

Posted in blueobelisk, chemistry, data, programming for scientists | Leave a comment

Sued for 10 Data Points

Peter Suber has blogged about an important discussion on Wiley’s action is threatening legal action for reproducing a data graph from a publication. (there’s quite a bit to read if you follow the links but it’s worth it.) Also read the followups where several Open luminaries comment in a more equable
manner than I feel capable of at the moment.

PS:  The Batts/Wiley story broke in late April when I was traveling.  If I’d been at my desk, I’d have covered it or at least I’d have tried.  But because the comments proliferated explosively, I wasn’t at my desk, and I had a full load of other work, I decided that I had to let it go.  I’m glad to catch up a bit with this post.  I’m also glad to have the chance to recommend comments by Mark Chu-Carroll, Cory Doctorow, Matt Hodgkinson, Bill Hooker, Rob Knop, Brock Read, Kaitlin Thaney, Bryan Vickery, and Alan Wexelblat.  Finally, Katherine Sharpe at ScienceBlogs, where the controversy began, solicited comments from five “experts and stakeholders” (Jan Velterop of Springer, John Wilbanks of Science Commons, Mark Patterson of PLoS, Matt Cockerill of BMC, and me [PeterS].)

The graph had 10 points. This, gentle readers, is Data. Numbers. Facts. Facts are Non-copyrightable. End of story. The author got round it by re-entering the data – well done – and absolutely correct – you cannot copyright numbers.
I have not seen the original graph but I cannot assume that the technical authors at Wiley had created a “creative work” or immense added artistic and cultural merit. There is a limit to what one can do with 10 data points. Perhaps they were going to hang it in Tate Modern. (Most publishers actually create “destructive works” on data – omissions, hamburgers, etc.).
We have to redeem our data – and quickly. There are several legal ways.

  • create supplementary data which we post on our web sites, in institutional repositories
  • just do it – as in this story. You have right on your side. Get your institution to back you. Make a fuss. Tell the world that the publishers are making it harder to save the planet. They are. We need data to save the planet. What if this were a graph (from a rival publisher) of the prediction of sea-level rise at Chichester (it’s on the sea – that’s where Wiley lives). Wouldn’t Wiley wish to know when they would be flooded?
  • Extract data from the publication in numeric form and post it. It will be increasingly possible to do this at zero cost. We’ll start explaining how in later blogs. And it will be legal.
Posted in data, open issues | 4 Comments

Linked Open Data

This is one of the key issues for me at present:. Paul Miller (Talis) – who with his colleagues is constantly working towards a community license – writes (Linked Data the real Semantic Web ?):

It has been interesting to follow the rise of the ‘Linked Data‘ meme in the Semantic Web community recently, and to track it alongside longer term (but quieter) mutterings around ‘Open Data‘ from the likes of Tim O’Reilly and XTech programme committees past and present.
The recent push is due in no small part, I believe, to the sterling efforts of the Linking Open Data community, and to the support they’ve been receiving from W3C’s Semantic Web Education & Outreach (SWEO) group, of which I’m a rather quiet member.
Listening to Tim Berners-Lee’s keynote in Banff a week or so back, there was a strong steer toward ‘Linked Data’, and the opportunities presented by the relationships between resources and the aggregate of those resources. This thread came up again and again, most notably in the Linked/Open Data sessions. Thinking about it again, the whole Linked Data thrust actually comes across as a far more compelling way to describe the value of the Semantic Web to the non-geek audience. Are we seeing some formal shift in W3C’s language as we and they grapple to clearly express the value of these misunderstood ‘new’ approaches? Let’s hope so, as these Data Web/ Web of Data stories get far less bogged down in the horrors of ‘triples’, ‘ontologies’ and other concepts designed to send most audiences into an irretrievable tailspin…
If the Web of Data is the target, of course, the thorny issue of to whom the data belong, and the ways in which the data may be used, come to the fore once more. This is an area we’ve been tackling with contributions such as the Talis Community License, and it came up in Rob’s contribution in Banff [Rob’s audio here, PDF of everyone’s slides here], as well as papers from both of us at XTech last week. We’ve seen a lot of interest in some of the issues we’ve been stressing around the need to apply some licence to data, and the importance of understanding the rights that do – and don’t – apply to data as opposed to creative works, and look forward to finishing the work we started with the TCL and getting the whole thing onto some more formal footing.
One conversation from last week that has carried over onto email this week was with Rufus Pollock of the Open Knowledge Foundation. They don’t have a license, but they do usefully define a set of principles to underpin the notion of ‘open knowledge’, and they explicitly include the separate notion of data;

“The Open Knowledge Definition (OKD) sets out principles to define the ‘open’ in open knowledge. The term knowledge is used broadly and it includes all forms of data, content such as music, films or books as well any other type of information.
In the simplest form the definition can be summed up in the statement that ‘A piece of knowledge is open if you are free to use, reuse, and redistribute it’.”

We’re seeing movement as a growing body of implementors, commentators and analysts recognise the potential of linking disparate data resources together, leveraging some of the more basic capabilities of RDF and other Semantic Web enabling technologies. We’re also seeing a matching awareness of the need to protect use of those data sets (and not merely to safeguard the interests of data owners, but also – and far more tellingly – to give confidence to data aggregators and users), and a refreshing willingness to engage openly and cooperatively in reaching a pragmatic solution. It’s a great time to be involved in this space, and Talis looks forward to playing our full part across the piece.
Update: Rufus Pollock has begun a Guide to Open Data Licensing on their wiki…

One of the drivers is that systems such as Freebase and Metaweb claim to be able to manage huge amounts of linked Open Data. I’m hoping so as it will revolutionise the closed minds in chemical information. I’ll be trying out some of these ideas at the Bioclipse meeting tomorrow.

Posted in open issues, semanticWeb | 1 Comment

Bioclipse and the Information Revolution

I have been honoured to have been asked to talk at the 07.05.23 Embrace Workshop on Bioclipse 2007 (EWB 07), BMC, Uppsala meeting next week in Sweden. This post explains why Bioclipse is so important (it goes beyond bio/chem) and also provides a title and abstract of my talk. So first the facts – http://en.wikipedia.org/wiki/Bioclipse :

The Bioclipse project is a Java-based, open source, visual platform for chemo- and bioinformatics based on the Eclipse Rich Client Platform (RCP). Bioclipse uses, as any RCP application, a plugin architecture that inherits basic functionality and visual interfaces from Eclipse, such as help system, software updates, preferences, cross-platform deployment etc. Via these plugins, Bioclipse provides functionality for chemo- and bioinformatics, and extension points that easily can be extended by other, possibly proprietary, plugins to provide additional functionality.
The first stable release of Bioclipse includes a CDK plugin (bc_cdk) to provide a chemoinformatic backend, a Jmol plugin (bc_jmol) for 3D-visualization, and a BioJava plugin (bc_biojava) for sequence analysis. Bioclipse is develped as a collaboration between the Proteochemometric Group , Dept. of Pharmaceutical Biosciences, Uppsala University, Sweden, and the Research Group for Molecular Informatics at Cologne University Bioinformatics Center (CUBIC).

Bicoclipse is based on the enormously professional and influential Eclipse framework – developed by IBM and made Open Source. I use Eclipse every day for my software development. It contains a rich set of resources (editors, browsers, searchers) along with the management of key components (compilers, repositories (SVN/CVS)). But because the Eclipse framework is written so flexibly many of these can be “stripped out” and replaced with domain-specific components (for bio- and chem- applications). Not surprisingly many of the Blue Obelisk projects have produced components which are now part of, or pluggable-into, Bioclipse.
Over the last two weeks I have been heaviiy influenced by the vision of the “lowercase semantic web” and this will be an important aspect of my presentation:

“Bioclipse and the Information Revolution”

(Peter Murray-Rust,

Unilever Centre for Molecular Sciences Informatics, Deparment of Chemistry, Cambridge, CB2 1EW, UK)

Chemistry is a complex subject and its information management requires complex software. Traditionally this has been provided by groups (often commercial companies) which provide monolithic software systems, and by large information aggregators who compile, curate and redistribute products and services. In recent years innovation and value has slowed down, and much of the emphasis has been on integration within commercial customers (e.g. pharmaceutical) rather than the development of new functionality. In particular the academic community – on whose research the industry relies – is deprived of a software and information environment in which it can freely innovate.

By contrast the web has seen a recent explosion of innovation and wealth creation – often categorised by “Web 2.0” or “semantic web”. This is exemplified by the rise of the blogosphere (see Chemical blogspace) where many (young) scientists are trying new ways of communication and information re-use.

But the current web is based very heavily on text and graphics and has very little support for formalized disciplines such as chemistry. The browsers have little native support for XML (and what little there is can be found in vertical plugins, e.g. for mobile telephony). Much of the technology is based on centralised APIs such as Google maps, which has a centralist model and thin client model which does not translate to chemistry. And, if it did, it could consolidate the central control of information which many of us feel to be restrictive.

The current set of tools (Wikis, Blogs, etc.) are syntactically weak and (excluding a few experiments) have no semantic support. Current authors require semantic chemical tools, but are frustrated. Most rich chemical information rests on the laboratory bench – molecules, reactions, spectra, crystal structures, reports, recipes, etc. If this were made publicly available in semantic form chemistry could move towards a peer-to-peer network  that accurately represented the “ownership” or information.

The chemistry Open Source and Open Data community has now produced a critical mass of tools, many in wide use (post-beta) with more at alpha and beta. They have been brave in that they create components, often unglamorous but increasingly robust, which are interoperable and reconfigurable. They are increasingly being taken seriously, for example in pharma.

Until now the bench chemist – often trained on “clicks” within a Microsoft environment and ignorant of commandlines or scripting – would find there is too much integration required. But Bioclipse can and will change that. Any tech support in any institution will be familiar with “Eclipse” and can help with installation and integration and maybe even wrap some plugins.

The challenge for Bioclipse is to generate “viral” penetration within the chemical community. To do this it must:

  • be trivially installable. (I am currently installing V1.1.1).
  • be navigatable. Is the user interface – of perspectives – one that chemists can learn?
  • provide enough functionality to be useful.
  • require little or no maintenance.
  • ideally have a unique selling point (do something useful that other systems don’t)
  • interoperate with other systems (Bioclipse won’t be able to do everything)
  • create a semantically rich editor-browser platform (perhaps based on RDF)

This is a big challenge, but most of the Blue Obelisk and other Open Source community will be helping. (Bioclipse is Java, so non-Java applications such as OpenBabel and InChI require additional engineering). The areas where Bioclipse can take a lead include:

  • management of chemical documents (papers, theses, lab reports), using chemical linguistics such as OSCAR3
  • integration of structured ontologies such as GoldBook, ChEBI, CML dictionaries
  • validation of chemical information (using CML and other XML technologies documents and data sets can be formally validated)
  • integration of robots (e.g. harvesting of public chemical information)
  • integration into the chemical blogosphere (e.g. support for microformats and RDF).
  • linking of information within chemistry (e.g. analytical data to spectra)
  • linking between disciplines (e.g. small molecules to bioscience applications)

Given these, and given support for “most” of what chemists already require, Bioclipse should have immediate appeal. This will be strengthened by the needs and support of other communities such as

  • publishers (who need structured information that can be repurposed)
  • librarians (who need future-proof semantics for archival and preservation in institutional repositories)
  • regulators (who need searchable semantic information)

If it can spread virally, Bioclipse will be part of a Disruptive technology which will change the face of chemical information and effectively start the creation of the chemical semantic web.

Posted in blueobelisk, open issues, programming for scientists, XML | 2 Comments

Audible Open Data at WWW2007

Danny Ayers who ran the developers track at WWW2007 recorded our Open Data session. Some presentations had slides and especially Steve Coast and I used animated/interactive material but I think the ideas come across. The Q&A had a lot of audience participation (which we all encouraged) but not all speakers were close to micophones – but hey, it’s the first time!

Danny Ayers: Open Data Podcasts

The Dev Track at WWW2007 began with a group of presentations on Open Data, chaired by Paul Miller of Talis. I did a rough & ready recording, which I’ve trimmed and chunked. Quality varies quite a bit (Q&A particularly), and some slides were shown, but I believe there’s plenty in the audio to make sense of the material.

Personally I was particularly pleased by this session because it revealed information that I for one would never have searched out without prompting. Turned out to be very interesting. “When you start using data, you need to start paying attention to these things…”

In retrospect the session was perhaps slightly mis-billed having “Semantic Web” in the title, when the material was about data in general. But the room was full and the audience engaged, so no harm done. In fact I think RDF was only mentioned once – see if you can spot it…  
Q: “Where do we go from here?”
A: “Evangelize!”
Many thanks to everyone involved.

I was particular pleased to see the wide engagement – this is dev-track which is not about politics – but it was clear that access to data really matters to a lot of people. There is obviously the need for licenses – Talis have been working on one, for example.
Thanks Danny

Posted in open issues, www2007 | Leave a comment

More on triple stores – molecules next

Since I think triple stores will change the way we think about information the current posts are somewhat of a stream of consciousness. (Rather tedious as it is almost impossible to put nicely formatted TT stuff into WordPress). We shall certainly benefit from HTML-based demos and Kingsley Idehen from OpenLink – with whom I spent a lot of time at WWW2007 – has done this…

  1. Kingsley Idehen Says:
    May 19th, 2007 at 10:36 pm eAppropos Richard’s comments, you can also view the Uppsala URI via an RDF Browser as demonstrated by the OpenLink RDF Browser that is installed on the DBpedia server – All About Uppsala Permalink .I am also knocking up a screencast to show how you can use the iSPARQL Query builder to paint your SPARQL Graph Patterns on a QBE Canvas. If you are impatient you can see some iSPARQL and DBpedia screencasts I’ve already prepared (re. other queries).Important Note: We have enhanced the hyperlink behavior so that you can experience the power of Linked Data and URI Dereferencing. Just click “Explore” or “Dereference” in the Lookup window presented by the enhanced hyperlink.

Thanks Kingsley. While I’m writing, here are a few more thoughts which the experts are welcome to correct.
A triple can be regarded as a directional arc in a graph (“graph” is used topologically – i.e. what is connected to what. Graphs have nodes with arcs between them and all can be labelled with URIs or literals or other primitives). A triple store is somewhere that you can add your labelled arcs and it will sort out whether they have nodes or predicates in common with other arcs. I think of the triple store as a huge graph which can be searched with SPARQL. Indeed if we have a large enough machine all the world’s triples can be loaded and you can ask any question which can be framed as SPARQL.
The real excitement is that if you use the same vocabulary as other people then you can combine your graphs. So, for example, if the Blue Obelisk and chemical blogosphere community all use the same microformats and all the blogs and data files all use them consistently you can ask questions of the whole blogosphere. Remember that things like dates, places, publications, etc. are already catered for – it is up to us to do the chemistry. So remember when I set a puzzle on finding a compound with yellow crystals, an unusual spacegroup and a molecular weight about 250? This is just the sort of query that opens up with WP or the chemical blogosphere.
To do this with relational databases is virtually impossible, but with RDF it is conceptually simple. If the information is in there, SPARQL will find it.
The problem is scale. Triple stores work quite well (I think) if everything fits into memory but start struggling when they get too big. To test this I downloaded Jena Semantic Web Framework – which is Open Source –
and loaded it in Eclipse. I then got persons.nt from dbpedia (about 50Mb) which has triples about people in WP. I found that out-of-the-box I could load about half the file (but I haven’t reset the VM size) and that it would answer questions in a second or two. So for personal use – up to a million triples – Jena looks good.
I have a naive idea that everyone will recognise the value of triple stores and technology will therefore advance so fast that size will not be a problem. I’m also hoping that content will be more important than computational resource and that free triple stores will appear in the cloud. Certainly I would expect to see continuation of free services for dbpedia (I don’t know the basis of the current server).
And we’ll be looking out for molecules. But first we have to convert them to triples. and that is already happening.

Posted in blueobelisk, chemistry, semanticWeb | Leave a comment

dbpedia, RDF and SPARQL – for chemistry

A comment on the last post

Richard Cyganiak Says: May 19th, 2007 at 7:55 pm e
Thanks for this nice introduction Peter. Note that the DBpedia URIs also work in a web browser, so you can go to http://dbpedia.org/resource/Uppsala and the DBpedia server will generate a web page showing the information it has about the item.

I have been playing some more with dbpedia and RDF. (It is important to realise that – as in the above example – many of the accesses will be through UIs, not raw RDF, so casual readers shouldn’t worry). There have been several ways of expressing RDF and these are slightly confusing but the important thing is to know what the predicates and potential values are. The exploratory approach (using a browser) Richard suggests is useful.
Let’s look at chemistry. Assume we know every little about dbpedia but we’ll look at some chemistry. So access Arsenic in Wikipedia

Arsenic

From Wikipedia, the free encyclopedia

33 germaniumarsenicselenium
P

As

Sb
General
Name, Symbol, Number arsenic, As, 33
Chemical series metalloids
Group, Period, Block 15, 4, p
Appearance metallic gray
Standard atomic weight 74.92160(2)  g·mol−1
Electron configuration [Ar] 3d10 4s2 4p3
Electrons per shell 2, 8, 18, 5
Physical properties
Phase solid
Density (near r.t.) 5.727  g·cm−3
Liquid density at m.p. 5.22  g·cm−3
Melting point 1090 K
(817 °C, 1503 °F)
Boiling point subl. 887 K
(614 °C, 1137 °F)
Critical temperature 1673 K
Heat of fusion (gray) 24.44  kJ·mol−1
Heat of vaporization ? 34.76  kJ·mol−1
Heat capacity (25 °C) 24.64  J·mol−1·K−1
P(Pa) 1 10 100 1 k 10 k 100 k
Atomic properties
Crystal structure rhombohedral
Oxidation states ±3, 5
(mildly acidic oxide)
Electronegativity 2.18 (Pauling scale)
Ionization energies
(more)
1st:  947.0  kJ·mol−1
2nd:  1798  kJ·mol−1
3rd:  2735  kJ·mol−1
Atomic radius 115pm
Atomic radius (calc.) 114  pm
Covalent radius 119  pm
Van der Waals radius 185 pm
Miscellaneous
Magnetic ordering no data
Electrical resistivity (20 °C) 333 n Ω·m
Thermal conductivity (300 K) 50.2  W·m−1·K−1
Young’s modulus 8  GPa
Bulk modulus 22  GPa
Mohs hardness 3.5
Brinell hardness 1440  MPa
CAS registry number 7440-38-2
Selected isotopes
iso NA half-life DM DE (MeV) DP
73As syn 80.3 d ε 73Ge
γ 0.05D, 0.01D, e
74As syn 17.78 d ε 74Ge
β+ 0.941 74Ge
γ 0.595, 0.634
β 1.35, 0.717 74Se
75As 100% As is stable with 42 neutrons
References

Arsenic (IPA: /ˈɑːsənɪk/, /ˈɑɹsənɪk/) is a chemical element that has the symbol As and atomic number 33. Its Atomic Mass is 74.92. Its Ionic Charge is (3-) Its position in the periodic table is shown at right. This is a notoriously poisonous metalloid that has many allotropic forms: yellow (molecular non-metallic) and several black and gray forms (metalloids) are a few that are seen. Three metalloidal forms of arsenic with different crystal structures are found free in nature (the minerals arsenic sensu strictu and the much rarer arsenolamprite and pararsenolamprite), but it is more commonly found as arsenide and arsenate compounds. Several hundred such mineral species are known. Arsenic and its compounds are used as pesticides, herbicides, insecticides and various alloys.

The box on the right is the “infobox” and it is many of these properties that find their way into dbpedia. The property names are not yet fully standardised and not all are memorable, but Wikipedian chemists can – and I suspect will – work on these to stnadrdaize. In which case they can be used for RDF/SPARQL queries. For example:
[… white space inserted for formatting – scroll down …]
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

  As http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/resource/Category:Chemical_elements
  As http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/resource/Category:Toxicology
  As http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/resource/Category:Metalloids
  As http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/resource/Category:Pnictogens
  As http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/resource/Category:Arsenic
  As http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/class/yago/chemical_element
  As http://www.w3.org/2004/02/skos/core#subject http://dbpedia.org/resource/Category:Articles_with_unsourced_statements
  As http://www.w3.org/2004/02/skos/core#subject http://dbpedia.org/resource/Category:Chemical_elements
  As http://www.w3.org/2004/02/skos/core#subject http://dbpedia.org/resource/Category:Toxicology
  As http://www.w3.org/2004/02/skos/core#subject http://dbpedia.org/resource/Category:Metalloids
  As http://www.w3.org/2004/02/skos/core#subject http://dbpedia.org/resource/Category:Pnictogens
  As http://www.w3.org/2004/02/skos/core#subject http://dbpedia.org/resource/Category:Arsenic
  As http://dbpedia.org/property/reference http://www.asmalldoseof.org/
  As http://dbpedia.org/property/reference http://www.atsdr.cdc.gov/HEC/CSEM/arsenic/
  As http://dbpedia.org/property/reference http://www.clu-in.org/contaminantfocus/default.focus/sec/arsenic/cat/Overview/
  As http://dbpedia.org/property/reference http://www-cie.iarc.fr/htdocs/monographs/vol23/arsenic.html
  As http://dbpedia.org/property/reference http://www.greenfacts.org/arsenic/arsenic-1.htm
  As http://dbpedia.org/property/reference http://www.inchem.org/documents/ehc/ehc/ehc224.htm
  As http://dbpedia.org/property/reference http://www.npi.gov.au/database/substance-info/profiles/11.html
  As http://dbpedia.org/property/reference http://www.webelements.com/webelements/elements/text/As/index.html
  As http://dbpedia.org/property/reference http://www.origen.net/arsenic.html
  As http://www.w3.org/2000/01/rdf-schema#label Arsenic
  As http://www.w3.org/2000/01/rdf-schema#label Arsen
  As http://www.w3.org/2000/01/rdf-schema#label Arsénico\”
  As http://www.w3.org/2000/01/rdf-schema#label Arsenic
  As http://www.w3.org/2000/01/rdf-schema#label Arsenico
  As http://www.w3.org/2000/01/rdf-schema#label ヒ素\”
  As http://www.w3.org/2000/01/rdf-schema#label Arsenicum
  As http://www.w3.org/2000/01/rdf-schema#label Arsen
  As http://www.w3.org/2000/01/rdf-schema#label Arsênio\”
  As http://www.w3.org/2000/01/rdf-schema#label Arsenik
  As http://www.w3.org/2000/01/rdf-schema#label 砷\”
  As http://dbpedia.org/property/number 33
  As http://dbpedia.org/property/symbol As
  As http://dbpedia.org/property/left http://dbpedia.org/resource/germanium
  As http://dbpedia.org/property/right http://dbpedia.org/resource/selenium
  As http://dbpedia.org/property/above http://dbpedia.org/resource/phosphorus
  As http://dbpedia.org/property/below http://dbpedia.org/resource/antimony
  As http://dbpedia.org/property/color1 #cccc99
  As http://dbpedia.org/property/color2 black
  As http://dbpedia.org/property/group 15
  As http://dbpedia.org/property/period 4
  As http://dbpedia.org/property/block p
  As http://dbpedia.org/property/k [[sublimation (physics)|subl.]] 887
  As http://dbpedia.org/property/k 1090
  As http://dbpedia.org/property/c 614
  As http://dbpedia.org/property/c 817
  As http://dbpedia.org/property/f 1137
  As http://dbpedia.org/property/f 1503
  As http://dbpedia.org/property/mn 73
  As http://dbpedia.org/property/mn 74
  As http://dbpedia.org/property/mn 75
  As http://dbpedia.org/property/sym As
  As http://dbpedia.org/property/na 100
  As http://dbpedia.org/property/na http://dbpedia.org/resource/synthetic_radioisotope
  As http://dbpedia.org/property/n 42
  As http://dbpedia.org/property/hl [[1 E6 s|17.78]] d
  As http://dbpedia.org/property/hl [[1 E6 s|80.3]] [[day|d]]
  As http://dbpedia.org/property/dm1 ε
  As http://dbpedia.org/property/dm1 http://dbpedia.org/resource/electron_capture
  As http://dbpedia.org/property/de1
  As http://dbpedia.org/property/pn1 73
  As http://dbpedia.org/property/pn1 74
  As http://dbpedia.org/property/ps1 http://dbpedia.org/resource/germanium
  As http://dbpedia.org/property/dm2 http://dbpedia.org/resource/Positron_emission
  As http://dbpedia.org/property/dm2 http://dbpedia.org/resource/Gamma_radiation
  As http://dbpedia.org/property/ps2
  As http://dbpedia.org/property/ps2 http://dbpedia.org/resource/germanium
  As http://dbpedia.org/property/de2 0.05[[Delayed nuclear radiation|D]], 0.01D, [[Conversion electron|e]]
  As http://dbpedia.org/property/de2 0.941
  As http://dbpedia.org/property/pn2 74
  As http://dbpedia.org/property/dm3 γ
  As http://dbpedia.org/property/de3 0.595, 0.634
 

You can work out some of the property names by comparing against their values. For example…

http://dbpedia.org/resource/Arsenic As http://dbpedia.org/property/c

links to the melting point in degC. There is still some work to be done (e.g. linking properties and units) but RDF is well suited for this and we’ll be working out some approaches.

Posted in chemistry, semanticWeb | 3 Comments

SPARQL, SNORQL triple store – an introduction

I am overwhelmed by the potential power of RDF and the semantic web and am sure that this is a large part of our information-based future. I have been playing with dbpedia and its SPARQL interfaces and here I introduce you to the ideas.

Recall that dbpedia is a dump of (all?) the organized information (infoboxes and categories) in Wikipedia. They have created an RDF version of this (a triple store). Until recently triple stores had problems of scale, but at WWW2007 several people/organizations asserted they could deal with many billions of triples. Here we are going to access dbpedia though the SNORQL web interface. I suggest you clickalong with me – note that the syntax is a little hairy in places but you will get used to it. As I am going to Uppsala next week for the Bioclipse meeting, I’ll use “Uppsala” as the theme.
A triple looks like:

subject – predicate – object

and any or all of the components may be absolute URIs. Here is an example (in N3 notation);

.

(Note that the syntax is NOT XML, NOT HTML and do not omit the ‘.’ at the end!). In simple terms this states that in dbpedia resource/Anders_Celsius is the subject, deathplace is the predicate and resource/Uppsala is the object. These are all symbols for nodes in the dbpedia triple store (graph) and so far have no lexical form and no semantics. However it is not suprising that we humans can interpret this as

The “deathplace” of “Anders Celsius” is “Uppsala”. dbpedia (and WP) attach human-readable annotations to these components so we can reasonably assert that “Anders Celsius died in Uppsala”.

We’ll show how to find “Celsius” later. For now let’s assume that http://dbpedia.org/resource/Uppsala will occur as subjects and as objects in dbpedia (It’s unlikely to be a predicate). For human readability we use prefixes:

PREFIX dbpedia:
PREFIX :

so <http://dbpedia.org/resource/Uppsala> … becomes … :Uppsala

and <http://dbpedia.org/deathplace … becomes … dbpedia:deathplace

dbpedia has a number of well-known namespaceURI systems and SNORQL lists several as PREFIXes:

PREFIX owl:
PREFIX xsd:
PREFIX rdfs:
PREFIX rdf:
PREFIX foaf:
PREFIX dc:
PREFIX :
PREFIX dbpedia2:
PREFIX dbpedia:
PREFIX skos: 

so let’s issue a simple query – “list all triples where :Uppsala is the subject”:

SELECT  *
WHERE {
  :Uppsala ?predicate ?object .
  }

This gives about a hundred triples of which Uppsala is the subject. Here are a few:
rdf:type dbpedia:class/City

The predicates are Wikipedia types – often “Category”. These are becoming increasingly well formalized and will represent a common human label for well described objects. So Uppsala “is a City”, “is a town”, “is a University_town” and so on.

dbpedia2:abstract: If you are searching for the Uppsala of Norse mythology, see also Gamla Uppsala. Uppsala(older spelling Upsala) is a city in central Sweden, located about 70 km north of Stockholm. It is the fourth largest city in Sweden with its 130,000 inhabitants; including immediate surroundings, Uppsala Municipality amounts to 183 403 (2005). Uppsala is the capital of Uppsala County (Uppsala Län), and Sweden’s ecclesiastical centre, being the seat of Sweden’s archbishop since 1164. Uppsala is famous for its university, the oldest still existing in Scandinavia and Northern Europe, founded in 1477 (a Studium Generale was founded in Lund already in 1425).“”@en

This is the @en abstract – there are at least 12 other language equivalents. They don’t all have the same information. Note that the object is here a literal (a string), not a URI.

owl:sameAs http://sws.geonames.org/2666199/

This is very important, geonames is the agreed central web classification/gazetteer/taxonomy of placenames. Only one is #2666199 and that is the Uppsala in Sweden. This allows linking to GIS coordinates, etc.

Now let’s try :Uppsala as the object – “something has predicate Uppsala”.

SELECT  *  WHERE {  ?subject ?predicate  :Uppsala  .  }

:Uppsala_County dbpedia2:capital

… Upsala County has capital Uppsala
:Area_code_018 dbpedia2:State
… area code 018 corresponds to State Uppsala

:Uppland_Rune_Inscription_1011 dbpedia2:city
… inscription 1011 is in (city) Uppsala

:Anders_Celsius dbpedia:deathplace
… Anders Celsius died in Uppsala

Already this is incredibly powerful – “tell me everything about X” in machine-understandale form. But when we link these we can get more:
“Find all people who were born and died in Uppsala and tell me all about them:

SELECT  ?subject ?predicate ?object
WHERE {
?subject dbpedia:deathplace  :Uppsala .
?subject dbpedia:birthplace  :Uppsala .
?subject ?predicate ?object .
}

I’ll leave you to do that. Not someone I know of, but I don’t watch Dynasty or Stargate.
And that’s only the beginning. Try asking “Which Stargate actors were born outside the US?” “Which actors played in both Dynasty and Stargate?” “Which actors were born and died in the same city?”.

And these are relatively simple questions. Imagine what will be possible when the Wikpedia chemistry gets fully into dbpedia!

Posted in semanticWeb | 3 Comments

Open Publishing – SPARC and Science Commons

Peter Suber highlighted the joint initiative of SPARC and Science Commons (a “spin-off” of Creative Commons and W3C) in creating an addendum that allows authors to state what THEY would like done with their publications.

The Scholar’s Copyright Addendum Engine will help you generate a PDF form that you can attach to a journal publisher’s copyright agreement to ensure that you retain certain rights.

Description

Each addendum gives you non-exclusive rights to create derivative works from your Article and to reproduce, distribute, publicly perform, and publicly display your article in connection with your teaching, conference presentations, lectures, other scholarly works, and professional activities. However, they differ with respect to how soon you can make the final published version available and whether you can authorize others to re-use your work in various ways. Below is a summary of the available options.

Science Commons / SPARC Addendum

Access – Reuse:
You retain sufficient rights to grant to the reading public a Creative Commons Attribution Non Commercial license or similar license that allows the public to re-use or re-post your article so long as you are given credit as the author and so long as the reader’s use is non-commercial. (This is a joint offering from Science Commons and SPARC and represents a new version of the former SPARC Addendum.)

Other Options From Science Commons

Immediate Access:
You retain sufficient rights to post a copy of the published version of your article (usually in pdf form) online immediately to a site that does not charge for access to the article. (This is similar in many ways to the MIT Copyright Amendment below)
Delayed Access:
You also have the right immediately to post your final version of the article, as edited after peer review, to a site that does not charge for access to the article, but you must arrange not to make the published version of your article available to the public until six months after the date of publication.

Additional Options from MIT

MIT Copyright Amendment:
Developed at MIT, this amendment is a tool authors can use to retain rights when assigning copyright to a publisher. It will enable authors to continue using their publications in their academic work at MIT, to deposit them into the MIT Libraries’ DSpace repository, and to deposit any NIH-funded manuscripts on the National Library of Medicine’s PubMed Central database. More information is available from the MIT Libraries.

(I’ll pass on PDF… 🙂 This is very welcome because it should encourage many authors to assert these rights. Knowing that you have a form of approach which has been worked out in advance it is easier to tell publishers what YOU want. Henry Rzepa and I have been through this – won some, lost some… but it’s just that bit harder for a publisher to reject authors’ wishes if they come supported by two formal bodies with a great deal of moral weight. If all authors did this – and there is nothing to lose – it raises the issue with publishers, and some will increasingly see the rationale
and adjust to the moral reality. Since this is a complex issue it is very valuable to have bodies who have thought through the most important issues.
So, when you publish, add this form. It will be fun to see how the publisher reacts.

Posted in open issues | Leave a comment

Avoiding Mass Extinction with OpenData

A very impressive talk yesterday by Gavin Starks about the challenge of Climate Change. If you ever have the chance to hear or meet him, do. The talk has been blogged by the indefatigable Talis/Nodalities (Paull Miller and (in this case) Rob Styles) as
Climate Change isn’t about saving the planet
Gavin’s message was simple – a necessary condition for saving the planet (and ourselves) is to have a consistent approach to using the available data. That means Open Data and Open Standards for using it.
As simple as that. How will future generations (if there are any) judge those people or organisations who did not share data?
Explore Gavin’s Avoiding Mass Extinction Engine

AMEE” is a technical service that features;

:: Measurement

Access to standardised co2 data and calculations (including the official UK Government figures)

:: Profiling

Store and retrieve personal footprints

:: Sharing and Transparency

Help develop, extend, share and collaborate on the measurement of energy consumption.

Our mission: enable and encourage engagment to address a truly global issue

Posted in open issues, xtech2007 | Leave a comment