XMLTech -XMLRDF

Alf Eaton and Gavin Bell (Nature) out together a lively BOF this evening on scientific publishing. They presented many of the key components – XML, persistent identifiers, ontologies, etc. Nice to see credit being given to PLoS for its pioneering use of these things (e.g. IDs for supplemental data).
A strong feeling from all that PDF must be supplemented of replaced by greater structure. “XML” is a useful mantra – although XML by itself is sometimes too constraining – and we need RDF. Maybe XMLRDF is a better mantra – it needs the XML to emphasis the difference from PDF and the RDF to point towards the future.
An anecdote of how the bite gets bitten – a publisher had acquired a chunk of content from another source (? merger/acquisition) and found that the PDFs were read-only – the hamburgers had been encrypted and the password lost. So they could be viewed but not re-used . Time for a change!
[ADDED IN PROOF] A much fuller post from Paul Miller

Posted in XML, xtech2007 | 1 Comment

XTech2007 – XForms – do I need them?

Now in XTech2007 – arrived in time for the afternoon session of XForms by Steve Pemberton. XForms allow you to pass XML into/out of forms rather than relying on HTML. In includes things like validation – if you tell it something is a date, then you can check in makes sense as a date. And there’s stuff about credit cards, etc. So it makes sense to adapt them for – say – chemistry so that we can check data and molecules on submission.
I hadn’t looked at them for ca 3-4 years as I hadn’t seen any implementations. In fact, according to Steve, XForms has been the MOST implemented W3C spec ever. The reason I have missed them is that they tend to be used in mobiles as well as browsers and there is also a lot of star-centered business – a company whose customers all use XForms and there is central control. Nothing wrong with that, but it won’t be obvious to non-customers. Also the insurance industry has gone for them in a big way.
But most of the implementations come from the actual communities rather than being based on libraries (which is what we need). There is XSmiles which might help us – I think it’s now mature. But the scale seems a bit daunting “we used to have 30 programmers working on UIs for 5 years, now we solved the problem in 1 year with only 10 programmers”. Sic.
But there do seem to be plugins for Firefox (or they are in the pipeline). Using, I think, XBL and some with SVG. So maybe there is still hope for the browser in this area.
But whether we can move quickly towards a validating chemical data entry tool … I will continue to hack with broken tools for a little while
(In the original version of this post I used the erroneous “XMLForms”)

Posted in XML, xtech2007 | 3 Comments

Data validation and Protocol validation

In response to a lively blog interchange on quality and validation of data Antony Williams has produced a useful comment (into which I insert annotations):

Thanks for the feedback on the definitions. I have connected with our collaborators at ACD/Labs, specifically the PhysChem product manager, and have pointed him to your comments on the blog. I will leave it to him to choose whether or not to edit the definitions or not.

Thanks. Definitions (ontologies) are key to the emerging semantic web and this is becoming mandatory so I obviously encourage this.

The display of the units for PSA on the initial search results page was an oversight since it is on EVERY other view of the results display so thanks for pointing it out. It was fixed within minutes of reading your blog.

I have an obsession with scientific units of measurement – it is a solved problem in principle but rarely implemented. We are developing semantic approaches to units.

Regarding your observations about Prussian Blue and solubility. There’s a lot of misinformation out there for sure… http://ptcl.chem.ox.ac.uk/MSDS/IR/iron_III_ferrocyanide.html named as Prussian Blue and defined as soluble in water. However, I am going with the Wikipedia definition which talks about : “Soluble” Prussian Blue – Prussian Blue is insoluble, but it tends to form such small crystallites that colloids are common. These colloids act like solutions, for example they pass through fine filters. According to Dunbar and Heintz, these “soluble” forms tend toward compositions with the approximate formula KFe2Fe(CN)6

There was nothing special about my selection of Prussian Blue – and in fact your suggestions below take care of many other concerns

Based on your multiple comments I am considering recalculating the properties having prefiltered and excluded compounds based on the following constraints:

This is indeed what I consider the correct way to manage the database – create a series of protocols and measure the value of each of them in improving the accuracy/quantity ratio.

1)Exclude substances containing elements other than As,B,Br,C,Cl,F,Ge,H,I,N,O,P,Pb, S,Se,Si,Sn, the elements supported by ACD/PhysChem predictors.
2)Only include single component substances – would resolve your issue with CaCO3 and Prussian Blue
3)Exclude substances represented as a single atom
4)Exclude structures containing isotopes
5) Exclude radicals
6)Exclude structures with a delocalized charge

These are very close to the filters based on molecular formula which I would recommend. Since I don’t have knowledge of your metadata (e.g. date, format, contributor, etc.) I can’t comment, but it may be that these are also useful filters. In general you may need to be prepared to sacrifice a considerable quantity of data in return for greater confidence in quality.

I welcome your comments….

=====
These are the types of filters that we now routinely institute in deciding which components of a chemical dataset are worth including. We normally develop these (e.g. for CrystalEye) by computing the difference between theory and experiment and the devising filters – and certainly all of the above have been included. In crystallography we also use temperature of experiment, etc. – you may wish to remember that many physical properties are directly dependent on both temperature and the physical state of the substance. Developing protocols can take time but it is worth it.
Best

Posted in data | 4 Comments

XTech2007 and Open Data

I had got lazy about tagging my posts until Brian Kelly gently reporoached me for not adding “WWW2007” as a tag. The point is that Technorati and other engines index the tags and you can search on them. (What, you didn’t know THAT? Well, not really). So now I do. And if you look at http://www.technorati.com/ and search for “www2007” in “tags” you’ll find many posts (and many photographs).
So I’m now going to add “xtech2007” to my tags. Is this the right tag? Don’t know – but the community will converge on something. and:
http://www.technorati.com/tag/xtech2007
shows that there are already 14 posts and that Paul Miller – who ran our Open Data session – will also be there. (Well I knew that, but there are others I don’t).
So there is the Open data track – quite diverse – and I get the impression this will be as exciting ast WWW2007. The world is changing as we look at it.

Posted in open issues, xtech2007 | 1 Comment

WWW2007 postscript

I am delighted that I had the chance to go to WWW2007 – at one stage I’d wondered whether there would be anything of interest other than the session I was in (Open Data). Or that I would know anyone… After all it was 13 years since the last/first WWW meeting I went to (although obviously there is a lot of overlap with XML). And would I have lost touch with all those W3C Recommendations (== standards). As it turned out I got so excited I found it difficult to sleep.
The features I take away are:

  • “Web 2.0” is big with the industry people – the keynotes (I’ve already mentioned TimBL) concentrated on the new webSociety where the technical stuff should be part of the plumbing. Nothing really new but optimism about pixelsEverywhere (i.e. we shan’t need laptops – we read our email on the gaspumps) – trust and identity, revenue generation, etc.
  • “Semantic Web” – overlaps with, but is different from Web2.0. The immediate progress (for which I am glad) will be lowercasesw – just do it NOW! – for which the human nodes and arcs will be critical. The sw will be rather like a W. Heath Robinson machine – all string and sealing-wax – but every joint will be surrounded by humans pouring on oil, adding rubber bands, etc. We’ve now idea what it will evolved to, but we are optimistic.
  • “Linked Data” – a very strong and exciting theme. We are generating RDF triples in advance of knowling how we are going to connect them. It’s somewhat like a neural net. We think there will be an explosion of insight when this happens – beyond what we have done with Web2.0 mashups – useful though those are. I’m currently trying to load the basic tools so I can play with real stuff.
  • “Open Data”. Very positive and exciting. There is no doubt that the Web of the next few years will be data driven. Everyone was dismissive of walled gardens and sites without RDF-compatible APIs – incuding Creative and other Commons licenses. The semantic web can only function when data flows at the speed of the internet, not the speed of lawyers, editors and business managers. And I have no doubt that there will be businesses built on Open Data. Excitingly for me there seems to be no real difference between OpenData in  maps,  logfiles, and scholarly publications. (So I’m looking forward to XTech2007)
  • Sense of community and history. A strong desire to preserve our digital history. Google finds the following image from WWW94 and CERN

P. Murray-Rust
Yes – I was running a biology and the Web session, only to find that Amos Bairoch was in the audience! How much of this is still in the collective web semi-consciousness. Somehow I am assuning that everything I now do leaves preserved digital footprints – is that naive? And what, if anything, could I do?

Posted in open issues, semanticWeb, www2007, XML | 3 Comments

What's in a namespaceURI?

On more than one occasion we had heated debates about whether a namespaceURI may/must be resolvable. In the session on linked Data TimBL made it clear that he thought that all namespaceURIs must be resolvable. This conflicted with my memory of the namespaces in XML spec which I remembered as saying the the namespace was simply a name (indeed there can be problems when software attempts to resolve such URIs). So I turned to Namespaces in XML 1.0 (Second Edition) which is more recent (and which I hadn’t read) and I’m not sure I’m clearer. I can find:

“Because of the risk of confusion between URIs that would be equivalent if dereferenced, the use of %-escaped characters in namespace names is strongly discouraged. “

and

” It is not a goal that it be directly usable for retrieval of a schema (if any exists). Uniform Resource Names [RFC2141] is an example of a syntax that is designed with these goals in mind. However, it should be noted that ordinary URLs can be managed in such a way as to achieve these same goals.”

So this sounds like “may” rather than “must” be dereferenceable.
Now namespaceURIs also exist in RDF documents (whether or not in XML format), and Tim was very clear that all URIs must be dereferenceable. I don’t know how whether this is formalised.
Looking for RDF I find Resource Description Framework (RDF) / W3C Semantic Web Activity which contains:

“The RDF Specifications build on URI and XML technologies”

and the first links contains:

“Uniform Resource Identifiers (URIs, aka URLs) are short strings that identify resources in the web: documents, images, downloadable files, services, electronic mailboxes, and other resources. They make resources available under a variety of naming schemes and access methods such as HTTP, FTP, and Internet mail addressable in the same simple way. They reduce the tedium of “log in to this server, then issue this magic command …” down to a single click.

All documents date from 2006.
So I think there is “XML namespaceURI” and RDF namespaceURI” which if not identified separately are confusing. Or maybe the time has come to make all namespaceURI dereferenceable even if their owners assert they are only names. In which case what is the value of the resource? The simplest should be the “Hello World!” of the URI:

“Hello Tim!”

I shall try to make namespaceURIs resolvable although this is difficult when not connected to the Internet.

Posted in semanticWeb, www2007, XML | Leave a comment

Web 2.0 and/or Semantic Web

Web 2.0 and Semantic Web are sometimes used synonymously, sometimes distinct. I’ve come in halfway through a presentation (missed speaker’s name) and taken away:
Web 2.0

  • blogging
  • AJAX
  • small-scale mashups
  • proprietary APIs
  • niche vocabularies
  • screenscraping

whereas Semantic Web is

  • large-scale data linking
  • comprehensive ontologies
  • standard APIs
  • well-defined data export
  • data reconciliation

and suggested that we put them together as:

  • blogging
  • AJAX
  • large-scale data linking
  • standard APIs
  • niche vocabularies
  • well-defined data export
  • data reconciliation

“There’s just one Web after all”

Posted in "virtual communities", www2007, XML | Leave a comment

Parsing Microformats (revised post)

Useful presentation online (in S5) from Ryan King (of Technorati) on parsing microfomats. (I’ve been out of touch with HTML4 and I’m learning things.) We’ll need a day or two of virtual Blue Obelisk discussion to make sure we are adhering to the specs (yes, there are some). You don’t have to LIKE them – but they seem to be the way that it works.For example the value of a class may be a list of whitespace-separated tokens. Spans may be nested. All class names are lowercase
I tried to give the examples in an earlier version of this post but the raw XHTML breaks WordPress. You’ll have to read Ryan’s talk – it’s very clear there.
The main thing is that we have to know what we are doing, not make it up from HTML vocabulary as we go along. So it’s definitely important that the Blue Obelisk has a Wiki page on how we should be using microformats. If Ryan has material relevant to BO I’ll blog it later.

Posted in programming for scientists, www2007, XML | Leave a comment

ChemZoo IS 2.0! AND I am Jester of the Month!

I have been enlightened. ChemZoo IS 2.0, and I’ll explain why….
One of the features of Web 2.0 (WP) is that communities arise by the very fact of being in the Internet

  • Web 2.0 is the business revolution in the computer industry caused by the move to the internet as platform, and an attempt to understand the rules for success on that new platform.
  • A social phenomenon embracing an approach to generating and distributing Web content itself, characterized by open communication, decentralization of authority, freedom to share and re-use, and “the market as a conversation

There have been many stories this week at WWW2007 of “Web 2.0” and one theme is that successes are not predictable. Yahoo CEO told us that he still did not understand the success of Google in areas where Yahoo and Microsoft had tried almost identical ventures. The popular Craigslist (a free community listing of classified adverts) received requests from users to charge for adverts to reduce ranking spam – thus generating an unforeseen business model. Many mashups are experiments – and some succeed. So, anything is possible, and Chemspider and this blog have formed an unplanned Web 2.0 community.
I have been appointed Jester to this community:

ChemSpider Announces Program to Recognize Tester of the Month – First Winner: Peter Murray Rust [1]

What is the purpose of this community? In a single word “Grot“. For those who do not know British sitcoms, WP:

Grot was the fictional company founded by Reginald Perrin in the television series The Fall and Rise of Reginald Perrin.
A last gesture of defiance at the world, Perrin envisaged Grot as a company selling items that were deliberately useless – machines which did nothing, soluble umbrellas, games without instructions and the like. Unexpectedly, the stores became a great success, so in an attempt to sabotage them, Perrin hired many of the former staff of the now-bankrupt company for which he had worked, Sunshine Desserts.

So this is the monkeys’ model – calculating and publishing Grot. We now know that the molecules and properties in the Zoo are intended only for chemoinformatics – not for the real world. (In our own world we are computing materials properties for use by engineers and we have to work hard to get them correct). The chemoinformatics world works within walled gardens. Many papers read like:

  • We took N molecules (from a close source because someone owns them),
  • calculated non-observable properties (such as Polar Surface Area) – for which their are no clear definitions
  • using a closed source program which you (the reader) cannot afford to use
  • put the results into our own SVN-NaiveBayes-PLS-Multidimensional program which we are about to sell to a company so can’t let you have
  • and got an R-squared or 0.9 (measure of agreement) using a closed commercial stats package
  • and published this as a Closed Access publication (without supporting iinfo)

The real joke is that you can get citations and career advancement by doing this. There was even a 1.5 day session at ACS chemoinformatics about how you can obfuscate (sic) your molecules so that no-one knows what you are working with. The monkeys are obfuscating the properties as well.
So the Chemspider model is automatic generation of chemoinformatics papers based on Grot – let’s watch for the first.
Here’s today’s monkey joke:
The monkeys have made a Grot calculator, for converting between quantities with units. Try the following jokes – I am not going to give the punchlines…

  • convert 1 cm into kilometers
  • convert 1 gm into kilograms

… and note that by acting as Jester and pointing you readers to the monkey site I am increasing their traffic, popularity and business. Unfortunately it’s only for a month, and so I shall be relinquishing my role shortly.
[1] May 12th, 2007 at 4:47 am e[…] In the past few weeks the best feedback that we have received to allow us to either improve our system or review the science behind the predictions has come from Peter Murray-Rust at the University of Cambridge who has provided feedback regarding ChemSpider issues around inorganic and organometallic complexes. His posts have been addressed in comments of two blog postings (1, 2) and we are presently working on comments from his latest post (3). We thank Peter for his feedback to improve ChemSpider. We look forward to his ongoing feedback to help us improve the ChemSpider System and Services. Thanks Peter Stumble it! […]

    Posted in "virtual communities", chemistry | 1 Comment

    Yahoo! pipes – yet another workflow?

    Nice presentation about YP – looks like they are going to start allowing custom web services. Drastically reduces coding – often to zero:

    • 10% of the web 2.0 pyramid (coders, remixers, bloggers)
    • assume prior knowledge (loops, data types)

    Heart of system is, of course, web data.

    • engine tuned for RSS but not necessarily.
    • editor – nearly everything can do it in browser. Instant “ON” – no downloads. Dataflow apps tuned to visual programming. learn and propagate by “view source” (this is valuable metaphor)
    • design – easy to use. highlights valid connections. l2r readability. (find pizza within 1 mile of foo), dragability. debuggable on refresh.

    Certainly looks slickr than Taverna. Uses <canvas> tag in many browsers. Runs on any modern browser (IE6/7 via SVG). Performance degrades with transparent layers. worst problem for Canvas is that it occludes DOM events (only click). [Obviously fairly hairy programming was required – transparency, drag etc.]. API rate limits (i.e. if your pipe is popular you might use up API rate)
    XML, JSON, KML GeoRSS. Disposable Applications? And perhaps XML-over-the-web has finally arrived?

    Posted in www2007 | 3 Comments