petermr's blog

A Scientist and the Web

 

Archive for July, 2009

Open Semantic Chemistry

Tuesday, July 28th, 2009

In a reply to my post on Chem4Word Egon makes a valuable contribution (Egon Willighagen says: July 27, 2009 at 5:37 pm)

I think the cheminformatics community is seeing the value of semantics in chemical editing, and understood that even closed-source product have shown serious evolution in this area. JChemPaint also followed the semantic path for a while, but does not have the advantage of tight integration in a production phase editing tool like Chem4Word has. With the current marketshare of Word, this editor will quickly see a quick uptake and bring semantic chemical editing to a new audience, that of organic chemists. This is positive, and anything drawn in this tool will be semantic and interoperate with other tools. That is positive too, even if many of us will not use the editor at all, like me.

I agree (although prediction of a quick uptake is an inexact science ). He is also right that he will not use the tool directly. However there are immediate spinoffs for the whole open chemistry community regardless of platform:

  • The system is modular. That means that it does not have to be used in Word (although obviously the benefits of creating a compound document will be absent). There is an essentially standalone tool allowing chemical manipulation of objects (relies on WPF/XAML and C#). There is also a library of routines (.NUMBO) which are independent of anything except the C# language. To what extent C# will be a help or a hindrance in the Open chemical world I don’t know.

  • The APIs have been designed to be largely platform and language independent. It’s difficult to write completely independent APIs (as for example CORBA IDLs) but the following signature is characteristic of the CID interface between the UI and the .NUMBO library:

public static bool CanFlipAboutExternalAcyclicBond(

ContextObject contextObject,

IEnumerable<XElement> atomPointers)

The contextObject holds the complete state in CML so that a generic library (such as JUMBO) can relatively easily implement them. That means, inter alia, that the system can be used for batch processing of data without the need for graphics

Many of the components are declarative (in various flavours of XML) and hence language-independent. Thus the primary CML validation in import is done using a CML XML Schema and a Schematron validator. This means that the process could be trivially ported to any other language or platform simply through standard XML APIs.

XML is platform independent (you do not have to worry about line-endings, blank space, etc.)

The CML-Lite schema has been thoroughly refactored and fairly well tested so that we have a good proven foundation for semantic chemistry

And, above all, it will be Open. That means that the community will be able to contribute and benefit.

How can people benefit and contribute if they do not use Microsoft technology? To the extent that the chemical architecture is language-independent we should be able to develop and refine the chemical algorithms and semantics independently of C#. At present we are hotly debating what is meant by add a positive charge to an atom – which I hinted at before. Think about the effect (i.e. what is the formula and electron count) of the following:

  • add a + to the N in (CH3)N

  • add a + to CH4

  • add a - to CH4

  • add a - to N=O

  • add a - to C6H6 (benzene)

  • add a - to Na

  • add a - to Na+

  • add a - to B in BH3

  • add a - to F in HF

  • Now consider what would happen if you had the option add a radical (often denoted by .).

  • I doubt very much whether the chemistry community agrees completely on the results, other than that it probably contains a - and/or + and/or . glyph somewhere. But if we do not know how many electrons there are, or what the spin multiplicity is, we cannot submit this to a QM calculation.

  • For this reason I think the Open Chemistry community (and especially the Blue Obelisk community) can help systemat
    ize these declarative processes. My current position is that there are no universal valence rules and that there needs to be a separate set of rules for each element, each with its own special cases. I suspect that much of this is implicit, and perhaps explicit, in Openbabel, CDK, JUMBO, Avogadro and other Open software. If we can extract these into a set of rules that are declarative (i.e. not expressed in a specific procedural language) then we can start to get semantic consistency in our tools.

  • Here’s two more. What’s the result of deleting one =O atom from:

  • CH3C(=O)CH3

  • CH3S(=O)CH3

  • CH3N(-O.)CH3

  • CH3-N(=O)

  • CH3-N(=O)=O

  • and are there any general rules?

Junk Science? The blogosphere thinks so

Tuesday, July 28th, 2009

I was alerted last week by a blogospheric PhD student (worked with us for some time before going to Oxford) to the following story from Totally Synthetic (TotSynth).

NaH as an Oxidant Liveblogging!

Even if you are not a scientist, please read on it’s entertaining and informative. It deserves to be put in front of every young scientist as it shows the process of science as it should be done.

When I was at high school I read a popular and good chemistry paperback (Penguin) which highlighted the scientific method through a passage from Dorothy Sayers’ Strong poison where she describes in graphic and entertaining detail how A Marsh test for arsenic was carried out. The thread in the blogosphere captures competely the rigour, the attention to detail, the likelihood of false trails, the unexpected, the need for reference to authority and the need to question authority.

If I were teaching young chemists I would set them this as a real exercise. As a group, and in the lab. Give them a month. By the end of that month they would know far more about reactions, thermodynamics, spectra than they would get from formal lectures.

Moreover it highlights a real message of the evolving scientific web which is that what is said matters more than where it is published. For non-chemists I will interpret:

A group of scientists submitted a manuscript to The Journal Of The American Chemical Society. This is a well-known and high quality journal which is often used (naively) as a numeric metric of the value of a chemist (how many JACS articles have they published?). The ACS stresses the value of peer-review (as do I) and that its quality is low in Open Access journals (which I dispute). The published article (Reductive and Transition-Metal-Free: Oxidation of Secondary Alcohols by Sodium Hydride) is advertised by the following graphical abstract (which I reproduce without permission as fair-use)

graphics1

The potential utilities of the simplest hydride reductant sodium hydride (NaH) as an oxidation promoter have long been overlooked.

This claim is sensational in that if goes completely against received chemical knowledge. Any first year student, if given the top (blue) reaction would be expected to draw the arrow in the OTHER direction (right to left). They would certainly fail (part of) an exam if they wrote what the authors have claimed. So it’s not an obscure finding. If true it would mean that (free) energy would have to come from an unknown source. Not impossible, but extremely unlikely. On the order of cold fusion or Benveniste’s homeopathic water.

The claim apparently went through the reviewers and editors with little comment. But the blogosphere picked it up and Totally Synthetic decided to question the finding. You must read the blog. There’s a blending of careful attention and excitement what IS the answer?

So I’m not going to give away the punchline. But I will say that the peer reviewing is closed so I cannot absolutely comment on whether the paper should have been accepted. Currently I regard the paper as an outstanding example of junk science published in a journal which prides itself on selling high-quality science. But I haven’t read the paper (as it’s closed access and will cost me 30 GBP for 2 days only). So my mind always remains slightly open.

This should convince any sceptic that the blogosphere is an essential part of current science.

See also comments in RSC’s Chemistry World. It includes comment from Paul Docherty (Totally Synthetic):

I was alerted to the paper by readers of my blog, who noticed its controversial abstract almost as soon as it appeared online,’ says Docherty. ‘A quick inspection of Wang’s results astounded me, as he seemed to suggest that black was apparently now white; most curiously, his postulated mechanism only accounted for half of his results. Most provocative papers in organic chemistry take some time and resources to verify, but Wang’s chemistry seemed very amenable to a quick test reaction. It only took a few minutes to set up his chemistry in my fume-hood, and a similarly short amount of time to analyse the results. As I was writing about this on my blog, my readers did likewise, each using slightly different materials and conditions, allowing a very quick “scoping” of the chemistry.’

Some oxidation of alcohols was observed in most cases, but a consensus was rapidly reached that an oxidising contaminant was making its way into the reaction, be it oxygen from the air adsorbed to the NaH, or traces of sodium peroxide or hydroxide or some other trace contaminant. When stringent steps were taken to ensure absolutely that no air could enter the reaction system, no oxidation was seen.

Update including Chem4Word

Monday, July 27th, 2009

I have been silent for over two weeks not because there was nothing to say but because we have been working very hard to get the first version of Chem4Word frozen. For Joe and me that means that when we get up in the early morning we think of nothing else and when we try to go to sleep it is whizzing round in our heads. This type of 100+ hour coding week can turn people into subhumans

But we’ve frozen the API and are technically in bug-fix mode. There are, of course, bugs to fix and we are tackling them. But we have our sights on releasing RSN (real soon now).

I should make it clear that Chem4Word will be Open Source. Everyone in the project is geared towards that. Microsoft is now starting to release considerable amounts of Open Source, and we are pushing hard to get the final legal clearance. I’m happy to discuss on this blog what Microsoft + Open Source means in a later blog post. I know there are readers who believe that Microsoft’s motto is do only evil – and I used to be close to that view. But Microsoft has changed, and so have I.

Our current strategy and this may change is to release as Open Source and to create a governance model that will allow managed Open development. There are lots of projects in software engineering such as Eclipse, Apache, etc. which have successful models. There are no such models in chemistry so we are in new territory. I’d welcome suggestions and offers.

I’ll be writing more about C4W but at present just a statement of some of the major bits

  • C4W consists of several modules, some of which are formally independent of Word.

  • The chemistry engine (based on CML and JUMBO, hence .NUMBO – dotNUMBO) is written in C#

  • The graphics and UI is based on WPF/XAML in C#

  • There is a stateless interface (CID) between the UI and .NUMBO which defines an abstraction of chemical commands

  • There is an import pipeline which enforces syntactically and semantically valid chemistry, thus avoiding the problem of not knowing what the chemical input actually represents.

  • There is considerable functionality (e.g. gallery, navigator) to interact with the Word document.

Chem4Word is a semantic editor I suspect it’s the first for chemistry. Writing semantically correct code and documents is a hard discipline. Most current chemical tools require a sighted human to make judgements as to what something means, but this does not work in the era of the Semantic Web where machines must make accurate deductions. For example many tools allow the user to add a + charge to an atom, but what does this actually mean? Does it change the implicit hydrogen count? Or the spinMultiplicity? The answer is that it depends on the chemistry and there is no universal algorithm to do this. So C4W is built with a framework that allows semantics to be imposed by the chemistry.

In summary, we have got a toolset with significant novel functionality even in places some limited chemical intelligence. When it’s released I will write blog posts explaining some of this.

Many thanks to the team Joe, Tola, Tim, Alex, Lee, Jim+Jim.

Open Data is coming

Sunday, July 12th, 2009

We (mainly Cameron Neylon and me) ran a session this morning on Open Data. These are un-sessions which need preparation but not a strict agenda. Certainly not a lecture. So we kicked off very briefly with the scene and moved to the Panton Principles on what scientists want to do in publishing data for the benefit of the community.

In very simple terms:

  • scientists want their data to be available to anyone and re-usable for any purpose without explicit permission.

  • The only requirement is that the source of the data be acknowledged.

  • Any further constraints are set by community norms in the particular domain. Those might involve human data, need for validation and data integrity, etc. Adherence might be a condition of funding. But they are set by the community, not by the author through a licence.

We’d anticipated that there would be some suggestions that commercial use could be forbidden. In fact there was none and we take great heart from this. We are all convinced that non-commercial restrictions (e.g. CC-NC) cause enormous problems. They propagate through the data chain. They are unclear (what is commercial teaching? Books? It’s impossible to say).

People sometimes say don’t you risk getting ripped off by someone who takes your Open Source code or Open Data and sells it? The answer is emphatically NO. The whole of the Blue Obelisk will agree with this stance. To reiterate:

Someone can take my Open Source and incorporate it into a commercial program. I am quite prepared for this to happen. The condition is simply that they must acknowledge the source. They must not pass off the work as their own (I have had this happen and it made me very angry). But commercialisation is in principle a good development. It leads to a successful economy we need the revenue streams. It may convince those who evaluate my work that it has additional merit (it may not, of course). Similarly is the data is valuable then products may be built on top of that. Again the developer must honour the source of the data. And in all cases there can be no backwards restrictions on the freedom of anyone to use the Open Source and Open Data in whatever directions they wish.

We got hung up a bit on what is data?. I think this will work itself out, so long as commercially interested parties are not allowed to draw the line. It’s critical that academia and funders and learned societies (limited to those without financial interests) evolve practices that create workable boundaries.

Of course it will become much easier when everything is Open Access. That’s my personal motivation-I spent too long today discussing with people about what is data, because they have to defend their business.

And a splendid surprise. Creative Commons were here and John Wilbanks joined us for lunch. John’s talking in London on 22nd and coming to the Panton the next day. Watch this blog…

Scifoo and LambdaMOO

Sunday, July 12th, 2009

Scifoo is magic. The first excitement is how many people you DON’T know. That means that you are going to be stretched in unimaginable directions. Then there are the people you do know virtually but have never met. Then the sessions which are often very direct and pragmatic while others are way out. We observe Chatham House here no names, no opinions but there’s enough we can talk about. Like the project to embed a poem in a archeobacter genome (yes, it makes sense). Are there life forms on earth based on other principles than DNA/RNA/proteins? That led to a fascinating explanation of 2-D electron gases at 30 millikelvin.

So one person I was delighted to catch up with was Pavel Curtis. Pavel is a visionary who influenced much of what I did during the 1990′s. Pavel worked at Xerox PARC and created the legendary LambdaMOO, a text-based virtual environment (MUD) based on OO programming, hence MOO. LambdaMOO was (is?) a vibrant community with often several hundred players and with the freedom to develop its own democracy or anarchy. For modern users of high-performance graphics games it may seem ridiculous that ASCII text can hold much power but it can, in the same way as a bare stage and the spoken word can recreate fantasy lands.

I’ll deal with some of the sessions on a per blog basis, but among today’s were:

superblog. The leader was becoming successful in running his blog as a magazine, contrating writers and bringing increasing an significant advertising revenues. If it goes right you can earn sizeable amounts. But we discussed the many other reasons why we blog and it was interesting that two of the members were doing it as part of creative arts. We also noted how the traditional blog following of commenters was now dispersed over twitter and friendfeed.

Making ice cream with liquid nitrogen. Tasted great. Pictures later

The Enernet an energy network based on Internet principles. Based on a Moore’s law like approach to energy (it will become exponentially cheaper I came in late so I missed how this would happen).

Wolfram Alpha. Making progress (it seems to have corrected the bugs I reported some time ago)

The two sessions I’ll deal with in detail are Open Data and Wave.

Scifoo: Wave, Open Data and much more

Saturday, July 11th, 2009

We’re in the heat of the Scifoo unconference at Google, hosted by Nature, O’Reilly and Google. It’s a fantastic experience I’ve been fortunate to be re-invited. It’s about what you can help to create in this atmosphere how can we change the world. We’re discouraged from real-time blogging and direct quotes without permission so this gives at atmosphere of trust and excited collaboration.

We all introduce ourselves in 3 words (and woe betide if you overrun) Mine were OpenData, Chemistry and Hacking. We’ve spent time before we came contributing to a Wiki and bouncing ideas off each other. How can we create human 2.0? how does knitting relate to migraine? Anything.

Then we post possible sessions. These are topics people might be interested in joining. So Cameron and I have put up two one on OpenData and one on Wave. Google’s rooms are either very large or quite small. So you have to guess whether your topic may attract people. Great fun.

I think I can reveal we had a presentation from Steph on Wave yesterday and there are lots of ideas on what we can do. (Wave is from Google in Sydney.) So we’ll have a developer there in the afternoon and see what happens. Wave is Java, XML, Python with robots on the server and gadgets on the client. In true Aussie style the robots all end in -y spelly is the spellchecker, rosy etta does translations in real time. So it looks a dead cert to translate OSCAR to OZ and become OZZY. (Except that you lot have now got 4 centuries for 5 wickets I can’t bear to watch any more). Anyway OSCAR can act as a robot which translates written chemistry into semantic chemistry. This is a great way to get programs out.

We also want to see what we can do in the client. Can Jmol run there? We’ll find out.

On Open Data this morning we’ll see who comes before we decide on the program. We can show the Panton principles, The OKF IsItOpen and also collect ideas of where open data works in Science and where it doesn’t. When it doesn’t I expect the main problems to be:

restrictions by publishers, including universities

lack of a naming scheme

no examples of why this is so exciting.

But it will be foolish to guess what will happen. After all we are at Scifoo and it’s all about the future, even when we look backwards.

Is It Open [Data] Service launched. I try it out!

Saturday, July 11th, 2009

The Open Knowledge Foundation has just launched the alpha version of Is It Open? a service designed to help clarify whether scientific data (normally on publishers’ websites but could be anywhere) is truly Open according to the ideas of the OKF and Science Commons. This service allows anyone to ask a formal question about the status of Data Openness and make the process public. This will avoid much wasted effort in repeated questioning and hopefully also promote the value of Open Data.

I have posted a question to Bryan Vickery at Chemistry Central a BMC Journal. I’m confident his answer will be yes, but it will serve as a test of the service and give us useful feedback. Assuming the process works we will be looking for volunteers who will mail other publishers. In this way we can crowdsource the process of getting the formal position of journals and publishers (the two do not always map).

Here’s the request, and I’ll publish a response when I get it.

Enquiry: dc8b764b-7cb3-47af-ab91-e14fcbeed61c

Summary: Please can you confirm that Data in Chemistry Central are fully Open

Status: Unresolved

Started: 2009-07-11T01:12:47.299048

ID: dc8b764b-7cb3-47af-ab91-e14fcbeed61c

To: bryan.vickery@chemistrycentral.com

Subject: Please can you confirm that Data in Chemistry Central are fully Open

Date: 2009-07-11 01:12:46.490647

Status: Not Yet Sent

Dear Brian,

I am writing to ask you about the Openness of data in Chemistry Central. I know that the journal is Open Access and I would like you to confirm that the data published in it are fully Open. I am using an exciting new tool (Is It Open? – http://isitopen.ckan.net/ – developed by the Open Knowledge Foundation) to ask you this question. You are the first to be asked :-)

The Open Knowledge Foundation and Science Commons have developed instruments for ensuring that Data can be marked as Open. By this we mean that Data (as distinct from text) can be re-published, re-used, data- and text-mined and used in similar processes without explicit permission and for whatever purpose [subject only to the need to acknowledge authorship and provenance]. While we are confident that your Open Access statement implies both the motivation and practice of this, we’d be grateful for confirmation.

A major reason for asking this from you is that many publishers do not make it clear whether their data is Open. If you can give us this assurance you will act as an example to which we can point others. We are, in fact, hoping to generate a larger number of similar enquiries to other publishers.

The Open Knowledge Foundation has created web buttons (see http://www.opendefinition.org/buttons) which can be used to indicate that Data is Open. A growing number of sites use these as they are a simple and effective way of indicating immediately to humans and robots that the data are Open. I would be delighted if you could think about including such as button on you site, particularly where data-rich documents might occur.

Our enquiry service is at an early stage and we’d welcome feedback – how could we improve it. Were we clear enough?

Many thanks in anticipation of your collaboration.

Peter Murray-Rust

[1] http://www.opendefinition.org/1.0/
[2] http://www.opendefinition.org/licenses/

– Sent by “Is It Open?” (http://isitopen.ckan.net/about) A service which helps scholars (and others) to request information about the status and licensing of information.

Off to Scifoo and Microsoft

Thursday, July 9th, 2009

Update…

I have been very busy hacking Chem4Word (Joe Townsend is the Doctor Who) and he has assigned me lots of tasks. It’s all starting to look quite good. We’ll be discussing this with Alex Wade and Lee Dirks in Microsoft next week (I blagged myself an invire to the Faculty Summit many thanks Tony).

En route I am visiting Scifoo (many thanks Timo) a great mind-blowing mixture of interesting people and ideas run by Google, Nature and O’Reilly. I’ve been before and it was fantastic. This year I think that the themes that Cameron Neylon and I have been developing (Open Data for Science and pervasive data acpture and notebooks) will be very important. Oh, and the Campers will get free access to GoogleWave. I would love to put CML into that. Let’s see how we get on…

Busy. I’ve given two interviews recently one on software for searching patents and one on the problems of scholarly publishing. I’ll try to piece this together on the plane.

The Open Knowledge Foundation is FIVE

Friday, July 3rd, 2009

Open Knowledge Foundation Blog

Open Knowledge Foundation Newsletter No. 11

July 2nd, 2009

Open Knowledge Foundation Newsletter No. 11 has just been sent out:

Open Knowledge Foundation Newsletter No. 11

Welcome to the eleventh Open Knowledge Foundation newsletter!

Contents:

The OKF turns five and we need your support!

Open Database License (ODbL) goes 1.0

European Open Data Inventory + Summit

Launch of the Open Data Grid

New developments on Public Domain Works

Other news in brief

Thanks to our volunteers!

Support the Open Knowledge Foundation

Further information

THE OKF TURNS FIVE – AND WE NEED YOUR SUPPORT!

This month the Open Knowledge Foundation is five years old. Over those last five years weve done much to promote open access to information from sonnets to stats, genes to geodata not only in the form of specific projects like Open Shakespeare and Public Domain Works but also in the creation of tools such as KnowledgeForge and the Comprehensive Knowledge Archive Network, standards such as the Open Knowledge Definition, and events such as OKCon, designed to benefit the wider open knowledge community. (To find out more about what weve been up to in the last year, you can read our latest annual report [1]).

Universities should act while they have the chance

Wednesday, July 1st, 2009

From David Wiley’s blog (“Iterating towards openness) – David is founder of OpenContent.org. After a general discussion about free-being-inevitable (reviewing reviews of Chris Andersons upcoming book, Free: The Future of a Radical Price he moves to higher education:

Competition! Massive amounts of almost-no-barrier-to-entry competition. Much of it will be poor. I suppose you can take some comfort in that. But some of it will be very, very good. And that should scare existing institutions silly. The education game is about to change, and you (your institution) have three choices:

1. Innovate your way forward. If you allow your business model to become flexible and responsive, you can feel your way forward, influencing the emergent educational context as it simultaneously influences your business model. (A dynamic system!)

2. Wait for others to innovate their way forward. Let them shape the future educational context without your input, and hope that 10 years from now higher education is still a place where your institution is relevant. (If it isnt, youll have only yourself to blame.)

3. Ignore / deny that anything is changing (or will ever change). Higher education is too important, too deeply woven into the fabric of society, too critical for employers, and too big a business to fail. (See you on the other side with GM and AIG.)

[...] but higher education will have to deal with [Chris's] thesis as surely as Im typing this post. As Lehi taught, there are two types of things in this world things to act and things to be acted upon. The day is close at hand when each university will have to decide which they are.

I had been planning to blog about universities and their attitude to the digital world, so this gives me the incentive. The points are general…

In 1992 I got very excited about the power of digital learning and embraced many of the startup ideas. These included the Globewide Network Academy which is a voluntary organisation (much the same dynamics as Wikipedia, but nearly 10 years ahead). We used MOOs to create VLEs and Marcus Speh ran the first Virtual course on the Web (Object Oriented design using C++) – the material fell foul of copyright Mordor even then. It won a best-of-the-web in 1994 at WWW1.

These were heady days. I thought the world was changing before my eyes. And I was invited to a Chair in the University of Nottingham to run a virtual course in Computer-Based Drug Design for the pharma industry. It was a technical success (highly rated by the Teaching Quality Assessement) but it didn’t have a sustainable business model and after a few years it closed down and I moved to Cambridge. But I have been looking for that spark elsewhere in Higher Education and I haven’t seen it.

By contrast, go back to 1970 when Harold Wilson initiated one of the great British achievements of the twentieth century, The Open University. That was stunning. The vision led the technology by a long way much of the material was posted paper, you could get online access to computer over a teletype (110 baud) for 2 weeks a year, and in some cases people had to climb a mountain to pick up the BBC signals. But again it changed my vision for ever. Anyone could, and did, go to the OU. Even if you couldn’t the programs were often stunning. The maths used graphics which for 1970 were miles beyond chalk-and-talk.

And now? Where are the universities changing the face of the world? Where communication is infinitely cheap. Where students are wired up with more power than the whole of the world 30 years ago. Where the Internet is changing democracy where are the changes in academia? Why, at least, are there few substantial discussions about what education means in a distributed world? It’s too easy to see the reverse where education is simply a branded deliverable contract between a customer (student) and a supplier (university).

Well, the internet changes that business very quickly. So unless there are some radically new ideas, Universities may find that others are eating their lunch.

In a later post I want to address the complex and depressing cycle between research and publication and the role of universities.