Please save the European Net

From La Quadrature Du Net it looks like this is the last chance to save the neutrality of the Net in Europe (if you understand this more than me please add enlightenment):

Threats to citizens’ basic rights and freedoms and to the neutrality of Internet could be voted without any safeguard in the EU legislation regarding electronic communication networks (Paris, May 4 2009 – Telecoms Package). EU citizens have two days to call all Members of the European Parliament (MEPs) to ask them to vote for the Citizens’ Rights Amendments, in the second reading of the Telecoms Package. These amendments include all the safeguards that were removed in the compromise amendments, as well as provisions protecting against net discrimination practices and filtering of content.

And

Telecoms Package: When rapporteurs betray EU citizens.

On both parts of the Telecoms Package, rapported by Malcolm Harbour (IMCO report) and Catherine Trautmann (ITRE report), agreements have been found with the Council of the EU to destroy or neutralize major protections of the citizens against graduated response, “net discrimination” and filtering of content on the Internet. There is little time left, but the Parliament has a last chance with the plenary vote on May 6th to reaffirm its commitment to protecting EU citizens.

I have written to my MEP again. I didn’t hear back the first time.

It is no surprise that the average European is disillusioned with the political process. The report was produced by 2 rapporteurs of whom one (Malcolm Harbour) has the Wikipedia entry:

Views on software patents

Malcolm is well known for his controversial pro-software patent views. He was one of the most outspoken supporters of the EU software patents directive until its ultimate rejection by the European Parliament in July 2005. He has been characterised as a software patent extremist, since he favours permitting Program Claims, a view not shared by most other supporters of software patents.

Within the European Parliament, he is associated with the Campaign for Creativity, a pro-software patent lobbyist group, in part because of unsolicited email sent from his address on behalf of that group.

Oh dear.

I wonder about his track record… let’s try FFII (http://eupat.ffii.org/players/mharbour/index.en.html)

Malcolm Harbour, Member of the European Parliament, UK Conservative, has been an active and forceful promoter of software and business method patents in Europe, all while pretending that he was “only closing loopholes in the current law so as to avoid US-style broad patentability” and that claims to the contrary came from “misguided lobbyists in the European Parliament”. Harbour vigorously promoted program claims and opposed all amendment proposals, including those approved in CULT and ITRE, which put any limits on what can be patented. Patent lobbyists have great confidence in “Malcolm”. Some write letters to MEPs telling them to look out for Harbour’s amendment proposals and to support them as soon as they come out. Harbour, until recently an automobile industry manager at Rover, speaks in a very self-confident manner which gives many of his listeners, including MEPs from other countries and other parties, the impression that he is in power and they can rely on him and follow him.

Looking at Harbour’s contributions to the discussions, it seems that Harbour has difficulty understanding how any process running on a computer (or, his favorite example, a mobile phone) could be unpatentable subject matter. He combines this lack of understanding with a deep trust for “IP experts” speaking in the name of large companies and industry associations. A trust which is has been reciprocated and nurtured for a while already.

So this is Britain’s contribution to judging the values of Net freedom. The other rapporteur is http://en.wikipedia.org/wiki/Catherine_Trautmann

who is a Socialist…

So what hope do I have in writing to my MEP? I have also rung his answer phone…

Maybe the story continues…

Posted in Uncategorized | Leave a comment

Should Open Source code create Open Data?

An important discussion on Ben Brumfield’s blog Open Source vs. Open Access:

I’ve reached a point in my development project at which I’d like to go ahead and release FromThePage [a genealogy program] as Open Source. There are now only two things holding me back. I’d really like to find a project willing to work together with me to fix any deployment problems, rather than posting my source code on GitHub and leaving users to fend for themselves. The other problem is a more serious issue that highlights what I think is a conflict between Open Access and Open Source Software.

[good discussion of BBB omitted] until…

The freedom to run the program, for any purpose.

6. No Discrimination Against Fields of Endeavor

Traditionally this has not been a problem for non-commercial software developers like me. Once you decide not to charge for the editor, game, or compiler you’ve written, who cares how it’s used?

However, if your motivation in writing software is to encourage people to share their data, as mine certainly is, then restrictions on use start to sound pretty attractive. I’d love for someone to run FromThePage as a commercial service, hosting the software and guiding users through posting their manuscripts online. It’s a valuable service, and is worth paying for. However, I want the resulting transcriptions to be freely accessible on the web, so that we all get to read the documents that have been sitting in the basements and file folders of family archivists around the world.
[]

My quandry is this: none of the existing Free or Open Source licenses allow me to require that FromThePage be used in conformance with Open Access. Obviously, that’s because adding such a restriction — requiring users of FromThePage not to charge for people reading the documents hosted on or produced through the software — violates the basic principles of Free Software and Open Source. So where do I find such a license?

Have other Open Access developers run into such a problem? Should I hire a lawyer to write me a sui generis license for FromThePage? Or should I just get over the fear that someone, somewhere will be making money off my software by charging people to read the documents I want them to share?

PMR: The comments are also useful and generally urge Ben to be brave and Open Source his program. I’ve suffered the save concerns myself and looked for ways to protect against use I didn’t like (including applying a curse to the code which is a great deterrent when it works). However I add some more suggestions here and as this is the first time I have aired them look forward to comments:

  • Community norms. Ben should specify in clear text what his wishes for the code and its output should be. This has no legal force but the effect within an Open community can be significant. If others in the genealogy field share his views (or if he can find additional suggestions or practices already) that helps the community to converge on a generally accepted set of norms.

  • Open Data tags. Modify the software so that it outputs the OKF’s open data buttons (http://opendefinition.org/buttons/) by default in the document. This is easy to do it’s simply a hyperlink. Make this the default when the program is run and add a runtime switch such as -nookfbutton. This allows a user to remove the button but it is a conscious act rather like clicking a shareware program. The button will advertize the value of OKF buttons such as

  • graphics1Note that this requires the data to be Open and re-usable without hindrance and this may not be what you want if you wish to restrict commercial use. However I would urge against this NC is problematic difficult to define and difficult to enforce.

Posted in Uncategorized | 3 Comments

Henry Rzepa's Inaugural Lecture

Henry gave an absolutely stunning presentation which combined history, topology, quantum mechanics, chirality and much more:
http://www.ch.ic.ac.uk/rzepa/talks/inaugural/
He doesn’t use Powerpoint either. Look at it, even if you don’t know any chemistry, and you’ll see why.
I don’t have time to write more

Posted in Uncategorized | Leave a comment

Meeting on the 1515 from Cambridge

Today Henry Rzepa invited me to his inaugural lecture and maybe some of that later. So I went off to the station and walked down platform 2 when I though I saw a familiar face Cameron Neylon. Now Cameron is coming to visit us and talk the week after next had I missed the date.

No it was a group of folks involved in (Open) publication who’d been to PloS to talk about article-level metrics (i.e. can we measure the value of a paper rather than simply the journal). But very shortly we were joined again by chance by Kaitlin Thaney weird since I’d been in Boston (and she was in Cambridge). Kaitlin’s from Science Commons which readers of this blog will know works to make science Open primarily though CC0, community norms and other intruments and advocacy.

This was really fortunate since I have been ivited to a COST meeting in Portugal in 2 weeks time to talk on Open Data relating to HLA. (Human leukocyte antigen)

graphics1HLA is one of those WOW! Proteins that as soon as you see it, you get the message it wraps a peptide and presents it to T cells. It’s got a groove which binds the peptide in a specific manner.

Anyway this group wants to explore data sharing in their community and they’ve asked me to help them with Open Data. Although I pontificate on this, I’m not really an expert (although very few people are). So I was able to catch up with Kaitlin and Cameron and pick their brains so that I could give some reasonable advice in Portugal.

The simple advice is be simple. That means dedicating your data to the public domain. Anything else is more complicated and this complexity brings effort and potential division. That’s why the OKF’s OPEN DATA tag is so useful.

Just add it!

Posted in Uncategorized | Leave a comment

BioIT 2009 – Where do we get Semantic Open Data?

Since Open Data are very rare in chemistry, how do we get them? We’ve been working on making this easier, but it’s still very hard work. So here’s a brief overview:

  • Born-semanticOpen. This is the ideal situation, where tools and the culture are used to create data which is intrinsically semantic and aggressivly labelled as Open. This is no production scale example of this. We hope that Chem4Word will be used as a primary tool for creating semantic documents, and that we can add an Open Data stamp to the result. In that way every use of the tool will create Open Semantic data

  • Conversion of structured and semi-structured Open legacy to Semantic Open. An example of this is CrystalEye, where we aggregate Open legacy data (CIFs) and convert to CML. This is then published with the Open data tag in every page. Spectra, if in JCAMP, are also tractable It is also possible, though harder to convert most computation chemistry outputs into CML. Gaussian archive, GAMES are relatively simple Gaussian logfiles are a nightmare.

  • Born-digital computation. By inserting FoX or other CML-generating tools into the source code of comp chem programs. We’ve done this for at least 10 and this means that we get lossless conversion of comp chem into CML, with complete ontological integrity.

  • Recovery from text and PDF (Text mining). Conversion to PDF destroys all semantics, most structure and all ontology. So it has to be messy heuristics to recover anything and we never get 100% recall or precision. We don’t touch bitmaps. Our current tools in Cambridge can:

  1. extract chemical structures from images. This depends on the actual way the image is represented but with vectors rather than bitmaps we have achieved 95% precision on several documents

  2. extracts spectra from images. This is also tractable we haven’t done enough to get metrics and we haven’t covered all the types of instrument but again ca 95% is manageable

  3. text-mining. OSCAR2 is able to recover peak lists from chemical analytical data with over 90%, the failures being mainly due to typos and punctuation. OSCAR3 can extract reaction information (Lezan Hawizy) with probably > 80% precision. We can also convert chemical names to structures (OPSIN) and Daniel Lowe has made impressive progress and for certain corpora – can achieve ca 70%

Some important caveats. Anything other than born-semantic is lossy. The recal/precision can range from 95% to 5%. That sounds silly, but the results are critically dependent on how the documents were created and published. The more human steps (copying, editing) the worse the recall and precision. But with high quality PDFs an impressive amount can be extracted.

But the major challenge is restrictive approaches to extraction. If publishers threaten extractors with legal action for getting science from papers we have destroyed the dream of Linked Open Data in Science.

But for those publishers who support the creation of publication of Born-semantic Open data the future is really exciting.

This Blog Post prepared with ICE 4.5.6 from USQ in Open Office

Posted in Uncategorized | Leave a comment

BioIT 2009 – What is data? -1

This is a post probably the last – in a series outlining Open Semantic Data in Science at BioIT Boston see (BioIT in Boston: What is Open? ).

I’ve explained Open and Semantic. Now I’ll attempt data. I thought this would be simple it’s not.

Why use the word data – at all? I think the reasons are pragmatic. There’s often an over-presented hierarchy:

Data Information Knowledge Wisdom

(the last influenced by T.S.Eliot)

Different people would have cutoffs at different points on this hierarchy but I think the following are fairly common attributes of data:

  • it is distinct from most prose (although some prose would be better recast as data)

  • it is generally a component of a larger information or knowledge structure

  • facts and data are closely related

  • many data are potentially reproducible or unique observations, are not opinions (though different people may produce different data)

  • data, as facts, are not copyrightable.

  • Collections of data and annotated data (data + metadata) may have considerably enhanced value over the individual items.

  • Data can be processed by machine

Here are some statements which provide data:

and here are some which are not data

  • her work is well respected

  • we thank Dr. XYZZY for the crystals

  • we find this reaction very difficult to perform

What’s the point of making the distinction? From my point of view:

  • Data can and increasingly should be converted to semantic form.

  • Data are not copyrightable and should be free to the community

  • Linking Open Data is now possible and has stunning potential.

So my self-appointed mission is to carry this out in the domain that I at least partially understand: chemistry.

  • We already have the semantic framework (CML, ChemAxiom)

  • we are managing to liberate data (Pubchem, Chemspider, CrystalEye, NMRShiftDB, CLARION, etc.)

  • when we have liberated enough we can start to provide Linked Open Data.

We aren’t there yet as there are very few fully Open Data in chemistry (CrystalEye may be the only one that asserts Openness through OKF). And unless there is something to link to we can’t do very much.

But we are moving fast. Four of us (Antony Williams, Alex Tropsha, Steve Heller and PMR) met for an hour yesterday to discuss what we’d like to do and how we might do it. We have complementary things to bring to the table, so watch for developments.

This Blog Post prepared with ICE 4.5.6 from USQ in Open Office

Posted in Uncategorized | 2 Comments

BioIT 2009 – chemical semantic concepts

This is a post in a series outlining Open Semantic Data in Science at BioIT Boston see (BioIT in Boston: What is Open? ).

To build a complete semantic framework we need formal systems to express our concepts, normally markup languages or ontologies. In Cambridge we use both, particularly Chemical Markup Language and ChemAxiom (Nico Adams). Let’s start with ChemAxiom which Nico has blogged. His diagram gives a good overview:

graphics1

Since Nico leads our Polymer Informatics effort there is a special concentration on polymers but the framework is very general and can be used for mainstream chemistry. As shown ChemAxiom emphasize substances and their properties. Because of our backgrounds as (in part) physical chemists there is emphasis on methods or measurement (metrology) and scientific units of measurement.

The ontology is descended from the Basic Formal Ontology which is an Upper Ontology. This has abstracted concepts which are common to many different ontologies science, commerce, literature, etc. From WP:

The BFO or Basic Formal Ontology framework developed by Barry Smith and his associates consists in a series of sub-ontologies at different levels of granularity. The ontologies are divided into two varieties: SNAP (or snapshot) ontologies, comprehending continuant entities such as three-dimensional enduring objects, and SPAN ontologies, comprehending processes conceived as extended through (or as spanning) time. BFO thus incorporates both three-dimensionalist and four-dimensionalist perspectives on reality within a single framework

In this a compound would be a continuant whereas a reaction as performed by a given person would be an occurrent. Note that language often maps different concepts onto the same linguistic term:

The reaction took place over 1 hour (occurrent)

The aldol reaction creates CC-bonds

Is the latter use a continuant or occurrent? This is the sort of thing we discuss in the pub.

ChemAxiom is expressed in OWL2.0 and this brings considerable power of inference. For example if something is a quantity of type temperature the ontology can assert that Kelvin is a compatible unit while Kilogram is not. And this is only a simple example of the power of ChemAxiom.

In general we expect ChemAxiom to have a stable overall framework as expressed in the diagram but for the details to change considerably as the community debates (hopefully good-naturedly) over the lower level concepts. Nico is presenting this at the International Conference on Biomedical Ontology.

Chemical Markup Language uses a wider-ranging set of chemical and physical concepts which are derived from chemical communication between humans and machines in various degrees. We (Henry Rzepa and I) have used published articles, chemical software systems, chemical data and computational chemistry outputs to define a range of concepts which now seem to be fairly stable. That’s perhaps not too surprising as many of them are over 100 years old. The main concepts are:

  • molecules, compounds and substances
  • chemical reactions, synthesis and procedures
  • crystallography and the chemical solid state
  • chemical spectroscopy and analytical chemistry
  • computational chemistry.

There are about 50 sub concepts, and all these are expressed as XML elements. Thus we can write:

<cml:property dictRef=properties:meltingPoint>
<cml:scalar dataType=xsd:double units=units:kelvin>450</cml:scalar>
</cml:property>

The general semantics are that we have a property, with a numeric value. The precise semantics are added by links to dictionaries (or ontologies such as ChemAxiom). With this framework it is possible to markup much of current chemical articles.

It’s also possible to markup much computational chemistry output and we’ve done this for many major codes such as Gaussian, Dalton, GULP, DL_POLY, MOPAC, GAMESS, etc. This has made it possible to chain together processes into a semantic workflow:

graphics2

Here one program emits XML and feeds automatically into the input of the next. All output is semantic and can be stored in XML or RDF repositories such as our Lensfield which is being developed by Jim Downing and others

Which brings me nicely onto Data… in the next post

This Blog Post prepared with ICE 4.5.6 from USQ in Open Office

Posted in Uncategorized | 1 Comment

BioIT 2009 – What is semantic?

I’m talking at BioIT about Open Semantic data and am going through the concepts in order. I’ve looked at data (BioIT in Boston: What is Open?). Now for semantic/s.

I’m influenced by the Semantic Web and the Wikipedia article is a useful starting point. It also highlights Linked Open Data and I’ll write about that later. Let’s recall TimBL’s motivation:

I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web the content, links, and transactions between people and computers. A Semantic Web, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The intelligent agents people have touted for ages will finally materialize.
Tim Berners-Lee, 1999

I have this dream for chemistry it’s easier than trade and bureaucracy and a an early chemical semantic web is now technically possible. What do we require?

The most important thing is to realise that our normal means of discourse – spoken and written are perfused by implicit semantics. Take:


Compound 13 melted at 458 K

To an anglophone human chemist the meaning of this is obvious but to a machine it is simply a string of characters (ASCII-67, ASCII-111…) Although Natural Language Processing can interpret some of this (and our Sciborg project has addressed this) it’s still hard for a machine to get the complete meaning. So lets look at the parts.

The concepts are:

  • A specific chemical compound
  • the concept of melting point
  • a numeric value
  • scientific units of measure

We use two main ways of expressing this, XML (Chemical Markup langauage) and RDF. Nico Adams and I will argue about when XML should be used and when RDF. Nico would like to express everything in RDF and that time may come, but at present we have a lot of code that can understand CML and so I prefer to mix them (see below). In any case I’m not going to display them here.

What are the essential parts of our semantic framework?

  • The concepts must be clearly identified in a formal system. This can be an ontology or a markup language or both. In each case there is a schema or framework to which the concepts must conform. CML has a schema (essentially stable) and Nico’s ChemAxiom Ontology has to conform to the Basic Formal Ontology.
  • There must be an agreed syntax for expressing the statements. In CML this is XML with a series of dictionaries also expressed in CML. For RDF there are a number of universally agreed syntaxes.
  • All components in the statement should have identifiers. In CML this is managed through ID attributes, in RDF through URIs. TimBL’s vision is that if everyone uses URIs based on domain names then the world become a Giant Global (RDF) Graph. There is lots of debate as to whether a URI should also be an address I’ll blog that later. Without question the management of identifiers is a key requirement in the C21.
  • There should be software that does something useful with the result. This is often overlooked systems like RDF allow navigation and validation of graphs and often a tabulation of the results. But chemists will want to view a spectrum as a spectrum, not as a set of RDF triples. We’ve made good progress here currently my thinking is that CML acts as the primary way of exposing chemical functionality to programs.

I think I’ll post this (also to check ICE) and then talk about chemical concepts in the next post.

This Blog Post prepared with ICE 4.5.6 from USQ in Open Office

Posted in Uncategorized | Leave a comment

Chem4Word demo – afterwards – 1

Rudy and I presented the C4W demo to about 150 people, mainly from pharma/IT. We’ed heard about the Pistoia Alliance which aims to bring together pre-competitive elements in the pharma industry to share standards, data , protocols. So it was a good setting.

I was presenting from Rudy’s machine with whose touchpad I was rather unfamiliar with, and also we magnified the text so that people could read it at the cost of some clipping. I’d had some challenge remembering the precise keystrokes to load and show the ontology markup so we actually started with a fairly complete document. As always with interactive demos it’s quite an effort to remember where to go next, what to say and to keep it moving.

Anyway C4W has got its first public showing and we can now show you a screen shot:

graphics1

The point to stress is that each cell is semantic they all show different surface features of the molecule in the row. We’ve chosen 5 views identifier, formula, name, 2D diagram and InChI. But each of these could be clicked to change to a different representation.

What’s important to realise is that C4W is not a molecular editor, it’s a chemical document editor. Here we are editing the DOCUMENT:

graphics3

We can, of course, edit molecules and here’s an example

graphics2

Here the S atom has been picked and dragged and when the editor is saved this will update the document.

More later…

Posted in Uncategorized | 3 Comments

Chem4Word demo

Rudy Potenzone (MS) and I spent a long breakfast (or pre-breakfast) preparing the demo of Chem4Word which gets show in ca. 30 mins (by me, with Rudy on hand as it’s on his machine). We’ve both done demos before my first was ca 1978 (more later, perhaps) but common in the mid-90’s when I was selling (a little) into pharma. Rudy’s been in many companies and orgs CAS, MDL, Polygen (Accelrys), and now MSR where he heads up the industry liaison in the chemistry. We’re excited about the potential of C4W and I have to organize my brain to make sure I remember my lines.

The demo has been planned for a month and there are 35+ separate operations to show. I have a crib sheet as it’s easy to forget what to show not how to show it, but what and when.

Next post might tell you how it went. And perhaps some screen shots…

Posted in Uncategorized | Leave a comment