petermr's blog

A Scientist and the Web

 

Archive for April, 2011

What are the formal restrictions on text-mining?

Sunday, April 17th, 2011

#oscar4 #okfn #pantonpapers

A little while ago I suggested that we create whitepapers (“Panton Papers”, http://blogs.ch.cam.ac.uk/pmr/2010/07/24/open-data-the-concept-of-panton-papers/ ) to help our development of open science. We’ve come up with some titles and I’ve drafted one on text-mining http://blogs.ch.cam.ac.uk/pmr/2011/03/28/draft-panton-paper-on-textmining/ . There’s now a useful response from Todd Vision on the Open-science discussion list (http://lists.okfn.org/pipermail/open-science/2011-April/000698.html )

Peter’s draft whitepaper on text-mining is badly needed and nicely put. I was particularly interested in this passage:

 

“The provision of journal articles is controlled not only by copyright but also (for most scientists) the contracts signed by the institution. These contracts are usually not public. We believe (from anecdotal evidence) that there are clauses forbidding the use of systematic machine crawling of articles, even for legitimate scientific purposes.”

 

We have also heard tell of the existence of such clauses, but also have not been able to secure first-hand evidence for them. It would be very nice to promote this from “anecdotal” to “documented”, and I would like here to put out a wider plea for anyone who might be able to provide the language of these contractual re[s]trictions. Alternatively, I would welcome suggestions for how we are to know what exactly we are prohibited from doing in light of the confidential nature of the contracts.

 

If copyright holders really wish to enforce such restrictions, it seems odd that their very existence is little more than a rumor. Can secret restrictions be legally enforced?

 

Todd

 

So I’d very much like to have authoritative evidence on this area. If anyone has first hand evidence of the FORMAL restrictions on text-mining please let us know on this blog or open-science. Ideally this would actually be hard figures but if institutions are actually debarred from publishing this information then I, as a reader of the material, am in an interesting position. So before poking around in my usual fashion I’d like any evidence that’s publishable without fear of lawsuits.

If not I will have to try to find this out the hard way.

 

 

 

OSCAR4 launch roundup

Friday, April 15th, 2011

#oscar4

IMO the OSCAR4 launch was a great success. We had visitors from outside the Unilever Centre and also remotely on Twitter and the streaming video. The talks were very well presented and were all captured on stream and more permanently. I have had a peek at the recordings and they will be useful in their own right – the text is a little small in places for mine (but the blog is public). For me there was a real surprise – captured on video. I compared OSCAR3 to OSCAR4 as a Heath Robinson train to a Lego ™ train. As I switch back to the web page there was a Google splash page with Stevenson’s rocket – completely unplanned. For a few secs I thought it was deliberate but it’s just a coincidence. Follow the web page http://www-pmr.ch.cam.ac.uk/wiki/OSCAR4_Launch to see details of where the videos are available

Here are the famous cupcakes and the journal-eating-robot…

OSCAR4 is a library – we trailed bits of its API – an Dave Jessop did an excellent job of pulling out the essentials and showing how with a few commands we could search for chemical terms, customize the dictionaries and ontologies and create chemical structures.

The first task was to extract NamedEntities from text. If you know what text is, and know what a NamedEntity is then it’s simple. Here’s the code. The whole code:

 

        Oscar Oscar = new Oscar();

List<NamedEntity> namedEntityList = oscar.findNamedEntities(text);

 

This 2-line program (idiomatic in almost all modern languages) says:

  • Create an Oscar object
  • Feed it some text and get a list of the named entities.

That’s it. If you understand the terms “text” and “named entity” then you can run Oscar. If you don’t know what a named entity is then just run the program and look at what comes out.

Simple.

Here’s OSCAR munching a spectrum:

        String text = “1H NMR (d6DMSO, 400MHz) d 9.20 (1H, m), 8.72 (1H, s),” +

                ” 8.58 (1H, m), 7.80 (1H, dd, J=8.3 and 4.2 Hz) ppm.”;

 

Let’s take it in reverse order. In a well engineered library there’s often only one way you can fit the bits together. Like the Lego™ train.

 

Hmm, I need a DataParser (that was hinted at);

 

        List<DataAnnotation> annotationlist = DataParser.findData(procDoc);

 

So now I need to feed it a ProcessingDocument:

 

        ProcessingDocument procDoc = ProcessingDocumentFactory.getInstance().makeTokenisedDocument(tokeniser, text);

 

And to make one of those I need a Tokeniser. Ah, here’s one of those

 

        Tokeniser tokeniser = Tokeniser.getDefaultInstance();

 

So I made a default one. This is the convention-over-configuration stuff – use what is in the box and it will work, usually in a reasonable manner.

Put them together in the right order:

        Tokeniser tokeniser = Tokeniser.getDefaultInstance();

ProcessingDocument procDoc =

ProcessingDocumentFactory.getInstance().makeTokenisedDocument(tokeniser, text);

        List<DataAnnotation> annotationlist = DataParser.findData(procDoc);

 

And I have my DataAnnotation (only one because there is only one spectrum). It’s in XML and can be easily converted into CML so displayed in CML-compliant tools.

Simple?

So here we are in the Panton afterwards..

 

 

OSCAR: The Journal-eating robot

Tuesday, April 12th, 2011

#oscar4

Follow http://www-pmr.ch.cam.ac.uk/wiki/OSCAR4_Launch for all news and URLs (live stream). Hashtag #oscar4.

To introduce the launch of OSCAR4 I will give a short timeline and hopefully credit those who have made contributions.

When I joined the Unilever centre I had the vision that the whole of chemical information could be used as a basis for intelligent software agents – a limited form of “artificial intelligence”. By collecting the world’s information, machines would be able to make comparisons, find trends, lookup information, etc. We would use a mixture of algorithms, brute force, canned knowledge, machine learning, tree-pruning, etc.

The knowledge would come from 2 main sources:

  • Data files converted into semantic form
  • Extraction of information from “written” discourse. This post deals with the latter – theses, articles, books, catalogues, patents, etc.

I had started this is ca 1997 when working with the WHO on markup of diseases and applying simple regular expressions. (Like many people I thought regexes would solve everything – they don’t!).

In 2002 we started a series of summer projects, which have had support from the RSC, Nature, and the IUCr among others. We started on the (very formulaic) language used for chemical synthesis to build a tool for:

  • Checking the validity of the information reported in the synthesis
  • Extracting and systematising the information.

This was started by Joe Townsend and Fraser Norton and at the end of the summer they had a working applet.

2003

Sam Adams and Chris Waudby joined us and turned the applet into an application (below); Vanessa de Sousa created “Nessie”, a tool which linked names from documents to collections of names (such as collections of catalogues (at that stage these collections were quite rare).


Here is OSCAR-the-journal-eating-robot in 2004 eating a bit of journal, analysing it, validating it and transforming the data into CML (e.g. a spectrum).

By 2004 OSCAR was in wide use on the RSC site (http://www.rsc.org/Publishing/Journals/guidelines/AuthorGuidelines/AuthoringTools/ExperimentalDataChecker/index.asp ) and could be downloaded for local work. During this period the variants of OSCAR were lossely named as OSCAR, OSCAR-DATA, OSCAR1 and OSCAR2. I pay tribute to the main authors (Joe Townsend, Chris Waudby and Sam Adams) who have produced a complex piece of software that has run without maintenance or complaint for 7 years – probably a record in chemistry. At this time Chris developed an n-gram approach to identifyig chemical names (e.g. finding strings such as “xyl” and “ybd” which are very diagnostic for chemistry).

By 2005 OSCAR’s fame had spread and (with Ann Copestake and Simone Teufel) we won an EPSRC grant (“Sciborg“) to develop natural language processing in chemistry and science (http://www.cl.cam.ac.uk/research/nl/sciborg/www/ ). We partnered with RSC, IUCr and Nature PG all of whom collaborated by making corpora available. By that time we had learnt that the problem of understanding language in general was very difficult indeed. However chemistry has many restricted domains and that gave a spectrum of challenges. Part of SciBorg was, therefore, to develop chemical entity recognition (i.e. recognising chemical names and terms in text). Simple lookup only solves part of this as (a) there are many names for the same compound (b) hypothetical or recently discovered/synthesised compounds are not in dictionaries. This work was largely carried out by Peter Corbett (with help from the RSC) and the tool became known as OSCAR3.

Entity recognition is not an exact science and several methods need to be combined: lookup, lexical form (does the word look “chemical”), name2Structure (can we translate the name to a chemical?), context (does it make sense in the sentence?) and machine learning (training OSCAR with documents which have been annotated by humans). There are many combinations and Peter used those which were most relevant in 2005. OSCAR rapidly became known as a high-quality tool.

This required large amounts of careful, tedious work. A corpus had to be selected and then annotated by humans. Strict guidelines were prepared; here’s an example from Peter Corbett and Colin Batchelor (RSC). What is a compound? There are some 30 pages like this.

.

The humans agreed at about 90% precision/recall, setting the limit for machine annotation (OSCAR is about 85%). OSCAR can do more than just chemical compounds and here is a typical result:

 

In 2008 Daniel Lowe took up the name2Structure challenge and has developed OPSIN to be the world’s most accurate name2structure translator, with a wide range of vocabularies. OPSIN was bundled as part of OSCAR3 which also acquired many other features (web server, chemical search, annotation, etc.). We were joined by Lezan Hawizy who initiated Unilever-funded work on extracting polymer innformation from the literature. Joe Townsend worked on the JISC SPECTRaT project looking at extracting information from theses. In 2009 we won a JISC grant (Cheta) to see whether humans or machines were better at marking upn chemistry in text and work proceeded jointly at Cambridge (Lezan Hawizy) and the National Centre for Text Mining (NaCTeM, Manchester).

In late 2009 OMII contracted to start the refactoring of OSCAR and in 2010 we won an EPSRC grant to continue this work in conjunction with Christoph Steinbeck (ChEBI at EBI), and Dietrich Rebholz. The hackforce included David Jessop, Lezan Hawizy and Egon Willighagen resulting in OSCAR4.

OSCAR is widely used but much of our feedback is anecdotal. It’s been used to annotate PubMed (20 million abstracts) and we have used it to analyse > 100, 000 patents from the EPO (“the green chain reaction”; example: http://greenchain.ch.cam.ac.uk/patents/results/2009/solventFrequency.htm ). In 2010 we were funded by EPSRC “Pathways to Impact” to extend OSCAR to Atmospheric chemistry (Hannah Barjat and Glenn Carver).

Besides the people mentioned above thanks to Jonathan Goodman, Bobby Glen, Nico Adams, Jim Downing.

Summary:

Thanks to our funders and supporters: EPSRC, DTI, Unilever, RSC, Nature, IUCr, JISC.

OSCAR4 now has a robust architecture and can be extended for work in a variety of disciplines. It interfaces well with other Open Source tools (e.g. from the Blue Obelisk) and can be customised as a component (e.g. in Bioclipse) or using OS tools and its own components.The main challenges are:

  • The commonest form of scientific document, PDF, is very badly suited to any information extraction.
  • Almost all publishers forbid the routine extraction of information from their publications.

 

OSCAR4 how you can make a library work for you

Monday, April 11th, 2011

#oscar4

As we’ve said OSCAR4 is a set of library components that can be reconfigured in many ways (rather than a single monolithic application). On Wednesday (in Cambridge or remotely) we’ll be looking at how to bolt the bits together. We’ll make a Java IDE (Integrated Development Environment, such as Eclipse, Netbeans or IntelliJ )available to participants. (Actually this is probably the hardest part of the whole thing – getting your environment working.)

Now you have to get OSCAR4 from https://bitbucket.org/wwmm/oscar4/. It’s managed by a system called Mercurial which looks after what to download and how to configure it on your system. Create a directory on your system and download OSCAR4 (Bitbucket even tells you how:

hg clone https://petermr@bitbucket.org/wwmm/oscar4 )

 

When you’ve done that you’ll find a series of directories representing the main sub-projects in OSCAR (which hopefully have meaningful names):

(Exactly how this displays will depend on the options in your API).

OSCAR4 provides the simplest options through the oscar4 package. We use the “convention over configuration” approach http://en.wikipedia.org/wiki/Convention_over_configuration . What’s that? Basically this says that you create default actions, default naming conventions , etc. so that you can run the system as easily as possible and get the most useful results. If you want something more sophisticated you change the configuration. Here’s the structure:

The main class is Oscar (actually uk.ac.ch.wwmm.oscar.Oscar) and it has a series of methods. This is not as bad as it looks. The get methods (getters) and the setters can be ignored for now (because we will use the default). So this leaves us with only 4 methods we might use. Here’s the most important:


/**

* Wrapper method for identification of named entities. It calls the methods

* normalise(String), #tokenise(String), and #recogniseNamedEntities(List).

*

* @param input the input text.

* @return the recognised chemical entities.

*/


public List<NamedEntity> findNamedEntities(String input);

 

Note that this already calls tokenise() and recogniseNamedEntities(List). So we don’t need to worry about these at first reading. The final method is:


/**

* Wrapper method for the identification of chemical named entities

* and their resolution to connection tables. It calls the methods

* normalise(String), tokenise(String), recogniseNamedEntities(List), and

* ChemNameDictRegistry.resolveNamedEntity(NamedEntity)

*

* @param input String with input.

* @return the recognised chemical entities as a List of ResolvedNamedEntitys,

* containing only those named entities of type NamedEntityType.COMPOUND

* that could be resolved to connection tables using the current dictionary registry.

*/


public List<ResolvedNamedEntity> findResolvableEntities(String input);

 

So, if we understand the meanings of the nouns this is almost English prose. Identifying Chemical Named Entities is the primary purpose of OSCAR4. Normalising the string is necessary, but we’ll trust OSCAR. Tokenization is splitting the sentence into words (not necessarily just at spaces). And The ChemNameDictRegistry is a registry of ChemNameDicts. These are dictionaries mapping chemical names to structures (e.g. “methanol” -> CH3OH). So, in this case instead of returning namedEntities, we resolve them into Structures before returning.

 

Simple?!

 

So the first exercise is:

 

    /**

     * Use the Oscar API class to identify named entities in

     * the text string below. Print each to the console.

     */

    public static void main(String[] args) {

        

        String text = “Ethyl acetate and acetone both contain the carbonyl group.”;

        //TODO Implement your method here!

        

    }

 

To get you started, I mention that you have to create an OSCAR, and then select the appropriate method:

 

OSCAR oscar = new OSCAR());

oscar.doSomething(argument);

 

and that’s all there is to it…

What is OSCAR4 and why we created it

Monday, April 11th, 2011

#oscar4

On Wednesday we are launching OSCAR4 (http://blogs.ch.cam.ac.uk/pmr/2011/04/08/oscar4-launch/) . OSCAR4 has involved a very large amount of work (“refactoring”) which has resulted in some change to the surface functionality and a huge change to the architecture.

What does that mean? Essentially it means that we (the extended group of Egon, Daniel, David, Lezan, Sam and Bala and others with not much PMR) have almost completely ripped out the guts of OSCAR3 and replaced them with a series of modules that are engineered to work reliably and interoperate. Here are two steam engines of roughly the same format. The first (Heath Robinson, from Wikipedia) is made of sealing wax and string , glue, wood and a prayer


while the second http://uk.wikipedia.org/wiki/%D0%A4%D0%B0%D0%B9%D0%BB:Lego_Creator_4837_-_Mini_Trains.jpg is made from reusable components. That’s a major difference between OSCAR3 and OSCAR4 – we now have something that can be extended and interoperate. (That is not to belittle the efforts of the authors of OSCARs1, 2 and 3 who have built excellent software that is useful and widely used. But every piece of software tends to become bloated and refactoring is an essential part of software engineering. The world changes and expectations change. Fred Brooks says: (http://www.softwarequotes.com/printableshowquotes.aspx?id=556) “Plan to throw one away; you will anyhow.” So time for OSCAR4.

What’s the difference? OSCAR4 consists of a “core” of OSCAR3 which is the main language engine. We’ve removed the following from the core:

  • Chemical substructure and similarity search. Structure isn’t fundamental to the language processing (unlike OPSIN where chemical structure matters). So searching for entities can be done through decoupled services or other libraries.
  • Scrapbook. A place where people can keep the structures in their documents. Again we decouple this – we could, for example, now use Chem#.
  • Lookup from Pubmed. Again this can be decoupled.
  • Annotation. This is useful for training models but doesn’t need to be part of the main libraries

Everything else has been kept in some form. We’ve also added:

  • Configurable Lexicons (Dictionaries, …). This allows anyone to add their own names and structures
  • Configurable workflow (perhaps the most powerful refactoring). This means that you can swap in your own Tokenizer, Hyphenator, Machine-learning model, Dictionaries, and ontologies (name – identifier pairs). It makes OSCAR compatible with tools such as UIMA.
  • ChemicalTagger (the chemical phrase analyser from Lezan Hawizy). Although this isn’t formally part of the core it’s very likely to be used in conjunction with OSCAR4. This combination is a very powerful chemical language analyser.

On Wednesday we’ll take you through this (hopefully including those online). It’s important to realise that OSCAR4 is a library of components, not an application. (It’s easy to build applications, of course). So OSCAR is not a web server (but can be bolted into one). It’s not a mobile app but should be capable of being included in one. Etc. Think gearboxes and axles, not cars. The scientist of the future will build their applications from components, just as they build their glassware from ground-glass components. OSCAR4 is designed as a tool that can be included in any application, whether Open or commercial (it’s Open Source).

This means understanding how to bolt things together. It’s not hard, any more than building trains from Meccano™. We’ll give some simple examples of how to process a document, how ChemicalTagger works, how one might create a server. We’ll show how to use Java, understand the docs, etc. And how to extend and modify OSCAR4 without fear of breaking it.

 

 

 

 

NWChem and Quixote: a first JUMBO template-based parser

Monday, April 11th, 2011

#quixotechem #nwchem #jumboconverters

As I’ve mentioned (http://blogs.ch.cam.ac.uk/pmr/2011/04/09/nwchem-a-fully-open-source-compchem-code-from-pnnl/ ) I now see NWChem as my flagship project for the Open Source Quixote (http://quixote.wikispot.org/Front_Page ) project to create an open source semantic framework for computational chemistry.

I was really flattered and encouraged by the NWChem group at PNNL and so spent much of the time on that way back (plane, taxi) and the weekend hacking a JUMBO parser (http://blogs.ch.cam.ac.uk/pmr/2011/03/24/extracting-data-from-scientific-calculations-and-experimental-log-files/ ). This is quite different from normal parsers written in C, Perl, Python, Java, etc. as it’s declarative. There is no need for the parser writer to write a single line of procedural code.

Here’s an example.I want to parse:

Brillouin zone point: 1

weight= 1.000000

k =< 0.000 0.000 0.000> . <b1,b2,b3>

=< 0.000 0.000 0.000>

.. terminated by a blank line ..

Assuming that the phrase “Brillouin zone point:” only occurs in this setting (and there are things I can do if it’s more complicated), I find it with a regular expression:

<template id=”brillouninzp” repeat=”*” pattern=”\s*Brillouin zone point:\s*” endPattern=”\s*”>

This gives the starting regex and ending regex ( \s* means any number of spaces) and the template can be repeated as many times as wanted (*). In the case of NWChem it works very well. Then to parse the 4 lines (records) we use per-line regexes (only active in the current scope (brillouinzp).

<record id=”brill”>\s*Brillouin zone point:{I,n:brill.zone}</record>

<record id=”weight”>\s*weight={F,n:brill.wt}</record>

<record id=”k”>\s*k\s*=&lt;\s*{F,n:brill.h}{F,n:brill.k}{F,n:brill.l}>.*</record>

<record id=”k1″>\s*=&lt;{F,n:brill.h}{F,n:brill.k}{F,n:brill.l}>.*</record>

This extracts the fields into CML elements (cml:scalar in this case) each identified with a unique ID. The {…} are shorthands for regular expressions for Integer, Float, etc. with an associated QName. The format n:foo maps onto a unique URI, with xmlns:n=”http://www.xml-cml.org/dictionary/nwchem“. (This dictionary will shortly exist publicly, but we hope the definitive one will be managed by the NWChem group – they know what the terms mean!)
Every field must have an entry in the NWChem dictionary and so far we have extracted about 200 terms from the example files – I expect this to reach about 3-400). Thus the entry with id=”brill.wt” will describe the floating point number (F) that is the weight of the zone point.

There are tools to extract arrays, matrices and molecules and to transform to CML where necessary (the above will probably be transformed into concise and semantic CML). Finally we terminate the template:

</template>

This declarative approach (inspired by XSLT) has many advantages over the procedural:

  • It’s much quicker to write and more concise
  • It’s much easier to see what is going on without delving into the code
  • There are no IF statements to take care of unexpected or variable chunks of output
  • It’s much easier to document (it’s all in XML and can be processed by standard tools)
  • It’s easier to create sub-parsers for special cases
  • There is no loss of information – unparsed records are reported as such
  • It maps easily onto a dictionary structure
  • It preserves the implicit hierarchical structure of documents
  • It would be possible to generate some templates directly from a corpus of documents
  • It provides an excellent input for Avogadro and Jmol

It requires the authors to learn regex, but they would have to do that anyway. Its main limitations are:

  • It’s based on lines (records) and does not work well where line ends are simply wrapping whitespace
  • It relies on distinctive phrases (especially alphabetic) – it’s not designed for dense numeric output (though it will work for some)

     

There’s about 120 templates so far for NWChem and it’s stood up well to new examples. It has parsed files of ca. 1 MByte in a few seconds. (Remember that these files can take days to compute so the time is trivial). So I’m convinced that it works, and scales. I don’t yet know how easy others will find it, but we’ve had good first impressions.

I will keep you in touch. Open Data, Open Source and Open Standards is coming to computational chemistry!

 

Southampton’s Blog3 and ScholarlyHTML

Monday, April 11th, 2011

#scholarlyhtml

There were several exciting things to come from the recent workshops (World University Network Lab note book, and OREChem) at PNNL; this post is on Southampton’s Blog3 (http://blog3.rubyforge.org) (Jeremy Frey, Simon Coles, Mark Borkum and others). They’ve been using blogging as a means to provide an “electronic lab notebook” (a term which I think is rather dated and where the Soton work goes much beyond).

They’ve been doing it for some years but I think the progress over the last year or so has been important. Originally blog technology was very flaky (and still is). They’ve written their own which creates a better semantic platform. I’m not sure what the interoperability is and the issues in using this beyond Soton, but hope to find out.

It has convinced me that blogging is the way to go for capturing and enhancing scientific work – at least for academia and probably for companies as well. There is so much common ground with established practice on the web. Obviously if strong AAA (Authorisation, Authentication, Accounting) is required this takes a LOT more effort whatever technology is required – there are no easy answers (and academia is a hotchpotch of so many different problems in that area).

This reinforces my conviction that HTML5, not PDF, is the way to go for science. (It always was until PDF lost us ten years of progress). ScholarlyHTML fits perfectly into this. It helps to define what convention(s) a blog should emit and what it can consume. If, for example, the blog has created a chemical compound record, then we should use a convention that supports and constrains this in ScholarlyHTML (http://blogs.ch.cam.ac.uk/pmr/2011/03/14/scholarly-html-%E2%80%93-major-progress/ ). Of course we can and should embed CML in this where appropriate – e.g. for molecules, crystals, calculations, etc.

If we all adopt ScholarlyHTML for our science – and the relatively modest discipline it imposes – then we can have something close to semantic interoperability.

And where we can’t it’s because we don’t fully understand the science, not because we cannot manage the syntax.

NWChem: a fully Open Source compchem code from PNNL

Saturday, April 9th, 2011

#nwchem #quixotechem

I’ve spent a great 4 days at Pacific Northwest National Laboratory (http://www.pnl.gov/ ) where we’ve been doing a number of things – including OREChem (which I’ll blog later). It’s been great to talke with the people who have developed and are continuing to develop NWChem (www.nwchem-sw.org/ ) – their flagship computational chemistry package (which does both atomistic and plane wave calculations). It’s very large and I’ll be finding out more during the plane journey back.

 

But the key first thing is that it’s Open Source.

The normal practice in computational chemistry is to develop a business model where costs can be recovered. Sometimes this is free-to-academics-pay-by-industry. Sometimes it’s pay-by-everybody.

I have no moral principles against charging for software. But there is a utilitarian downside. It fragments the society (there are probably 10 other codes which do much-the-same as NWChem). It leads to closed algorithms – “you can’t see our code because you might steal it”) . And it is difficult to develop a modern model where there are community contributions.

The result is that many codes have an architecture and community that creaks.

NWChem has broken the mould. (I should mention that there are plane wave codes which have also done this, Quantume Espresso and ABINIT – and I work with them as well).

So I and other Quixotans are working with Open Source codes to add semantics. That will take them from FORTRAN-like tools with serious impedance in input/output top potentially semi-intelligent information engines. It means that the language of compchem will not have 20 separate languages, but languages which truly reflect the physics and chemistry.

What’s the plane journey got to do with it?

I’m writing a declarative parser for NWChem. Here’s a chunk of current log output:

Lattice Parameters

——————

 

lattice vectors in a.u. (scale by 1.000000000 to convert to a.u.)

 

a1=&lt; 5.920 0.000 0.000 >

a2=&lt; 0.000 10.255 0.000 >

a3=&lt; 0.000 0.000 9.653 >

a= 5.920 b= 10.255 c= 9.653

alpha= 90.000 beta= 90.000 gamma= 90.000

omega= 586.0

 

reciprocal lattice vectors in a.u.

 

b1=&lt; 1.061 0.000 0.000 >

b2=&lt; 0.000 0.613 0.000 >

b3=&lt; 0.000 0.000 0.651 >

 

Now CML understands this – it has lattice vectors (real and reciprocal). But what’s “omega”? I’m a crystallographer and I’ve never heard of omega. There’s a clue later in that the volume is also given as 586.0. So I am guessing that omega is the symbol for volume in some community of practice.

So we are creating a vocabulary that the whole NWChem community can contribute to. I even hope that someone will comment on this post, but even if not the communal process will soon resolve this problem.

Once and for all.

So by an open community process we make rapid progress. Which will soon mean that the Open codes will have a major semantic advantage over the closed codes.

At that stage scientists will start to wonder whether “free as in beer” and “free as in speech” is actually a very valuable concept and one worth throwing their effort behind.

I look forward to much continued collaboration with the NWChem group and the Quixotans.

 

And an exciting plane journey.

 

OSCAR4 Launch

Friday, April 8th, 2011

#oscar4launch

I am delighted to announce the launch of OSCAR4:

http://www-pmr.ch.cam.ac.uk/wiki/OSCAR4_Launch

OSCAR (Open Source Chemistry Analysis Routines) is an open source extensible system for the automated annotation of chemistry in scientific articles. It can be used to identify chemical names, reaction names, ontology terms, enzymes and chemical prefixes and adjectives. In addition, where possible, any chemical names detected will be annotated with structures derived either by lookup, or name-to-structure parsing using OPSIN[1] or with identifiers from the ChEBI(`Chemical Entities of Biological Interest’) ontology.

The current version of OSCAR. OSCAR4, focuses on providing a core library that facilitates integration with other tools. Its simple to use API is modularised to promote extension into other domains and allows for its use within workflow systems like Taverna[2] and U-Compare [3].

We will be hosting a launch on the 13th of April to discuss the new architecture as well as demonstrate some applications that use OSCAR. Tutorial sessions on on how to use the new API will also be provided.

OSCAR4 is a major rewrite and the people involved : Lezan Hawizy, Bala Kolluru, David Jessop, Sam Adams and others deserve great credit. OSCAR4 makes it much easier to incorporate as a module for:

  • Training/machine-learning
  • Domain adaptation
  • Web applications
  • Etc.

We see OSCAR4 as potentially applicable to a wide range of corpora in physical sciences (not just chemistry) and is particularly suited to named entities, quantities with units and errors and chemistry-in-other-disciplines

 

The Freedom Cloud: The future of our culture is in the balance

Friday, April 1st, 2011

#okcon2010 #okfn

I have known Becky Hogge for several years – Becky is deeply involved in the Open movement and is inter alia a board member of the OKF. She’s just published an essay http://www.opendemocracy.net/becky-hogge/freedom-cloud which so exactly mirrors my own thoughts ( and leads them) that I want you all to read it. It also mirrors keynotes at OKCon10. The message is simple:

At this very moment the freedom of the world’s culture is in the balance

That’s a strong statement and it’s an act of faith. We are in the middle of a great cultural change (due to the Internet). Because we are in the middle of history we cannot (by definition) asses it objectively. But in 20 years historians will look back and say that in 2010-2015 the battle for freedom was won or lost. (Of course if the loss is too traumatic – a 1984-like newspeak culture – there will be no historians. And no language in which to express our loss.)

Read Beckky, not my summary. But in essence the forces of control (mainly large corporations, Google, Apple, Microsoft (though diminished), Thomson-Reuters, Macmillan, Elsevier, Murdoch) are looking to monopolise our thought and culture. The printing press liberated our culture, but printing presses can be controlled. The heady days of 1993 when everything was possible have withered and we have Facebook, Google, etc.

So why aren’t these a “good thing”? Google does no evil, so we shouldn’t worry. But history teaches that all large organizations self-corrupt. I used to work in pharma (Allen and Hanbury’s). It did no evil. I knew the people who ran it. They made medicines to cure people or manage diseases (e.g. Ventolin). I know they would be incapable of the excesses of current pharma. But now the pharma industry is managed by standard corporate goals. So it wasn’t surprising that a publishers and a pharma got together to create a fake scientific journal solely for making money for both. Truth was abandoned.

By analogy that has to be true for all large industries. Some have better corporate roots than others but the benevolent dynasties of 19th Century industrialists (from which my personal history springs) – Cadbury, Rowntree, Lever, Nettlefold have gone and there are no moral or religious checks. So we have to questions and check everything that large corporations do.

The key problem is the control of information and through that the control of people and people’s thought. Facebook controls people. Google controls people. And through their lobbying of governments publishers and media control people. Net-neutrality is critical – we have to fight for it. Established laws are not a useful precedence – we have to create the visions that thinking moral citizens would adopt. Charters and constitutions (e.g. why the OK definition is so import ant). Our 21st C equivalent of the Bill of Rights.

The good thing at present is that there are many more educated literate humans than in C 18th. Even in Scotland, whose Enlightenment was responsible for much of our current freedom of thought.

Some days I wake up and think – what a lot of things we are liberating. And other days I think how much is being ripped away before our eyes. Why does not academia rise up and protect its freedoms? Its primary job is to define our possible cultures and put them in front of us. And if we want to pursue freedom (as opposed to personal glory) to help us and to go to the wall if necessary. Yes, as Becky recounts, freedom is in the balance from Libya to Bahrein. But it’s also in the balance in Washington and London.

All I can do is “keep buggering on”. And hope that the little bits of very hard won freedom will inspire and can be aggregated to an emergent phenomenon of world internet freedom.

It can happen.