MathML and CML communities

I was delighted to meet old friends from the MathML/OpenMath community last week at Mathematical Knowledge Management 2007 – Patrick Ion, Robert Miner, James Davenport and Michael Kohlhase (apologies to any I have omitted). OpenMath (1993) was one of the first non-textual markup languages and was based on SGML, while MathML came along later (1999). The languages are distiinct but deliberately converging and (from WP):

OpenMath consists of the definition of “OpenMath Objects”, an abstract datatype for describing the logical structure of a mathematical formula, and the definition of “OpenMath Content Dictionaries”, or collections of names for mathematical concepts. The names available from the latter type of collections are specifically intended for use in extending MathML, and conversely, a basic set of such “Content Dictionaries” has been designed to be compatible with the small set of mathematical concepts defined in Content MathML, the non-presentational subset of MathML.

so I shall tend to use them interchangeably. Note, however, that MathML is an activity of the W3C, while (WP)

OpenMath has been developed in a long series of workshops and (mostly European) research projects that began in 1993 and continues through today.

MathML and CML have had a long history of association. We tend to present on the same platforms (e.g. NSF / NSDL Workshop on Scientific Markup Languages). Each has its particular growth points – they are accepted as formal means of scholarly publication by several major publishers and there are a variety of toolkits.
Here I want to emphasize that each is required not just in its own domain, but by neighbouring ones. Thus chemistry needs MathML, geology needs CML, etc. This requires a different mindset when developing tools – it isn’t necessary to address all the cutting edge research in the mother subject – but important to make sure that you can solve a useful number of problems in everyday science and engineering.
As an example I asked the maths community whether I could search for a given differential equation, e.g.:

dx/dt = -k*x

You can, of course, type this directly into Google and get results like this but that only works when the variables are x and t. Thus
da/dt = -ka  … or …
da/a +kdt = 0
or many other forms represent the same abstract equation.
So I was delighted to find that several people were actively working on this – it means we can serach the world’s literature for given functional forms indepedently of how they are represented. It’s hard – in some cases very hard – and varies between countries. It’s similar to the chemist’s use of InChI (see Unofficial InChI FAQ) to normalize and canonicalize chemical structure (it doesn’t matter whether you write HCl or ClH – the InChI is the same). And Google is quite good at finding these forms.
Even more fundamental is the use of dictionaries – OM has the content dictionaries and CML has CMLDict/Golem. They aren’t identical but close enough that it’s easy to convert between them.  The dictionary concept is very powerful and allows languages to be extended almost indefinitely. It also allows different groups to develop their own systems – which may even be incompatible – you load in the appropriate dictionary. And the software is effectively written.
So there is now a strong bond between the MathML and CML community. They are starting to adopt the idea of blogging and social computing (chemistry has led the way here), while we shall adopt some of the formalities of OM in our representations of physical science.
We’re going to pursue the following (at least) and keep in touch through the blogosphere:

  • mixed mathematics and chemistry (see next post)
  • social computing, which could involve student projects, etc.
  • combining forces in the advocacy of markup languages in scholarly scientific publications and the communal dissemination of data.

So – to show this isn’t just talk, MichaelK and I are starting to see how a “simple” formula in physical chemistry can be represented. We’ll show you shortly

Posted in chemistry, mkm2007, programming for scientists, XML | Leave a comment

Open Reading Mashup – would it work for chemistry?

Bill Hooker – a staunch supporter and campaigner for Open Data – has published his first mashup. He queries whether it actually is one – and I tend to agree – but the effect is to bring together different sources of information – Nature’s table of antibody suppliers and Google’s custom search. And whatever it is technically it’s an example of how linking information on the web can dramatically enhance our power.
FWIW My view of mashup is  informed by the classic ChicagoCrime where two entirely unconnected resources – the city’s crime reports , and Google maps – are linked by common, agreed, metadata – in this case the geographical coordinates of the crime. By accessing the crime data the coordinates are transmitted to Google maps and displayed; alternatively you can query the maps (inluding areas and routes) and access the crime database to find out what sort of crime happened in that geographical objects.
The requirements are simple:

  • common, agreed metadata
  • public information
  • re-usable without restriction (e.g. CC-BY)

then anyone can create a mashup where they discover two sources with the same metadata.
So could it work for chemistry? Of course – we publish the information with common metadata (chemical identity, standardized dictionaries of properties, perhaps some human metadata – persons, institutions, etc.). The software overhead is very light – a few pages of javascript is probably enough – a nice web interface and so on.
There’s one tiny flaw – where’s the public re-usable information? There should be lots – we scientists publish over 1 million compounds per year – that will make a lovely mashup. But then the publishers copyright it and through various techniques – cutting off subscriptions, sending in lawyers, hiring pitbulls put endless obstacles in our way.
But there are growing chinks: Pubchem, ChEBI, the chemical blogosphere, openness of patents, CrystalEye and COD, Blue Obelisk, e-theses. The growing support from outside chemistry. Some of this I’ll publicize, some we’ll discuss in private e-sessions where we plot our next moves on how we can wrest our rightful data back. A little bit of FUD from our side might be useful 🙂

Posted in "virtual communities", blueobelisk, chemistry, open issues | 3 Comments

Blogging in science and mathematics

In a splendid post – reproduced in full (indented) – Kyle Finchsigmate highlights the difference between chemistry and other sciences.

WTF is up with the Science blogosphere?

15:11 30/06/2007, Kyle Finchsigmate, sciency politics, The Chem Blog
vismap.jpgI [i.e. KF] was recently interviewed by Nature on the state of blogs and anonymity and whatnot and the interviewer had an interesting question: Why is the population of chemistry blogs so high relative to other disciplines?  There have been a number of posts on why chemistry is so absent in popular media – (IMHO, it lacks the “God element” present in biology/medicine and physics, especially astrophysics, which makes it less interesting to the masses.  The complexities and fuzzy logic employed by astrophysicists requires nothing less than a religiosity to believe some of the odd shit they sling out.)
Anyway, it’s disheartening that there are very few blogs about physics and biology compared to chemistry since the future is integrated approaches.  If I could, I would seek out a lively biologist to team up with on the Chem Blog, but I know of none.  It would be awesome to have a readable blog about either field, since I consider both of them too far off topic to be approachable.
Anyway, my response to the Nature interviewer (the interview should be available via pod cast next month) was that the chemistry blog-o-sphere had a number of very strong voices and drew a lot of inspiration to a lot of people.  Particularly catalytic in that was Dylan Stiles, Paul Bracher and Paul Docherty.  When asked if I was a strong voice, I arrogantly replied affirmatively, but that’s just my style and I was the subject of the interview in any case, so I had to have some degree of impact.  I know that I have made no secret that I started blogging because of Dylan’s post on Otera’s Catalyst, which he employed in his recent Org. Let. publication.  (I did not find it via Bengu Sezen [a researcher involved in ongoing controversy about the validity of published results – PMR], though I did exploit her to jump start my blog via trackbacks to Dylan’s, which is a wee shameful, but it worked.  Besides, if Blogging really had any superstar it was her.  It is, after all, the news that makes the reporter, not the other way around [though, with blogging, that argument can be contested].)
The walls are too high really to make a plea to people in other fields to start blogs, since I don’t think physicists or biologists frequent this blog, but if they WERE here, I would ask them to consider it.

[PMR: Question to Kyle – is the diagram the chemical blogosphere?]

PMR: I had exactly this experience when I was invited to talk to the Mathematical Knowledge Management group last week. I asked if anyone blogged – not really. There are maths blogs but I get the impression that they are mainly aimed at school, problems solving, etc. No-one was blogging the meeting (other than me! – look for the mkm2007 tag on http://www.technorati.com). That’s a pity as the talks were excellent – As a crystallographer I was fascinated to here Tom Hales’ work on proving Kepler’s conjecture (cubic/hexagonal close packing really are the densest ways of packing uniform spheres in 3D).
In chemistry the blogosphere gives up-to-the-minute  reports – in a large meeting it’s not impossible that people transfer sessions when they read the blogs (except that the ACS normally refuses to provide wireless even to speakers and you have to buy your own). I won’t speculate on why this is so, but I certainly felt more confident of starting a blog because Tenderbutton had shown the way. (It’s well know that his supervisor disapproved – “went ballistic” is what I heard from one senior chemist). Note, however, that in synthetic chemistry there are long periods of watching reactions “bubble away” and blogging can be a near-zero-cost multitasking activity.
There are many motivations for blogging – when I talk to young scientists I point out that several chemists are now on the first steps to science journalism having been scooped up by science publishers.  For myself the blog has several novel advantages.

  • I can post ideas in progress. That’s anathema in chemistry, though I see signs we are changing it.
  • It summarises my current position – especially where peer-reviewed publication may not be the best way. Difficult to publish technology in science journals.
  • It is a platform for advocacy
  • It reaches out to other disciplines (and I’ll say more about maths in later posts)
  • It acts as a record of my talks. In general I blog about what I am going to say at a meeting. This alerts people to the issues, and may also be a fallback if my machine breaks down. At WWW20007 I posted the summary of my ideas and many people in the audience had already read them or were following them as I went through the talk. This is especially useful where (as then) I only had 5-10 minutes – you can give details that you don’t have time to say physically. And, since my talks are stochastic, it reminds me if there is anything I have forgotten as I come to the end of the talk.

So I think my talk has catalysed at least a subset of the maths community to think about blogging. Michael Kohlhase‘s blog is an example and I’ll be talking more later about the collaboration we have set up between MathML/OpenMath and CML – this might be exciting news for science publishers and reporters. So perhaps one of the most important aspects of blogging – for me – is:

  • A way of reaching beyond the boundaries of my own domain. It’s obviously an effective approach in Open Data, etc. as I have had several people in the LIS (library-information sci) community tell me they were glad I had restarted my blog.  I think that Michael and I will make it work for chemistry and maths. He is intimately connected to the area of mathematical knowledge while I connect to the Blue Obelisk of chemical open source. Thus if we say “who would like to help with the management of geometrical algorithms in the BO repository it’s quite possible we’ll get someone from the maths community being interested. And when – as we hope – MathML and CML start to really interoperate we will have the basis of some of the formal knowledge architecture of the immediate future.

That couldn’t happen without the blogosphere. Blogging is an integral part of modern scientific knowledge. And the more enlightened scientific publishers know it. Unfortunately very few senior chemists do.

Posted in "virtual communities", blueobelisk, mkm2007, open issues, www2007 | 3 Comments

Collaborative Organic Synthesis (a subversive proposal)

Every months we get several new chemistry blogs – I don’t have time to do more than glance at them but I was struck by a newcomer, TotallyRetrosynthetic. (TotallyFoo is a metasyntactic linguistic style sparked off by TotallySynthetic.) Retrosynthesis is the process of working out how to make a chemical compound by starting at the target and working backwards – if you want to climb a mountain start at the top and descend and then retrace your steps. Of course for chemistry this is done on paper. (it ought to be done in silico as well, but organic chemists fear they will lose virility by using computers to help them).
TR suggests a subversive proposal – they chemists should collaborate and distribute the work:

Join the project to help the cause

I would like to extend the suggestion made to European Chemist to become a member of this project (in the comments section of Daphnicyclidins post here) to others as well who are willing to contribute to the cause of this project. It need not be Daphnicyclidins. It could be your dream proposal that you had come up at some point of your career which never saw the light. All you have to do is join this project as a member, create a new page where you explain briefly about the importance of your “myresearchproposal” and post it there. I will make a post on it on this blog. Peers would review it, comments would flow, and the idea would refine and evolve further.
Then person ‘A’ from Australia could try the step 1 of your proposal since he has the required expertise and the materials, and post his experimental results here. Then ‘B’ from Brazil can pitch in and evaluate the feasibility of key transform since he has the closely related model compounds in his lab, and post his results here. Then person ‘C’ from China could carry some computational studies for the observations of the ‘B’ or the proposed out come of step 10 and list his results and inferences. Biochemist ‘D’ can collaborate for molecules that are made and screen them against his targets. Some funding agency ‘X’ could sponsor projects that are worth of pursuing. It could be any proposal, as long as it is yours):
The flow of the materials, ideas, expertise and resources in the fashion described above would render your dream project become reality that you never thought of, be it because of your limited resources or lack of opportunities. I agree that it might take time and efforts, but isn’t it what anyhow some project would take even in a traditional setup; and never see the light at all, at times, because it met a dead end at some final stages since it was done in the closed doors, and never tapped the expertise of other scientists out there and thus wasted the taxpayers’ money. Imagine, the same project being carried out else where in the world or in the immediate next door with the taxpayers’ money again, because it was not done in an open fashion. I could be exaggerating a bit, but I want you to give a serious thought to it.
I am sure that the regular members of Chemical Blogosphere know the ‘potential’ that is referred to. Let that proposal of yours, and your scientific talent be not wasted!
Alternatively, some PI could come forward to try your proposal with his resources in a traditional fashion. He is welcome to do so upon mutual agreements.
The objectives and advantages could go on and on ……….
So, I welcome you to join me if you are willing to become the part of this project and take it forward. You never know your ideas might add something that I did not think of as far as the project is concerned. So pull out that proposal you drafted that has been sitting under the heap of papers, and refine it a bit with your added expertise, and post it to MyResearchProposals. I would also suggest those of you who wants to be A, B, C, D, and X also join the team, and you are important here. If you just want to be a mere knowledgeable peer you are most welcome to be a member so that you can review the things and leave your impact.
Click here to see the file to check what is it that we have been talking about.
I would suggest trusting in the scientific attitude of the scientists . After all, we are talking about the progress of science. As you all know, this project is still in its incipient stage – things will be defined, and actions will be planned as we progress.
(Caution: You be the judge for your proposal or idea, and decide if you want to be part of the things here, and act accordingly.)
Cheers!
Shiva

This is a wonderful vision. It is, of course, what we try to arrange in Open Source where different software modules are offered and different people agree to accept them. It’s hard, often fails, but works very well in many cases.
Could it work in chemistry? Yes, if chemistry is seen as a collaborative science where there is a common goal for the benefit of humankind. Unfortunately we have a little way to go. Currently synthetic organic chemistry is often a competitive sport rather than a distributed science. The goal is to make things that are more impressive than your competitors, rather than make things that are useful in themselves. It’s rather like the plumage wars that male birds engage in. And graduate students are often seen as wage slaves or cannon fodder. A regular reading of the chemical blogosphere reveals that the non-Open process results in over-hyped yields (i.e. the reported success of experiements), badly presented supporting data, etc. While these are relatively infrequent (I hope) the blogosphere from those who actually do the work are sufficiently concerned that it is a common topic.  In way of contrast who ever heard of an Open Source programmer who manufactured code?
I’m not suggesting that chemistry should go Open and start collaborating. It won’t happen. But why don’t we pursue the idea of Open Chemical Synthesis directed against real targets of benefit to humankind. The idea of international collaboration should be feasible – many years ago the EC (I think) funded the sequencing of the yeast chromosome by farming out each chromosome to each of 17 nations. This could be done in Open Medicinal Chemistry. So I hope TR gets some critical mass of interest and maybe finds a funder who wants to do science rather than sport.

Posted in "virtual communities", chemistry, open issues | Leave a comment

Stochastic hyperslide at MKM2007

I have just given my presentation at Mathematical Knowledge Management 2007 for which I wrote an abstract about 2-3 months ago : Mathematics and scientific markup. I knew that in the intervevning time I would find something new to get excited about – and this has happened – I have added the excitement of the lc-semanticweb. Of course the technology and community have developed since then.
As many of you know I rail against Powerpoint as a prime destroyer of semantic content. Powerpoint also constrains the presenter to a linear mode – yes you can skip a few slides and maybe even hide them, but it’s not easy to flip about. And it’s a poor launch platform for interactive demos.
I’ve done my slides in XHTML+SVG, believing this is the right way to remain true to my campaign for XML. (I’ll do Powerpoint when it’s necessary for business purposes – e.g. to integrate with colleagues, but that’s about it). This worked for a bit but soon hit problems of scale. I started addressing that with XSLT to add menus to the presentation. In fact I started with the wrong technology (for some bizarre reason I chose it to be Windows specific) and have now simply changed to XHTML.
I have over 12000 XHTML slides. (before you get the wrong idea,  many of these are scraped – so 3000+ from one example of OSCAR3). But nonetheless there are very many. I want to be able to reassemble them for each talk, and I want the technology to be as simple as possible – ideally none. (The efforts I have used in the past have all been broken by browser “upgrades” – a synonym for disasters).
Some ideas are:

  • use a database and craft metadata for each slide
  • use something like Spotlight or local Google

but these don’t assemble the talk. So at present I have about 100 directories (maybe with trivial subdirectories) and 5-20 slides per directory. I make the talk by selecting directories which may have some general bearing in the talk – perhaps 20-30. Admittedly it takes memory to work out what is likely to be in each folder but I have to work hard at a talk and the time is well spent. I then asterisk those directories which I HAVE to present (i.e. if I get to 5 mins before the end and haven’t mentioned them, i break off and visit them). I prepare demos (such as BIOCLIPSE, OSCAR1, GoogleInChI, Blue Obelisk GreaseMonkey,) and visits to the WWW (when the organizers have provided it – e.g. the ACS hardly ever does even when I ask in advance – it makes little sense to have sessions about the Web when you can’t get there).
So I prepared this for today’s talk to the MKM. A very nice audience to present to as they understand all about semantic content, namespaces, XML, dictionaries – so none of that has to be explained. I said my hyperslide would be stochastic – I didn’t know what slides I would present and in what order. The demos might break.
They did. BIOCLIPSE hung on Jmol rotation (although I got to demo Jmol later). However I am sure the audience appreciated the value – we’d seen Eclipse being used for theorem proving, etc. GreaseMonkey worked yesterday, but failed today. Now I have reinstalled it and it works great. GoogleInchi failed (is the Google API finally broken?) But OSCAR1 and OSCAR3 worked – and the links out to Pubchem and the chemical blogosphere. And the polymer builder, although I didn’t have time to explain exactly how it was a symbol manipulator. And I certainly covered less than half of what I might have said. But at least the hyperslide approach means I never overrun – as you can stop when you need to.
There are downsides. It’s difficult to keep a record (that’s why videos are useful). And  Powerpoint does have the merit of acting as a document container. I’ve tried both S3 ans Slidy but neither help you to assemble talks.
The only complete way to make slides available is to put them under SVN on WWMM. I can copy the directories to a pen drive. But none of this is a record of which slides were visited in which order and what was said.
I’d be interested in whether anyone else is mad enough to create new ways of managing their slides? And whether they have any ideas. At present I’m almost motivated to try Javascript, but the last time I did that – 5 years ago – everything broke within a year.

Posted in general, mkm2007 | 2 Comments

A snapshot of the chemical blogosphere

I want to show the mathematicians the vibrancy and value of the chemical blogosphere so – at random – I picked today’s TotallySynthetic. By chance it’s very fitting as it is a review of a paper by one of the blogospheric heroes – Tenderbutton (== Dylan Stiles). Dylan wrote a fascinating and idiosyncratic blog for over a year until his supervisor and pressure of work combined to stop him – fittingly he has a regular blogging column for a chemical publisher. Here’s a flavour of the current post which I’ll show tomorrow:
== Excerpt (with some deletions) from TotallySynthetic Blog ==

Spirotryprostatin B

27 June 2007 spirotryprostatin.jpg
Trost, and Stiles. Org. Lett., 2007, ASAP. DOI: 10.1021/ol070971k.
He’s done it :) . Those of us who read Tenderbutton from the start will have known of Dylan’s work on this tasty little number, and he’s done it proud. Eight steps to the natural product; we’ll start with step one (or rather, the first non-literature step):
spirotryprostatin_1.jpg
We’ve looked at Otera’s catalyst before, but as a quick reminder, it’s a funky trans-esterification catalyst. It’s been known for a little while, and the mechanism is in this JOC article. Nice to see it being used; what was it like to handle, Stiles?
[… SNIP CHEmICAL DETAILS …]
Good job, old chap!

10 Responses to “Spirotryprostatin B”

  1. milkshake Says:
    June 28th, 2007 at 1:56 A lovely detail is that a simple oxindole was used as a starting material…
  2. the dude Says:
    June 28th, 2007 at 2:32 Could someone (maybe even Dylan) comment … Why can’t you use any of the plethora of other esterification catalysts? Thanks.
  3. provocateur Says:
    June 28th, 2007 at 2:32 just a doubt…how is otera’s catalyst superior to the usual transesterification catalyst , titanium isopropoxide?
  4. Spiro Says:
    June 28th, 2007 at 2:37 #1Absolutely. This guy, Trost, is plagiarizing Baran!
  5. Spiro Says:
    June 28th, 2007 at 3:31 #2This is the one-million-dollar question!Provocateur’s comment about the “usual transesterification catalyst” makes me laugh. There are probably as many usual transesterification catalysts as there are Synth. Commun., Tet. Lett., Chem. Commun. or Org. Lett. articles titled “XYZ, a superactive catalyst for (trans)esterification of QWERTY-acids with YTREWQ-alcohols under mild conditions”.
    […]
    PPS: I have read Otero’s book on esterification : http://www.wiley.com/WileyCDA/WileyTitle/productCd-3527304908.html
    This is the poorest book I read from Wiley. After reading this book and many more articles on the topic I came to the conclusion that this field is a hoax.
  6. kiwi Says:
    June 28th, 2007 at 10:11 #2, #3 – Oteras catalyst is a beaut, nice sparkling white crystals, stable as a rock. a shaved lab rat can make it on a scale of tens of grams…

=========== end of excerpt ==========

Posted in "virtual communities", chemistry | 2 Comments

Synergy between "MKM"-math and "CML"

In this post I am using the context of the Mathematical Knowledge Management 2007 conference to try to construct similarities in the MathML and CML communities and their thought processes. I’ll show this list in my presentation tomorrow – I may not cover all the points. Similarities:

  • both are evolving languages – MathML is not “finished” and won’t be for some time.
  • smallish part of the main community, often struggling to get the message across.
  • absolute belief that computer mechanization is essential and beneficial.
  • belief this will happen, but timescales are unclear
  • need for a formal, somewhat arbitrary, selection of core components. e.g. geometry has 42 formal concepts; CML has 100 elements. Often pragmatic.
  • extension is through dictionaries (OM has ContentDictionaries, CDs; CML has convention-based dictionaries (CML/Golem))
  • systems are still evolving and will continue indefinitely. Need versioning and flags.
  • Correctness checking (e.g. of publications) is important, but undervalued by the mainstream communities.
  • Fragmented support for development. Have to concentrate limited resources to aim in communal direction.
  • Would revolutionise publication process but difficult to get mainstream involvement including learned Socs.
  • Problems with browsers – how do we transmit rich content? And every browser release things break.
  • Legacy – can we extract useful info from Bitmaps? PDF? LaTeX/ChemDraw
  • Commercial organizations are often apathetic or even antagonistic
  • Virtual communities can flourish. Openness is critical
  • Both have a strong bottom-up approach. MathML has enough pragmatists to make it work in practice.
  • both are needed by other disciplines – sometimes the support comes from them rather than the mainstream.

Some differences:

  • Maths is relatively poor – although there are commercial companies providing tools and services.
  • Maths sometimes produces proof of concept without later sustainability.
  • Maths believes in standards. Chemistry doesn’t (yet).

Possible joint activities:file:///D:/wwmm/presentations/standalone/resources/golem/indexFrame.html

  • workshop on combined infrastructure of MLs, tools, examples
  • joint projects (inc. students presented with “toy” systems)
  • synergy in developing tools (esp. client side)
  • blogs etc. aimed at joint activities.
  • minimal level of infrastructure for OM CML interoperability. Converters? libraries? stylesheets?
  • rich client

I’ve gone over these with Michael Kohlhase and we roughly agree on them.

Posted in chemistry, mkm2007, programming for scientists | Leave a comment

The Handbook of Integer Sequences

I am at the Mathematical Knowledge Management 2007 and having a wonderful time. At present Neil Sloane is talking about his marvellous On-Line Encyclopedia of Integer Sequences a collection of every known (and voluntarily communicated) sequence. e.g. what is next term in:
1,2,3,4,5
1,2,4,8…
and over 100,000 more

  • 1964 started
  • 1973 2500 seqs
  • 1995 1 m**3 of mail
  • 1995 5500 seqs
  • 1996 10000
  • 2007 131000

A large volunteer community – with a virtual party for 100,000 sequence. One has set the sequences to music (try “Listen” – this sometimes great fun).
2000000 lines of flat file, 120M edited with emacs – total 450 Mbytes including all info. 10K seqs/year, 30 comments/day, 600 emails/day
shortest seq 76337 (1 term)
longest a27 500,000 terms (natural numbers) – this raised a laugh. The point of including it that you can plot one sequence against another.
Used to find out everything about a sequence. Difficulty between conjectures and proofs.
Sequence fans mailing list. But The database is (rightly) restricted to a set of trusted editors. But ultimately almost all is done by Neil.
200 examples of sequence pairs which have identical content but are different sequences.
I find it wonderful that one can search Google with a sequence and it will hit Neil’s site (if it’s in the database). It’s one of the few digital objects where the content acts as an index in Google. InChI is another.
It’s fascinating. Every sequence that comes in is transformed but over 100 transforms to see what its structure might be. It’s a real living ecology of digital objects. Neil has shown us how datamining the database – comparing sequences with each other – has resulted in new mathematical theorems. Wouldn’t it be wonderful if chemical information was available for datamining and not owned by commercial interests?

Posted in mkm2007 | Leave a comment

OpenMath/MathML, CML and communities of practice

James Davenport – one of the originators of OpenMath is presenting the current status. (OpenMath and MathML are converging and although they are distinct I shall confalte them here).
What is the formal semantics? OM has 4 flags:

  • official
  • experimental
  • private
  • obsolete

Normally only the official and obsolete are visible on the official http://www.openmath.org site.
What if we have conflict? An exampls is systems (like Euclid) which start from 1 whereas OpenManth may start from 0. (I’m on weak ground here…). MathML has ContentDictionaries (CD) for representing each approach – but we need a mechanism for conversion.
This is very similar to the CML idea of convention attribute.
For example the British and the French have different symbolisms for integrals. They aren’t just linguistic. In CD sqrt(-1) is i, in engineering it is j.
Do we skate over inconsistencies or refuse to emit any results where there are conflicts?
“A theorem prover which  emits wrong results is probably worse than useless.” [A chemistry system which emits wrong compounds? Of course. (but most of the current ones do].
Is notation independent of semantics? Apparently not. And it’s a harder problem that in looks to convert. How do we speak:
T(n) = O(n2)
Apparently no mathematician would speak “equals” – they would say “is in”. JD is suggested a “notation” approach (I see this as similar to CML’s convention). So our symbology is much deepr than we usually realise.
So OM defines a smallish set of symbols which are fundamental. This are hardcoded. (In chemistry these migh be element symbols, coordinates, etc.) Then there are those that OM uses widely but are not fundamental, and then those that the community develops – they have to be supported by OM, but are not part of it.
I’ll stop here – I hope you get the idea that OM is struggling to formalize those things which are absolute, those which are universal but not fundamental, and those which are  widely used. If those words (which are mine not James’) conflict, that’s probably accurate.
He finishes:
“OM is useful, but being semantic, let’s use it carefully”. That goes for CML as well.
My message is that MLs are community processes, depend heavily on interpreting implicit semantics, and in doing so will help to formalize our understanding.
Maths has a community process. CML is much less advanced, but uses the Blue Obelisk as a clearing house.  But I take from OM and communities like FOAF that CML should have flags on its components such as

  • accepted
  • experimental
  • convention (e.g. JSpecView)
  • obsolete
Posted in chemistry, mkm2007 | Leave a comment

Towards Mechanized Mathematical Assistants (and the Scientist's Amanuensis)

I’m now at MKM2007 whose book[1] has the splendid title:
Towards Mechanized Mathematical Assistants
the vision that computers can work together with humans to enhance the the understanding of both parties. This vision has much in common with our own SciBorg project where we use the idea “scientist’s amanuensis” for a computer system than can help hi=umans understand chemistry in machine-accessible form
(see also An Architecture for Language Processing for Scientific Texts)
As with many of my talks I’m blogging so of the talks – ans more importantly the encounters – during this meeting. I haven’t decided what I’m going to say tomorrow and these notes may form part of it.
I’m first of all searching for similarities in the thought processes between the MKM community and our own chemical knowledge community (i.e. those people in chemistry who have realised we are in the Century of information and knowledge – a small percentage). That’s also true in mathematics – the MKM community – which overlaps with symbolic computation and computer algebra – is not mainstream but, unlike chemistry, has achieved critical mass (i.e. there are regular meetings, serial publications, etc. – that’s a pointer for ourselves – we need to do these things as well). So here are some top-level isomorphisms:

  • need to check the correctness of submitted manuscripts. Maths has many places where the author states “it’s obvious…”, or lemma 12.23.34 appears without an actual derivation. In chemistryand other sciences we have suspect or missing data
  • it’s not “proper maths/chemistry”. Markup is trivial/non-essential/notRealResearch, etc. You have to proves theroems or make compounds to be a proper researcher.
  • The Open Source systems are no good because the commercial systems (Mathematica, Maple, ChemDraw, OpenEye) are better. If it doesn’t cost money it can’t be any good
  • Machines can never have the insight that humans have.
  • The publishers/senior editors know what is good for the community. We’ve done in on printed papaer for 200+ years. It works, whyc change it? (It’s also a lot of trouble.

The first lecture (Paule) showed that machines can now prove some quite complex theorems. Cerianly miles beyond me. Things like Wallis’ formula for PI and Ramanujan’s first identity. And, if we are going to use symbolic algebra systems, we need other systems to prove that the CAS are correct.
[1] Springer, LNAI 4573 eds: Kauer, Kerber, Miner, Windsteiger. Not, of course, Open Access – you have to buy a physical book.

Posted in mkm2007 | Leave a comment