Chem4Word – aspects of Openness

I’ll now reply to the second part of Rich’s question

Rich Apodaca says:

Metallocenes? Axial Chirality? Apache/MIT/BSD License? OpenOffice? GitHub?

My current understanding is that C4W will be posted on CodePlex when we believe it’s in a reasonable state for community work. Good Open Source projects need clear release management and we haven’t yet addressed that, but we shall. I’m not very familiar with CodePlex but it tells me:

CodePlex is Microsoft’s open source project hosting web site. Start a new project, join an existing one, or download software created by the community.

and

Microsoft is hosting the CodePlex site solely as a web storage site as a service to the developer community

From WP I find:

CodePlex is an open source project hosting website from Microsoft. It allows shared development of open source software. Its features include wiki pages, source control based on Team Foundation Server but accessible using Subversion, discussion forums, issue tracking, project tagging, RSS support, statistics, and releases. Some of the available licenses are more restrictive than traditional open source licenses[1].

Choosing an Open Source licence is not trivial, and we haven’t yet chosen one. We’d welcome informed comment. The site offers, inter alia, Apache 2.0 and MsPL. JUMBO is Artistic because it allowed me to require forkers to rename and acknowledge. It’s compatible with MS Open philosophy. Much of the Blue Obelisk is LGPL which is probably compatible. Open Babel is GPL, probably incompatible with Microsoft licences.

Open Office? This probably relates to interoperability between the two systems. There are several aspects; some interoperates and some doesn’t. I’m visiting Peter Sefton in Toowoomba this month and we’ll probably find out what works and what doesn’t. Peter has managed to get CML molecules into ICE/ODT and interoperating with Open Office, but I don’t know the details. But as a rough guide I’d say:

  • The C4W code is C# so only works in a environment that supports that. There are Open C# implementations, but I have no idea what they are like. I would doubt it suport the UI and graphics.
  • WPF and XAML. Probably not interoperable. But I know of no high-quality graphics/UI system that is truly interoperable.
  • customXML (where the CML is). I don’t know details, and I suspect they are somewhat hairy, but the CML itself is unaffected.
  • OOXML and ODT. I suspect these are interoperable enough for most chemistry purposes.

GitHub. WP says:

GitHub is a web-based hosting service for projects that use the Git revision control system. It is written in Ruby on Rails by Logical Awesome developers Chris Wanstrath, PJ Hyett, and Tom Preston-Werner. GitHub offers both commercial plans and free accounts for open source projects. It is similar to Bitbucket which uses Mercurial.

I’m not sure how this is relevant. Currently we mainly use Subversion on Sourceforge, but Nico has mounted his ontology on BitBucket and I am sure he’ll tell us about it. We haven’t decided on the developer strategy yet but I would expect a single point of contact will work initially. I haven’t found MS’s TFS very cuddly yet and I prefer Subversion, but I am sure all these systems will develop.

Posted in Uncategorized | 3 Comments

Chem4Word + CML representational power

Rich Apodaca is an original member of the Blue Obelisk and has developed his own chemical authoring tool (ChemWriter 299USD). He’s just posted the rather enigmatic comment…

Rich Apodaca says:

Metallocenes? Axial Chirality? Apache/MIT/BSD License? OpenOffice? GitHub?

I’m taking this to mean that he’s asking slightly tongue-in-cheek (a) about the power of C4W’s and CML’s representation of chemistry and (b) the openness and comunity aspects of C4W. It’s actually an excellent opportunity to follow those up. Here’s a recent post…

Language for Chemical Representation Part 2: Real-World Problems

Posted by Rich Apodaca 8 days ago

The last installment in this series discussed the limitations in today’s molecular languages and how FlexMol is designed to overcome them. Although these limitations are clearly present theoretically, what’s the practical effect likely to be?
For the last two years, a series of articles highlighting specific examples from the current chemical literature have appeared here. Variously titled “How would your cheminformatics tool do this?”, “Can your cheminformatics tool to this?”, and “Cheminformatics Puzzler”, each entry featured an article from a mainstream chemistry journal in which SMILES, Molfile, CML, and/or InChI would be incapable of faithfully representing a centerpiece structure. The examples are taken from well-read journals in synthetic organic, natural products, and medicinal chemistry.
The purpose was not to bash these languages, but rather point to an important common set of limitations among them – a kind of groupthink if you will.

and earlier

The fundamental problem with ‘standard’ molecular languages such as molfile, SMILES, InChI, and CML is their simplification of bonding and stereochemistry. Bonding is defined as an association between two atoms using two electrons. Stereochemistry is defined in terms of one or more chiral templates.

FlexMol takes a different approach. Bonding is defined in terms of systems of one or more pairs of atoms interacting with the cooperation of zero or more electrons. Stereochemistry is defined in terms planes passing through atom pair axes.
As we shall see, this flexible system enables the faithful, lossless representation of almost any chemical substance consisting of a single, well-defined molecular entity.

<p .
There's a fundamental misunderstanding here about the role of CMl (which is anything but group-think). CML addresses the semantics of chemistry. I could reply – in the same lighthearted vein:

Zeolites? Clathrates? Block co-Polymers? HEPES buffer? transition states? gaussian logfile output? cell dimensions? multiplets? eigenvectors?
and assert that CML can deal with all of them and ChemWriter cannot.

In fact CML can deal with any of the examples that Rich has mentioned in his article because it is (a) extensible (b) namespaced and (c) linked to ontologies. CML can add any properties to any of its primitives (atoms, bonds, etc.) It can define multicenter bonds and bonds between bonds. It has primitives for lines, planes, etc which should be sufficient for representing any of the geometry mentioned. JUMBO can do geometry and algebra on these if required. It has a primitive for electron. It also can hold 2D and 3D coordinates for atoms, so that it can represent the drawings of any of the species in Rich's diagrams.

That some of these CML primitives are not used in practice is because to be useful there needs to be agreement between two of more people. If Rich wishes people to use FlexMol then either they all have to use his software or other vendors have to install FlexMol readers and writers. If he can show me a groundswell of users of FlexMol and if it appears useful for them to convert to CML then I'd be happy to give some pointers.

What would emerge is a set of primitives and ontology terms that was FlexMol-specific - in CML we call call this a convention. There are already several conventions in CML - a typical example is a JSpecView convention for spectra. This requires that a spectrum contains data (it's perfectly reasonable to have an empty spectrum) so that JSpecView can display it. Another convention is CML-lite - a subset of primitives which are processable by default by C4W.

But because CML is semantic and because it uses ontologies it can hold a very wide range of chemistry. If a processor does not understand some of it, then it simply passes it through without loss. Whether this changes the semantics can be decided by the ontology and although that's at an early stage the basic infrastructure works.

I appreciate that for generations raised on FORTRAN-like formats (Mol) and implicit information (SMILES) that it will take time to migrate to XML-based ontology-driven chemistry. But it's the only way forward that can cover mainstream chemistry whether it be molecules, reactions, crystallography, nanotechnology, computation, spectra, physical properties and their measurement.

Because chemistry is a lot more than organic molecules drawn pictorially...

Posted in Uncategorized | 5 Comments

A ramble through alternative chemistry

For many years I used to have a small patch of weeping scalp, but when I came to Cambridge I went to Ray’s barber shop in All Saints’ Passage – he was about 75 and got his politics from the Daily Mail but was otherwise an entertaining talker (you didn’t try to stop him). He told me that the patch was because I was an academic and I used my left brain more so I have this patch on my right scalp. I kept listening. So the way to get rid of it was Dead Sea Spa Magik Mineral Shampoo. It’s not cheap, but it’s not outrageous so I bought a bottle. And for whatever reason the patch disappeared and hasn’t come back.

I haven’t done a single blind trial on me and I couldn’t easily do a double blind one so it may be the placebo. However it works and I can just about afford it (I save by leaving longer times between haircuts…). Sadly Ray has upped shop…

Today I had to renew it and went to a well known supplier of food and neutraceuticals. Ever since reading Ben Goldacre’s Bad Science I now believe that almost all are worthless. (I used to believe antioxidants were useful but he showed from the literature that there were no proven benefits.)

So what’s in this Dead Sea stuff? I assumed it was a simple shampoo with masses of KCl and MgSO4 or something. Here goes:

  • Harmonized water (TM) (Aqua Maris Sal) which I take to mean water from the Dead Sea. I do not know how it is harmonized or what that means. Maybe it’s homeopathic and you dilute with ultra concentrated salt solution.
  • Sodium Coceth sulfate. Never heard of it but WP gives:
    “Sodium Coceth Sulfate is a semisynthetic detergent-like compound derived from fatty acids obtained from coconut oil, modified using ethylene oxide (oxirane). It is a milder foaming agent found in baby cleansers, gels, and cleaners.” Well, since I am supported to know all about polyethylenoxide this is useful knowledge.
  • Cocamidopropyl betaine. Not sure what OSCAR/OPSIN would make of this name (I’ll have more on that later). I guess it’s coconut oil treated with a betaine. Yup, I was right (see WP). I don’t think OPSIN would get that yet but I expect Pubchem does. Look for yourself.
  • Sodium beeswax (sic). I guessed this was saponification of beeswax and found:

    This derivative of Beeswax is derived through the saponficiation process, which involves the hydrolyzation of Beeswax in the presence of sodium in order to create soaps. It serves as a natural emulsifier and provides the same skin-caring properties as Cera Flava (Beeswax).

    I hope they have some hydroxide ions as well otherwise it could be quite fun.

  • Zinc pyrithione. This is a well-characterized pure compound with a beautiful structure (5-coordinate Zn). WP says it’s antifungal and antidandruff so I just wonder whether that should be where I go next.
  • Lauryl Pyrrlidone. It should be Lauryl pyrrolidone of course and for those of you – like me – who stop counting at capric acid it’s easier for OPSIN as 1-dodecylpyrrolidin-2-one.
  • Sodium chloride. Wonder where they get that?
  • Xanthum gum
  • Fragrances, plants extracts, etc.

Ah, at the bottom it explains “harmonized”:

“Special de-ionised water with pure Dead Sea Minerals: Magnesium, Potassium, Calcium, Sodium as Bromides, Chlorides and Sulphates”

Isn’t the idea of specially deionising water and stuffing it full of concentrated salts enormous fun!

So, determining to get out as fast as possible and whizzing past the shelves that BadScience rightly attacks my eye was suddenly caught by
Methylsulfonylmethane
Since Daniel (to be explained later) has been improving OPSIN out of all recognition I can do this in my head and get MeSO2Me. It’s a solid (tablets). Never heard of it as a pharmaceutical. So when I get to the counter I ask what MSM does. The assistant looks it up and it cures arthritis, anxiety, and much else including with worms+parasites. (I can believe that – it looks as if it could be unpleasant although WP’s verdict is mostly harmless and probably worthless). Further information: “Active ingredient: sulfur”.

Oh dear. The Victorians used to give children brimstone (native sulfur). So I asked did they sell sulfur (they might). No, she said, but would I like some chondroitin sulfate..

Thanks I said, but no. I get my Sulfur in Calcium Sulfate from the water company – and they throw it in for free, every day.

Posted in Uncategorized | 1 Comment

Chem4Word – the journey so far

We’ve been very silent about Chem4Word (C4W) for several reasons, but a major one is that I don’t like vapourware. I’ve spent too long in the pharma industry getting high-pitch sales including (ca. late 1980s, all true):

  • “We have a revolutionary method for predicting protein structure. It’s so powerful we aren’t telling anyone anything about it or how it works – you have to buy it to find out”
  • “Our graphics can render spheres 10 times faster than the competition, so you can design 10 times more drugs”
  • “we are launching our product to a selected group of pharma companies; we’ve got one slot left, but we need a PO from you by the end of the week”
  • “The Bioengine is so powerful that it can understand japanese and fold proteins. It’s only 22M USD but you get an ETA supercomputer thrown in (ETA went belly-up the next week)”.

Needless to say none of these were heard of again.

So I have been careful not to create vapour-ware during the gestation period (let’s say 9 months so far). And, gratifyingly, a lot of things have changed in a positive direction. So I can say, accurately, that I am delighted with where we arrived on Tuesday last week. It’s been a rather twisty journey to get there and this has resulted in false trails, confusion, fun, pub sessions, belief, despair, etc. A month earlier I would have said the project velocity was negative, we had a broken system that I would be embarrassed to show anyone, our architecture was a disaster, etc. I was terrified of showing it at BioIT. Today I am proud and very positive and I will tell you about it in a series of posts.

So what is C4W? At the most general it’s an act of faith by Microsoft Research about our work on semantic chemistry at Cambridge and how Word2007 can become the semantic framework. But it’s the people that matter – I can list 20 at Microsoft and I will do so over the weeks but I’ll start with Lee Dirks (project sponsor) and Alex Wade (program manager). Without both of these, working very long hours at difficult hours and with great patience the project would have crashed. They have never flinched from the belief that the project would succeed, and they have changed direction on several occasions in response to need. We’ve been through different approaches to architecture, different types of project management and different allocation of resources. This may sound like thrashing; I can tell you it isn’t.

More specifically C4W is a semantic and ontological chemistry system which includes creation, editing, publishing and re-use of what Henry Rzepa and I call datuments – integrated data and documents. It’s not YACE – “yet another chemical editor”, or YAELNb (lab notebook) or a “ChemFoo killer”. It’s what I have been wanting for 15 years – a properly resourced Open implementation of a semantic chemistry system – a collaboration with a 600-pound gorilla which can make the dream happen.

In developing Chemical Markup Langauge I was always aware that I would need help. Chemists are conservative and when people said “who is using CML? Only the Blue Obelisk? Oh, then we shan’t bother” – I had to accept this as the verdict of the market – I and others had to create a complete ecosystem for CML and then people might start using it. That is, of course, hard. But I knew I needed a 600PG ( I can’t find the origin of this phrase and the mass varies – 800, 900).

In likening Microsoft to a 600PG we know that gorillas are largely harmless unless you upset them or get in their way. I am getting to know the gorilla well in parts and so far I can co-exist without being squashed. I’m in control of those parts I need to be in control of and happily leave other bits to the gorilla.

So what is C4W? It’s a flexible, modular, validatable, semantic ontological chemistry platform in C#, XML/CML and RDF with graphics/UI in WPF and XAML. It emphasizes validation and semantic correctness (e.g. all hydrogens must be specified and no information can be provided by default). The implementation is declarative/functional in that modules are side-effect-free and information is computed lazily or on demand. Recomputing rather than storing makes sure that information cannot get corrupted. An XML data model means that everything is visible. Todays’ machines are fast enough that graphics can be loosely coupled to the data model – we can pass the data repeatedly every few milliseconds. XML gives the flexibility missing in fixed storage models. and a lot more…

What does it currently do? We decided early on that there had to be compromises between functionality, aesthetics, and semantic correctness. We’re strong on the semantics, which needs to be correct right from the start. Word gives a lot of semantic functionality for free – XML validators, smart tags, etc. We’re working on the aesthetics over the next few weeks to get the “user experience” right. Word itself has a great deal of UI functionality – we have a navigator, a gallery, etc. and the group continues to improve its experience in MS UI tools. We’ve developed a completely new approach to chemical styling. The functionality concentrates on processing existing molecules, including “tweak” functionality, validation and normalization. We’ve deliberately left molecule creation to later because its open ended and we need to do it semantically (most current programs are graphically oriented and have virtually no validation other than valence checking on carbon). In the modern era with Pubchem it’s probable that most scientists will be able to find their molecules already exist and there are anyway free drawing tools that emit CML.

As we’ll be formally showing C4W this later this month I’m not going to show screenshots, etc. I am very excited indeed about the ontology feature and Nico Adams’ ChemAxiom but I am keeping my mouth shut and leaving him to blog about it.

We will be releasing C4W under an Open Source licence and looking for collaborators. We don’t know the mechanism of this – it will appear in CodePlex – and we shall be looking for developers who are able and keen to work in a collaborative .NET environment. If you are interested in this, and can contribute, let us know. Accept that we can’t make time promises at this stage and until we know the .NET chemistry community better we don’t know what degree of management will be required.

Posted in "virtual communities", Uncategorized | 4 Comments

library of the future – warming down

I’m winding down on the library of the future theme – I am impatient to tell you about Chem4Word, whether and how we can rescue Cheminformatics from its current position as a pseudoscience, how to do language processing and textmining properly, and so on. But I owe my hosts at “The JISC” and Bodley thanks and the wider blogosphere some wind-down.

It’s been valuable to me and to some others I have spoken to. The event itself was great and brilliantly staged and Dicky, Helen and many others deserve great credit. (I will pass over the late night session with Dicky and Rachel, other than to say that after retiring at 0200 getting the 0651 back to London was a major effort of will.)

I loved the Twitter feed and SecondLife. The feed was instantaneous and had comments all round the world – search for #LOTF09 and PMR (http://search.twitter.com/search?q=%23lotf09+PMR) and you get lots of tweets – most of these while I was speaking and answering questions. It’s a new experience and I think it’s great (perhaps it was a good thing I couldn’t see the feeds during my talk – one was “Classic PMR”) . Here’s a few from the first pageL

kevingashley: “The archivist of the future will not come from the archives of the past; they will be a revolutionary” #digccurr (~from PMR #lotf09)
about 15 hours ago from web · Reply · View Tweet
adrianstevenson: PMR “Google, Amazon etc will nick the business of education if we’re not careful” #lotf09
briankelly: @adrianstevenson You’re right. I came across ‘Critical Friend’ concept from JISC, but used it wrt PMR‘s #lotf09 blog posts.
Tombypoppy_normal
tomroper: #LOTF09 Q: role of university presses in battle for scientific information? A PMR: publishing and learned organisations poor bedfellows

The SL was an experiment and it worked, within limits. The display was on except during talks, which I think was a pity and which should – in retrospect – be tried. I could see Jennifer/broniba/Akua_Inkpen but I couldn’t talk to her. Also any messages from SL were relayed through a human. But the streaming video in SL looked great from where I sat – the avatars were watching a live screen and we were watching live avatars. (I have an avatar but I haven’t worked out how to dress it – I need a lesson).

I need to clear up a misunderstanding which arose from my blogging – the library blogosphere felt I was attacking them. I was deliberately not, but I was giving them very . So I will briefly reply to a comment on John Dupuis’ blog

John Dupuis said…

I found it refreshing that although [PMR] started with the assumption that we’re all useless and dead, he did seem willing to at least listen and learn. Hopefully, we can listen and learn from him as well and see where we can improve.

At the end of the day, there’s a lot of mutual misunderstanding between scientists and librarians. It’s hard to know which is more of an issue: them not understanding us or us not understanding them.

PMR: If you read my earliest posts on this you will see that I did NOT assume librarians were “useless and dead” and I have never said that. I blogged #LOTF09 on the assumption I would get input from the library community. I know it was being read but even after nine days I got effectively no feedback. In bioscience, chemistry, information science, semantic web, open access, replies come within hours. I went into the FriendFeed and Twittersphere and picked up things like (paraphrased)  “we’ve written this report, why doesn’t PMR read it”, “why does PMR think that he can blog in our territory and expect a response”, “it’s PMR’s responsibility to approach us properly”, “he should apologize for criticizing us”.

So, getting no feedback, I turned up the outrage knob a bit and said

“I don’t blame the organizers (and I’m grateful to Dicky for sending the Ithaka report).  I’m left with the overwhelming impression that the community is now past caring about the future of the library. That’s essentially what Ithaka said 2-3 years ago – that ULibraries had to be visible and rebrand themselves.  They’re not and they aren’t.”

By community I meant the academic community as a whole, and I stand by what I said. I have tried to act as a messenger between  scientists and librarians. I have, I believe, been factually accurate. When JohnD says:

“It’s hard to know which is more of an issue: them not understanding us (ULibrarians)  or us not understanding them (scientists).”

I feel I am acting as a Cassandra or Jeremiah by saying that the issue is deadly clear. I am sorry to say the following, but it’s true.

Most scientists don’t care about (science) librarians. I talked last week with a very senior bioscientist – dean of science – editor of prestigious journal (which does its own data reposition, without help from libraries). I will not repeat what he said in detail becuase it will hurt too much, but a simple version is that libraries are a costly irrelevance and should be got rid of. I have no doubt that’s  a very widespread view in science.

There is no compulsion for the scientist to come to or understand the library. Scientists are already finding ways (Pubmed, EBI, Wikipedia, Nature Precedings, Wellcome Trust, etc.) to manage information without libraries. Scientific research manages hundreds of billions of dollars annually. ULibraries are topsliced, at least in part, from that.Topslicing is always an unpopular tax.

So I am trying to help libraries, rather than attacking them when I say that  it is up to them to approach scientists – and very rapidly – because that is where the money and the power is.

That’s the simple truth. It may be too late to do anything – I don’t know.

Posted in Uncategorized | 4 Comments

librarians of the future – Christine and Kimberley

It was great to meet up again with Christine Borgman from UCLA at the Microsoft meeting. Christine and I have much in common about what needs to be done for digital scholarship.

Christine runs a Masters (I think) in LIS and serially hijacked many of the invitees to take part in virtual sessions with her students. So I gave 15-20 minutes of brain dump over the video link. I said that I would be talking in Oxford and asked for contribution. This one, from Kimberley Garmoe arrived just too late for me to reference it… I am really flattered by the reference to the Enlightenment…

I was the small and nearly invisible voice from the back of the room at UCLA. I do not intentionally hide behind tall people, but somehow they always end up in front of me. I immensely enjoyed your talk, and agreed in principle with everything you had to say. I also think we are on the cusp of the communications revolution, the importance of which can only be compared to the first 300 years of printing. The revolution is already upon us, and clearly will not be televised. I hope that you do not take umbrage with my argumentative tendencies. Your style is so engaging and ideas so compelling that I found it impossible to remain passive and polite.

Dr. Borgman is correct, I am one of the students whose background is not the library sciences, or in any science for that matter. I am a Ph.D. student in European history, and my work is on the communications revolution in German mass media at the end of the 18th century. I will spare you further gory details. However, I think that much your argument reaches back into late Enlightenment thought, and I am happy to see that the legacy of the Enlightenment lives on.

I share your concern that information should not be monopolized, but I would point out that monopoly in the production of information seems to have existed from the early days of print. And by this I do not mean merely the official monopolies granted individual printers, but also the tendency of first printers, and then publishing houses, to establish control over the selection and distribution of information over long periods of time. Even the world of print there have been long decades of disaggregation and competition, but time and time again we end up with powerful monopolies. I would like to know what you think, why does information end up monopolized?

I am certain that your talk at the Bodleian went exceedingly well, and I only wish I was there to hear it. However I know I can look forward to it in the Bodleian print series.

With my best regards,

Kimberly Garmoe

See a very comprehensive account of (some of) the proceedings. I expect that video, etc. may appear as well.

Posted in "virtual communities", nmr, Uncategorized | 1 Comment

libraries of the future – what I shall say

I am blogging what I hope to cover in my 15 minutes.and I am speaking from the view of practising STM researchers in publicly funded institutions. Please feel free to follow the links during the presentation. There are also ca 30 visitors from Second Life.

Power Corrupts; Powerpoint corrupts abslutely (Tufte); This is a semantic, distributable, accessible presentation

Polemic warning: HIGH; Speed: FAST; ComfortFactor: NEGATIVE; Known Copyright Violations = 0

Thanks and Greetings

Background

I researched by

  • talking to scientific colleagues..
    The main conclusion was that the formal “library” was largely unseen/irrelevant and at best a service department
  • raising the subject on my blog and following Twitter and FriendFeed. Feedback was slow until I tweaked the outrage knob slightly and was then mainly from tech-aware librarians. The main thrust was that they were doing a good diverse job which wasn’t appreciated by me or scientists in general.
  • I have been described by Brian Kelly as a Critical Friend” to libraries (and I accept the compliment).
  • Conclusion: the librarian of the future will not come from the librarian of the present. They will be real revolutionaries.

What scientists do and want in their information environment

  • Quality peer-review
  • Immediate seamless access and search to all published information. Not photocopies
  • Electronically, not on paper. Little bits of lots of papers.
  • Interdisciplinary. No subject libraries. No arbitrary discipline boundaries
  • Access to experimental data and its re-use
  • Write papers and grants as efficiently as possible.
  • Build a personal collection of relevant papers. iPDF
  • Experimental data – Collect, version, annotate and preserve (medium term).
    Maybe through society/publisher
  • Recognition for their work, papers, data, software, services, methods…

Where the world/web is going anyway

  • All information will be free and online
  • Everyone will be pervasively connected
  • Evolution, not planning
  • Rapid entry of major players – (GYM) Google, Yahoo, Microsoft
  • Personal information collections online
  • Clouds and Communities
  • The semantic web (TimBL)
  • Micropayments??

A few resources we use or have created

Software and informatics are the new instruments of communication. The code is mightier than the report…

  • Pubchem library of molecules. About 20 million contributions from researchers, suppliers, agencies, etc. A vigorous campaign by American Chemical Society to close it down as government-sponsored “socialised science”. The campaign failed
  • Sourceforge. A true repository where I store all my code, versioned, preserved, sharable.
  • C3DeR crystallographic repository. This captures all experiments in the department, and publishes them under embargo.
  • Crystaleye nightly robotic aggregation of the worlds published crystallography
  • The Blue Obelisk A group of chemists dedicated to Open Data, Open Source, Open Standards, who have developed many widely used libre resources.
  • Chem4Word. An Open tool for authoring chemistry in Word2007, thus returning power to the authors who can declare their data as Open.
  • The Open Knowledge Foundation provides a wide range of protocols, visions, tools. We have developed a Is It Open?
    service for requesting information from publishers as to whether their information is Open.
  • Wikipedia. We see WP as an important reference work for teaching, learning and research and are helping to add semantic chemistry.

Battle for the ownership of scholarship

The web is hosting a battle between universal access to information and control by ,major commercial interests. The balance between “good” and “evil” shifts monthly – free sites become closed and data appropriated; hitherto monopolists (e.g. Microsoft) promote open information…

The Universities had a golden chance 10 years ago to regain control of scholarlyt publishing (e.g. through University Presses). They completely lost the plot.

The Universities have ceded ownership of scholarship to the publisher giants – Elsevier, Thomson ISI, Wiley, and most regrettably learned societies which have lost their mission (American Chemical Society). The power of the web still allows us to reclaim this but we must be quick.

Hundreds of billions of research dollars pass from Funders to Universities but this is in large part “controlled” by commercial and pseudo-commercial publishers who decide what is meritorious by mindless algorithms suited to their profitability – as meaningful as “top of the pops” sales .to musical quality.

The (quasi) commercial publishers are vigorously lobbying governments to restrict access to scholarly information epitomised by the PRISM association for denigration of Open Access (believed to be Elsevier, ACS and a few others).

Newcomers look to the web, not libraries. for their information and publishers (in the most general sense) will exploit this to create direct links between authors, publishers and consumers. GYM, Elsevier, Facebook , Twitter…Universities should welcome this and seek to control their interests.

Open Access, Open Data are not about business models, but the soul of scholarship. HEADS OF SCHOLARLY INSTITUTIONS MUST SPEAK OUT AND ACT, OR THEY WILL LOSE CONTROL. They must collaborate, not compete on this. There is not much time left

What can you do? JUST DO IT

TimBL says “just do it”. Pubchem has “just liberated molecules”. Greg Crane (Perseus) has “just liberated classical scholarship”. Wikipedia has “just liberated encyclopedias. Openstreetmap.org has “just liberated geospatial data”. The OKFN “just built” the IsItOpen system. Undergraduates in our group “just built” the OSCAR system, the C3DeR chemistry repository. Harvard (and others) “just declared” autonomy of their scholarship.

And tell the world about it. Every day.

Posted in "virtual communities", Uncategorized | Tagged | 2 Comments

Wellcome would like comments on author-pays licence

Robert Kiley has asked me to help garner comment on a potentially restrictive author-pays (or as may be funder-pays) licence. I’ll post him in full and add my own comment.

Open access licence: researcher opinion sought

A learned society has offered the Wellcome Trust an open access, author pays option for researchers who seek publication in their journal. However, the licence they wish to attach to these articles is more restrictive than the Trust would normally require when paying an OA fee.

The purpose of this posting is to seek opinion from the research community on whether these restrictions will, in any way, limit a researchers ability to re-use this content.

Summary of relevant licence conditions

The relevant section of the licence is shown below, in italics

PMC or UKPMC mirror site users may access, download, copy, display and redistribute articles, as well as text and data mine content in articles for non-commercial purposes only, subject to the following conditions:

  • In the case of text-mining, User may incorporate individual words, concepts and quotes up to 100 words per matching sentence, whereas longer paragraphs of text and images cannot be used.
  • Users may not create derivative works (as defined in the U.S. Copyright Act, 17 U.S.C. §101 et seq.) based upon the documents.

The Wellcome Trust is seeking input from the research community to help determine:

A) Whether such a licence would impact on your ability to re-use and re-purpose this open access content.

B) If so, please give some examples of research activities that would be limited by this licence.

If you would like to respond to this issue, please use the comment function below or send an email to r dot kiley at wellcome dot ac dot uk.

PMR: The first thing to remember is that funder-pays Open Access is – in large part – a business model. The funder wishes a good – universal access to the information – and is prepared to pay the provider an appropriate sum. There is, possibly, a balance between the fee and the extent of the service provided (although personally I am an absolutist). IOW it could be that the publisher says “we shan’t charge you very much and this makes author-pays access affordable to those who normally couldn’t. But understand that we cannot give so much in return.” I doubt this is the case here – I do not know and shall not guess the society but I suspect that the fees are comparable with normal author-pays.

A major problem is “what is a fair price for author-pays”. I have complete trust that Wellcome has thought very hard about this, because it sets the scene for the market. If the society is properly run for the benfit of its members then they should be able to determine whether the fee covers the cost of processing (after all these are non-profit orgs). If the books are closed and the fee is large then there is a suspicion that revenue is used for other purposes or that there is no drive towards effficiency.

It’s complicated because the society probably gets most of its revenue from closed access publications – they don’t stop coming and the subscriptions don’t stop either. So the society should lower its subscriptions pro rata to the author-pays income. Is there any evidence for this?

The wording of the licence implies – though I would hate to impugn any society publisher – that the society (or its senior officers) cares more abouyt income than about the furtherance of the scientific domain in which the community has invested its trust. A scientific society, especially in the chemistry/bioscience area should actively welcome text-mining as this is a major tool for extracting science from textual publications. I was sorry, for example, when Nature (for whom I have many regards) created the OTMI (open text mining initiative) where they exposed the text of the paper but jumbled the the order of the sentences. This obfuscation of science sits badly on a learned organization (even a for-profit one).

Our group text-and data-mines chemistry and bioscience. In some cases this is a worthy activity because running text is the way that bioscience is expressed. In others, such as chemical data, it’s because the authors, publishers, readers, don’t yet understand the value of semantics and ontologies (Chem4Word addresses that problem at the authoring stage).

What worries me about the licence is that the society wishes to restrict the use of semantic information. Why? Possibly because it has an interest in redistribtuing its own semantic version of the work or an aggregate at a higher cost. If so we are seeing yet again deliberate restriction in the progress of science.

Finally I really discourage the use of NC licences. They are difficult to manage, have complexities in mashups and derivative works. It’s a pity that the proponents of Gold-OA haven’t universally insisted on CC-BY – there are many who do like SPARC , BMC, PLoS, etc. Anything less is a less valuable good and should not carry the same fee.

I understand that restrictions and politics in the US have made it difficult to get full CC-BY for the NIH policy (“eyeballs are sufficient”) but we must not let this creep.

Posted in Uncategorized | 1 Comment

Chem4Word at Microsoft External Research Symposium

I took several months off blogging – completely. Why? Because I was concentrating on Chem4Word – a semantic chemical authoring tool sponsored by and jointly built with Microsoft External Research in Redmond. In this and subsequent posts I’ll now tell you about it and what hopes we have for it. Yesterday – and I’ll explain why in later posts – was a breakthrough day for us and I felt I could start to tell the world about it.

When we started I had no idea of the amount of mental and emotional effort I was going to have to put into this. We started about 2 years ago when Tony Hey and Ley Dirks invited some academic chemical informatics folk to Redmond to plan what has started as eChemistry and has now turned out to be OREChem. I’ll be blogging about this regularly elsewhere. But Tony and Lee also asked me if I would like our group and his to jointly develop a chemical drawing tool (I think that’s the phrase used then) for Word.

Why our group? Why not a commercial software company with lots of experience, with customers, with proven algorithms, etc. I can’t put words into his mouth but remember that Tony ran the UK eScience program and got a view of many disciplines – medicine, earth, environment, transport, oil, and particularly bioscience. These disciplines were applying the new tools of the distributed web, looking to share semantic information and prepare for the multi-party clouds that we are now seeing. He visited the ACS-CINF session to see what the state of the art was, and I’ll have to leave him to say what he concluded.

Microsoft – through the strenuous efforts of Jean Paoli – was a very early adopter of XML and so it was natural that Tony should look to CML (Chemical Markup Language) as the basic of our project. We’ve started from day zero with an XML data model, which has an amazing number of benefits. It supports side-effect-free programming (ably steered by Jim Downing) and can take advantage of the incredibly powerful LINQ system in dotNet (it’s really “.NET”) but it’s very easy to miss the dot.

So we started with a very fluid high-level plan and at the same time started the contractual negotiations. For those of you who don’t know, *all* large companies take a long time to finalise contracts on the first occasion. (And so do many small ones…). And Universities aren’t always quick. So the contract is measured in hundreds of days negotation – many of these anticipating the problems we might run into. (In retrospect – no surprise – a lot of this seems very periperal or even obsolete.

The basis of the chemistry was a CML engine. I’d already written JUMBO (Java Universal Molecular browser for Objects), which is neither Universal and no longer a browser – but this is converted into dotNUMBO (in C#).I’d hoped that we could autoconvert the software and I’m very glad we didn’t try. Although all lines of code have been typed in blood the design is much cleaner and it’s much smaller. And we are startiung with a subset of chemistry.

The effort has been incredible. I think I and Jim are nominally 10% of our time, but that’s >10% of 162 hours/week at times. Over the last 2-3 months we’ve taken to having daily telcons – sometimes short and sometimes cancelled – but often over an hour. And this effort has been made by the MS people as well – we’ve had many visits from Lee, Alex, Savas, and also a week’s stay from JimM. I think it’s taken over some of their lives as well….

It’s been a bigger project than any of us thought (I think). It’s been run on project management system (TFS) where we all invent scenarios and all get (multiple) tickets for tasks. Many of these are past their sell-by date. At times they hung over me (at least) as a cloud of doom, showing how slowly the project was going. Parts of milestones were missed. Scenarios moved to later milestones, etc.

But we set ourselves two deadlines which could not be moved. This is often a good way of helping to ensure communal vision and prioritisation. The deadlines were

  • yesterday (the Microsoft External Research symposium, where we would present to a wider range of academics and MS staff
  • end of April – BioIT is Boston – a large event with many commercial vendors and purchasers.

We made it yesterday with 10 minutes to spare. It was great. More – much more – later…

Posted in Uncategorized | 1 Comment

libraries of the future – "just do it"

I have the phrase “just do it” associated with TimBL – certainly he was saying it at WWW in Banff during the semantic web workshops. (see, e.g. http://www.w3.org/2001/04/30-tbl). What I take this to mean is that if you spend too much time working out all the ways to do things the world moves on – and very rapidly.

The point for JISC/Bodleian tomorrow is that libraries have to be at the front of the web. Some are and particularly JISC. I’ve been very impressed with the things that JISC have been supporting – rapid innovation for part of a person for part of a year. Lightweight protcols. Developer Heaven – a solid week of geeks bashing information – Jim and Nico went and I’m envious.

Software is a new form of creativity. If we want to get our ideas across it’s often better to write a program than a document. That usen’t to be true – code development, distribution, compilers, licences were all killers for rapid development. But now the technology has risen to meet the expectations. I watch people craft ideas into web pages and services within hours (I’m only partially literate and it’s not the best use of my time). The technology has arrived to liberate the expression of much of scientific semantics.

“Just do it” means building something that may or may not work and may or may not take off. Like SAX, which we developed (through David Megginson) on the xml-dev list in a month. Like chemical/MIME which took me and Henry Rzepa an afternoon at the pub. Sure these are the exceptions – good code takes lots of work – JUMBO has taken 20 years. But Nick Day wrote CrystalEye in less than a year; Joe Townsend and Chris Waudby wrote the first OSCAR over two summers. Dan and Lee put together the C3DeR crystal repository over 2.5 months last summer and I’ll be showing it tomorrow.

Next time you are tempted to read another report or write another one why not “Just go out and do it”. It can be painful, but it can be great fun.

Posted in Uncategorized | Tagged | Leave a comment