HHMI – green or gold? And the data?

Peter Suber has highlighted a new policy by HHMI and given a careful critique of what “Open” may or may not mean. It’s a good illustration of the fuzzy language that is often used to describe “Open”.  See: HHMI mandates OA but pays publishers to allow it
HHMI Announces New Policy for Publication of Research Articles, a press release from the Howard Hughes Medical Institute (HHMI), June 26, 2007.  Excerpt:

The Howard Hughes Medical Institute today announced that it will require its scientists to publish their original research articles in scientific journals that allow the articles and supplementary materials to be made freely accessible in a public repository within six months of publication.
[… snip …]
HHMI also announced today that it has signed an agreement with John Wiley & Sons. Beginning with manuscripts submitted October 1, Wiley will arrange for the upload of author manuscripts of original research articles, along with supplemental data, on which any HHMI scientist is an author to PMC. The author manuscript has been through the peer review process and accepted for publication, but has not undergone editing and formatting. HHMI will pay Wiley a fee for each uploaded article.
In addition, the American Society of Hematology, which publishes the journal Blood, has extended its open access option to HHMI authors effective October 1. Cech said that discussions with other publishers are ongoing.
The policy and supporting resources have been posted on the Institute web site and may be found [here].

To supplement this press release see

  1. The policy itself, dated June 11, 2007, to take effect January 1, 2008
  2. The Institute’s new page on HHMI & Public Access Publishing

Comments (by Peter Suber – absolutely to the point as always).

  • HHMI is finally mandating that its grantees provide OA to their published articles based on HHMI-funded research within six months of publication.  We knew last October that it was planning to adopt a mandate, but now it’s a reality.  Moreover, HHMI is taking the same hard line that the Wellcome Trust has taken:  if a grantee’s intended publisher will not allow OA on the funder’s terms, then the grantee must look for another publisher.  This is all to the good.  Funders should mandate OA to the research they fund, and they should take advantage of the fact that they are upstream from publishers.  They should require grantee compliance, not depend on publisher permission.
  • But unfortunately, HHMI is continuing its practice of paying publishers for green OA.  I criticized this practice in SOAN for April 2007 and I stand by that criticism.  HHMI should not have struck a pay-for-green deal with Elsevier and should not be striking a similar deal with Wiley.  HHMI hasn’t announced how much it’s paying Wiley, and it’s possible that the Wiley fees are lower than the Elsevier fees.  But it’s possible that they’re just as high:  $1,000 – $1,500.  We do know that its Wiley fees will not buy OA to the published edition, but only OA to the unedited version of the author’s peer-reviewed manuscript.  HHMI hasn’t said whether its Wiley fees will buy unembargoed OA or OA with a CC license.  The Wellcome Trust’s fees to Elsevier buy three things of value –immediate OA, OA to the published edition, and OA with a CC license– while HHMI’s fees to Elsevier buy none of these things.  If HHMI gets all three of these valuable things for its Wiley fees, then it’s basically paying for gold OA and no one can object to fees that are high enough to cover the publisher’s expenses.  But paying for green OA, when the publisher’s expenses are covered by subscription revenue, is wrong and unnecessary even if the fees are low.   For details, see my April article.


PMR:
Note that “Green” OA is very unlikely to make the Data Open. By default the publisher may restrict text-mining, and may have copyrighted the data (Wiley certainly have done and do this). So unless there is a CC license – which makes it effectively “gold” (in this very unsatisfactory terminology) it’s almost useless to data-driven science.

Posted in data, open issues | Leave a comment

What we do at UCC – job opportunity in Polymer Informatics

I don’t normally say very much in this blog about what our day jobs are; now is a useful time to do so. The Centre is sponsored by Unilever PLC – the multinational company with many brands in foods and HomeAndPersonalCare (HPC). It came about through some far sighted collaboration between Unilever and Cambridge to create a Centre where cutting edge research would be done in areas which didn’t just address present needs  but also looked to the future.
This is typified by Polymer Informatics – where we have an exciting vacancy.  Many of Unilever’s product s contain polymers – you can think of them as long wriggly molecules.  They  can be very hard – as in polythene, or flexible as in silicones or additives in viscous liquids. Next time you put something on your hair, teeth, face, toilet bowl or laundry, etc there’s a good chance it will have a polymer ingredient of some sort.
Work in my group looks forward to where the world will be in 5 or even 10 years’ time. Here’s a list of some of the technologies in the current position:

OSCAR3, natural language processing, text-mining, Atom, Eclipse/Bioclipse, SPARQL, RDF/OWL, XPath, XSLT, etc.)

What’s that got to do with wriggly molecules? Everything. Science is becoming increasingly data- and knowledge-driven. In many cases the “answer is out there” if only we knew where to look – publications, patents, theses, blogs, catalogs, etc.. We may not need to go back to the lab but can use reasoning techniques to extract information from the increasingly public world of information. And, as we liberate the major sources – scholarly publications, theses, patents – from their current closed practices we shall start to discover science from the relations we find. Open scientific information has to be part of the future.
The phrase Pasteur’s Quadrant is sometimes used to describe research which is both commercially exploitable and also cutting edge scholarship.  That’s a useful vision. I have certainly found in my time in the Centre that industrial problems are often very good at stimulating fundamental work. So polymer informatics has taken me to new fields in knowledge representation. Polymers, unlike crystals, are not well-defined but fuzzy – they can have variable lengths, branching, chemical groups, etc. They flop about and get tangled. This requires a new type of molecular informatics and we have had to explore adding a sort of functional programming to CML to manage it. We now have a markup language which supports polymers and several features are novel.
And, as the world develops, the information component in products continues to increase.  So we know we are going in an exciting direction.

Posted in chemistry, Uncategorized | Leave a comment

Top-down or bottom-up ontologies?

I am working out some of the ideas I want to talk about at Mathematical Knowledge Management 2007 – and in this post I explore how a knowledge framework might be constructed, and also how it can be represented in machine-understandable form. This, I think, will be seen as one of the central challenges of the current era.
I have worked on bodies responsible for the formalisation of information and have also for 15 years been co-developing Chemical Markup Language (with Henry Rzepa). This journey has not yet ended and I’m changing my viewpoint occasionally.
My first view – perhaps stemming from a background in physical science – was that it should not be too difficult to create machine-processable systems. We are using to manipulating algorithms and transforming numeric quantities between different representations. This process seemed to be universal and independent of culture. This was particularly influenced by being part of the Int. Union of Crystallography’s development of the Crystallographic Information Framework dictionary system.
This is a carefully constructed, self-consistent, system of concepts which are implemented in a simple formal language. Physical quantities of interest to crystallographic experiments can be captured precisely and transformed according to the relations described, but not encoded, in the dictionaries. It is now the standard method of communicating the results of studies on small molecules and is the reason that Nick Day and I could create CrystalEye. Using XML and RDF technology we have added a certain amount of machine processability.
Perhaps encouraged by that I and Lesley West came up with the idea of a Virtual Hyperglossary (original site defunct, but see VIRTUAL HYPERGLOSSARY DEVELOPMENTS ON THE NET) which would be a machine-processable terminology covering many major fields of endeavour. Some of this was very naive, some (e.g. the use of namespaces) was ahead of the technology. One by product was an invitation to INTERCOCTA (Committee on Conceptual and Terminological Analysis) – a UNESCO project on terminology. There I met a wonderful person Fred W. Riggs who very gently and tirelessly showed me the complexity and the boundaries of the terminological approach. Here (Terminology Collection: Terminology of Terminology) is an example of the clarity and carefulness of his writing. One of Fred’s interests was conflict research and his analysis of the changing nature of “Turmoil among nations”. I am sure he found my ideas naive.
So is there any point in trying to create a formalization of everything – sometimes referred to as an Upper Ontology? From WP:

In information science, an upper ontology (top-level ontology, or foundation ontology) is an attempt to create an ontology which describes very general concepts that are the same across all domains. The aim is to have a large number on ontologies accessible under this upper ontology. It is usually a hierarchy of entities and associated rules (both theorems and regulations) that attempts to describe those general entities that do not belong to a specific problem domain.

The article lists several attempts to create such ontologies, one of the most useful for those in Natural Language Processing being

WordNet, a freely available database originally designed as a semantic network based on psycholinguistic principles, was expanded by addition of definitions and is now also viewed as a dictionary. It qualifies as an upper ontology by including the most general concepts as well as more specialized concepts, related to each other not only by the subsumption relations, but by other semantic relations as well, such as part-of and cause. However, unlike Cyc, it has not been formally axiomatized so as to make the logical relations between the concepts precise. It has been widely used in Natural Language Processing research.

(and so is extremely valuable for our own NLP work in chemistry).
But my own experience has shown that the creation of ontologies – or any classification – can be an emotive area and lead to serious disagreements. It’s easy for any individual to imagine that their view of a problem is complete and internally consistent and must therefore be identical to others in the same domain. And so the concept of a localised “upper ontology” creeps in – it works for a subset of human knowledge. And the closer to physical science the easier to take this view. But it doesn’t work like that in practice. And there is another problem. Whether or not upper ontologies are possible it is often impossible to get enough minds together with a broad enough view to make progress.
So my pragmatic approach in chemistry – and it is a pragmatic science – is that no overarching ontology is worth pursuing. Even if we get one, people won’t use it. The International Union of Pure and Applied Chemistry has created hundreds of rules on how to name chemical compounds and relatively few chemists use them unless they are forced to. We have found considerable variance in the way authors report experiments and often the “correct” form is hardly used. In many cases it is “look at the current usage of other authors and do something similar”.
And there is a greater spread of concepts than people sometimes realise. What is a molecule? What is a bond? Both are strongly human concepts and so difficult to formalize for a machine. But a program has to understand exactly what a “molecule” is. So a chemical ontology has to accept variability in personal views. A one-ontology-per-person is impossible, but is there scope for variability? And if so how is this to be managed.
So far CML has evolved through a series of levels and it’s not yet finished. It started as a hardcoded XML DTD – indeed that was the only thing possible at that stage. (In passing it’s interesting to see how the developing range of technology has broadened our views on representability). Then we moved to XML Schema – still with a fairly constrained ontology but greater flexibility. At the same stage we introduced a “convention” attribute on elements. A bond was still a “bond” but the author could state what ontology could be attached to it. There was no constraint on the numbers of conventions but the implied rule is that if you create one you have to provide the formalism and also the code.
An example is “must a spectrum contain data?”. We spent time discussing this and we have decided that with the JSpecView convention it must, but that with others it need not. This type of variable constraint is potentially enforceable by Schematron, RDF or perhaps languages from the mathematics community. We have a system where there is “bottom-up” creation of ontologies, but which agree they need a centrally mechanism for formalizing them – a metaontology.  The various flavours of OWL will help but we’ll need some additional support for transformation and validation, especially where numbers and other scientific concepts are involved.

Posted in mkm2007 | Leave a comment

Mathematical Knowledge Management 2007

I have been invited to give a lecture at the Mathematical Knowledge Management 2007 meeting next week in Hagenberg, Austria. My talk is entitled Mathematics and scientific markup. I am both excited and apprehensive about this – what is a chemist (whose level of mathematics finishes at Part1A for scientists in Cambridge) doing talking to experts in the field?
However in the spirit of the new Web I’m blogging my thoughts before the meeting. This serves several purposes:

  • helps me get my ideas in order
  • gets feedback from anyone who may have an interest
  • identifies other people who may also be blogging about the meeting
  • acts as a public resource from which I can give my talk if I have problems with my machine.

The conference topic …

Mathematical Knowledge Management is an innovative field in the intersection of mathematics and computer science. Its development is driven on the one hand by the new technological possibilities which computer science, the internet, and intelligent knowledge processing offer, and on the other hand by the need for new techniques for managing the rapidly growing volume of mathematical knowledge.
The conference is concerned with all aspects of mathematical knowledge management. A (non-exclusive) list of important areas of current interest includes:

  • Representation of mathematical knowledge
  • Repositories of formalized mathematics
  • Diagrammatic representations
  • Mathematical search and retrieval
  • Deduction systems
  • Math assistants, tutoring and assessment systems
  • Mathematical OCR
  • Inference of semantics for semi-formalized mathematics
  • Digital libraries
  • Authoring languages and tools
  • MathML, OpenMath, and other mathematical content standards
  • Web presentation of mathematics
  • Data mining, discovery, theory exploration
  • Computer Algebra Systems
  • Collaboration tools for mathematics

Invited Speakers:

Neil J. A. Sloane AT&T Shannon Labs,a Florham Park, NJ, USA The On-Line Encyclopedia of Integer Sequences
Peter Murray Rust University of Cambridge, Dep. of Chemistry, UK Mathematics and scientific markup

What has molecular informatics to do with this? More than it appears. Chemistry overlaps considerably with chemistry and here formal systems are important. It should be possible to explore the formal representation of thermodynamics or material properties in semantic form (though I may find that my use of “semantics” is imprecise or even “wrong”). Repositories are an obviously exciting area – can we find mathematical objects either by form or by metadata? OCR is important for all content-rich disciplines – see below. Inference and semantics are becoming increasingly important in the emerging web. And so I tick about half the topics above – not in mathematical detail, of course, but in the general approach to the problems.
As an example, what objects contain enough structure and canonicalized content that they act as their own discovery metadata? Most objects need a human or a lookup-table to add the metadata for their web discovery. For example you need to know the names of humans – you cannot work these out by looking at them. But in chemistry we can describe a molecule by its InChI – a canonical representation of the connection table (which is not easily human-interpretable). This is both its content and its discovery metadata. You can search Google for molecules though InChIs will find instances of molecules on the web. I wondered what other objects could be identified just by their textual content. Perhaps a poem (although it won’t tell you who wrote it). I started typing lists of numbers into Google and suddenly found I was getting hits on Neil Sloane’s Encyclopedia.
In this a sequence can be identified by its content – search Google for “1,3,6,10,15” and you get A000217 in The On-Line Encyclopedia of Integer Sequences. I had a chat with a well known computer scientists ex-mathematician at WWW2007 and he bet that I couldn’t tell me the next term in a sequence within 5 minutes. I bet him the drinks that I could. So as we had wireless in the bar I searched Google and immediately found the answer – he was astounded – and bought the drinks.
So many of the problems are generic between domains. Searching for MKM2007 I found this paper on how to extract mathematics from PDF (Retro-enhancement of Recent Mathematical Literature). It’s better than recreating cows from hamburgers as they have some access to source – but there are similarities to what we are trying to recover from PDF.
I shall use the tag mkm2007 for this and subsequent posts (in which I’ll explore things like the different between top-down and bottom up management systems.) No one has yet used it but maybe someone will find it – let’s see.

Posted in mkm2007, semanticWeb | Leave a comment

The NIHghts who say 'no' – to chemoinformatics

A recent post from The Sceptical Chymist: The NIHghts who say ‘no [1]

The NIHghts who say ‘no’

Apologies to our international readers for the U.S.-centric post, but the National Institutes of Health announced earlier today that PAR-07-353, a grant involving Cheminformatics Research Centers, has been canceled for “programmatic reasons.” For those of you who haven’t heard of the Cheminformatics Research Centers, they are part of the Molecular Libraries Roadmap Program (MLP), which is

an integrated set of initiatives aimed at developing and using selective and potent chemical probes for basic research … [The MLP] was proposed to introduce high-throughput screening approaches to small molecule discovery, formerly limited to the pharmaceutical research industry, into the public sector… [and] is made up of the following major components: (1) access to a library of compounds (Molecular Libraries Small Molecule Repository); (2) access to bioassays provided by the larger research community; (3) support for the development of breakthrough instrumentation technologies; (4) access to a network of screening and chemical probe generation centers (MLPCN) where assays are screened and probe development is undertaken; (5) Pubchem, the primary portal through which the screening results of the MLPCN are made public and (6) the Cheminformatics Research Centers (CRCs) with multiple roles focused on high-level data analysis and dissemination with a focus on developing new understanding of the cellular processes (genes and pathways).

One reason why this is so surprising is because the grants were due next week (June 28th). I imagine the timing of this decision (and the decision itself) is bound to upset a number of people in this community, especially since many applicants were probably working around the clock to get their grant submitted before the (now non-existent) deadline…
Does anyone know more about this story or why the grant was canceled?
Joshua
Joshua Finkelstein (Senior Editor, Nature)

First, Joshua, no apologies needed – this affects world science not just US chemoinformatics. (And a reminder that Nature was active in helping to report the activities of those of us who wished to promote the value of the Pubchem effort). In Cambridge we are (or were) working with potential applicants in this program and were in the process of preparing material. We were informed:
There is a notice published today in the NIH Guide
that cancels the “Preapplication for Cheminformatics Research Centers
(X)2) PAR-07-353. Here is the essential element of that notice:
“This Notice is to inform the scientific community of the cancellation
of the PAR-07-353 entitled, Preapplication for Cheminformatics Research
Centers (X02), due to programmatic reasons. Applications should not be
submitted for the June 28, 2007 date. Any application submitted will not
be assigned or reviewed. NIH intends to re-issue this announcement at a
later date.”
The point is that NIH is (or was until yesterday) reaching out to the world community. Chem(o)informatics is a key tool in understanding biology and in discovering new leads for pharma. There was a real chance to revitalise chemoinformatics which has been languishing for many years bedevilled by lack of:
  • Open Data
  • Open software
  • Open processes
  • a modern approach to information
  • reproducible science
Just last week we had a 3-day workshop at the Unilever Centre on Machine Learning, aimed primarily at chemoinformatics. One of the speakers, a statistician, has published concerns about the quality of science in many chemoinformatics publications – it suffers from all the above list.
The NIH program would have given a major boost to Open reproducible science. The data would have been fully Open and reusable, software would have been Open and modular, and chemistry would have had a major example of how data could be re-used for science rather than being aggregated and resold. Whether or not our group had been funded I was looking forward to this program as it would have been a highly cost-effective use of funds. And it could have shown the pharma industry, which relies heavily on this approach but does so little to encourage good practice in it, a way forward. Will any pharma CEOs speak out? Or do anything to help a science on which they depend?
It would be irresponsible to speculate on what “programmatic reasons” means. It could be that, like Britain, the US wants to spend its income on wars rather than health. But unfortunately not every US organization approves of the NIH’s funding of Pubchem and related projects which are often seen as “socialist” and seen as the government competing against the “private sector”. Will all scientific journalists highlight the major damage to chemistry that this cancellation causes ?
So, Joshua, please keep investigating. Maybe there is a need for another scientific Woodward-Bernstein.
[Added subsequently – my private email favours cock-up rather than conspiracy. But it’s still indefensible to pull grants 6 days before the deadline. And my concern in the previous paragraph still holds – chemoinformatics should have proper public support to support proper science, not languish.]
[1] From Monty Python. In the current case the humour is ironic.

[This is the second grant I have failed to get today. At least the other one – for FP7 – got to the submission stage and was – presumably – reviewed]

Posted in chemistry | Leave a comment

Data and Institutional Repositories

One of the themes of ETD2007 was a strong emphasis on IR’s. Not surprising since they are topical and a natural place to put theses and dissertations. Almost everyone there – many from the Library and Information Services (LIS) community – had built, or was building, IRs. I asked a lot of people why. Many were doing it because everyone else was, or there was funding, or similar pragmatic motives. But beyond that the motives varied considerably. They included:

  • To promote the institution and its work
  • To make the work more visible
  • To manage business processes (e.g. thesis submission, or research assessment exercises).
  • To satisfy sponsors and funding bodies
  • To preserve and archive the work
  • To curate the work

and more. The point is that there is no single purpose and therefore IR software and systems have to be able to cope with a lot of different demands.
The first generation IRs (ePrints, DSpace, Fedora) addressed the reposition of single eManuscripts (“PDFs”) with associated metadata. This now seems to work quite well technically, although there are few real metrics about whether they enhance exposure and there is poor compliance in many institutions. There is also a major problem in some Closed Access (e.g. in chemistry) formally forbidding Open reposition. So the major problems are social.
Recently the LIS community has started to highlight the possibility of repositing data. This is welcome, but needs careful thought – here are a few comments.
Many scholars produce large amounts of valuable data. In some cases the data are far more important than the full text. For example Nick Day’s crystallographic repository CrystalEye contains 100,000 structures from the published literature and although it links back to the text can be used without needing to. This is also true of crystallographic data collected from departmental services as in our SPECTRa system. With the right metadata data can often standalone.
All scientists – and especially those with sad experiences of data loss (i.e. almost all) – are keen for their data to be stored safely and indefinitely. And most would like other scientists to re-use their data. This needs a bit of courage: the main drawbacks are:

  • re-analysis could show the data or the conclusions were flawed
  • re-analysis could discover exciting science that the author had missed
  • journals could refuse to publish the work (isn’t it tedious to have to mention this in every post :-()

But many communities have faced up to it. The biosciences require deposition of many sorts of data when articles are published. A good example is the RCSB Protein Data Bank which has a very carefully thought-out and tested policy and process. If you are thinking of setting up a data repository this document (and many more like it) should be required reading. It works, but it’s not trivial – 205 pages. It requires specialist staff, constant feedback to and from the community. Here’s a nice, honest, chunk:

If it is so simple, why is this manual so long?
In the best of worlds all information
would be exchanged using both syntactically and semantically precise protocols. The reality of the current state of information exchange is somewhat different. Most applications still rely on the poorly defined semantics of the PDB format. Although there has been significant effort to define and standardize the information exchange in the crystallographic community using mmCIF , the application of this method of exchange is just beginning to be employed in crystallographic software.
Currently, there is wide variation in specification of deposited coordinate and structure factor data. Owing to this uncertainty, it is not practical to attempt fully automated and unsupervised processing of coordinate data. This unfortunately limits the functionality that can be presented to the depositor community through Web tools like ADIT , as it would be undesirable to have depositors deal with the unanticipated software failures arising from data format ambiguities. Rather than provide for an automated pipeline for processing and validating coordinate data, coordinate data processing is performed under the supervision of an annotator. Typical anomalies in coordinate data can be identified and rectified by the annotator in a few steps permitting subsequent processing to be performed in an automated fashion.
The remainder of this manual describes in detail each of the data processing steps. The data assembly step in which incoming information is organized and encoded in a standard form is described in the next section.

So data reposition is both highly desirable and complex. I’m not offering easy solutions. But what is not useful is the facile idea that data can simply be reposited in current IRs. (I heard this suggested by a commercial repository supplier in the Digital Scolarship meeting in Glasgow last year. He showed superficial slides of genomes, star maps, etc. and implied that all this could be reposited in IRs. It can’t and I felt compelled to say so.)
At ETD2007 we had a panel session on repositing data. Lars Jensen gave a very useful overview of bioscientific data – here are my own points from it:

  • There is a huge amount of bioscience data in repositories.
  • These are specialist sites, normally national or international
  • There is a commitment to the long-term
  • much bioscience is done from the data in these repositories
  • the data in them is complex
  • the community puts much effort into defining the semantics and ontologies
  • specialist staff are required to manage the reposition and maintenance
  • there are hundreds of different types of data – each requires a large amount of effort
  • the relationships between the data are both complex and exceedingly valuable

All bioscientists are aware of these repositories (they don’t normally use this term – often “data bank”, “gene bank”, etc. are used.) They would always look to them to deposit their data. Moreover the community has convinced the journals to enforce the reposition of data by authors.
Some other disciplines have similar approaches – e.g. astronomers have the International Virtual Observatory Alliance. But most don’t. So can IRs help?
I’d like to think they can, but I’m not sure. My current view is that data (and especially metadata) – at this stage in human scholarship – have to be managed by the domains, not the institution. So if we want chemical repositories the chemical community should take a lead. Data should firstly be captured in departments (e.g. by SPECTRa) because that is where the data are collected, analysed, and – in the first instance – re-used. For some other domains it’s different – perhaps it might be at a particular large facility (synchrotron, telescope, outstation, etc.).
Some will argue that chemistry already operates this domain-specific model. Large abstracters aggregate our data (which is given for free) and then sell it back to us. In the 20th C this was the only model, but in the distributed web it breaks. It’s too expensive, does not allow for community ontologies to be developed (the only Open ones in chemistry are developed by biologists). And it’s selective and does not help the indivdual researcher and department.
Three years ago I thought it would be a great idea to archive our data in our DSpace repository. It wasn’t trivial to put in 250, 000 objects. It’s proving even harder to get them out (OAI-PMH is not designed for complex and compound objects).
Joe Townsend who works with me will submit his thesis very shortly. He want to preserve his data – 20 GBytes. So do I. I think it could be very useful for other chemists and eScientists. But where to put it? If we put it in DSpace it may be preserved but it won’t be re-usable. If he puts it on CD it requires zillions of actual CDs. And they will decay. We have to do something – and we are open to suggestions.
So we have to have new model – and funding. Here are some constraints in chemistry – your mileage may vary:

  • there must be support for the development of Open ontologies and protocols. One model is to encourage groups which are already active and then transfer the maintenance to International Unions or Learned Societies (though this can be difficult when they are also income-generating Closed Access Publishers)
  • funders must make sure that the work they support is preserved. Hundreds of millions or more are spent in chemistry departments to create high-quality data and most of these are lost. It’s short-sighted to argue that the data only live until the paper publication
  • departments must own the initial preservation of this data. This costs money. I think the simplest solution is for funders to mandate the Open preservation of data (cf. Wellcome Trust).
  • the institutions must support the generic preservation process. This requires the departments actually talking to the LIS staff. It also requires LIS staff to be able to converse with scientists on equal terms. This is hard, but essential.

Where the data finally end up is irrelevant as long as they are well managed. There may, indeed, be more than one copy. Some could be tuned for discoverability.
So the simple message is:

  • save your data
  • don’t simply put it in your repository

I wish I could suggest better how to do this well.

Posted in data, etd2007 | 2 Comments

CML on ICE – towards Open chemical/scientific authoring

Because WWMM had outages my blogging is behind and I’d written a post on Peter Sefton’s ICE. Peter and I met at ETD2007 and immediately clicked. But WWMM went to sleep and I haven’t reposted. Peter has beaten me to it.
ICE is a content authoring tool based on Open Office. It works natively with XML and Subversion. So it adds a dramatic aspect to document authoring – versioning with full community access and collaboration (if required). For example, if Peter and I write a paper about this we’d use the ICE server at University of Southern Queensland to store the versions. And of course as it’s Open Source anyone can set one up – it would be ideal for the Blue Obelisk community to author papers with.
But what catalysed this was the possibility of authoring theses. Students and looking for imaginative approaches and many will be happy to be early adopters in this new technology. If the domain-specific components are in XML (or other standards) it becomes easy to integrate them into ICE. And it is fantastic to be able to revert to previous versions at – I find Subversion easier than Word change management for example.
So some points from PeterS’s post:

View this page as PDF

jmolInitialize(” http://ptsefton.com/files/jmol-11.2.0″);

I mentioned before that at the ETD 2007 conference I met Prof Peter Murray-Rust. [1] We’re going to collaborate on adding support for CML the Chemical Markup Language to ICE, so that people can write research publications that include ‘live’ data.

[1] I’m just Petermr or PMR 🙂

Here’s a quick demo of the possibilities.
I went to the amazing Crystaleye service.

PMR: This is Nick Day’s site. We’d hoped to announce it formally a week or so ago but machine problems kep us back. But we’ll get some posts out this coming week. [We thank Acta Cryst/IUCr for a summer studentship which helped greatly to get it off the ground.]

The aim of the CrystalEye project is to aggregate crystallography from web resources, and to provide methods to easily browse, search, and to keep up to date with the latest published information.
http://wwmm.ch.cam.ac.uk/crystaleye/

Crystaleye automatically finds descriptions of crystals in web-accessible literature, turns them into CML and builds pages like Acta Crystallographica Section B, 2007, issue 03-00.
From that page I grabbed this two dimensional image of (C6H15N4O2)2(C4H4O6-2),
graphics3

PMR: Minor point: This is just the anion – there is a separate image for the cation.(the 3D structure below displays the cations as well).

There’s a Java applet on the page that lets you play with the crystal in 3d. Here’s a screenshot. of the 3d rendering.

graphics2
There’s lots more work to be done, but I thought I’d show how easy it is to make an ICE document that shows the 2d view for print, with the 3d view for the web, via the applet. Be warned, this may not work for you. The applet refuses to load in Firefox 2 for me, but it does work in Safari on Max OS X. If you follow the ‘view this page in PDF’ link above you’ll see just the picture.

PMR: image and applet deleted here …

What’s happening here?
My initial hack is really simple. I grab the image and paste it into ICE like any other image, but then I link it to the CML source. I wrote a tiny fragment of Python in my ICE site to go through every page, and if it finds a link to to a CML file containing an image, it adds code to load the CML into the Jmol applet. This is a kind of integration-by-convention, AKA microformat.

The main bit of programming only took a few minutes, but sorting out where to put the CML files and the Jmol applet, and integrating the changes into this blog took ages. I ended up putting the files here on my web site which meant putting a big chunk of stuff into subversion, something that should have been done ages ago, but the version of svn that runs on my other server refuses to do large commits over HTTPS ‘cos of some SSL bug and I can’t figure out how to update it which meant switching the repository to use plain HTTP, and so on. It wasn’t made easier by me mucking around with the Airport Extreme router and our ADSL modem at the same time, halting internet access at home for a couple of hours.
To make this integration a bit more usable and robust we want to:

  • Work out a workflow that lets you keep CML files in ICE and easily drop images in to your documents, letting ICE render using the applet when it makes HTML.
  • Integrate forthcoming work from Peter & team that will provide high quality vector graphics instead of the PNG files I’m using now.

PMR: I have now hacked JUMBO so it generates SVG images of 2D (and soon 3D molecules). Note that this then allows automatic generation of molecular images in PDF files (through FOP/SVG)

  • Investigate embedding CML in an image format such as EPS that word processors understand.
  • Generalize this approach for other e-scholarship applications. We’re working with the Alive team at USQ on this.
  • Talk to the DART & ARCHER teams.

I am also extremely keen to talk to these teams – as they are doing very similar and complementary work to our SPECTRa and SPECTRA-T projects in capturing scientific data at source.
I am impressed by the Australian commitment to Open Access, Open Data and collaborative working. ICE is an excellent example of how we can split the load. ICE likes working with the technical aspects documents (I don’t really though I have to). The Blue Obelisk likes working with XML in chemistry. The two components naturally come together.
This is something I have been waiting for for about 12 years. We haven’t got there yet, but we are well on the way.

Posted in "virtual communities", blueobelisk, chemistry, data, etd2007, programming for scientists | 2 Comments

Author Choice in Chemistry at ACS – and elsewhere?

A number of closed access journals/publishers have brought out “Author Choice” and similar approaches where authors pay publishers for “open access”. The details probably varies from publisher to publisher and I have been idly looking for examples in chemistry.
It is actually extremely difficult to to find articles on the basis of their access rights and you might think that I am weird to take this approach – surely the content is more important than the metadata? But I am interested in material that my robots can do exciting things with and I will take just about anything – chemistry is a desert for open text-mining (BJOC and Chemistry Central and a few others excepted)
I have found the first example of this in Chemistry, alerted – of course – by the blogosphere – in this case The ChemBlog. The post casually announces JACs ASAP AUTHOR CHOICE which points to a graphical abstract. The abstract links to a further fuller abstract, the HTML and the PDF. It also announces that this is an author choice paper and links to:

ACS AuthorChoiceArticles bearing the ACS AuthorChoice logo have been made freely available to the general public through the ACS AuthorChoice option.
The ACS AuthorChoice option establishes a fee-based mechanism for individual authors or their research funding agencies to sponsor the open availability of their articles on the Web at the time of online publication. Under this policy, the ACS as copyright holder will enable unrestricted Web access to a contributing author’s publication from the Society’s website, in exchange for a fixed payment from the sponsoring author. ACS AuthorChoice also enables such authors to post electronic copies of published articles on their own personal websites and institutional repositories for non-commercial scholarly purposes.

and in further information:

The American Chemical Society’s Publications Division is pleased to announce an important new publishing option in support of the Society’s journal authors who wish or need to sponsor open access to their published research articles. The ACS AuthorChoice option establishes a fee-based mechanism for individual authors or their research funding agencies to sponsor the open availability of their articles on the Web at the time of online publication. Under this new policy, to be implemented later this Fall [2006, PMR], the ACS as copyright holder will enable unrestricted Web access to a contributing author’s publication from the Society’s website, in exchange for a fixed payment from the sponsoring author. ACS AuthorChoice will also enable such authors to post electronic copies of published articles on their own personal websites and institutional repositories for non-commercial scholarly purposes.
The base fee for the ACS AuthorChoice option will be set at $3,000 during 2006-2007, with significant discounts applied for contributing authors who are members of the American Chemical Society and/or who are affiliated with an ACS subscribing institution. The fee structure will be as follows:
$3,000: Base Fee (authors who are neither ACS members nor affiliated with an ACS subscribing institution)

I first went to the abstract:

J. Am. Chem. Soc., ASAP Article 10.1021/ja070003c S0002-7863(07)00003-0
Web Release Date: June 22, 2007
Copyright © 2007 American Chemical Society
A Red Cy3-Based Biarsenical Fluorescent Probe Targeted to a Complementary Binding Peptide
Haishi Cao, Yijia Xiong, Ting Wang, Baowei Chen, Thomas C. Squier, and M. Uljana Mayer*
Cell Biology and Biochemistry Group, Pacific Northwest National Laboratory, Richland, Washington 99352
uljana.mayer@pnl.gov
Received January 1, 2007
Abstract:
We have synthesized a red …peptide motifs.
[Full text in html]
[Full text in pdf]

There are several aspects of this:

  • The abstract (and the full text) is copyright ACS.
  • There is no mention in the abstract that this is an Author Choice publication
  • or in the HTML or PDF full text
  • the material cannot be used for commercial purposes
  • The link from the abstract to the HTML links to the access toll-access login – i.e. this link is closed.
  • as is the PDF

Indeed I thought the whole paper was closed until I realised that the Open Access was possible only though the graphical abstract. (To be fair this is what the DOI – at present – points to so the open access version would be found in Google through the DOI). I cannot see any reason why the full abstract should only point to closed versions of the paper. Indeed I cannot see any reason why there are closed versions of the paper at all.
I hope this is an oversight rather than deliberate. I know and respect people at ACS and I know they read these messages. But the value that they have offered to the authors – who may have paid 3000USD – seems minimal. They have insisted that the authors donate their intellectual property to the ACS, advertised the Openness of the article only in one place (and not in the article), restricted the use of the article (copyright probably restricts my holding the article on my machine for serious computation). It is not in the spirit or letter of the Budapest Declaration.
I have seen very little evidence of “Open Choice” and other schemes having any impact in chemistry. Technically the publishers can claim that they offer “Green” Open Access for a fee. I suspect – though I don’t know – that other commercial publishers of chemistry such as Springer and Wiley have similar low uptakes of the schemes. I have no idea whether this worries them – they get the subscription income anyway. Certainly I don’t get the impression that they intend to change the publishing model this way. I have written to Springer asking for details on openness of data in their Open Choice policy and what the take-up is in chemistry but haven’t heard back – maybe this blog will elicit a response. And, indeed, I’d be delighted to hear from any other closed access publishers – they will get a considered response.
Although I support OpenAccess my energies are in Open Data, so I remain fairly quiet on the mechanisms and business models whereby OA may be achieved. The level of OA offered by these mechanisms is totally unsatisfactory for the modern semantic world. (That’s why I am turning to theses).
[NOTE: If you reply please use a good ratio of text to links. I get 500 spam per day and although Akismet is excellent it occasionally consigns posts with a high link density to the spambin.]

Posted in chemistry, data, open issues | Leave a comment

The power of the scientific eThesis

This is the summary of a presentation I am giving tomorrow at ETD2007 (run by Networked Digital Library of Theses and Dissertations. I’m blogging this as the simplest way of (a) reminding me what I am going to say and (b) acting as very rough record of what I might have presented. (My talks are chosen from a menu of 500+ possible slides and demos and I don’t know which at the start of the presentation so it’s very difficult to have a historical record. The blog carries the main arguments).
Main themes (many of which have been blogged recently):

  • the thesis need not be a dull record of a final result but a creative work with lives and evolves until and beyond the “final submission”
  • theses should be semantic and interactive, supported by ontologies and go beyond “hamburger PDF”. Theses are computable.
  • We must develop communal semantic authoring/creation environments and processes.
  • the process should move rapidly towards embracing open philosophies and methodology. Metadata and ontologies should be open.
  • young people should be actively involved in all parts of managing the thesis process.
    (Harvard Free Culture)
  • “Web 2.0” will transform society and therefore the academic process. We must be prepared for this.
  • It is not clear that current approaches to “repositories” will help rather than hinder innovation and dissemination of eTheses. They will only be useful for preservation if they are semantic.

In detail scientific theses need support for authoring and validating:

  • thesis structure (templating) – e.g. USQ’s Integrated Content Environment ICE system which supports XML/”Word”
  • MathML
  • SVG (graphics)
  • CML (Chemistry)
  • GML (maps)
  • Numeric data (various, including CML)
  • graphs (various, including CML)
  • tables (various, including CML)
  • scientific units (various, including CML)
  • ontologies and dictionaries (various, including CML)

Some exciting thesis projects:

Why PDF is so awful: Organic Theses: Hamburger or Cow?
Subversion (CML project)
Wikipedia – caffeine – (info boxes)
GoogleInChI – semantic chemical search without Google knowing
The power of the semantic Web –dbpedia.org – Using Wikipedia as a Web Database.
Chemical blogspace – overview of exciting developments in chemistry
Local demos including analysis of theses:

What should institutions and NDLTD do to promote this vision?

  • involve young people in all parts of the process – understand Web 2.0 culture and democracy. Be brave
  • help promote their vision against the conservatism of institutions, learned societies and commercial interests
  • promote thesis creation as a complete part of the research process. Start on day 0 with tools, encouragment. Get students from year+1 to explain the vision
  • Harness the power of social computing (Google, Flickr, Wikipedia, etc.). You will have to anyway. Give credit for innovation in this area
  • Co-develop semantic authoring tools, including scientific languages. Use rich clients for display.
  • Promote the use of ontologies and similar resources as integral parts of the scholarly process. Insist on marked up information and entities
  • Use software to validate data in theses. Give these tools to examiners.
  • insist that data belongs to the scientific community. Use creative commons licenses from day 0.

and overall… Use the power of the scholarly community to show that they can communicate science far better than the absurd e-paper, unacceptable business models, and repression of innovation that is forced on us by the commercial and pseudo-commercial publishers. Destroy the pernicious pseudo-science of citation metrics. Reclaim our scholarship.

Posted in etd2007, open issues | Leave a comment

"open access" – some central questions

I am grateful for the recent correspondence from Peter Suber and Stevan Harnad as it helps me get my thoughts in order for ETD2007. In response to Stevan:
Open Access: What Comes With the Territory,
Peter has analysed the central question very clearly (as always)

I expect that all of us will agree with the analysis below. The position each of us takes may vary:

Summary [of Stevan’s post]:Downloading, printing, saving and data-crunching come with the territory if you make your paper freely accessible online (Open Access). You may not, however, create derivative works out of the words of that text. It is the author’s own writing, not an audio for remix. And that is as it should be. Its contents (meaning) are yours to data-mine and reuse, with attribution. The words themselves, however, are the author’s (apart from attributed fair-use quotes). The frequent misunderstanding that what comes with the OA territory is somehow not enough seems to be based on conflating (1) the text of research articles with (2a) the raw research data on which the text is based, or with (2b) software, or with (2c) multimedia — all the wrong stuff and irrelevant to OA.

Comments

  • Stevan is responding to Peter Murray-Rust’s blog post from June 10. But since I agreed with most of what Peter MR wrote, I’ll jump in.
  • Stevan isn’t saying that OA doesn’t or shouldn’t remove permission barriers. He’s saying that removing price barriers (making work accessible online free of charge) already does most or all of the work of removing permission barriers and therefore that no extra steps are needed.
  • The chief problem with this view is the law. If a work is online without a special license or permission statement, then either it stands or appears to stand under an all-rights-reserved copyright. The only assured rights for users are those collected under fair use or fair dealing. These rights are far fewer and less adequate than OA contemplates, and in any case the boundaries of fair use and fair dealing are vague and contestable.
  • This legal problem leads to a practical problem: conscientious users will feel obliged to err on the side of asking permission and sometimes even paying permission fees (hurdles that OA is designed to remove) or to err on the side of non-use (further damaging research and scholarship). Either that, or conscientious users will feel pressure to become less conscientious. This may be happening, but it cannot be a strategy for a movement which claims that its central practices are lawful.
  • This doesn’t mean that articles in OA repositories without special licenses or permission statements may not be read or used. It means that users have access free of charge (a significant breakthrough) but are limited to fair use.

PMR: “The chief problem with this view is the law”. That puts it precisely, and that’s where Stevan and I differ. At the moment I think we have to work within the law, and I think the law debars me from crunching. There may come a time where we feel that civil disobedience is unavoidable but it hasn’t arrived yet – if it does I shall be there.
And some comments on other parts of Stevan’s post:

Get the Institutional Repository Managers Out of the Decision Loop

The trouble with many Institutional Repositories (IRs) (besides the fact that they don’t have a deposit mandate) is that they are not run by researchers but by “permissions professionals,” accustomed to being mired in institutional author IP protection issues and institutional library 3rd-party usage rights rather than institutional author research give-aways.

PMR: I have had similar thoughts. I got the distinct impression that some IR’s are run like victorian museums – look but don’t touch. Ithe very word “repository” suggests a funereal process – it’s no surprise that having put much of my stuff into DSpace I find it’s an enormous effort to get it out. Why don’t we build “disseminatories” instead?
[Stevan’s analysis of how we should deposit papers omitted. I don’t disagree – I’m just more interested in data t present.]

Now, Peter, I counsel patience! You will immediately reply: “But my robots cannot crunch Closed Access texts: I need to intervene manually!” True, but that problem will only be temporary, and you must not forget the far larger problem that precedes it, which is that 85% of papers are not yet being deposited at all, either as Open Access or Closed Access. That is the inertial practice that needs to be changed, globally, once and for all.

PMR: Here we differ. In many fields there has been little movement and no Green journals. We could wait another five years for no effect. But my main concern is the balance between Green access and copyrighted data. The longer we fail to address the copyrighting of data the worse the situation will become. Publishers are not stupid – they have revenue-oriented business people working out how to make money out of our data – Wiley told me so. Imagine, for example, that a publisher says “I will make all our journals green as long as we retain copyright. And we’ll extend the paper to cover the whole of the scientific record”. That would be wonderful for Stevan and a complete disaster for paper-crunchers. We can’t afford to wait for that to happen.

TJust as I have urged that Gold OA (publishing) advocates should not over-reach (”Gold Fever“) — by pushing directly for the conversion of all publishers and authors to Gold OA, and criticizing and even opposing Green OA and Green OA mandates as “not enough” — I urge the advocates of automatized robotic data-mining to be patient and help rather than hinder Green OA and Green OA (and ID/OA) mandates.

PMR: I am not – I hope – hindering Green access. I am not personally agitating for Green or Gold – my energies go into arguing that the experimental process must not be copyrighted by the publisher or anyone else. And that institutional repositories should start to be much much more proactive and actively support the digital research process.

Posted in etd2007, open issues | 1 Comment