petermr's blog

ChEBI

Posted on May 15, 2008 by pm286

ChEBI has been very important to us – here’s Duncan Hull:

Building a Better ChEBI

Chemical Entitites of Biological Interest, ChEBI, is a freely available dictionary [1] of molecular entities, especially small chemical compounds. Like all big dictionaries and ontologies, it has its own unique challenges. Fortunately, those nice people at the EBI are holding a workshop to discuss future developments in ChEBI. In preparation for the workshop, here are some brief notes on how ChEBI could be made better. [Disclaimer: I’m fairly new to ChEBI and “thinking out loud” here, add comments below if I’ve said anything stupid or wrong]

ChEBI: Too much, too young?

Some dictionaries try to describe too much. When it comes to writing down knowledge, it isn’t always easy to know where to stop. To define scope, the BI in ChEBI stands for “Biological Interest”. So this begs the question, why does ChEBI describe all sorts of subatomic particles that are of little (or no) biological relevance? While electrons (ChEBI:10545) and protons (ChEBI:24636) play an important role in Biology, you have to wonder what the biological interest of neutrinos (ChEBI:36352) and bosons (ChEBI:36341) is. Who decides what is “biologically interesting” and how?
Then there is the inescapable legacy of IUPAC, which ChEBI aligns itself with closely, but unfortunately IUPAC is a bit dated and cumbersome (or so I’m told).

ChEBI: I just can’t get enough?

Some people are never happy. Take any dictionary or ontology and they will pick holes in it. “It doesn’t say this, it doesn’t say that, this is wrong” etc. In no particular order:

The master copy of ChEBI is stored in an Oracle database. However, a common way of sharing ChEBI is the be OBO flat file format, but this is difficult to reason with. This means you can’t easily check ChEBI for contradictions and many of the relationships in ChEBI (”is-a” etc) have to be maintained by hand. This is a tedious and error prone process, where some relationships could be inferred by a reasoner. A mapping from OBO to OWL is available to make reasoning possible in the future.

ChEBI could be much better aligned and related with other ontologies [2,3], like the Gene Ontology for example.

ChEBI could also be aligned with wikipedia as well, no seriously, I’m not joking (and neither is Peter Murray-Rust)!

If I missed anything off the list, of things that are “wrong” with ChEBI, please let me know. If you’re going to the workshop, see you there (alongwith Christoph[e] Steinbeck and Kirill Degtyarenko I suppose)

PMR: I shan’t be at EBI but I think some of us will. Also Christoph (sic) visited us last week. ChEBI is an example of the sort of chemical ontology that should be commonplace already in mainstream chemistry but isn’t. Crystallography has its CIF dictionaries, Bioscience has GO and OBO, but chemistry? Yes, we have the IUPAC Gold Book which has been lovingly crafted into XML. But it’s not an ontology – it’s a rather random terminology whose structure is rather random entailment.
So – as always – the bioscientists are putting the chemists to shame and eating their lunch. Public Databases of chemical structure? Pubchem. Public Databases of chemical reactions? KEGG. Public Abstracts of chemical papers? Pubmed. And what do the chemists think? Most don’t even know they exist.
There is no doubt that the bioscientists are getting more interested in chemistry. They’ll start developing tools and databases. And if they happen on this side of the Atlantic – as ChEBI does – there is no way that lobbying Congress will get them closed down.
A bioscientist asked me today why she couldn’t get a free tool that translated IUPAC names into structure. IUPAC doesn’t have resources to do this sort of thing – it works by voluntary labour. Whereas commercial and quasi-commercial chemical databases trun over more than 1 billion per year. Bioscience funds the infrastructure of informatics. Chemistry seems hell-bent on staying in the dark ages.
I’ve got a shopping list of what I’d like to see chemical bioinformatics create. And I’d be happy to talk to those interested in getting them off the ground.

Posted in Uncategorized | Leave a comment

Interaction with Chemspider

Posted on May 15, 2008 by pm286

Antony Williams (Chemspiderman) has posted a useful comment on this blog under ( I am still DELIGHTED with Chemspider – May 10th, 2008) – [I sometimes have trouble with permalinks to comments]. I’ll pick up some points and reply, but first to say that there is a loose and hopefully ongoing synergy between the our site which is not a definite project but which may from time to time turn into one.

ChemSpiderMan Says:
May 13th, 2008 at 5:56 am e
Peter, I thank you for the applause regarding our implementation of licensing on ChemSpider. I also acknowledge and accept the apology you have issued publicly to John Wilbanks, ChemSpider and members of the advisory board.

PMR: This episode appears to have had a silver lining in places and I’ll blog that separately under Open Data

I believe that some good has likely come out of the conversations over the weekend – maybe a little more confusion, maybe a little more clarification (especially around John’s “data in the public domain” comments) and maybe a few more relationships. This latter part is especially of interest to me as we work on creating a community for chemists.
Now to the outcome for ChemSpider. ChemSpider went live in March of last year with a “who knows where it will go” approach. From the moment we went live you have paid attention. However, rarely has this been with any sense of support but, rather, a framework of negativity. You have criticized our science and our intent. You have projected your judgments as truths. I have addressed these judgments many times but rarely with acknowledgment from your side. It has been a lot of work for both of us. To be clear, I have judged your efforts around Open Notebook Science for NMR similarly.

PMR: I am putting the past behind us.

ChemSpider appears to have a center spotlight now in terms of licensing and Open Data. I acknowledge these are significant parts of YOUR agenda and a key part of what you have worked on for many years. I judge your other agendas to be Open Access, Semantic Web and associated technologies. I honor your work in these areas and feel you have contributed and will continue to contribute to the ongoing shifts of Open science prevailing at present. Thank you.
Our agenda for ChemSpider is different. We are building a community for chemists (Notice the recent shift from the original vision “Building a Structure Centric Community for Chemists” as we expand out of structures only.) At present, we are doing what we can to support the needs of chemists researching structure-based information. We are integrating information. We are more than a “linkbase”. We are actively supporting Open Notebook Science. We ARE listening to our users, the community, our collaborators and our advisory group.We have delivered a valuable solution in the past year with no cost to the users, to the tax-payers, with no grants and based on the hard work of a small dedicated team of volunteers only.
[…]I believe that you judge our efforts to be in conflict with those of your WorldWide Molecular Matrix but I doubt that is true.

PMR: I don’t think they are in conflict. The WWMM is not a fixed concept and evolves. Indeed part of its philosophy was that it was a peer-to-peer system with no centre. I still believe that to be true. What unites the components is a shared sense of purpose and a shared technological infrastructure.

I will respond shortly to some historical posts regarding your call for a structure collection for your eChemistry Project with Microsoft. We are willing to help and I am open to a discussion should you wish to collaborate. I am working with the Wikipedia:Chemistry team to build a validated SDF file for the public domain and we can make this available to you.

PMR: The current position is that we would like to collect a set of common compounds which have high-quality data and which are likely to emanate from trusted sources. Wikipedia is one, Pubchem is another, MSDS are another and various other Open web pages. We wish to make sure that this information is consistent (that is not necessarily the same as correct) and this is not easy. We are very happy to use Chemspider as one conduit for this, but like all sources we have to be sure of the quality. Chemspider adds quality through human checking and it’s value to have an audit trail. Some of the sources (such as ICSC MSDS are much worse than we thought – despite the fact that it claims to be “peer-reviewed” there is a significant percentage of identifiable errors – e.g. molecular masses do not match formulae and formulae do not match names. We hope to assemble information from a number of sources and use RDF to check the consistency (it can generally never check correctness).
There is a general problem with robotic aggregation of information – it can highlight agreement or disagreement but it can also introduce noise. Thus Pubchem has a great deal of noise and there is no simple robotic means of removing it. Indeed (though I can’t find it) I think Shannon has a theorem that proves that machines cannot guarantee the correctness of information. In a similar vein, the interpretation of names can often add significant noise, not just because of the ambiguity of interpretation but also through generic use and metonymy. Peter Corbett uses the example of “a pyridine” as almost certainly meaning something other than C5H5N. I’ll probably write more later.
We are preparing our CrystalEye data in a form that it can be reused by you and others. It’s harder than it looks – partly because crystallography and chemistry are not the same, and partly because here is no system of unique identifiers in the original data. But Jim has or will have tools to access it.
Best

Posted in Uncategorized | Leave a comment

Chemical compounds in Wikipedia

Posted on May 15, 2008 by pm286

Wikipedia is (rightly) becoming the first place that people look for well-understood scientific information including chemistry. Chemical compounds are particularly suited as the concept is over 150 years old and it is universal practice to index parts of chemistry through compounds. In most cases the ompound can be given one or more identifiers, through the relationship of these can be complex. Examples are names, serial numbers and other arbitrary IDs, and chemical structures.
Recently two derivative works of WP compounds were announced:

This post is primarily to welcome these developments and add some general comments.

The style of the two sites is different and they appear to be completely independent. They are somewhat complementary: CS integrates the entries into a datacentric format; MM describes entries as monographs and has an emphasis on text and images. Neither site references the other AFAICS.
I think both sites use the WP title and URL as the primary identifier in WP. WP also has a set of numeric identifiers which I think represents the internal WP uniquification system. This may matter at some time as WP entries can be deleted or moved while the identifiers are sacrosanct.
Both sites have a search capability (I have not compared them). I may have missed it but there was no clear way to download results.
It is not clear what the ingestion strategy is for either site. MM has a mechanism for humans to ingest entries at the same time as they author them on WP.
I am not clear what data transformation (if any) is carried out automatically by the ingest process. Data in WP Infoboxes is still variable (DBPedia 2008-02 release shows at least 4 different syntaxes for molecular mass. An ingestion program either has to deal with all lexical variants (quite a problem) or simply ingest the string. There is also potential confusion between minus, hyphen-minus, negative and ranges. Scientific units are not always easy to extract.
Does either site have an RSS feed for new entries?

Wikipedia has about 5000 compounds (the number is fuzzy because most people would not include proteins, probably not peptides, and nucleic acids. There are also many substances which describe a range of constituents such as petrol, polystyrene and many solid state compounds.
I have, in the past, downloaded WP data from the lists of organic and inorganic compounds (this totals considerably less than the ca 5000 in the two derivative sites). Is there a central page, preferably with RSS or a watch list, which lists those entries primarily considered to be chemical compounds.
Our own work on collections of common compounds using RDF is progressing well though it has been technical harder than we thought mainly due to variability in data input. We will use and acknowledge gratefully material from the sites above, and particularly from DBPedia (though there needs to be continued work in standardising the infoboxes to give consistent semantics). It t is, however, critical that the process of copying or transclusion does not introduce errors (which I suspect is likely until there are consistent infoboxes). We shall, of course make our results freely and Openly available, modulo the difficult issues which have been raised about data sharing are re-use.

Posted in Uncategorized | 3 Comments

Peter Suber on the definition of OA

Posted on May 14, 2008 by pm286

Peter Suber has again been foiled by our WordPress comment system and I copy his latest one and comment on it.

Hi Peter[MR]: Some people objected to “weak OA” on the ground that it disparaged some difficult and significant achievements. Some objected to “strong OA” on the ground that it glorified some weak or not-very-open variations on the theme. You’re clearly in the second camp: “I feel deeply unhappy about the use of ‘strongOA’ to describe something which has most of its permission barriers still in place….”
Both objections are justified, which is why I’m no longer using the terms. However, as I said in the passage you quote, the distinction itself (between removing no permission barriers and removing some) remains important, widely accepted, and non-controversial. I’m currently leaning toward terms that are purely descriptive and carry no judgments –such as “gratis” and “libre”, which have the advantage of expressing exactly this distinction in the world of software.
But even with neutral terms, we must accept the fact that there’s more than one permission barrier to remove and therefore more than one degree or kind of “strong” or “libre” OA. The neutral term for that *family* of OA, therefore, will not be synonymous with any single flavor, such as CC-BY or CC-NC. It will still be the case, as it always has, that the most precise way to refer to a single flavor is to refer to a license.
You add: “If I were a funder wishing to support OA I would have little idea what I should be campaigning for.” Funders should focus on substance, not labels, just as researchers should. If they want to remove price barriers alone (which is the goal of most funder policies to date), they should do so. If they want to remove some permission barriers as well (which some now do), they should decide which ones. The unitary label “OA” never made these decisions easier and a plurality of labels for different varieties of OA doesn’t make the decisions harder.
For example, no funder has ever required BBB OA for its funded research, even if we think they should have and even if they were among those who insisted that “OA” was synonymous with “BBB OA”.
I’m looking for clear and neutral terms for different types of openness precisely so that we can talk about our substantive policy options more clearly. As I said in my original post on strong and weak OA, the term “OA” is now used ambiguously for both species. This is a fact of usage and it hurts communication. Clear and neutral terms solve the problems caused by ambiguity and shouldn’t affect our thinking about substance and strategy at all –except to make it clearer.

PMR: This is very helpful. Yes, I am an the camp who thinks that we should be campaigning for BBB, but I accept that there are others which believe that goals may often (even most often) be more modest. Again I hope it’s clear that less-than-BBB can be intensely frustrating for scientists, whereas for many disciplines self-archiving is adequate.
As you say “strongOA” – and worse “full-OA” – can be misleading. Let’s take a real example, the ACS’s Author choice, which you blogged (http://www.earlham.edu/~peters/fos/newsletter/04-02-07.htm) and which I asked the community to define about a week or two ago. The ACS’s Author choice is an aoption where authors pay 2000 USD-3000 USD (depends on membership, etc.) for the right to have their article published immediately and permanently on the publisher web site. It clearly comes under Stevan/your definition of weakOA (currentky a placeholder term). It ticks all Stevan’s boxes – immediate, permanent.
It also removes precisely one permission barrier – the right to post a copy on web pages or in an IR. It does not remove other permission barriers to potential uses such as:

creating material for teaching/coursebooks, etc.
datamining
textmining
republishing of images, e.g. in books, research papers,
inclusion in novel works through re-use (mashup)
re-puposing (e.g. through novel display technologies, etc.)

(this is not exhaustive, but gives an idea of the things I am interested in).
Under the new (temporary) definitions this would be classified as “strongOA”, as would CC-BY and BBB. You can see why I would object to terms such as “fullOA” or “reuseOA” as these would be seriouly misleading. A term should not be capable of facile misinterpretation. I’m not likely to be very much help here but I still think there should be three terms (at least), fooOA, barOA, BBB-OA, where the first two replace weak and strong.
Here’s what you said – accurately – about the ACS scheme, where you calibrated it against your nine points. Several of these are not about permission barriers and so aren’t relevant. You also, accurately, noted that this was not very different from greenOA in its final effect. The tone of your comments (with which I agree) implied that you felt that authors (and implicitly readers) were not getting a particularly good deal. I wuld feel that “strong OA” could be taken as somewhat flattering. It might be logical, but it isn’t comfortable:

PeterS: Paying for green open access
Two announcements in March [2007] showed that some publishers want to charge for OA archiving and at least one foundation is willing to pay for it. Neither amounts to a trend, but both could slow the progress of green OA, either by the direct imposition of new and needless costs or by confusing policy-makers about the economics of green OA.
First the American Chemical Society (ACS) re-announced its hybrid journal program, AuthorChoice, and reminded us that authors who wish to self-archive must pay the AuthorChoice fee. Then Elsevier and the Howard Hughes Medical Institute (HHMI) agreed that when an HHMI-funded author publishes in an Elsevier journal, HHMI will pay Elsevier a fee to deposit the peer-reviewed postprint in PubMed Central six months after publication. Here’s a closer look at each policy.
* The ACS AuthorChoice program
ACS first announced AuthorChoice in August 2006 and re-announced it in March 2007. I looked for a change in the policy that might explain the second announcement but couldn’t find one. Perhaps the uptake was lower than ACS expected. Perhaps it simply wanted to remind people of something it feared was being overlooked. (I know the feeling.)
ACS press release announcing AuthorChoice, August 14, 2006
http://pubs.acs.org/pressrelease/author_choice/
http://www.earlham.edu/~peters/fos/2006_09_03_fosblogarchive.html#115747997209615778
ACS press release re-announcing AuthorChoice, undated but c. March 6, 2007
http://pubs.acs.org/4authors/authorchoice/press_release.html
http://www.earlham.edu/~peters/fos/2007_03_04_fosblogarchive.html#117328108051594942
To review AuthorChoice, I’ll draw on my nine questions for hybrid journal programs from SOAN for September 2006
http://www.earlham.edu/~peters/fos/newsletter/09-02-06.htm#hybrid
Of the nine, the ACS gives a good and welcome answer to just one: it will let authors deposit articles in repositories independent of the ACS. It gives unwelcome answers to three more: it does not let participating authors retain copyright; it does not promise to reduce its subscription prices in proportion to author uptake (hence using the double-charge business model); and it will charge its AuthorChoice fee even to authors who want to self-archive. It leaves us uncertain on the remainder: Will it let participating authors use OA-friendly licenses? Will it waive fees in cases of economic hardship? Will it force authors to pay the fee if they want to comply with a prior funding contract mandating deposit in an OA repository? Will it lay page charges on top of the new AuthorChoice fee?
The ACS was not a green publisher before adopting AuthorChoice. Hence, its current position, disallowing no-fee, no-embargo self-archiving, is not a retreat.
Nor is the ACS first publisher to charge a fee for self-archiving. About a week before the ACS announced AuthorChoice, Wiley announced its hybrid program, Funded Access, which has the same effect. However, the ACS is the second, and so far Wiley and the ACS seem to be alone in this category.
Wiley charges $3,000 for OA archiving and deposits the published edition immediately upon publication. The ACS charges the same fee for the same benefit, but also offers discounts for ACS members and those affiliated with institutions subscribing to ACS journals.
At both publishers, these fees pay for gold OA, and I should make clear that I have no objection to charging for gold OA. On the contrary; if we are to have it, we must pay for it (through author-side publication fees, institutional subsidies, or some other way). However, I do object to charging for gold OA when authors only want green OA. It’s like offering a car with a free bicycle to people who only want to buy a bicycle.
(BTW, by “green OA” I mean OA through a repository and by “gold OA” I mean OA through a journal. Gold OA includes peer review and green OA does not. Gold OA begins at the moment of publication and applies to the published edition of an article. Green OA is sometimes immediate, sometimes delayed, and can apply to any version of an article: a preprint, the published edition of the postprint, or the peer-reviewed but not copy-edited version of the postprint.)
As the ACS policy is currently worded, it only charges the fee for self-archiving the published edition of an article. Hence it leaves the door open for no-fee self-archiving of the final version of their peer-reviewed manuscript, rather than the published edition. On the American Scientist Open Access Forum, Stevan Harnad asked whether ACS planned to charge for that form of self-archiving as well. Adam Chesler, the ACS Assistant Director Sales and Library Relations, said yes.
Chesler’s answer makes the ACS policy even worse than it seemed at first. It’s bad enough to force authors to pay for gold OA in order to get green OA; at least they really get gold OA too, wanted or not. But under this new wrinkle in the policy, even self-archiving authors who don’t get gold OA must pay for it.

Posted in Uncategorized | Leave a comment

Missing comments

Posted on May 13, 2008 by ojd20

I’ve been investigating Peter’s missing comments problem, and have found the cause, but not yet the solution…
The comments to these blogs are passed through the Akismet spam filter before going to the human spam filter (Peter) before being posted. The Akismet plugin allows you to see the most recent ~150 spam posts, so you can rescue ham. At the current rate of spamming, this gives Peter just ~4 hours from posting time to rescue ham from Akismet.
I’m going to start looking for a Captcha plugin for WordPress – can anyone recommend one?
Jim Downing

Posted in Uncategorized | 3 Comments

Current issues

Posted on May 12, 2008 by pm286

With all of the delicacy of an elephant dissecting fruit flies I have jumped into three issues, all of which require careful reply and may take a little while. Meanwhile I have a talk to give on Open Access next week and I haven’t a clue what to say. So I’ll talk about Open Data which I think I do understand.
All three areas have had substantive discussion/reply and the current states are:
Word OOXML ODT etc.

Peter Sefton: Some comments on OOXML, ODF and Microsoft Word

File format conversions

What can we do now?

Some comments on the Microsoft issue

There is a conversation about file formats and word processors going on between Peter Murray-Rust and Glyn Moody.
Peter made some comments about wanting access to chemistry publications in Word format so he can better extract chemical information embedded in them, which sparked some push-back regarding the Microsoft OOXML format.
Unsurprisingly I have a few opinions on all of this. [MUCH more useful stuff snipped]
Glyn Moody: How Microsoft Uses Open Against Open
To my shame, Peter Murray-Rust put up a reply to my post below in just a few hours, where it had taken me days to answer his original posting. So with this reply to his reply, I’m trying to do better.

PMR: Interim summary. I’m fully conscious of the issues. Some are ethical – should one work with commercial companies and under what circumstances. Others relate to standards – should one never use anything except standards? and the practicalities – is it possible to find technical ways forward? This discussion will continue…
Open Data

More on the science exchance – or building and capitalising a data commons

11:10 11/05/2008, Cameron Neylon,

[…]

The other story of the week has been the, in the end very useful, kerfuffle caused by ChemSpider moving to a CC-BY-SA licence, and the confusion that has been revealed regarding data, licencing, and the public domain. John Wilbanks, whose comments on the ChemSpider licence, sparked the discussion has written two posts [1, 2] which I found illuminating and have made things much clearer for me. His point is that data naturally belongs in the public domain and that the public domain and the freedom of the data itself needs to be protected from erosion, both legal, and conceptual that could be caused by our obsession with licences. What does this mean for making an effective data commons, and the Science Exchange that could arise from it, financially viable? (more…)

PMR: There’s also extremely valuable material from John Wilbanks (follow links from Cameron). Current position: I blundered, but this has probably created some good. The issue is now clear. Licensing data diminishes our right to them. (Continued kudos to Chemspider who I am sure will follow the discussion as the situation is clarified). The Open Knowledge community is clear and united. No half-measures. Data should be Open.
Open Access – “Strong and Weak”
Very little discussion till yesterday when Peter Suber issued a definitive post yesterday:

The boundary between removing no permission barriers and removing some

23:06 11/05/2008, Peter Suber, Open Access News

Stevan Harnad, Lower Bound Needed for Permission-Barrier-Free Open Access, Open Access Archivangelism, May 4, 2008.
Summary: “Permission-Barrier-Free OA,” because it is on a continuum, needs at least a minimal lower bound to be specified.
“Price-Barrier-Free OA” is not on a continuum. It just means free access online. However, it too needs to make a few obvious details explicit:
(1) The free access is to the full digital document (not just to parts or metadata).
(2) There is no “degree of free” access: Lower-priced access is not “almost free” access.
(3) The free access is immediate, not delayed or embargoed.
(4) The free access is permanent and continuous.
(5) The access is free for any user webwide, not just certain sites, domains or regions.
(6) The free access is one-click and not gerrymandered (as Google Books or copy-blocked PDF are).
Hence “Almost-OA” [via Closed Access plus the “Email Eprint Request” Button] is definitely not OA — though it will help hasten OA’s growth.
Nor does Price-Barrier-Free OA alone count as Permission-Barrier-Free OA. The only way to give that distinction substance, however, is to specify a minimal lower bound for Permission-Barrier-Free OA.

Comments

The background here is the distinction that Stevan and I once described with the terms strong and weak OA. We now agree that we picked infelicitous terms to describe the distinction and are looking for better ones. But the distinction itself remains important, widely accepted, and non-controversial. Here Stevan elaborates on one side of the distinction: what we called “weak OA” or the removal of price barriers without the removal of permission barriers. I want to elaborate on the distinction itself or on the borderline between the two halves.

Here’s how i [PS] described the borderline in a comment on Peter Murray-Rust’s blog last week:

The borderline between strong and weak OA is easy to define. Weak OA removes no permission barriers and strong OA removes at least some permission barriers. (Both of them remove price barriers.)

Here’s how I described the borderline in a comment on Peter Murray-Rust’s blog last week:

The borderline between strong and weak OA is easy to define. Weak OA removes no permission barriers and strong OA removes at least some permission barriers. (Both of them remove price barriers.)

PMR: This is clear, and implies the the removal of even one permission barrier makes something “strongOA”. (Stevan Harnad has proposed the term “FULL-OA” instead. I shall continue to use “strongOA” until there is an alternative.)
PMR: I fell deeply unhappy about the use of “strongOA” to describe something which has most of its permission barriers still in place and for which someone may have been persuaded to pay thousands of dollars. If I were a funder wishing to support OA I would have little idea what I should be campaigning for.
It leaves me with a dilemma. I can continue to engage in discussion of what Open Access is, and how it should be practised. In doing so I shall upset Stevan Harnad who thinks that debating defintions is a distraction from actually self-archiving my articles. I shall also perhaps reveal cracks in the Open Access community since we are now looking at a spectrum of options where many practitioners are advocating different practices. [I continue to applaud SPARC for its no-frills approach – BBB/CC-BY or nothing and the main Open Access publishers who usee CC-BY licences.]
Or I can shut up and and devote energy to Open Data and Open Knowledge instead. Here, at least, I think there is a clear understanding of what we are doing. I’ll probably summarise my Open Access position before that. I will continue to be “in favour of Open Access” but I’ll stop short of explaining what it actually is and refer people to SPARC/BBB.

Posted in Uncategorized | Leave a comment

OOXML and ODT/F; should we work with commercial tools?

Posted on May 11, 2008 by pm286

Glyn Moody has taken me to task for espousing Word as a tool that should be considered for archival of scholarly output. Not Word alone, but as a supplement to PDF. I explained my reasons and motivation. Now Glyn has continued the discussion and I reply…

A Word in Your Ear

A little while back I gave Peter Murray-Rust a hard time for daring to suggest that OOXML might be acceptable for archiving purposes. Here’s his response to that lambasting:
My point is that – at present – we have few alternatives. Authors use Word or LaTeX. We can try to change them – and Peter Sefton (and we) are trying to do this with the ICE system. But realistically we aren’t going to change them any time soon. My point was that if the authors deposit Word we can do something with it which we cannot do anything with PDF. It may be horrible, but it’s less horrible than PDF. And it exists.

There are two issues here. The second concerns translators between OOXML and ODF. Although in theory that’s a good solution, in practice, it’s not, because the translators don’t work very well. They are essentially a Microsoft fig-leaf so that it can claim using OOXML isn’t a barrier to exporting it elsewhere. They probably won’t ever work very well because of the proprietary nature of the OOXML format: there’s just too much gunk in there ever to convert it cleanly to anything.

PMR: I didn’t regard it as a lambasting but a controlled robust discussion. I understand and appreciate where Glyn is coming from. I haven’t said anything I regret, but I’m also aware that there are unclear boundaries. Before diving in I should get a potential conflict of interest out of the way. We are about to receive funding from Microsoft (for the OREChem project (see post on Chemistry Repositories). This does not buy an artificial silence on commenting on Microsoft’s practice, any more than if I accept a grant from JISC or EPSRC I will refrain from speaking my mind. Nor do I have to love their products. I currently hate Vista. However I need an MS OS on my machine because it makes it easier to use tools such as LiveMeeting (a system for sharing desktops). I’ve used LiveMeeting once and I liked it. OK, Joe did the driving because he knows his way round better than me, but I can learn it. Not everything MS does is bad and not everything it does is good.
The reason I currently like OOXML is that we can make it work and that we have material in Word that we can use. I’ll be demoing it publicly in a week’s time (more later). If we had material in ODT we’d use that, but we don’t. There may be a few synthetics chemists somewhere who use ODT, and we’d really like to hear from them, but currently we have to work with what chemists do.
I’m sorry to hear the translators aren’t good and I’m not surprised. I can’t imagine they are as bad as trying to get structured documents out of PDF. Remember also that we don’t want to do everything at this stage.
Our primary goal is to evangelize the semantic chemical web. To do this we have to create a lot of infrastructure: demonstrators, ontologies, microformats, etc. This is all independent of the tools used to create the starting material. Everything we do will be modular and none of the chemistry will have hardcoded OOXML stuff in. I believe that were we to have a chemical thesis in ODT then we’d be able to adapt our material very easily.
Our motivation, therefore, is to work with scientists as they are, not as we would like them to be. There is no point in trying to make them use Docbook, for example. (Last time I tried I couldn’t even get it to work for me – the stylesheet stackdumped (something I have never seen before)). My worry about Open Office (which emits ODT) is that I don’t yet believe that has reached a state where I could evangelize it without it falling over or being too difficult to install.

The larger question is what needs to be done to convince scientists and others to adopt ODF – or least in a format that can be converted to ODF. I don’t have any easy answers. The best thing, obviously, would be for people to start using OpenOffice.org or similar: is that really too much to ask? After all, the thing’s free, it’s easy to use – what’s not to like? Perhaps we need some concerted campaign within universities to give out free copies of OOo/run short hands-on courses so that people can see this for themselves. Maybe the central problem is that the university world (outside computing, at least) is too addicted to its daily fixes of Windows and Office.

PMR: I face this sort of problem daily in chemistry. Chemists would rather pay for something commercial than become early adopters. I particularly blame the pharma industry. I meet people on an irregular basis who say things like: “Oh yes, we use OSCAR to develop our textmining experience and then we go out and buy X, Y, Z”. Since X, Y and Z are commercial I can’t evaluate the and say they are worse than OSCAR but I firmly believe that for many aspects we are ahead of them. Companies want to buy things where they can sue the suppliers.

We see this with the Blue Obelisk. The pharma companies all use its tools – OpenBabel, Jmol, CDK, possibly JUMBO/CML (but how would I know? I only wrote it and made it Open/Free). Occasionally they write an say “we’d really like to develop Open Source” I write back enthusiastically and then they dump me.

So the strategy is to create something that is self-evidently good and miles ahead of the current software offerings. Then people will have to take notice.

What do we do with Universities? I wish I knew. Universities have an information budget running into zillions (publications, subscriptions, librarians, repositarians, etc.) They are completely incapabale of managing this market. Worth is decided by citations, which are decided by a commercial organisation. Repositories aren’t full, there’s no control and ownership of their output which they simply gift to the publishers. When was the last time a provost spoke out on this? (Yes, I except Harvard, and probably Soton, and QUT and Stirling, but almost all major Universities have failed to tackle this).

So I could spend my time writing letters to Vice-Chancellors.

Or I could develop the next phase of the Open Chemical Semantic Web.

I’ve chosen the latter. But it would certainly help if some readers did the former.

Posted in Uncategorized | 9 Comments

I am still DELIGHTED with Chemspider

Posted on May 10, 2008 by pm286

A few days ago I applauded (blog post) ChemSpider for releasing their data under CC-SA and I still do. CC-SA is compliant with the BBB defintion/declaration.
There has been some apparent criticism of this created because I unintentionally posted a private list to my blog. I apologize. to John, Chemspider and other members of the advisory board. The language reflects the type of discussion that a small community uses in private.
John has now blogged this

Chemspider: Good Intentions and the Fog of Licensing

04:17 10/05/2008, john wilbanks, john wilbanks’ blog

This is a comment I posted on the ChemSpider blog, one of two I tried to post. I’m cross posting here to make sure it’s public. Make sure to click through to the blog, it’s on the topic of using CC licenses on data. I sent an email to a list that got blogged, before I could get a chance to reconcile everything and contact the Chemspider guys. I think they should get complimented for their intentions and that they deserve tea and sympathy, because this licensing stuff is really complicated, and all they wanted to do was share.
In short, it’s a demonstration of how confusing data licenses make the position of data providers essentially untenable. From my perspective, the answer is either go public domain, or don’t. If you don’t, please make the metadata public domain. Anything is simply too confusing to figure out, and it’s going to be worse.
Part of the problem is that we have created a cargo cult around licenses. A contract will come from the heavens and make us free! But in data we’ve got the public domain right there to teach us. All we have to do is look up from the lawyer’s desk and follow the yellow brick road…er, the NCBI’s lead.
jtw
>>>>>>>>>>>>>>>
I tried to post a comment but don’t know if it got through.
I did not intend for my comments to become public – that was a post to an advisory board list, intended to highlight precisely how this issue demonstrates the difficulty providers have in understanding licensing of data.
Creative Commons licenses were built for cultural works, like this blog or a website or music. They weren’t built for data. Data has different qualities and characteristics and thus requires different licensing approaches.
I would recommend you read the official CC position on this, which is the Science Commons Open Access Data Protocol (http://sciencecommons.org/projects/publishing/open-access-data-protocol/) and that you look at the best available legal tool to achieve the protocol (http://www.opendatacommons.org/odc-public-domain-dedication-and-licence/). These are regimes that facilitate data integration, unlike the CC BY SA license.
Please know that I salute your intent here and don’t want to slander you – you’re trying to share, and you’re confused on how to do so. I do believe that in our conversations I did indeed recommend to you the idea of releasing an RDF dump of your database in the public domain, using only the NCBI approach listed on this very blog. That’s essentially what we recommend at CC, as you’d see in the protocol.
Again, it was not my intent for this to go public before I could reach you, and I’m very sorry for that. It is never fun to make a decision and get pummeled for it, and from my perspective you don’t deserve the pummeling.
I’ll cross post this to my blog to make sure it gets online.

PMR: On the assumption that no lasting harm has been done, some good comes out the the episode:

I will take greater care to make sure what I repost
As there is very little general understanding of the benefits and disadvantages of licences it is clear that we (SC/OKF) have to work hard with each new community, some of whom have virtually no experience of licencing.
We have had the chance to publicize how much we care about getting the mechanics of Openness right and what the issues are. We do not want fuzzy borderlines

Posted in Uncategorized | 2 Comments

Strong and Weak OA – where are we?

Posted on May 9, 2008 by pm286

Jonathan Gray of the Open Knowledge Foundation reviews the postings over the last few days on the new ideas of strongOA and weakOA.

Beyond Strong and Weak: Towards a Typology of Open Access

03:01 09/05/2008, Jonathan Gray,

Over the past week or so there has been a flurry of posts about ’strong’ and ‘weak’ open access, including the following:

Strong and weak OA, Peter Suber

What’s in a Name? Strong and Weak Open Access, Glyn Moody

The Two Forms of OA Have Been Defined: They Now Need Value-Neutral Names, Stevan Harnad

Lower Bound Needed for Permission-Barrier-Free Open Access, Stevan Harnad

Peter Suber on what is strongOA, Peter Murray-Rust

Further discussion on strongOA and weakOA, Peter Murray-Rust

How many forms of OA are there now?, Peter Murray-Rust

Peter Suber’s comments on strongOA/weakOA, Peter Murray-Rust

Suber-Harnad strongOA/weakOA borderline, Peter Murray-Rust

More clarification from Stevan Harnad, Peter Murray-Rust

Peter Suber and Stevan Harnad both agree:

[…]We have agreed to use the term “weak OA” for the removal of price barriers alone and “strong OA” for the removal of both price and permission barriers. To me, the new terms are a distinct improvement upon the previous state of ambiguity because they label one of those species weak and the other strong. To Stevan, the new terms are an improvement because they make clear that weak OA is still a kind of OA.
On this new terminology, the BBB definition describes one kind of strong OA. A typical funder or university mandate provides weak OA. Many OA journals provide strong OA, but many others provide weak OA.

Furthermore, Peter Suber adds:

As soon as we move beyond the removal of price barriers to the removal of permission barriers, we enter the range of strong OA. Hence, an article with a CC-NC license is strong OA because it allows some copying and redistribution beyond fair use (even if it doesn’t allow all copying and redistribution). My own preference is still for the CC-BY license, but we shouldn’t speak as if CC-NC were not strong OA or as if there were just one kind of strong OA.

According to this schema, a cost free publication counts as weak open access, and a publication licensed under a CC-NC license counts as strong open access. Stevan Harnad agrees with the distinction but suggests the need for ‘value-neutral’ terms to describe it – suggesting ‘basic’ and ‘full’.
Its worth adding to this discussion that there is also Open Definition compliant open access, which I understand is equivalent to BBB open access and which is more permissive than ’strong’ or ‘full’ open access. As we blogged a couple of weeks back – anything with the SPARC Europe Seal will be open access in this sense.
As Peter Murray-Rust comments:

Open Source has the OSI which determines whether ot not a given licence is OS. Open Knowledge after only a short time of volunteers has the OKF and has an agreed definition and a list of conformant licences.

Scholarly publications, as literary works, constitute knowledge and hence are covered by the OKD. A journal, monograph or any other publication can still be ‘open as in the OKD’ as with other forms of knowledge. Debates about open access aside, demarcating between knowledge that is ‘open’ and ‘closed’ is precisely what the OKD is there for!
It will be interesting to see what emerges as the new classificatory scheme for open access, and where OKD compliant publications sit on the spectrum. Perhaps these will be called ‘OKD/BBB compliant open access’ journals, or suchlike.

PMR: I am now completely confused about what Open Access is. A month ago I thought it was simple – conformance to the BBB declarations. I was happy to see this extended to strongOA and weakOA which I thought were BBB-compliant, and everything else that is freely accessible, immediately and permanently. The saving grace is that this is also SPARC’s interpretation on which they award their seeal of Open Access Journal.
I do not know what a permission barrier is. I assume it’s a barrier for the reader/user, not a barrier for the archiver, librarian, author, funder, etc. But no one makes it clear. So if I don’t know what a permission barrier is I don’t know when one has been removed. And I am sure interpretations differ.
For some reason the Open Access movement does not seem to wish to define its terminology (other than the very clear words in the BBB declarations). I’ve set out my concerns in the posts above but, apart from SPARC, had no response. OK, I’m not a professional OA player – I would fail an examination on Open Access – and I’m primarily interested in Open Data.
Does it matter?
Yes.
Because many journals now charge large amounts of money (kilodollars) for Open Access. If they don’t define what it is then many purchasers of OA will be paying for a pig-in-a-poke. And some of the pigs turn out to be turkeys.
I’ve given six examples of “Open Access” – they’d be typical examination questions when Open Access becomes a degree subject (if it isn’t already). And I’m in danger of failing.

Posted in Uncategorized | 3 Comments

Open Knowledge; London meeting and later

Posted on May 9, 2008 by pm286

The Open Knowledge Foundation has a clear and pragmatic approach to making knowledge (data, creative works, monographs, etc.) Open. It’s very clear what Open Knowledge is (The Open Knowledge Definition) (unlike Open Access which is still working out what it is). Last week the OKF had a London meeting – it meant to blog it (a) during the meeting, but I was much too intested in the business and (b) after we got back from the pub. So just to say that we were a diverse set of people – from academia, transport, government, etc. OKF has achieved a lot already – it’s respected, it keys into organisations such as Science Commons and it has made major contributions to defining the meaning and the practice of Open Knowledge. We’re not an advocacy group, but we do want to evangelize OK. Jonathan Gray ran the meeting and it was nice to meet again – among others -Cameron Neylon – who is Opening science in the laboratory – and Jordan Hatcher who was the major engine behind the legal apparatus of Open Knowledge.
What is Open Knowledge? One idea we left with was to create a 10-minute vidcast showing what sorts of things people are doing and why it is so valuable. I think Jonathan will be following this up.
John Wilbanks of Science Commons is one of the advisory board and here’s a substantial post from him. I add a few comments at the bottom…

I second the motion.
I would add to it that I'd like to see a meaningful discussion of the
risks of Share Alike and Attribution on data integration. Chemspider's
move to CC BY SA fits into this discussion nicely - it's a total
violation of the open data protocol we laid out at SC, which says "Don't
Use CC Licenses on Data" - but it does conform inside the broader OKD.
It'd be nice to articulate when and how the goals of integrating
databases fit in the OKD. If you're not trying to integrate data, then
SA and BY are ok I suppose. But in data, the *most restrictive license
wins* - which creates inherent problems with at least SA and NC clauses.
This is the opposite of software code in which the virally open license
can defeat closed licensing, and is a reflection of the inherently
different behavior of data from a copyrights and sui generis rights
perspective.
For example, given the lack of clarity on the database directive, and
the fact that any query result from a database might well qualify as
"substantial extraction" then the rational assumption is that answers to
database queries should be regarded as "derivative works" under
licensing regimes like the ODL. This would also apply to any query to an
integrated knowledgebase like our Neurocommons. It would also in theory
apply to a Google result if that Google result touched the database
licensed underneath (especially if the Google result brought back a lot
of the underlying database, somewhere in its "5,000,000 results" set).
This is the case of any query to a federated web if the results of that
query touch on the database, as the result of a query to a database is
most definitely itself a data product and thus subject to the license...
This is, at least, the conclusion of the nearly two years of legal
research we did at SC. We continue to look for holes in the reasoning
and welcome any and all feedback from this group on the topic...
jtw
jo [email] wrote:

> On Thu, May 08, 2008 at 01:08:30PM -0700, Jonathan Gray wrote:

>> Peter Murray-Rust and I also recently discussed how it would be great to
>> have a document, or a series of documents, on the rationale behind not
>> including licenses with non-commercial restrictions in the OKD.

> 
> I'd *really* like to see this especially if it includes a strong
> account of the economic case for "public sector" organisations not
> placing non-commercial restrictions on data (a compromise viewed as
> "politically acceptable" too much ...  Rufus et al's recent paper on
> Trading Fund potential models for distribution of raw/unrefined data
> made some useful points but it was all kind of enmeshed in the local
> policy context...
> 
> 
> jo
> --
> 
> _______________________________________________
> od-coord mailing list
> od-coord@lists.okfn.org
> http://lists.okfn.org/cgi-bin/mailman/listinfo/od-coord
PMR: I agree with John. Licences are not appropriate for data (and when I applauded Chemspider it was for the motivation rather than the actual mechanism - CC-SA is conformant to the OK definition, but difficult to operate for re-use). That's why we use the OKF's OpenData sticker on CrystalEye.
Note that Science Commons' promotion of the idea of Community norms is very important. The bisocience community does not licence data because there is a general agreement (Community Norm) that it is Open and re-usable. Whereas in chemistry the Community Norm is copyright it, protect it, sell it back to the authors, stop it being integrated, send the lawyers, etc. Some chemical information organizations (no names) have lots of lawyers and seem to get a lot of practice by suing people. So maybe chemistry needs licences as it drags itself out of the information swamp.

Posted in Uncategorized | 3 Comments

ChEBI

Building a Better ChEBI

ChEBI: Too much, too young?

ChEBI: I just can’t get enough?

Interaction with Chemspider

Chemical compounds in Wikipedia

Peter Suber on the definition of OA

Missing comments

Current issues

OOXML and ODT/F; should we work with commercial tools?

A Word in Your Ear

I am still DELIGHTED with Chemspider

Strong and Weak OA – where are we?

Open Knowledge; London meeting and later

Recent Posts

Recent Comments

Archives

Categories

Meta