Klaus Graf on what is strongOA

I am delighted that Klaus Graf – a fervent and consistent supporter of strongOA – has replied at some length to my question on what is strongOA. If we do not address this question now then we shall lose most of the value of the new terminology. He also comemnts on what is weakOA which I think is much more difficult. I’ll park that for the present and see if we can agree on the strong/weak boundary. If you are interested in this discussion you must be vocal now.

Klaus Graf Says:
May 1st, 2008 at 12:17 am e
I have commented in German at http://archiv.twoday.net/stories/4900938/ but will try to write some thoughts also here.
May I remember the suggestions from PMR and Charles Bailey (color codes) at:
http://archiv.twoday.net/stories/4110564/ (2007)
There is no reason why the OA community should´nt say:
Strong OA documents are by default CC-BY (”Make all research results CC-BY”).

PMR: I think I agree with the sentiment but would change the wording. I think it means “if you are looking for a way to express the intention is that your artefact is strong OA, use a CC-BY licence”. I agree 100%. The existence of CC-BY has helped to save OA from an awful mess. And I congratulate published such as PLoS and BMC who use this licence universally to express strongOA. Whatever you think about their business models and whether strongOA is a good thing, I hope no-one would disagree that they have done an excellent job of making it absolutely clear what they are offering. And because of their lead, others such as MDPI can follow naturally.

Regarding data-re-use it might be in some cases more appropriate to choose PD. (Please keep in mind that some Wikipedia articles have hundreds of contributors. To list them completly doesn’t make sense although GNU FDL is requiring this.)

PMR: Science Commons and the Open Knowledge address this point. OKF has “Open Data” tags, and SC has a mechanism for declaring work to be in the public domain (PDDL).

Weak OA documents are
(1) OA-Free (Fair use only)
(2) OA-NC (no commercial use) with and without Copyleft (SA)
(3) OA-ND (no derivative works)
(4) OA-NC-ND

PMR: I pass on these at present other than to agree that *-NC, *-ND and *-NC-ND are NOT strong OA.

[… embargo stuff snipped …]
KG: I was not too amused to read from Suber that weak OA is a kind of OA but I think the positive aspects of the Suber/Harnad-agreement prevail.

PMR: I know how you feel, as I feel it myself. Up till now I have seen “OA” as implying strongOA. That has got me involved in a lot of lievly discussions on this blog when I have critiqued those who use it as a general label. But it is clearly not a precise term and we can’t turn it into one. It has the same value as “healthy” in “healthy foods”. It may represent positive if muddled intent and it may represent an attempt to ride on a bandwagon (or worse). So let us agree that “OA” by itself contains little precise meaning.

I agree with PMR (and PS) that labeling OA is essential. Let’s remember the Berlin definition: “including a copy of the permission as stated above”.
And the best label is …. CC-BY!

PMR: Labelling is critical. One example of where this can start TODAY is academic theses. Librarians, repositarians, BOGS, get going. Add “CC-BY” to every thesis. INSIDE the thesis. Page 1. Ad more labels “strongOA”, “Open Knowledge”. It’s trivial. Microsoft even has a Word plugin for adding licences. There is no excuse. So here’s a simple bumper sticker – I am sure you can do better
strong OA – start today

Posted in Uncategorized | 5 Comments

Why we need semantic chemical authoring -4

Wolgang Robien has nicely solved the puzzle of the inconsistency in the formula and molecular mass of the chloromethyl methyl ether: Wolfgang Robien Says:

  1. April 30th, 2008 at 4:25 pm eWhat you read as ‘Cl’ (=chlorine) – because you expect it – is written as ‘CI’ (=carbon+iodine) …
    quite simple, but good example showing that even the most trivial checks might avoid errors. The ‘advantage’ of this error is, that no conclusion is built on it. Every misassignment in NMR-spectroscopy might have the consequence, that another assignment is based on it – making it more reliable, because of better statistical parameters.
    For a summary of more or less ’sophisticated’ errors in NMR-spectra see:
    nmrpredict.orc.univie.ac.at/csearchlite/NMR_misinterpretation.html

PMR: Thanks Wolfgang. It is interesting how many people (including me) could not see the problem. It is so natural to communicate using visual signs that we forget that they frequently mislead. There are some more general points:

  • Wolfgang shows some examples of errors in reported spectra. I imagine that these are a very small
    percentage of those actually in the literature. The 10 examples break down roughly equally into scientific errors and typos. He makes the point, and I completely support him, that these are avoidable. If we use semantic tools when authoring then we can avoid many errors. It’s straightforward to check the expect NMR values of many compounds and it’s a small fraction of the effort to actually make them. There are known ranges for many properties – the IUCr does this very well and has an extensive dictionary/ontology system for all mainstream crystallography. It’s made a huge difference to the quality of crystallography which is generally considerably higher than spectra. It is not a problem of tools, it’s a problem of will.
  • It can be quite difficult to know what a compound – even a simple one – actually is if we cannot assume the semantic coherence of the document. There are ca 1500 common compounds and substances in the ICSC collection (from which my example was taken). I’ve blogged about this before and shall return but there are
  1. names (often opaque) and sometimes inconsistent
  2. identifiers such as CAS or RTECS for which there is no authoritative free resolution system (it costs USD6 to find out what a single CAS number is)
  3. chemical formula – for which there is no standard method of reporting
  4. molecular structure with a variety of approaches mainly based on connection tables

This is unacceptable for the machine information age. The ICSC documents are important (they are about safety) yet it is enormously difficult for a machine to read them accurately. Almost every other web source of information and many databases are also difficult or impossible to read accurately. The only modern way forward is a combination of XML(CML and others), RDF and ontologies. You’ll be seeing more of this over the next few weeks and months with ways of showing how it can be achieved. Yes, there is a need for new tools, and demonstrators, but the main problem is one of will.
And, in passing, it’s worth noting that the problem only occurred for sighted people. It’s worth remembering that our information is not just for them.

Posted in Uncategorized | 2 Comments

What is strongOA?

In previous posts (see links in Why weakOA and strongOA are so important) I have welcomed the Suber-Harnad approach to OA, labelling obejcts either as “strong OA” or “weak OA”. In this post I want to explore what strong OA is. I believe this is possible and relatively simple. I hope that all OA advocates will be able to agree on an operational procedure that will simply and absolutely determine whether something is strong OA.
A useful starting point is the Wikipedia “definition”. I have copied this verbatim and added two suggested clarifications:

Open access (OA) is free, immediate, permanent, full-text, online access, for any user, web-wide, to digital scientific and scholarly material,[1] primarily research articles published in peer-reviewed journals [PMR: and academic theses]. OA means that any user, anywhere, who has access to the Internet, may link, read, download, store, print-off, use, and data-mine the digital content of that article [PMR: without requiring to consult authors, publishers, or hosting sites]. An OA article usually has limited copyright and licensing restrictions.

PMR: The Budapest declaration includes the definition:

By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.

For background let’s assume some axioms:

  • we must have an operational procedure for determining the strongOAness of an object. Without the procedure we have argued endlessly over things that now will not take up our energies. The only people who will argue are those who wish to muddy the OA waters, including those who wish to rent weakOA objects for the sale price of strongOA.
  • we can explore the discussion in the arena of scholarly and research publications. There will be an overlap with certain digital objects (data sets, computer code) but we’ll omit discussion here. The most important artefacts are research articles (normally peer-reviewed) and theses (of all types, undergraduate, masters, doctoral) published through a University or scholarly organisation.
  • that there are overriding statements of intent which also contain definitions. These include the BBB declarations [above] and the The Open Knowledge Definition (which is part of the basis of Open Data and Science Commons). I believe these all describe strongOA (and it would be difficult to dumb them down without breaking my idea of strongOA). It is how to translate these definitions into practice that I address here.

I see the following challenges for strongOA.

  • The logical consequences of strongOA are extensive. I believe is possible for anyone to download a complete journal, repackage it and resell it without the publisher’s or author’s permission. They must, of course, preserve the provenance (authorship) but that’s all that is formally required. Just as people re-use and resell my cmoputer code (as in Bioclipse) they can do the same with my articles and – theoretically – the whole OA content of, say, PLoS or BMC. In practice I think that would be slightly questionable and that’s where community norms come in – it’s useful to say “you may do this but we’d rather you didn’t”. I generally enforce this by adding the bit-rot-curse to my code. So there will be a culture change as publishers adopt strong OA – there will be mistakes – and we should help them adjust.
  • There may be a tendency to blur the boundary. “This article is strongOA as long as it is for non-commercial use”. No. It is either strongOA which requires the permission for commercial use or it’s not strongOA. We have to agree on this.
  • We have to police strongOA. One of the plus points of Open Source and the weak points of OA (up to now) has been the policing. If you say something is strongOA and it isn’t someone should take you to task (gently if it’s a mistake). If we don’t do this then the bright shiny present that the Suber-Harnad terminology has created will tarnish. Fuzzy practice begets fuzzy thinking.
  • We have to be able to know (not just guess) the strongOA status of an object at all times. This is critical. I shall continue to stress this. It’s not good enough to say “I am emailing this document from repository X which is classified as an Open Access repository, so you can do anything you like with it”. The document/artefact has to announce that it’s strongOA. Nothing else will do because provenance by association gets lost. The only way that I know of doing this is by embedding a licence or reference to a licence in the document. Typical licences include CC-BY, GPL document licence, Science Commons/Open Knowledge (meta)licences such as PDDL, or CC0. The licence can be asserted either by embedding RDF in the XML/HTML or adding an approved icon from the organisations above.

To summarise at this point:
strongOA requires a clear borderline defined by a licence (or licence reference) embedded in the document and policed by the scholarly community.
This discussion has been about what strongOA is, not whether it’s a good thing or how to achieve it. It ought to be something that responsible publishers have a view on as well as authors, funders, repositarians, human readers and machine users.

Posted in Uncategorized | 2 Comments

Why weakOA and strongOA are so important

Yesterday Peter Suber and Stevan Harnad announced (Strong and weak OA) a critically important step forward in OA – that the terms “weak OA” and “strong OA” should be used to describe various approaches, philosophies, practices. I reported this (Weak and Strong OA) and promised to elaborate from my perspective.
Until yesterday the label “OA” was too fuzzy to allow precise definition of practice. This had serious practical consequences:

  • an author or funder paying for “OA” might be getting less than they expected.
  • a reader or user (human or machine) might not know what they could and could not do with an “OA” article. “OA” did not guarantee rights of re-use and it was possible that a reader could do something in good faith that would earn then a lawyer’s letter from the publisher (or worse).
  • Many of us (funders, librarians, authors, readers) wasted huge amounts of time trying to make clear what could and could not be done. Generally this led to erring on the side of extreme caution (==paralysis) and was a godsend for those trying to inject FUD int the system.

I take Peter and Stevan’s observations that most OA is not strong OA and there is a place for weak OA. I support that view. I shall of course campaign for strong OA but now it is entirely clear (as I intend to show) what it is that I am campaigning for.
More later, but until then we should all practice trying to catalog digital objects into three categories: nonOA, weakOA, strongOA. I think the results may be surprising.

Posted in Uncategorized | 1 Comment

Why we need semantic authoring tools in chemistry – 3

The type of problem highlighted in my recent post is a very serious one and so rather than giving the answer I want to help you discover it for yourself. Hopefully then you will have a wow! or aha! or buggerthat! moment that will help orient you to the importance of semantic tools. Persevere in this and you will see why I rant against PDF, why weak OA does not normally provide high quality semantic documents.
You need to know a very little chemistry and I’ll explain it all below. But first the essence of the problem (relating to methyl chloromethyl ether – you can look it up on WP but it’s not necessary to solve the problem)
In essence the chemical formula as given:

CH3OCH2CI

is completely incompatible with the molecular mass as given:

Molecular mass: 80.5

For those who have forgotten high-school chemistry all you need to know is:

  • Elements are defined by an unambiguous symbol. Thus “H” means hydrogen, “C” means carbon, “O” means oxygen. You can look up all the information in Wikipedia.
  • The count of each element is one, unless subscripted. Elements can be repeated. Thus CH3OH is read as one carbon, three hydrogens, one oxygen and another hydrogen. Adding them up gives one carbon, four hydrogens and one oxygen.
  • to get the molecular mass you look up the atomic masses of each element in the Wikipedia entry (or on the Blue Obelisk site) and multiply by the count. The example above (methanol) goes: 1 carbon @ 12 = 12; 4 hydrogens @ 1 = 4; 1 oxygen @ 16 = 16. Add together and the answer is 32 (you can check in Wikipedia). Note that you should round the atomic masses to the nearest 0.5 (my teaser is not a problem of decimal points).

If you do this for the puzzle compound you should discover the problem.
And you’ll see why it bears on PDF, OA, and all the rest.
If we had semantic chemical tools where the information was checked as it was entered this COULDN’T happen. Now for that we need something like a chemical plugin for Word.
Is there a good fairy out there?

Posted in semanticWeb, Uncategorized | 4 Comments

Weak and Strong OA

Peter Suber (and Stevan Harnad) have just published a very important announcement about the definition of various types of OA. I’ve known about it for some days and have been waiting till it’s public. I’ll copy it in full and then comment.

The term “open access” is now widely used in at least two senses.  For some, “OA” literature is digital, online, and free of charge.  It removes price barriers but not permission barriers.  For others, “OA” literature is digital, online, free of charge, and free of unnecessary copyright and licensing restrictions.  It removes both price barriers and permission barriers.  It allows reuse rights which exceed fair use.
There are two good reasons why our central term became ambiguous.  Most of our success stories deliver OA in the first sense, while the major public statements from Budapest, Bethesda, and Berlin (together, the BBB definition of OA) describe OA in the second sense.
As you know, Stevan Harnad and I have differed about which sense of the term to prefer –he favoring the first and I the second.  What you may not know is that he and I agree on nearly all questions of substance and strategy, and that these differences were mostly about the label.  While it may seem that we were at an impasse about the label, we have in fact agreed on a solution which may please everyone.  At least it pleases us.
We have agreed to use the term “weak OA” for the removal of price barriers alone and “strong OA” for the removal of both price and permission barriers.  To me, the new terms are a distinct improvement upon the previous state of ambiguity because they label one of those species weak and the other strong.  To Stevan, the new terms are an improvement because they make clear that weak OA is still a kind of OA.
On this new terminology, the BBB definition describes one kind of strong OA.  A typical funder or university mandate provides weak OA.  Many OA journals provide strong OA, but many others provide weak OA.
Stevan and I agree that weak OA is a necessary but not sufficient condition of strong OA.  We agree that weak OA is often attainable in circumstances when strong OA is not attainable.  We agree that weak OA should not be delayed until we can achieve strong OA. We agree that strong OA is a desirable goal above and beyond weak OA.  We agree that the desirability of strong OA is a reason to keep working after attaining weak OA, but not a reason to disparage the difficulties or the significance of weak OA.  We agree that the BBB definition of OA does not need to be revised.
We agree that there is more than one kind of permission barrier to remove, and therefore that there is more than one kind or degree of strong OA.
We agree that the green/gold distinction refers to venues (repositories and journals), not rights.  Green OA can be strong or weak, but is usually weak.  Gold OA can be strong or weak, but is also usually weak.
I’ve often wanted short, clear terms for what I’m now going to call weak and strong OA.  But I also wanted a third term.  In my blog and newsletter I often need a term which means “weak or strong OA, we don’t know which yet”.  For example, a press release may announce a new free online journal, digital library, or database, without making clear what kind of reuse rights it allows.  Or a new journal will launch which makes its articles freely available but says nothing at all about its access policy.  I will simply call them “OA”.  I’ll specify that they are strong or weak OA only after I learn enough to do so.
Stevan and I agree in regretting the current, confusing ambiguity of the term, and we agree that the weak/strong terminology turns this ambiguity to advantage by attaching labels to the two most common uses in circulation.  I find the new terms an especially promising solution because they dispel confusion without requiring us to buck the tide of usage, which would be futile, or revise the BBB definition, which would be undesirable.

Postscript.  Stevan and I were going to write up separate accounts of this agreement and blog them simultaneously.  But when he saw my draft, he decided to blog it verbatim without writing his own.  That’s agreement!

PMR: This is an enormous advance. I shall write several posts on different aspects but here I will simply congratulate P and S on their agreement. From now on it becomes clear that the OA movement is united, coherent and points in a single direction. But the actual mechanism is important as well as we move from the political aspect of OA to include the strictly operational.
Similar movements – e.g. Open Source – have had their prophets and differences of orientation. But there again we see the unity as greater than the differences – differences which by now are accepted rather than than divisive.
It’s also worth pointing out that OA is changing. Obviously the numbers keep increasing, the awareness increases and closed access is increasingly less defensible in many cases. CC-* is now much more prominent than a few years ago. The new OA terminology helps us understand these changes and work out where to go next.
I believe that the Open Knowledge Definition applies to, and can be used to define, strong OA. There is therefore a series of Strong Opens (Source, Access, Data, Knowledge) all of which require the removal of permission barriers.  There are minor differences but those arise from the natures of the endeavours, not to the fundamental knowledge rights.  In our own area it’s reflected by the Blue Obelisk’s Open Data, Open Source and Open Standards.
Lots more later. See some of you in London.

Posted in Uncategorized | Leave a comment

Why PDF is a Hamburger

In a recent comment Chris Rusbridge asks:
April 29th, 2008 at 4:47 pm e

I’ve been thinking about a blog post related to your hamburger rants. But the more I try to think it through, the murkier it gets. Is the problem that PDF cannot store the semantic information? I’m not sure, but I’m beginning to suspect maybe not, ie PDF can. Is the problem that the tools that build the PDFs don’t encode the semantic information? Probably. Is the semantic information available in the publisher’s file from which the PDF is built? Possibly to probably, depending on the publisher and their DTD/schema. Is the semantic information available in the author’s file? Probably not to possibly, depending on author tools (I’m not sure what chemists use to write these days; Word would presumably be dire in this respect unless there is a chemistry plug-in; LaTeX can get great results in math and CS, but I’m not sure how semantic, as opposed to display-oriented, the markup is). And even if this were to all happen, does chemistry have the agreed vocabulary, cf the Gene Ontology in bio-sciences, to make the information truly “semantic”? And…

PMR: Thank you Chris. It’s a good time to revisit this. There are several aspects.
(From Wikipedia) The PDF combines three technologies:

  • A sub-set of the PostScript page description programming language, for generating the layout and graphics.
  • A font-embedding/replacement system to allow fonts to travel with the documents.
  • A structured storage system to bundle these elements and any associated content into a single file, with data compression where appropriate.
  • and…

One of the major problems with PDF accessibility is that PDF documents have three distinct views, which, depending on the document’s creation, can be inconsistent with each other. The three views are (i) the physical view, (ii) the tags view, and (iii) the content view. The physical view is displayed and printed (what most people consider a PDF document). The tags view is what screen readers read (useful for people with poor eyesight). The content view is displayed when the document is re-flowed to Acrobat (useful for people with mobility disability). For a PDF document to be accessible, the three views must be consistent with each other.

… so why is this a problem?
First let me dispose of the “PDF is only bad if it’s authored with tools from a Moscow sweat-shop. Good PDF is fit for any purpose”. PDF is concerned with positioning objects on the page for sighted humans to read. Yes, there are the two other views but they are often inconsistent or impenetrable. Because most of us are sighted the problem does not grate, but for those others it can be very difficult. Let’s assume I have the text string “PDF”. In a normal ASCII document (including HTML and Word) the “P” comes first, then the “D” then the “F”. In PDF it’s allowable (and we have found it!) to have the following instructions in the following order

  1. position the “F” at coordinate (100,200)
  2. position the “D” at coordinate (90.3, 200)
  3. position the “P” at coordinate (81.2, 200)

When drawn out on screen the F would come first, then the D then the P. The final result would read naturally, but a speech synthesizer would hear the order “F”, “D”, “P”. I believe that the US government was sufficiently concerned about accessibility that they required Adobe to make alterations to the PDF software so that the characters would be read aloud in the right order. This is the Eric Morecambe syndrome (in response to Andre Preview telling him he is “playing all the wrong notes”:

I am playing all the right notes, but not necessarily in the right order.
Eric Morecambe

This spills over into all common syntactic constructs. Run-of-the-mill PDF has no concept of a paragraph, a line end or other common constructs. This gets worse with technical documents – you cannot tell where the diagrams or tables are or even if they are diagrams and tables. HTML got it 90% right – it has concepts such as “img”, “table”, “p”. PDF generally does not.
To retiterate PDF is a cheap and reliable way of transporting a printed page from one site to another and a cheap and inexpensive way of storing pages without paper. Beyond that it gets much less valuable very rapidly.
There’s a general problem with semantic information. If I write “the casus belli is very important” the emphasis (italics) tells me that the words carry semantic information. It doesn’t tell me what this information is. You have to guess. We often cannot guess or even guess wrong. This type of semantics is very fragile – if the phrase is cut-n-pasted you’ll probably lose the italics in most systems. If, however, you use HTML and write ‘class=”latin” and ‘class=”authorEmphasis” you immediately see that the semantics are preserved. So HTML can, with care, carry semantics. PDF generally cannot.
To answer your other points rapidly (I will come back to them in more detail). I use to think Word was dire. Word2007 has changed that. It can be used as an XML authoring tool. Not always easily but it preserves basic semantics. And as for a chemical plugin to Word…
…I’ve run out of time 🙂

Posted in Uncategorized | 2 Comments

Open Knowledge in London

On Wednesday the Open Knowledge Foundation is holding the First Open Knowledge London Meetup on Wednesday 30th April

The first Open Knowledge London meetup will take place this Wednesday at the London Knowledge Lab. The meetup should be great opportunity for informal discussion of open knowledge projects and issues. If you’d like to participate or present, please add details to the wiki page!

  • When: Wednesday 30th April, 19:00-21:00
  • Where: London Knowledge Lab, 23-29 Emerald Street, WC1N 3QS.
  • Wiki: http://okfn.org/wiki/LocalGroups/LondonGroup
  • PMR: I intend to be there (haven’t yet  checked diary). One of the values of blogging is that JudithMR knows what part of the world I am or will be in. I have to make a sacrifice – a lifelong supported of Liverpool FC I shall not be able to watch the match.
    ‘Some people believe football is a matter of life and death.
    I’m very disappointed with that attitude.
    I can assure you it is much, much more important than that.’
    However Open Knowledge is also a matter of life and death and takes precedence. It is part of what we need to save the planet.
    Posted in Uncategorized | 1 Comment

    Why we need semantic chemical authoring-2

    We’re in the process of aggregating a repository of common chemicals (somewhere in the range 1000-10000 entries) and we are taking data from various publicly available web sites. Typical sources are Wikipedia, any aggregator with Open Data policies and MSDS sheets (chemical safety information). One such site is INCHEM (Chemical Safety Information from Intergovernmental Organizations which lists about 1500 materials (most are chemical compounds though some are mixtures).
    The information on the web is HTML pages and we wrote a scraper to extract the information from each. I’d planned to show a screenshot but WordPress has stopped me uploading any images, so you’ll have to visit the link. In any case you wouldn’t be able to see the point from a screenshot. Scrpaing is not fun – the HTML is as bad as almost any other HTML. It needed a 2-pass process – first into HTMLTidy and then analysis by XML tools. From his we extract the most important information and turn it into CML – names, formula, connection tables, properties, etc.
    We wanted to see if the aggregation and consistency checking could be done by machine, using RDF. This is surprisingly hard as none of the sites contains all the information we need and many have large sparse patches. There is also the subtle problem of identifying the platonic nature of each chemical – what should we actually use as an entry for – say – alumin(i)um chloride? Or should there be more than one?
    We’ve got the data in. There are a large number of simple but niggly lexical problems, such as the degrees symbol for temperature (totally inconsistent within and between documents) And the semantics – how do you record a boiling point as “between 120 and 130 at 20 mm Hg”? (CML can do this, but it takes work to do the conversion.)
    And the sites have errors. Here’s a rather subtle one which the average human would miss (we needed a machine to find it). You’ll have to go to the page for chloromethylmethylether – I daren’t try to transcribe it into WordPress. The error is in the displayed page (no need to scroll down).
    It we had semantic authoring tools this wouldn’t happen. I’ll be blogging soon (I hope) about our activity in this area.
    UPDATE: My best go at scraping the bit of the page with the error. It’s now semi-semantic (HTML) so you should be able to track the error down. You only have to know a little bit of chemistry…


    Dimethylchloro ether


    Chloromethoxymethane

    CAS #

    107-30-2

    CH3OCH2CI

    RTECS #

    KN6650000

    Molecular mass: 80.5
    Posted in Uncategorized | 7 Comments

    TANSTAAFL: Openness is not a Free Lunch

    In a reply to a recent post Rich Apodaca made the point that Open Access (Open Data) will require business models:

    Rich Apodaca Says:
    April 28th, 2008 at 1:38 am e
    By identifying and executing the right business model, the idea of control will become much less important. For example, you’ll find few complaints about Google essentially controlling the online search market; the vast majority of users are delighted to be able to search with the service whenever they want – and to have Google index their site.
    This only happened because Google found the right business model and executed on it.
    Maybe open access pricing and business models bring out nonproductive arguments because those putting them forward (and responding) are stuck in old patterns of thinking, or too heavily dependent on the current system. Scholars and publishers likely both share responsibility here.
    My guess is that the open access scientific publication system that ends up working will start out by horrifying most of today’s scholars and being ridiculed or ignored by today’s publishers. But there will be a few niche groups for whom the truly disruptive open access innovation in scientific publishing will be a godsend.
    Developing a workable open access business model starts by identifying who these groups are and how solving their problem can solve other important problems. It continues with finding a price and medium of exchange (perhaps not even money) that the market will find tolerable for awhile.
    How can this issue be anything other than central to making open access work?

    PMR: I agree generally with this – it’s often characterised by TANSTAAFL (There Ain’t No Such Thing As A Free Lunch,). I think most of the major innovators (certainly the funders)  realise this – that’s why they are prepared to develop funder-pays approaches for Open Access.
    Data is/are a particular problem. Data are more expensive than manuscripts. It’s virtually cost-free to download and read and copy and transmit a standard PDF or HTML, or any other document whose sole endpoint is to be read by humans. The creation of a reading human is, of course not, cost-free – the investment in the average human is large – but it’s not generally borne by higher education or scientific research (YMMV). But data are complex, and we are only at the start of learning what we can do with them. Open Data is not an end, but without it there is no beginning.
    Data are normally produced for a particular purpose and the reuse them for another cost money. I’ll exemplify this by taking CrystalEye data – about 120,000 crystal structures and 1 million molecular fragments – which were aggregated, transformed and validated by Nick Day as part of his thesis. (BTW Nick is writing up – it’s a tribute to his work that CrystalEye runs without attention for months on end). The primary purpose of CrystalEye was to allow Nick to test the validity of QM calculations in high-throughput mode. It turned out that the collection might be useful so we have posted it as Open Data. To add to its value we have made it browsable by journal and article, searcahable by cell dimensions, searchable by chemical substructure and searchable by bond-length. This is a fair range of what the casual visitor might wish to have available. Andrew Walkingshaw has transformed it into RDF and built a SPARQL endpoint with the help of Talis. It has a Jmol applet and 2D diagrams, and links back to the papers. So there is a lot of functionality associated with it.
    This has come under some criticism to the effect that we haven’t really made it Openly available. For example Antony Williams(Chemspider blog) writes (Acting as a Community Member to Help Open Access Authors and Publishers):

    “This [interaction with MDPI] is contrary to some of my experiences with some other advocates of Open Data and Open Access where trying to get their “Open Data” is like pulling teeth.”

    PMR: I assume this relates to CrystalEye – I don’t know of any other case. Antony and I have had several discussions about CrystalEye – basically he would like to import it into his database (which is completely acceptable) but it’s not in the format he wants (multi-entry files in MDL’s SDF format, whereas CrystalEye is in CML and RDF).
    This type of problem arises everywhere in the data world. For example the problem of converting between map coordinates (especially in 3D) can be enormous. As Rich says, it costs money. There is generally no escape from the cost, but certain approaches such as using standards such as XML and RDF can dramatically lower the costs. Nevertheless there is a cost. Jim Downing made this investment by creating an Atom feed mechanism so that CrystalEeye couls be systematically downloaded but I don’t think Chemspider has used this.
    The real point is that Chemspider wishes to use the data for a different purpose from which it was intended. That’s fine. But as Rich says it costs money. It’s unrealistic to expect we should carry out the conversion for a commercial company for free. We’d be happy to a mutually acceptable business proposition and it could probably be done by hiring a summer student.
    I continue to stress that CrystalEye is completely Open. If you want it enough and can make the investment then all the mechanism are available. There’s a downloader and converters and they are all Open (though it may cost money to integrate them).
    FWIW we are continuing to explore the ways in which CrystalEye is made available. We’re being funded by Microsoft as part of the OREChem project and the result of this could represent some of the way in which the Web technology is influencing scientific disciplines. We’d recommend that those interested in mashups and re-use in chemistry took a close look at RDF/SPARQL/CML/ORE as those are going to be standard in other fields.
    TANSTAAFL…

    Posted in Uncategorized | 5 Comments