Suber-Harnad strongOA/weakOA borderline

Peter Suber (again mysteriously unable to post comments) and Stevan Harnad clarify the weak/strongOA borderline. This is making it clearer to me. Accept that I completely misread or misunderstood the original statement. I’ll comment below. At present I am interested in seeing where the borderline is set, not what the two partitions are called. When we – and hopefully the community – agree on where the border is I’ll comment on whether I think this is a useful way forward. Note that my concern is now what a “permission-barrier” is.
======================= from Peter Suber =======================
Peter: This is incorrect and confusing. The borderline between strong and weak OA is easy to define. Weak OA removes *no* permission barriers and strong OA removes *at least some* permission barriers. (Both of them remove price barriers.)
The fact that strong OA covers a range of different positions which may or may not be easy to distinguish is not relevant to the distinction between strong and weak OA itself.
Please see again the <a href=“http://www.earlham.edu/~peters/fos/2008/04/strong-and-weak-oa.html”>statement</a> in which Stevan and I introduced terms for describing this distinction. (NB: We all agree on the need for new terms. All I’m doing here is clarifying and reiterating the distinction itself.)
====================== from Stevan Harnad =====================
Stevan Harnad Says:
May 4th, 2008 at 12:50 pm e
The Need to Specify a Minimal Lower Bound for Permission-Barrier-Free OA
I agree that Permission-Barrier-Free OA (by whatever name we give it) needs at least a minimal lower bound to be specified, otherwise it is too vague.
Price-Barrier-Free OA (regardless of what name we gave it) does not need an upper or lower bound, because it is not on a continuum. It just means free access online, However, as I have said before, it does need to be shored up a bit by stating the obvious:
(1) The free access is to the full digital document (not just the metadata).
(2) The free access is one-click and non-gerrymandered: Instant download without having to do a song and dance for every page.
(3) The free access is immediate, not delayed or embargoed. (A document is not OA if it *will be* available free in a year, or in 10 or 10,000.
(4) The free access is permanent and continuous: A document is not OA if it is available free for a limited time, say, for an hour, or on even-numbered calendar months.
(5) There is no “degree of free” access: Low-priced access is not “almost” free access.
(6) The free access is for anyone netwide, not just those at certain sites or in certain domains or regions.
For Green Price-Barrier-Free OA self-archiving and Green Price-Barrier-Free OA mandates all of these specifications are dead-obvious, irrespective or what proper name we choose for it. They are spelled out only for the pedantic and the obtuse.
But in the case of Permission-Barrier-Free OA, regardless of the name (and even in the case of the BBB definition), a minimal lower bound has to be specified, otherwise the condition is so vague as to make no sense. It is not just Peter Murray-Rust who is perplexed. Anyone would be. The BBB definition gives examples, but it does not give a lower bound. That is like saying “hot” means temperatures like 30 degrees, 300 degrees or 3000 degrees. That leaves one in perplexity about what, between 0 degrees and 30 degrees, counts as not hot. In particular, does Price-Barrier-Free OA alone count as Permission-Barrier-Free OA? The answer is No, but the only way to give this content is to specify a minimal lower bound for Permission-Barrier-Free OA.
========================== PMR =====================
This is useful. I was going to address the price-free criterion but Stevan has done a good job. I’ll add one or two additional concerns and see if he or anyone thinks they should be included
I hope we can agree on a spectrum like this:
closedAccess || weakOA ||<– strongOA –>| removal-of-all-permission-barriers (BBB, CC-BY)
Here || means a complete, clear barrier while | means potentially inclusive.  <– –> means a range of options. To help clarify:

  • closedAccess means either price barriers and/or access restrictions (specific logins) and/or embargo period and/or metadataOnly and/or abstractOnly and/or limited access to multiple documents.
  • weakOA means that anyone anywhere can bring up the full document on their screen (as often as they wish). They have no other rights other than “fair-use” (which is not relevant to OA IMO). All possible permission-barriers are in place. [we need to discuss these later.] I would interpret this a meaning there is only one type of weakOA and manifestations differ only in non-OA matters.
  • strongOA covers a spectrum defined by the removal of one or more permission barriers. Let’s assume there are N such. Then a document with 1 PB removed and N-1 PBs remaining is strongOA and so is a document with 1 PB remaining and N-1 removed. If all N PBs are removed this is a special case.
  • BBB, CC-BY are a special case of strongOA with no permission barriers that could be potentially removed.

If we can agree on this that will be useful. If not I have again failed to understand.
Now weakOA. Here are some additional comments for Stevan’s list:

  • weakOA must have the explicit or implicit permission of the publisher. Self-archiving (by individuals or institutions) forbidden by the publisher could lead to take-downs and is not therefore permanent OA.
  • I’d replace “gerrymandering” (which is to do with an electoral system) by “restricted technical access”. This includes DRM, explicit logins and restriction of the number or type of downloads.
  • weakOA in hybrid journals overrides any contractual obligations provided by the reader’s institution. Thus, as a domestic individual, I may be allowed to access multiple articles in a hybrid journal but if I do this from a university address the system might cut me off for overreaching the limit of downloads

weakOA has ALL conceivable permission barriers. We need to know what they are, because otherwise we cannot define anything as weakOA. Here are some currently enforced permission-barriers.

  • you may not store a local copy of the article on your machine
  • you may not mail copies of the article
  • you may not reproduce any of the article beyond fair use
  • you may not index it not publish an index
  • you may not use if for data or text-mining
  • you may not make print copies of the article

Stevan said:

Most Gold OA today is just Price-Barrier-Free OA.

which implies that this retains all the permissions. However if, for example, we regard the prohibition of storage on disk as a permission-barrier and if, then, a publisher removes it, then it makes the article strongOA. So it is probably easy to claim that almost all OA is strongOA. If so the use of “any permission-removal” to define strongOA makes it almost meaningless as it is always possible to find something you can do without ending up in court.
Note, in passing, that the NIH/PMC system is not even weakOA, so not OA at all. There is a restriction on the number of articles that can be downloaded so that if I access any particular one I may be cut off.
That’s not any form of Open Access

Posted in semanticWeb, Uncategorized | 1 Comment

Peter Suber's comments on strongOA/weakOA

For some reason this blog does not accept comments from Peter Suber so he has sent one by email. I copy it and then add a brief comment:

[PS] I made two points in my <a href=“http://www.earlham.edu/~peters/fos/2008/04/strong-and-weak-oa.html”>blog post</a> last week on strong/weak OA which address two of Klaus’ concerns.
(1) Stevan and I wanted a term for the kinds of OA which remove price barriers and some or any permission barriers. We suggested the term “strong OA” to cover them. As I said in my blog post, “there is more than one kind of permission barrier to remove, and therefore…there is more than one kind or degree of strong OA.” CC-NC removes some permission barriers (and therefore is strong in this sense) and leaves some in place (and therefore is not as strong as other licenses). The only way to say that CC-NC is not strong is to use “strong” in some other sense. Klaus is free to do that, but he should then be clear that he’s using his own definition of the term.
Disagreeing how to define the term “strong OA” (and therefore on whether CC-NC counts as strong OA) shouldn’t hide the fact that we agree on the BBB, agree on CC-NC, and on all the other issues of substance which he raises here. Nor should it hide the fact that clarity about the strong/weak distinction (regardless of the terms we use to express it) can promote communication and minimize needless quarrels among people who agree on issues of substance. Stevan and I were trying to promote this kind of clarity, starting with ourselves. We wanted to be clear and careful about distinctions we all recognize, not to be mufti.
(2) Klaus writes, “Harnad and Suber don’t have the right to [change] BBB.” This is a deeper misunderstanding because it’s not about words. Stevan and I were not changing the BBB. In fact, this was uppermost in my mind. Formerly, Stevan did want to revise the BBB. I didn’t, and we have now found common ground on this. As I said in my post: “We agree that the BBB definition of OA does not need to be revised.” This was an important development for all supporters of the BBB.
I agree with Klaus, Bernd-Christoph Kämper, and others that CC-BY is better than CC-NC. That has been my position for years, and I haven’t retreated from it. Anyone who thinks I did should reread what I said.

PMR: This helps to clarify the position in part. It seems that for Suber-Harnda strongOA represents a spectrum of strengths and using this terminology we can reasonably say “this strongOA is stronger than that strongOA”.  I (and I suspect many others) misread strongOA to mean a single barrier line corresponding to (roughly) BBB or CC-BY. This is an easy line to define (I hope). A spectrum of strongOA is not.
strongOA thus represents the removal of some or all permission-barriers. Unless we know what a permission barrier is then we can’t say whether a document or service has what barriers. Is the freedom to download what you read on the screen (you can’t help downloading it)  a permission? Because if so, and I assume it isn’t, then almost all OA can be classified as strongOA.  In fact Suber-Harnad have said that almost all OA is weakOA.
So is CC-NC the proposed borderline? Is so we need to know – and have a right to comment.

[PS] A minor point: I would have welcomed this discussion any of the forums I formerly moderated. But in fact I no longer moderate any forums. The BOAI Forum is now moderated by Iryna Kuchma and the SPARC OA Forum is moderated by Stacie Lemimck. In both cases, the switchovers were announced on the lists themselves.

PMR: Thanks. But in general there doesn’t seem to be huge amounts of general discussion on either of these – mainly reports and announcements. But I don’t subscribe. It would certainly be better to have a central area rather than try to manage it on a blog.

Posted in Uncategorized | 1 Comment

The strongOA-weakOA borderline is undefineable

Stevan Harnad (one of the creators of the current terms strongOA and weakOA – Peter Suber is the other) now makes it clear (this blog May 3rd, 2008 at 2:00 pm e) that an objective operational definition of strongOA is impossible, so I shall stop trying:

Permission-Barrier-Free OA is a continuum of CC-license levels
You can’t define Permission-Barrier-Free OA absolutely any more than you can define “hot ” absolutely, because both are a matter of degree.

PMR: I am glad that this is now clear. This means that for one person a document or journal can be strongOA (your term) while for another it is merely weakOA (your term). “strong” and “weak” are thus subjective and cannot therefore be used to determine (say) whether a journal article can legally be used in any particular way or whether a publisher is charging a reasonable fee for a funder-pays article.

Price-Barrier-Free is not a matter of degree: It means accessible free online (immediately, permanently).

PMR: I think I agree, but I will explore this and see if we concur.

Green OA means whatever OA means (whether price OA or permission OA), but provided by author self-archiving.
Gold OA means whatever OA means (whether price OA or permission OA), but provided by publishing in an OA journal.
Virtually all Green OA today is just Price-Barrier-Free OA (a necessary but not sufficient condition for permission OA(s))
Most Gold OA today is just Price-Barrier-Free OA.

PMR: I would find life much easier if the colour labels disappeared. Self-archiving OA is easy to understand, “Green” is confusing. It may not be confusing to you, but it confuses a lot of people.

What you call just a matter of “sociopolitical observations and beliefs” is what I call working to actually generate OA.

PMR: I don’t believe I ever said “just”. I tried to use a descriptive, non-emotive phrase that would distinguish it from the technical descriptions that I favour.

The preoccupation with definitional details (while time’s a’passing and research access and impact continue to be lost, daily, needlessly and cumulatively, while we dither) is what I would call just a matter of “sociopolitical observations and beliefs”.

PMR: I think it’s now very clear where we both stand – we have common objectives within the area of Open Access. We differ on how we talk about them, how we want to achieve them. I like well defined situations and algorithmic rules; you like grand visions and rhetoric (not a pejorative term). I thought that this week we had common ground in “strongOA” and felt that was a major achievement. Since however anyone can redefine it to mean what they like we have world views that seem unlikely to merge without friction.

Posted in Uncategorized | 3 Comments

SPARC requires CC-BY for their "OA Seal"

I missed this announcement from SPARC or I would certainly have trumpeted it… Comments at end…

From: David Prosser [email]
Sent:
23 April 2008 17:18
To: ‘sparc-europe@arl.org’
Subject: Launch of the SPARC
Europe Seal for Open Access Journal

Lund, Sweden23 April 2008

SPARC Europe and the Directory of Open Access Journals Announce the Launch of the SPARC Europe Seal for Open Access Journals

Seal to Set Standards for Open Access Journals

For more information, contact: David Prosser, [email] or Lars Björnshauge, [email]

Oxford, UK and Lund, Sweden — SPARC Europe (Scholarly Publishing and Academic Resources Coalition), a leading organization of European research libraries, and the Directory of Open Access Journals (DOAJ), Lund University Libraries today announced the launch of the SPARC Europe Seal for Open Access journals.  Growing numbers of peer-reviewed research journals are opening-up their content online, removing access barriers and allowing all interested readers the opportunity of reading the papers online, with over 3300 such journals listed in the DOAJ, hosted by Lund University Libraries in Sweden.

However, the maximum benefit from this wonderful resource is not being realised as confusion surrounds the use and reuse of material published in such journals.    Increasingly, researchers wish to mine large segments of the literature to discover new, unimagined connections and relationships.  Librarians wish to host material locally for preservation purposes.  Greater clarity will bring benefits to authors, users, and journals.

In order for open access journals to be even more useful and thus receive more exposure and provide more value to the research community it is very important that open access journals offer standardized, easily retrievable information about what kinds of reuse are allowed.  Therefore, we are advising that all journals provide clear and unambiguous statements regarding the copyright statement of the papers they publish.  To qualify for the SPARC Europe Seal a journal must use the Creative Commons By (CC-BY) license which is the most user-friendly license and corresponds to the ethos of the Budapest Open Access Initiative.

The second strand of the Seal is that journals should provide metadata for all their articles to the DOAJ, who will then make the metadata OAI-compliant.  This will increase the visibility of the papers and allow OAI-harvesters to include details of the journal articles in their services.

‘We want to build on the great work already done by the publishers of many open access journals and improve the standards of open access titles,’ said David Prosser, Director of SPARC Europe.  ‘Working with the DOAJ means that we can provide help and guidance to journals who wish to move beyond the first step of free access to full open access and our long-term aim is to ensure that all journals listed in the DOAJ can attain the standards expressed within the Seal’

‘Improving the standards of the rapidly increasing numbers of open access and contributing to the widest possible visibility, dissemination  and readership of the journals is very much in line with our mission, ‘ said Lars Björnshauge, Director of Libraries at Lund University. ‘We are very happy to see the enormous usage of the DOAJ and the support from our membership’

`Legal certainty is essential to the emergence of an internet that supports research. The proliferation of license terms forces researchers to act like lawyers, and slows innovative educational and scientific uses of the scholarly canon` said Johan Wilbanks, Executive Director of Science Commons. `Using a seal to reward the journals who choose to adopt policies that ensure users’ rights to innovate is a great idea. It builds on a culture of trust rather than a culture of control, and it will make it easy to find the open access journals with the best policies.’

‘This is an excellent program with two important recommendations.  CC-BY licenses make OA journals more useful, and interoperable metadata make them more discoverable.  The recommendations are easy to adopt and will accelerate research, facilitate preservation, and make OA journal policies more open and more predictable for users.  I hope all OA journals will adopt them –not to get the Seal  from SPARC Europe and the DOAJ, but for the same reasons that moved these organizations to launch the program:  to make OA journals more visible and useful than they already are,` said Peter Suber, Open Access Advocate & Author of Open Access News.

PMR: I don’t know how I missed this… it’s exactly what I want. [Does “Europe” mean that only European journals or publishers qualify – I assume not.]

So it solves most of my problems:

  • an organisation I respect and which has the guts and perspicacity to take on difficult problems.
  • clear-thinking and language
  • concern about the needs of (scientific) scholars

So the operational borderline is BBB-OA == CC-BY. Simple.
Doubtless with a few tweaks it could be applied to papers in hybrid journals (though the sooner they go the better) and theses.
CC-BY is simple. It’s 2 letters, a hyphen-minus and 2 more letters. People know what it means. If they don’t it’s the top-hit in Google (Creative Commons Attribution 2.0 Generic). Pronounce it “see-see-by”. It takes less than a second to utter it.
And since SPARC is the central organisation handing out OA gongs, then shortly everyone will start to see them.

  • “What’s this?”
  • “it’s the SPARC OA seal”
  • “What does it mean?”
  • “It means you don’t have to worry. You can do what you like”.
  • “Does our Institutional Repository have them? Can I put one on my thesis?”

… but that’s another part of the story.

Posted in Uncategorized | Leave a comment

How many forms of OA are there now?

You use Although the main OA world seems oblivious to the need to define what they are talking about the discussion continues on this blog! Stevan Harnad now writes a comment which, when taken with Peter Suber’s comments leaves me more perplexed than before:

Stevan Harnad Says:
May 3rd, 2008 at 2:41 am e
The Two Forms of OA Are Now Defined: They Now Need Value-Neutral Names

PMR: This title worries me greatly. There are not Two forms of OA. There are now at least three. The BBB declaration contains a definition. That definition is signed by both Suber and Harnad and they have agreed that it should not be amended. Peter Suber makes it clear on these pages that “strongOA” is NOT the same as BBB-OA. It appears that strongOA is a subset of BBB-OA. So even in the Suber-Harnad camp there are logical three types of OA (weak – I will abbreviate it to wOA so as not to offend), strongOA and BBB-OA. Whatever the rights or wrongs of each that seems to be current state.
BBB is well defined in a single sentence of power and clarity:
By ‘open access’ to this literature, we mean its free
> availability on the public internet, permitting any users to
> read, download, copy, distribute, print, search, or link to the
> full texts of these articles, crawl them for indexing, pass them
> as data to software, or use them for any other lawful purpose,
> without financial, legal, or technical barriers other than those
> inseparable from gaining access to the internet itself
I am now completely unclear what wOA and sOA now are. They are defined in terms of “permissions” which is currently undefined and not self-explanatory. (I repeat, nothing in OA is self-explanatory).

To repeat, “Weak/Strong” OA marks a logical distinction: price-barrier-free access is a necessary condition for permission-barrier-free access, and permission-barrier-free access is a sufficient condition for price-barrier-free access. That is the logic of weak vs. strong conditions.

I have already agreed this. I just don’t have any idea what permission-barrier-free means.

The purpose of our joint statement with Peter Suber was to make explicit what is already true de facto, which is that both price-barrier-free access and permission-barrier-free access are indeed forms of Open Access (OA), and that virtually all Green OA today, and much of Gold OA today, is just price-barrier-free OA, not permission-barrier-free OA, although we both agree that permission-barrier-free OA is the ultimate desideratum.
But what Peter Suber and I had not anticipated was that if price-barrier-free OA was actually named by its logical condition as “Weak OA” (i.e., the necessary condition for permission-barrier-free OA) then that would create difficulties for those who are working hard for the adoption of the mandates to provide price-barrier-free OA (Green OA self-archiving mandates) that are only now beginning to grow and flourish.

PMR: I have no further views on this – it is a sociopolitical observation, but does not help use define what we are talking about. I would be happy if the types of OA were called OA1, OA2, OA1.2a, etc

In particular, Professor Bernard Rentier, the Rector of the University of Liege (which has adopted a Green OA self-archiving mandate to provide price-barrier-free OA) is also the founder of EurOpenScholar, which is dedicated to promoting the adoption of Green OA mandates in the universities of Europe and worldwide. Professor Rentier said quite explicitly that if price-boundary-free OA were called “Weak OA,” it would make it much harder to persuade other rectors to adopt Green OA mandates — purely because of the negative connotations of “weak.”
Nor is the solution to try to promote permission-barrier-free (”Strong OA”) mandates instead, for the obstacles and resistance to that are far far greater. We are all agreed that it is not realistic to expect consensus from either authors, university administrators or funders on adoption or compliance with mandates to provide permission-barrier-free OA at this time, and that the growth of price-barrier-free OA should on no account be slowed by or subordinated to efforts to promote permission-barrier-free OA (though all of us are in favour of permission-barrier-free OA too).

PMR: These are sociopolitical observations and beliefs that do not help clarify what we are talking about.

So, as the label “weak” would be a handicap, we need another label. The solution is not to spell it out longhand every time either, as “price-barrier-free OA,” etc. That would be as awkward as it would be absurd.

PMR: If price-barrier-free OA actually describes what you are talking about it seems a useful term. If people adopt it will get acronymised (e.g. PBFOA, or shorter). If it doesn’t get adopted, then it doesn’ matter. Most people are capable of managing multiword terms. It seems perverse to exchange clarity for short english words which are bound to confuse. “soft” OA is completely meaningless.
You use “Green” above. I thought I knew what it meant. Now I don’t. If it means self-archiving-in-institutional-repositiries, call it that. Shortened to SAIR. As, perhaps distinguished from Self-archiving-on-web-pages (SAWP). Green is a political term, not a technical one.

So we are looking for a short-hand or stand-in for “price-barrier-free OA” and “permission-barrier-free OA” that will convey the distinction without any pejorative connotations for either form of OA. The two forms of OA are now defined, explicitly and logically. They are now in need of value-neutral names.
Suggested names are welcome — but not if they have negative connotations for either form of OA. Nor is it an option to re-appropriate the label “OA” for only one of the two forms of OA.

PMR: I agree completely with the last sentence. I have spent the last year on this blog getting into rows with people because I thought OA was defined and now I know it isn’t. Not until you define a permission-barrier. Until  we have some definitions we are in a mess.

Posted in Uncategorized | 5 Comments

The merits and demerits of PDF

Chris Rusbridge and I have been indulging in a constructive debate about whether PDF is a useful archiving tool. Chris, as readers know, runs the Digital Curation Centre. I’ll reproduce the latest post from his blog and intersperse with comments. But just before that I should make it clear that I am not religiously opposed to PDF, just to the present incarnation of PDF and the mindset that it engenders in publishers, repositarians, and readers. (Authors generally do not use PDF).

20:28 02/05/2008, Chris Rusbridge,
[Robotic mining] is comparatively new, and (not surprisingly) hits some problems. Articles, like web pages, are designed for human consumption, and not for machine processing. We humans have read many like them; we know which parts are abstracts, which parts are text, which headings, which references. We can read the tables, find the intersections and think about what the data points mean. We can look at the graphs, the spectra etc, and relate them to the author’s arguments. Most of these tasks are hard for robots. But with a little bit of help and persistence, plus some added “understanding” of genre and even journal conventions, etc, robots can sometimes do a pretty good job.
PMR: Agreed

However, most science articles are published in PDF. And PDF does not make the robot’s task easy; in fact, PDF often makes it very hard (not necessarily to be deliberately obscure, but perhaps as side-effects of the process leading to the PDF).

PMR: Agreed. An expansion is “most articles are authored in Word/OO or LaTeX and converted to PDF for the purposes of publishing.”

Peter Murray-Rust has been leading a number of one-man campaigns (actually they all involve many more than one man[*], but he is often the vocal point-person). One such campaign, based on attempts to robotically mine chemical literature can be summed up as “PDF is a hamburger, and we’re trying to turn it back into a cow” (the campaign is really about finding better semantic alternatives to PDF). I’ve referred to his arguments in the past, and we’ve been having a discussion about it over the past few days (see here, its comments, and here).

PMR: [*] Alma Swan does me the honour to quote this in her talks 🙂

I have a lot of sympathy with this viewpoint, and it’s certainly true that PDF can be a hamburger. But since scientists and publishers (OK, mostly publishers) are not yet interested in abandoning PDF, which has several advantages to counter its problems, I’m also interested in whether and if so, how PDF could be improved to be more fit for the scientific purpose.

PMR: I don’t think scientists care about PDF. It’s something that comes down the wire. If it came down in Word they wouldn’t blink. So it’s the publishers, not the readers. And most authors create Word. The tip it into the publisher’s site which converts it to PDF.
PMR: Having said that “PDF” is rapidly moving from a trademark to an english word. Rather than “send the manuscript” it’s “send the PDF”. Just like “please send us your Powerpoints”.

One way might be that PDF could be extended to allow for the incorporation of semantic information, in the same way that HTML web pages can be extended, eg through the use of microformats or RFDa, etc. If references to a gene could be tagged accordning to the Gene Ontology, references to chemicals tagged according to the agreed chemical names, InChis etc, then the data mining robots would have a much easier job. Maybe PDF already allows for this possibility?

PMR: This is completely possible at the technical level. My collaborator Henry Rzepa is keen on using PDF as a compound document format and metadata container. It can do it. But nobody does, it certainly will require tools that have to be bought.

PMR argues quite strongly that PDF is by design unfit for our purpose (in this case, holding scientific information such that it can reliably be extracted by text mining robots); that PDF’s determined page-orientation and lack of structural and semantic significance doom such attempts to failure. He also argues strongly that the right current solution is to use XML… or perhaps XHTML for science writing.
I don’t know. He might be right. But being right is not necessarily going to persuade thousands of journal editors and hundreds of thousands of scientists to mend their ways and write/publish in XML.

PMR: I’m not asking for XML. I’m asking for either XHTML or Word (or OOXML)

CR: I think we should tackle this in several ways:

  • try to persuade publishers to publish their XML (often NLM XML) versions of articles as well as the PDFs

PMR: I am one hundred percent in favour of this. The problem is that most publishers are one hundred percent against it. For business reasons, not technical. Because they are worried that people might steal their content (oops, the content that we wrote).

  • try to persuade publishers who don’t have a XML format to release HTML versions as well as (or instead of PDFs)

PMR: Most large publishers have an XML format. It’s trivial for them to create HTML and many do. This is a business problem, not a technical one.

  • tackle more domain ontologies to get agreements on semantics

PMR: agreed. This is orthogonal to whether we use PDF, Word or clay tablets (many ancient civilisations used markup)

  • work on microformats and related approaches to allow semantics to be silently encoded in documents

PMR: Absolutely agreed. It needs tools but we have some cunning plans for chemistry which will be revealed shortly.

  • try to persuade authors to use semantic authoring tools (where they exist), and publishers to accept these

PMR: ditto

  • try to persuade Adobe to extend PDF to include semantic micro-metadata, and to help provide tools to incorporate it, and to extract it.

PMR: The first part already exists. I would not espouse the second as I don’t want to have to purchase another set of tools for something that should be free.

Might that work? Well, it’s a broad front and a lot of work, but it might work better than pursuing only one of them… But if we got even part way, we might really be on the way towards a semantic web for science…

PMR: It will work at some stage – the stage when the publishers want to help scientists in their endeavour rather than prevent them taking the next logical step because it might impact on subscriptions or be extra work. The W3C community , the Googles, Flickrs, etc. etc do all this already. They have semantic linked data. It just that the scientific publishing Tardis is still stuck in the nineteenth century. It looks lovely from the outside.

Posted in Uncategorized | 8 Comments

Further discussion on strongOA and weakOA

I have still seen very few public comments but have now had comments on this blog from Stevan, PeterS and Klaus Graf which is at least a good spectrum. So I’ll comment in detail, and meanwhile hope to goad some of the silent throng into actually saying something…

Klaus Graf Says:
May 2nd, 2008 at 10:03 pm e
I do not think that’s a good thing if the two leading OA advocates decide “par ordre de Mufti”. The OA community has few free forums. Harnad and Suber are strictly moderating their lists and Suber doesn’t allow comments to his weblog.

PMR: I have sympathy with this view but am trying to remain objective. Klaus is right that there isn’t a clear place to discuss OA. FWIW I do not censor this blog other than spam so anyone can post whatever they wish here.

For me weak OA isn’t enough (and the pejorative connotations of “weak” appropriate) and CC-BY-NC definitively NOT strong OA.
Many thousands of scholars and scientists support the BBB definition of OA which includes commercial use and derivative works. BBB is a necessary condition for strong OA because it is the only authoritative consensus. If one person (Harnad) has another opinion that’s the problem of this person. Harnad and Suber don’t have the right to chance BBB and the accepted definition of OA.
The German librarian and OA advocate Bernd-Christoph Kämper has given some arguments that commercial use is necessary. Like me he regrets that Suber has soften his position.
http://archiv.twoday.net/stories/4900938/ (Comment in German)
Please read the Archivalia enttries he mentions. They are in English.

PMR: Being objective, Suber and Harnad have created their own terminology. It’s doesn’t have to be agreed by any organisation. Stevan has created several different terminologies in the past – green gray, light green gold, etc which (IMO) were highly confusing. You’ll see from his post below that he wants to get rid of “weak” and “strong” after only a few days. Whatever the rights and wrongs I regret the confusion this will cause. If, after having announced what seemed to be a clear position it is then renamed it will give the message that the community cannot work out what it is talking about

BTW: I agree to license my comments here under the the Creative Commons Attribution-license (default is NonCommercial).

PMR: This list is already CC-BY (after your prompting :-). We originally started with CC-NC, and then changed when we could find out how to fix WordPress.
PMR: In my view the Open Access movement desperately needs a central organisation. The funders and universities are pumping millions if not billions into “OA” and they cannot even define what it is. Open Source has the OSI which determines whether ot not a given licence is OS. Open Knowledge after only a short time of volunteers has the OKF and has an agreed definition and a list of conformant licences. Funders pay money for stuff that is ultra-weak OA, universities haven’t a clue what the status of the material in their repositories is. I ask “Please can I have some Open Access theses from your repository to datamine?” and am told “Sorry we don’t know whether you are allowed to or not” It’s clearer from the British Library – “no to everything.”
=================================================================

Stevan Harnad Says:
May 2nd, 2008 at 8:23 pm e
Remedy Needed to Prevent Unintended Negative Connotations of “Weak” from Becoming a Liability
Important caveat: “Weak/Strong” OA marks the logical distinction: price-barrier-free access is a necessary condition for permission-barrier-free access, and permission-barrier-free access is a sufficient condition for price-barrier-free access. That is the logic of weak vs. strong conditions.

PMR: Great! The first algorithm I have seen. I agree completely. We need algorithms. strongOA subsumes weakOA.

But since Peter [==PeterS] and I agreed on the distinction, and agreed that both price-barrier-free access and permission-barrier-free access are indeed open access, many of our colleagues have been contacting us to express serious concern about the unintended pejorative connotations of “weak.”

PMR: I think it’s a pity the colleagues are not more openly vocal. Personally I was delighted with the terms “weak” and “strong” as I thought they gave exactly the right connotations. I have no idea whether I’m in a minority because no-one is saying anything.

As a consequence, to avoid this unanticipated and inadvertent bias, the two types of OA cannot be named by the logical conditions (weak and strong) that define them. We will soon announce a more transparent, unbiased pair of names. Current candidates include:
Transparent, self-explanatory descriptors:
USE OA vs. RE-USE OA
READ OA vs. READ-WRITE OA
PRICE OA vs. PERMISSION OA

PMR: There are NO SELF-EXPLANATORY TERMS in OA. Until this is recognised the situation is as bad as ever. I do not understand “RE-USE”. I am certain that I would interpret it differently from Stevan and PeterS. If I interpret somethign differently (as I did with the strongOA/weakOA borderline) it’s not because I’m stupid, or ignorant or wilful, but because it’s not clear. So far we have a score of 2-2 (Suber, Harnad, vs Murray-Rust, Graf) as to where they think the intended border was before it was defined. The definition is arbitrary, not self-explanatory.
What does “PERMISSION” mean? I now have no idea. If I am allowed to read something, that is a permission. If I am allowed to mount it on my web site but no-where else that is a permission. And so forth. I promoted the idea of OA-permission and OA-free (or something like it) some while ago but it didn’t carry weight at the time…
Unless we actually start to define these terms they continue to be of little value

Generic descriptors:
BASIC or GENERIC OR CORE OA vs. EXTENDED or EXTENSIBLE or FULL OA
SOFT OA vs. HARD OA
EASY OA vs. HARD OA

PMR: I don’t see any point in these at all. And unless they are defined they are meaningless.

The ultimate choice of names matters far less than ensuring that the unintended connotations of “weak” cannot be exploited by the opponents of OA, or by the partisans of one of the forms of OA to the detriment of the other. Nor should mandating “weak OA” be discouraged by the misapprehension that it is some sort of sign of weakness or of a deficient desideratum
Stevan Harnad

PMR: The issue is in the balance. Duck the definitions and the publishers will take everything except *-BY for a ride.
In sciences such as Natural Language Processing we have defined metrics. We have a set of annotation guidelines and then require humans to annotate a corpus. For chemical name recognition (i.e. “is this a chemical or not”) the experts agree at 92%. I suspect that if you gave the average involved person (funder, publisher, repositarian) a set of documents on defined sites and asked them whether they were strong or weak OA you would be lucky to get 50% agreement. And 50% is useless.

Posted in Uncategorized | 2 Comments

Peter Suber on what is strongOA

Peter Suber has replied to my general request for the definition of strongOA. (It got lost in the comments queue, as has also happened to StevanH :-):

Hi Peter.
Just one point of clarification.  As I said in my blog <a href=“http://www.earlham.edu/~peters/fos/2008/04/strong-and-weak-oa.html”>post</a> on Tuesday, “there is more than one kind of permission barrier to remove, and therefore…there is more than one kind or degree of strong OA.”  BBB OA is definitely strong OA, but not all strong OA is BBB OA.
As soon as we move beyond the removal of price barriers to the removal of permission barriers, we enter the range of strong OA.  Hence, an article with a CC-NC license is strong OA because it allows some copying and redistribution beyond fair use (even if it doesn’t allow all copying and redistribution).  My own preference is still for the CC-BY license, but we shouldn’t speak as if CC-NC were not strong OA or as if there were just one kind of strong OA.

PMR: This is very useful. At this stage I simply want to find out where the dividing line is. Not whether it should be drawn there. So if Suber-Harnad strongOA includes CC-NC, then that’s part of the definition.
It’s essential that our community decides how to draw the boundary. It’s not like a healthy-unhealthy spectrum – there should be a clear dividing line between strongOA that we all know how to operate. Because otherwise we end up with debates about what we mean, rather than what we want to do about it.
The first question, then, is how do we distinguish strongOA from weakOA. The Peter and Stevan have given principles, and I now need algorithms to determine how to apply them.
A good starting point is to take the current labels and major OA publishers or OA journals and see if we agree. We have to agree.
In passing I am surprised that this announcement has not generated more discussion. I think it’s one of the most important OA events this year (after, say, NIH and SCOAP3). It’s critical that for all of and more of :

  • librarians (who are going to have to manage it)
  • funders (who want to know what they are paying for)
  • authors (who worry about their rights
  • and information-seekers like me who want to know what they can do with the data

So please start commenting. Give us examples of strongOA, and no-so-strong, …

Posted in Uncategorized | 3 Comments

How can we create semantic chemical information?

We’re looking to create an Open semantic resource for chemistry for a group of common chemicals – partly as a partner in the ORECHEM (Chemistry Repositories), project and partly because we need it for our own work in machine understanding of chemical text (OSCAR). We are developing an RDF-based repository and want to populate it with semantic information. Initially a maximum of 2-3 thousand common chemicals with names, identifiers, chemical formulae of various types and the commonest (mainly physical) properties. Nothing particularly special – the sorts of things that undergraduates ill come across – but it must be semantic and it must be Openly redistributable without permission (Open Data). Note that we cannot legally robotically access the major chemical databases maintained by CAS and Beilstein. Nor can we (yet) extract enough high-quality information from the published literature. We are close to having the technology, but still encounter the lawyer-barrier.
Initially we want to hold the following:

  • chemical names
  • chemical composition (brutto formula – When did this term start being used, and is there a definition?)
  • chemical structure/connection table
  • identifiers
  • molecular mass
  • physical properties
  • (possibly) safety data
  • links to other sites

A useful starting point is Rich Apodaca’s list of free chemistry databases. (and you can find many of the databases mentioned here in it). This about 15 months old and there may be more now. I also include ChemSpider as it has some community contributions. Many databases are not really relevant as they are too large, do not have programmatic access, or are only partly chemical.
The obvious starting point is Wikipedia and we are working closely as part of Wikipedia to add semantics to the information. Indeed we would see the results of our endeavours as giving a resource which could be used to help WP in quality control. The current main problem is the inconsistency of the information, especially in the variable syntax and semantics of the InfoBox.
It would be nice to start with the NIST webbook but this is not Open – and there are copious copyright notices and indications that NIST may wish to charge in the future. This is unusual for US government works but NIST has a special dispensation to recover costs.
CheEBI is the most semantic Open resource, and has been assembled by humans but there are many common chemicals not in it.
The various aggregators have very large numbers of molecules and therefore do not define a useful starting point for “common chemicals”.
All of these are potentially useful for enhancing information once it has been found.
The other major resource on the Web is MSDS (Materials Safety Data Sheets). These collections are freely accessible but probably not Open. However they form a useful starting point. The two main ones are the Inchem site and the collection hosted on the Oxford University Physical and Theoretical Chemistry server (Chemical and Other Safety Information from the Physical Chemistry). Each has somewhere between 1000 and 2000 unique chemical compounds. Manufacturers are obliged to create MSDS for their products, and we can expect them to be accurate because it has some legal force.
How can we check the “correctness” of the information on web pages. In general we can’t. All we can do is to compare information and note where it agrees or disagrees. To go further we need to know who the “authority” is. We trust some authorities more than others for a whole variety of reasons. But in general there is no “right” or “wrong” there are assertions made more or less strongly by authorities to which we give variable weights.
A good example is Wikipedia. I trust many of the the articles in physical science to a high degree. I may modulate that with the age of the article and the number of different collaborators. The relies on the “wisdom of crowds”, but I think it works well in chemistry. Chemspider has harnessed the wisdom of crowds but I suspect that only a very small fraction of their entries have been human-curated and I give an example below which seems to need attention.
I trust the suppliers of MSDS sheets … but read on.
In general the statement that “the formula of aspirin is C9H8O4” is unverifiable. I can, however assert that “Wikipedia asserts that the formula of aspirin is C9H8O4″ is true. (To be picky I should give the date of this assertion). Chemspider makes the same assertion. So does Oxford PTCL. And so do all the other ones I have checked. But there are lots of them, and for this we need robots. And, for less common compounds as we’ll see it doesn’t work out as wel..
How can a robot identify a chemical on a web site? It’s got the following choices:

  • Common Names. This is how Wikipedia organizes the top-level access to chemicals. But, as we know, most common chemicals have tens of hundreds of synonyms and some of these synonyms refer to more than one compound.
  • Systematic name. This can be useful, and it’s what we use in OSCAR. But it’s hard work parsing the totality of chemical names as there are many dialects, sub-grammars, etc. There are no good metrics for this – I heard a report values of ca 60% for name recognition for the commercial packages (our OPSIN does reasonably well for simple compounds but needs a lot more work – it’s an area where volunteer contributions might scale).
  • (Brutto) formula – e.g. C4H10O. This does not normally identify a compound completely but is a useful constraint – two compounds with different formulae can be held to be non-equivalent.
  • Molecular mass. This is often reported and can usually be calculated from the brutto formula. Again it can be used as a constraint to assert non-equivalence.
  • Connection tables (also serialized as SMILES and InChI). These work well for organic compounds, poorly for inorganic ones. But there can be different levels of precision (hydrogens, stereochemistry, etc.) Identical connection tables (after canonicalisation) cane be held to show equivalence, but come compounds have several connection tables (e.g. glucose).
  • Identifiers. Potentially identifiers are the easiest and most powerful tool. An identifier is a unique string associated by an authority with a substance (not necessarily pure). If an authority(X) asserts that substance A(X) and substance B(X) have the same identifier then they can be said to be equivalent. There are many authorities making such assertions. Ultimately it is only the authority(X) who can make assertions about its identifiers. To be widely useful the authority should provide a lookup (resolution) service which is both human- and machine-accessible. In practice many authorities don’t do this or provide only a toll-access service. The identifiers are also often copyright and may or may not be copied. This often leads to other authorities(Y) who copy identifiers without permission and make their own assertions which may or may not be compatible with the authority(X). Frequently also the source of the identifier is not given. Thus many people who submit information to Pubchem give identifiers and these are listed as “[RN]” = registry number. For aspirin for example, there seem to be many identifiers – in the Chemspider entry all the following link through to Pubchem, e.g. 2349-94-2[RN], 26914-13-6[RN], 98201-60-6[RN]
  • physical Properties. It is generally assumed that for any pure compound many of the physical properties are invariant. (this is not true if it has solid polymorphs or similar metastable states, but it’s a very useful guide for non-equivalence.

In the next post I’ll show how we get on with some typical exploration. It may show the scale of the problem we face in reconciling current chemical information.

Posted in semanticWeb, Uncategorized | Leave a comment

More PDF hamburger

Chris Rusbridge (Director of the Digital Curation Centre) has added another thoughtful comment which has helped me clarify my ideas.

Chris Rusbridge Says:
April 30th, 2008 at 5:00 pm e
Well Peter, thanks for amplifying your answer, but I still think you miss the point. Despite your “First let me dispose of…” comment above, much of your post says that many PDF files are badly structured. I absolutely agree, but it wasn’t the point I was trying to make (and anyway it’s true of most other formats; I’m staggered how few people use Word styles properly, for instance, yet without them many simple transformations become impossible).

PMR: I agree we needn’t follow this:

What I’m asking is:
IF we have good authoring tools (preferably able to tag scientific information semantically)
AND we have a good publishing workflow that preserves and perhaps standardises that semantic information
AND we have good tools for converting the in-process document to PDF, XHTML etc
THEN COULD the PDF contain semantic information, in the form of metadata, tags, RDFa, microformats, etc?

PMR: Potentially yes, in practice I think we have missed the opportunity and it won’t happen. But there may be technical and social advances that will prove me wrong. The baseline is that we need structured documents which can act as a compound document and to which semantics can be added. And I agree that we can separate the creation of semantic information from the final object (at least in most cases). And that in many cases it isn’t easy to create the semantic information. And it is conceivable that this cold be done with PDF – it is a compound document format and it can hold metadata (DC, MPEG, etc.). But it’s neither the best, nor is it widely used for this. So it would require a great deal of effort to satisfy me on these two points.
In SPECTRaT we looked at this and had heated discussions about whether we could make PDF/a work as a container document. Henry Rzepa has made several attempts to use PDF as a container for chemistry. I don’t think it’s going to fly. And we came to the strong conclusion that if we wanted to involved machines in helping us get useful information out of theses then PDF – in its varieties – doesn’t help.

(Or conversely, if we don’t use decent authoring tools, don’t care about encoding the semantic information in the first place, don’t care about document structure, use cobbled-together publishing systems ignoring standard DTDs, what does it matter if we use manky Russian PDF converters since there’s no semantic information there anyway!)

PMR: That’s a pragmatic approach. It keeps scientific information in the C20.

… and if PDF cannot currently contain semantic information at a fine enough level, what would need to be added to make it possible? But there’s no point going down this route (try to get PDF better for science) if the earlier parts of the workflow make sustaining semantic information too hard.

PMR: As you know I favour XML documents. These have many advantages over PDF. More below

BTW, from the PDF/A standard (ISO 19005-1): “The future use of, and access to, these objects depends upon maintaining their visual appearance as well as their higher-order properties, such as the logical organization of pages, sections, and paragraphs, machine recoverable text stream in natural reading order, and a variety of administrative, preservation and descriptive metadata.” So a conforming PDF/A file should at least be able to recover text in the natural order… I do suspect PDF metadata is at the wrong level of granularity, although PDF reference says it can apply to various levels of object, but I don’t know…

The fundamental problem is that PDF is a graphically oriented language. Scholarly work is a small part of what it is used for. It’s perfectly OK to have text reading vertically, in huge or tiny fonts, with arbitrary graphics strokes and primitives. It has no sense of “content-versus-presentation” that is the key design principle in all of XML.
It is extremely difficult to read the average PDF. Firstly it’s often encrypted or compressed. You cannot read it without bespoke software. In contrast XML documents are designed to be readable by humans without special software. And I often do. It’s not fun but it’s perfectly possible to read MathML, CML in a normal text editor.
There are virtually no useful freely available tools for PDF, while there are zillions for XML. Every computer on the planet now has an XML-DOM and SAX engine (I take some credit for the latter). In PDF there are heroes such as Ben Litchfield who has created PDF-BOX and I’ve worked a lot with it, but in general the problem of reading PDF into a machine is horrendous. For example “where does one paragraph end and the next one start?”. This is impossible to determine in PDF. There is no concept of paragraph. There isn’t even a concept of word – there are heuristics saying that if one character is sufficiently close to another on the screen then it’s probably part of a word. While I accept that there may – somewhere – be expensive PDF tools that can add information on where the words and paragraphs are in a document and expensive tools that can decode this, there is nothing in general use. In contrast every teenage web-hcaked can use HTML to define where the words start and end and where the paragraphs are.
Maybe some people don’t mind not being able to determine word boundaries, paragraphs, tables, graphics, lists, etc. But scientists have to. XHTML solves all these problems. It’s not perfect but with microformats it can be made into a very sophisticated approach. It’s universal, free, fit-for-purpose

Posted in Uncategorized | Leave a comment