cyberscience: Labels and licences

Bill Hooker has supported my suggestion of labels (We must have licences for publications) for describing the re-usability of publications. I will use “labels” rather than “licences” at present as it allows us to describe practice rather than mandate it

  1. Bill Says:

    I shall adopt the same naming method, making licenses explicit when talking about specific journals, publishers or whatever. I think this will go a long way towards alleviating confusion/term dilution, and also towards fixing in the consumers’ (researchers’) minds that OA must come with a licence in order to be useful.
    I suggest one other way forward: pay careful attention to mandates as they are established. If they are worded clearly, much publisher weaseling can be pre-empted; if the publishing lobby and their pet pit-bulls get their way, the mandates we end up with will be full of loopholes.

I have an absolute need to know precisely what the status of a paper is – effectively can my robots download it without the publisher cutting off all supply to journals to the University of Cambridge. This has happened twice already even though we did nothing wrong and I don’t particularly want it to happen again. So it would be valuable to know what type of information I was likely to find on publishers’s (and institutional – I don’t want to be banned from University repositories either) sites. So I suggested a licencing system  and I’m happy to start with the idea of labels (Bill and I and other Blue Obelisk members are starting to do this although it’s mote complicated than it looks).
Most OA licences are permissive – you are allowed to do X. Most non-OA licences are primarily restrictive – you (author) are not allowed to do X, you (reader) are not allowed to do X, Y, Z. So I am not sure yet what the CA (closed access) labels will look like till we have surveyed a few. We’ll probably start with “CA”, meaning “assume you can’t do anything” and extend it to “CA-PW” (can post on personal web site but may not advertise where this is) and CA-ER (limited number of electronic reprints – a gorgeously fatuous idea – you give a publisher URL to your friends and they can read (but not redistribute) a copy of the paper there. Every read decrements a counter.) I am sure there are other splendid licences out there – we should run a little competition for the most restrictive licence.
The simple point is that any publisher with a restrictive licence is actively crippling cyberscience. We are paying commercial organisations to stop us doing the next generation of science. It’s even more fatuous than the European Common Agricultural Policy where farmers are…
… I need to do some real work – coding – to relax from this.

Posted in open issues, semanticWeb | Leave a comment

Open Access metrics can be simple and fun

Here’s a simple idea for showing how Open a given field of endeavour is (thanks to Peter Suber Measuring the OA Quotient of a research topic):

Matt Cockerill, How open is your research area? BMC blog, July 22, 2007. Excerpt:

Using PubMed’s “Limits” tab, it is easy to filter searches by date of publication, and also by whether an article has a link to an online full text, and whether that online full text is freely available….
One handy side effect of this is that it is possible to search PubMed for articles in the last 60 days, and to calculate an Open Access Quotient to quantify just how open a particular research field is – i.e. what fraction of the research in that area is available with open access immediately following publication.
[Open Access Quotient = (PubMed results with open access fulltext links for last 60 days) / (PubMed results with fulltext links for last 60 days)]
The OAQ for PubMed as a whole currently stands at 6.8%, but this overall figure conceals major variation between fields.
[Malaria 19.8%, microarray 16.9%, genomic 12.9%, influenza 12.3%, AIDS 11.3%, cancer 7.2%, cardiovascular 5.0%, clinical trial 4.0%, PutMed average 6.8% …]
Is there a research area with a higher Open Access Quotient than malaria? Why not help us find out?
We’ll send an “I’m Open” BioMed Central T-shirt to whoever can identify the biomedical field with the highest Open Access Quotient (and we’d also be interested to know what fields seem to have the lowest).
To qualify, a PubMed Search should be based on conceptual keywords (not author or journal names) and should return at least 100 articles which have online fulltexts published in the last 60 days. Send your findings to blog@biomedcentral.com

PS Comment. For topics covered by PubMed, the OAQ is a great idea. I’ve been hoping for such a measurement for all topics since 2002, but it’s impractical (so far) for fields where there is no PubMed or equivalent. By all means, however, let’s start with PubMed and measure what is measurable

PMR: We can use this in reverse for the game of “The Most Closed Discipline on the Planet”. It’s fairly easy to use Pubmed – I have already shown (Is Natural Product Synthesis Interesting?) how we can measure the interest of a discipline by how many hits it gets in Pubmed. (The answer to the above question – (“is anyone  other than chemists interested in what the chemists are trying to make”) -is generally “no”. So I expect Natural Product Synthesis to be a candidate since no-one other than chemists read it and so there is no reason for it to be Open. Just point a robot at Pubchem with the words “total synthesis” and see how many Open articles you get. My prediction is < 0.1%.
More seriously, metrics and labels are critical. We need to be able to tell at a glance what the OA status of a paper is. Linking to the full text is a second best – It won't find many of the "green" OA papers.

Posted in chemistry, open issues | Leave a comment

We must have licences for publications

I have written several times over the last few weeks about how important it is to clarify and protect the re-use of scientific data in publications (Open Data) and have, on several occasions, argued that the primary means that we currently have is the label “Open Access”. “Open Access” is precisely defined in three declarations (BBB) (see Open Access in WP) and I have consistently argued that this should be the touchstone of whether something is Open Access and how it can be used.
But the term has been muddied by sloppy usage and in many cases means seems to mean only only “free to read”. As I have pointed out the lack of clarity is extremely serious for data-rich and data-driven sciences where little enough data is published anyway and where it is essential that we have the access to and the right to re-use data.
I do not wish to become some logic-chopper or fundamentalist who monotonously intones “BBB” while the rest of the world gets on and develops the new information model. The problem is that without this emphasis we are killing data-driven science. We are walking into a world where the role of publishers is to stop people reading things and certainly to stop them using scientific data – this is already happening.
Bill Hooker has summed this up:


BH: Peter Suber commented on the last entry to clarify his position on the varying uses of the term “Open Access”:

PS: For me, OA in the strict sense removes both price barriers and permission barriers; all the major public definitions say so; and I’m only too glad to repeat this whenever it comes up. However, as a matter of word usage, the term now covers more territory than this and I’ve stopped fighting that fact. That is, the term is often used for content that is merely free-to-read.

BH: Peter goes into more detail in a recent entry on his blog:

…many projects which remove price barriers alone, and not permission barriers, now call themselves OA. I often call them OA myself. This is only to say that the common use of the term has moved beyond than the strict definitions. But this is not always regrettable. For most users, removing price barriers alone solves the largest part of the problem with non-OA content, and projects that do so are significant successes worth celebrating. By going beyond [I would say “outside” — BH] the BBB definition, the common use of the term has marked out a spectrum of free online content, ranging from that which removes no permission barriers (beyond those already removed by fair use) to that which removes all the permission barriers that might interfere with scholarship. This is useful, for we often want to refer to that whole category, not just to the upper end. When the context requires precision we can, and should, distinguish OA content from content which is merely free of charge. But we don’t always need this extra precision.In other words: Yes, most of us are now using the term “OA” in at least two ways, one strict and one loose, and yes, this can be confusing. But first, this is the case with most technical terms (compare “evolution” and “momentum”). Second, when it’s confusing, there are ways to speak more precisely. Third, it would be at least as confusing to speak with this extra level of precision –distinguishing different ways of removing permission barriers from content that was already free of charge– in every context. […]

BH: and in the Sept 2004 edition of the SPARC OA Newsletter:

One danger is the dilution of our term. That’s why [this newsletter discusses] the BBB definition and its place in our history. But another danger is the false sharpening of our term. If we thought that the BBB definition settled matters that it doesn’t settle, then we could prematurely close avenues of useful exploration, needlessly shrink the big tent of OA, and divisively instigate quarreling about who is providing “true OA” and who isn’t.The BBB definition functions as a usefully firm definition of “open access” even if it leaves room for variation. We should agree that OA removes some permission barriers (e.g. on copying, redistribution, and printing) even if it leaves different OA providers free to adopt different policies on others (e.g. on derivative works and commercial re-use). My personal preference, for example, is to permit derivative works and commercial re-use. But (as I wrote in FOSN for 1/30/02) I want to make this preference genial, or compatible with the opposite preference, so that we can recruit and retain authors on both sides of this question.

BH: I’ve omitted a lot of good information to save space here; anyone interested in this issue should read all of the linked discussions. In particular, the SPARC newsletter goes into useful specifics about the OA-related activities of a number of publishers.BH: Peters Suber and Murray-Rust have both pointed out that one way to be specific about “levels” of openness is to be explicit about licensing — PMR:

If the community wishes to continue to use “open access” to describe documents which do not comply with BOAI then I suggest the use of suffixes/qualifiers to clarify. For example:

  • “open access (CC-BY)” – explicitly carries CC-BY license
  • “open access (BOAI)” – author/site wishes to assert BOAI-nature of document(s) without specific license
  • “open access (FUZZY)” – fuzzy licence (or more commonly absence of licence) for document or site without any guarantee of anything other than human visibility at current time. Note that “Green” open access falls into this category. It might even be that we replace the word FUZZY by GREEN, though the first is more descriptive.

BH: I take Peter S to be saying that it’s inevitable that “Open Access” will come to mean, in general use, more things to more people than strict BOAI, and we will not achieve anything by making arseholes of ourselves over it. (Even if that’s not quite the way Peter S would put it, that’s the way I’ve come to look at the situation.) There’s no point in picking quarrels we don’t have to have. It’s enough to be more careful in our own usage, for which purposes suffixes a la Peter MR should prove very useful when we need extra precision. I don’t think we need invent terms (“fuzzy”) just yet — “OA (specific licence, with hyperlink if writing online)” and “OA (free to read)” should cover most cases.BH: If we can get to the point where the average consumer — basically, any researcher — responds to an OA claim or label by asking “which licence?”, we will have done an end-run around the problem of term dilution.

Well, I’m a pragmatist – I don’t really want to end up pursuing a Stallman-like insistence on a particular model. tempora mutanta (and I hope) nos mutamur in illis. But we are in grave danger of something worse.
I hesitate to spell out what will happen if we let the most greedy publishers (and we have to accept that a large motivation of the commercial publishers is simple greed) have their way. By pointing out a future which is to their benefit and detrimental to us we give a new potential business model to them. And the less public support we get for protecting data the more that the publishers will steal it.
So if we accept the sort of model that is being brokered by HHMI we get the following:

  • journals are by default completely closed. Even if now they permit self-archiving, why should they continue? They get more money by closing the journal and then asking the author’s funder to pay for non-subscribers to read it. (Green access) So the HHMI deal may cause more journals to close.
  • By default Gold Access allowed the authors to retain copyright and for the licence – if any – to permit any conceivable re-use. It’s now clear from what PeterS has said that “Open Access” covers everything from less-than-green to full CC-BY or PD. A publisher – such as MDPI/Molecules can claim that their papers are full Open Access even though they are copyright the publishers and forbid commercial re-use. Have I lost this one? Am I making myself stupid here? Should I rightly accept the criticism of a Nature editor that I am unacceptably rude in suggesting that “open access” cannot be redefined by publishers as they wish?
  • It will lead to a model where publishers start charging additionally for the data. It’s pretty close to that already but I had thought that the Wellcome model of paying for complete Open Access also helped to clarify access to and re-use of data. If funders blur this by offering OA models which depend on the funder and where everyone has to spend hours reading the small print then we have lost.

So is there a positive way forward? (I am already feeling a traitor for having suggested that there may be a publishing model where publishers ransom our scientific data and sell it back to us. But I am afraid the publishers are no longer simplistic and they will have worked this out. Several (Elsevier, Wiley) have huge scientific databases which they will wish to populate with the primary data in publications.
Yes. It comes from those publishers which are proactive in promoting Open Access. I don’t know the situation outside chemistry and some bioscience but there are some hybrid publishers such as the Int. Union of Crystallography who seems to me to have the most active hybrid program in the chemistry arena. But I am not in favour of the fuzzy hybrids who do not allow re-use of data. I do not know where Nature stands – yes they have some interesting experiments and a single Open journal. And I don’t think they wish to own supplemental data. But they forbid full text-mining (else they wouldn’t have created the OTMI where all the text appears in jumbled form).
So it seems clear that we have to have licences. I shall take the following position:

  • Any publisher or author who exposes a CC-BY or Open Knowledge Foundation licence  I shall call “OA-BY”. This permits full data re-use.
  • Any publisher or author who exposes a CC-NC or CC-ND or similar I shall call OA-NC or OA-ND. This does not permit full data re-use but does permit some. We may have to kludge some of the worst “conditions” like “you may post this on your web site but not in your institutional repository”
  • Any publisher or author who posts a paper that I can read I shall call OA-FREE.

I also issue a warning to the funders, librarians and provosts that unless they take re-use of data seriously and fight for it they – and we – will lose – and quite quickly.

Posted in data, open issues | 5 Comments

cyberscience: Why 100% is never achievable

In the current series of posts I argue that data should be Open and re-usable and that we should be allowed to use robots to extract it from publishers’ websites. A common counter argument is that data should be aggregated by secondary publishers who add human criticism to the data, improve it, and so should be allowed to resell it. We’ll look at that later. But we also often see a corollary expressed as the syllogism:

  • all raw data sets contain some errors
  • any errors in a data set render it worthless
  • therefore all raw datasets are worthless and are only useful if cleaned by humans

We would all agree that human data aggregation and cleaning is expensive and requires a coherent business model. However I have argued that the technology of data collection and the added metadata can now be very high and that “raw” data sets, such as CrystalEye can be of very high quality and, if we can assess the quality automatically, can make useful decisions as to what purposes it is fit for.
In bioscience it is well appreciated that no data is perfect, and that everything has to be assessed in the context of current understanding. There is no “sequence of the human genome” – there are frequent releases which reflect advances in technology, understanding and curation. The Ensembl genome system is on version 45. We are going to have to embrace the idea that whenever we use a data set we need to work out “how good it is”.
That is a major aspect of the work that our group does here. We collect data in three ways:

  • text-mining
  • data-extraction
  • calculation and simulation

None of these is perfect but the more access to them and their metadata, the better off we are.
There are a number of ways of assessing the quality of a dataset.

  • using two independent methods to determine a quantity. If they disagree then at least one of them has errors. If they agree within estimated error then we can hypothesize that the methods, which may be experimental or theoretical, are not in error. We try to disprove this hypothesis by devising better experiments or using new methods.
  • relying on expert humans to pass judgement. This is good but expensive and, unfortunately, almost always requires cost-recovery that means the data are not re-usable. Thus the NIST Chemistry WebBook is free-to-read for individual compounds but the data set is not Open. (Note that while most US governments works are free of copyright there are special dispensations for NIST and similar agencies to allow cost-recovery).
  • relying on the user community and blogosphere to pass judgement. This is, of course, a relatively new approach but also very powerful. If every time someone accesses a data item and reports an error the data set can be potentially enhanced. Note that this does not mean that the data are necessarily edited, but that a annotation is added to the set. This leads to communal curation of data – fairly common in bioscience – virtually unknown in chemistry since – until now – almost all data sets were commercially owned and few people will put effort into annotating something that they do not – in some sense – own. The annotation model will soon be made available on CrystalEye.
  • validating the protocol that is used to create or assess the data set. This is particularly important for large sets where there is no way of predicting the quantities. There are 1 million new compounds published per year, each with ca 50 or more data points – i.e. 50 megadata. This is too much for humans to check so we have to validate the protocol that extracts the data from the literature.

The first three approaches are self-explanatory, but the fourth needs comment. How do we validate a protocol? In medicine and information retrieval (IR) it is common to create a “gold standard“. Here is WP on the medical usage:

In medicine, a gold standard test is a diagnostic test or benchmark that is regarded as definitive. This can refer to diagnosing a disease process, or the criteria by which scientific evidence is evaluated. For example, in resuscitation research, the gold standard test of a medication or procedure is whether or not it leads to an increase in the number of neurologically intact survivors that walk out of the hospital.[1] Other types of medical research might regard a significant decrease in 30 day mortality as the gold standard. The AMA Style Guide prefers the phrase Criterion Standard instead of Gold Standard, and many medical journals now mandate this usage in their instructions for contributors. For instance, Archives of Physical Medicine and Rehabilitation specifies this usage.[1].
A hypothetical ideal gold standard test has a sensitivity of 100% (it identifies all individuals with a disease process; it does not have any false-negative results) and a specificity of 100% (it does not falsely identify someone with a condition that does not have the condition; it does not have any false-positive results). In practice, there are no ideal gold standard tests.

In IR the terms recall and precision are normally used. Again it is normally impossible to get 100% and so values of 80% are common. Let’s see a real example from OSCAR3 operating on an Open Access thesis from St Andrews University[1]. OSCAR3 is identifying (“retrieving”) the chemicals (CM) in the text and OSCAR’s results are underlined; at this stage we ignire whether OSCAR actually knows what the compounds are.

In 1923, J.B.S. datastore[‘chm180’] = { “Text”: “Haldane”, “Type”: “CM” } Haldane introduced the concept of renewable datastore[‘chm181’] = { “Element”: “H”, “Text”: “hydrogen”, “Type”: “CM” } hydrogen. datastore[‘chm182’] = { “Text”: “Haldane”, “Type”: “CM” } Haldane stated that if datastore[‘chm183’] = { “Element”: “H”, “Text”: “hydrogen”, “Type”: “CM” } hydrogen derived from wind power via electrolysis were liquefied and stored it would be the ideal datastore[‘chm184’] = { “ontIDs”: “CHEBI:33292”, “Text”: “fuel”, “Type”: “ONT” } fuel of the future. [5] The depletion of datastore[‘chm185’] = { “ontIDs”: “CHEBI:35230”, “Text”: “fossil fuel”, “Type”: “ONT” } fossil fuel resources and the need to reduce climate-affecting emissions (also known as green house gases) has driven the search for alternative energy sources. datastore[‘chm186’] = { “Element”: “H”, “Text”: “Hydrogen”, “Type”: “CM” } Hydrogen is a leading candidate as an alternative to datastore[‘chm187’] = { “Text”: “hydrocarbon”, “Type”: “CM” } hydrocarbon fossil fuels. datastore[‘chm188’] = { “Element”: “H”, “Text”: “Hydrogen”, “Type”: “CM” } Hydrogen can have the advantages of renewable production from datastore[‘chm189’] = { “Element”: “C”, “ontIDs”: “CHEBI:27594”, “Text”: “carbon”, “Type”: “CM” } carbon-free sources, which result in datastore[‘chm190’] = { “ontIDs”: “REX:0000303”, “Text”: “emission”, “Type”: “ONT” } emission levels far below existing datastore[‘chm191’] = { “ontIDs”: “REX:0000303”, “Text”: “emission”, “Type”: “ONT” } emission standards. datastore[‘chm192’] = { “Element”: “H”, “Text”: “Hydrogen”, “Type”: “CM” } Hydrogen can be derived from a diverse range of sources offering a variety of production methods best suited to a particular area or situation.[6] datastore[‘chm193’] = { “Element”: “H”, “Text”: “Hydrogen”, “Type”: “CM” } Hydrogen has long been used as a datastore[‘chm194’] = { “ontIDs”: “CHEBI:33292”, “Text”: “fuel”, “Type”: “ONT” } fuel. The gas supply in the early part of the 20th century consisted almost entirely of a coal gas comprised of more than 50% datastore[‘chm195’] = { “Element”: “H”, “Text”: “hydrogen”, “Type”: “CM” } hydrogen, along with datastore[‘chm196’] = { “cmlRef”: “cml5”, “SMILES”: “[H]C([H])([H])[H]”, “InChI”: “InChI=1/CH4/h1H4”, “ontIDs”: “CHEBI:16183”, “Text”: “methane”, “Type”: “CM” } methane, datastore[‘chm197’] = { “cmlRef”: “cml6”, “SMILES”: “[C-]#[O+]”, “InChI”: “InChI=1/CO/c1-2”, “ontIDs”: “CHEBI:17245”, “Text”: “carbon monoxide”, “Type”: “CM” } carbon monoxide, and datastore[‘chm198’] = { “cmlRef”: “cml7”, “SMILES”: “O=C=O”, “InChI”: “InChI=1/CO2/c2-1-3”, “ontIDs”: “CHEBI:16526”, “Text”: “carbon dioxide”, “Type”: “CM” } carbon dioxide, known as “town gas”.

Before we can evaluate how good this is we have to agree on what the “right” result is. Peter Corbett and Colin Batchelor spent much effort on devising a set of rules that tell us what should be regarded as a CM. Thus, for example, “coal gas” and “hydrocarbon” are not regarded as CMs in their guidelines but “hydrogen” is. In this passage OSCAR has found 12 phrases (mainly words) which it think are CM. Of these 11 are correct but one (“In”) is a false positive (OSCAR thinks this is the element Indium). OSCAR has not missed any, so we have:
True positives = 11
False positives = 1
False negatives = 0
So we have a recall of 100% (we got all the CMs) with a precision of 11/12 == 91%. It is, of course, easy to get 100% recall by marking everything as a CM so it is essential to report the precision as well. The average of these quantities (more strictly the harmonic mean) is often called the F or F1 score.
In this example OSCAR scored well because the compounds are all simple and common. But if we have unusual chemical names OSCAR may miss them. Here’s an example from another thesis [2]:

and datastore[‘chm918’] = { “cmlRef”: “cml42”, “SMILES”: “N[C@H](Cc1c[nH]c2ccccc12)C(O)=O”, “InChI”: “InChI=1/C11H12N2O2/c12-9(11(14)15)5-7-6-13-10-4-2-1-3-8(7)10/h1-4,6,9,13H,5,12H2,(H,14,15)/t9-/m1/s1/f/h14H”, “ontIDs”: “CHEBI:16296”, “Text”: “D-tryptophan”, “Type”: “CM” } D-tryptophan and some datastore[‘chm919’] = { “cmlRef”: “cml1”, “SMILES”: “NC(Cc1c[nH]c2ccccc12)C(O)=O”, “InChI”: “InChI=1/C11H12N2O2/c12-9(11(14)15)5-7-6-13-10-4-2-1-3-8(7)10/h1-4,6,9,13H,5,12H2,(H,14,15)/f/h14H”, “ontIDs”: “CHEBI:27897”, “Text”: “tryptophan”, “Type”: “CM” } tryptophan and datastore[‘chm920’] = { “cmlRef”: “cml2”, “SMILES”: “c1ccc2[nH]ccc2c1”, “InChI”: “InChI=1/C8H7N/c1-2-4-8-7(3-1)5-6-9-8/h1-6,9H”, “ontIDs”: “CHEBI:16881 CHEBI:35581”, “Text”: “indole”, “Type”: “CM” } indole derivatives (1) and datastore[‘chm921’] = { “cmlRef”: “cml42”, “SMILES”: “N[C@H](Cc1c[nH]c2ccccc12)C(O)=O”, “InChI”: “InChI=1/C11H12N2O2/c12-9(11(14)15)5-7-6-13-10-4-2-1-3-8(7)10/h1-4,6,9,13H,5,12H2,(H,14,15)/t9-/m1/s1/f/h14H”, “ontIDs”: “CHEBI:16296”, “Text”: “D-tryptophan”, “Type”: “CM” } D-tryptophan and datastore[‘chm922’] = { “cmlRef”: “cml42”, “SMILES”: “N[C@H](Cc1c[nH]c2ccccc12)C(O)=O”, “InChI”: “InChI=1/C11H12N2O2/c12-9(11(14)15)5-7-6-13-10-4-2-1-3-8(7)10/h1-4,6,9,13H,5,12H2,(H,14,15)/t9-/m1/s1/f/h14H”, “ontIDs”: “CHEBI:16296”, “Text”: “D-tryptophan”, “Type”: “CM” } D-tryptophan and datastore[‘chm923’] = { “cmlRef”: “cml42”, “SMILES”: “N[C@H](Cc1c[nH]c2ccccc12)C(O)=O”, “InChI”: “InChI=1/C11H12N2O2/c12-9(11(14)15)5-7-6-13-10-4-2-1-3-8(7)10/h1-4,6,9,13H,5,12H2,(H,14,15)/t9-/m1/s1/f/h14H”, “ontIDs”: “CHEBI:16296”, “Text”: “D-tryptophan”, “Type”: “CM” } D-tryptophan (1) (1) (1)
Halometabolite PrnA / datastore[‘chm925’] = { “Text”: “PrnC”, “Type”: “CM” } PrnC
datastore[‘chm927’] = { “Text”: “Pyrrolnitrin”, “Type”: “CM” } Pyrrolnitrin (5) datastore[‘chm928’] = { “Text”: “Rebeccamycin”, “Type”: “CM” } Rebeccamycin (6) datastore[‘chm929’] = { “Text”: “Pyrrindomycin”, “Type”: “CM” } Pyrrindomycin (7) datastore[‘chm930’] = { “Text”: “Thienodolin”, “Type”: “CM” } Thienodolin (8) datastore[‘chm931’] = { “Text”: “Pyrrolnitrin”, “Type”: “CM” } Pyrrolnitrin (9) datastore[‘chm932’] = { “Text”: “Pentachloropseudilin”, “Type”: “CM” } Pentachloropseudilin (10) Pyoluteorin (11) –

Here OSCAR has missed “Pyoluteorin” (see Pubchem for the structure) and we have 12 true positives and 1 false negative. So we have recall of 12/13 == 91% and precision of 100%.
Peter Corbett measures OSCAR3 ceaselessly. It’s only as good as the metrics. Read his blog to find out where he’s got to. But I had to search quite hard to find false negatives. It’s also dependent on the corpus used – the two examples are quite different sorts of chemistry and score differently.
Unfortunately this type of careful study is uncommon in chemistry. Much of the software and information is commercial and closed. So you will hear vendors tell you how good their text-mining software or descriptors or machine-learning are. And there are hundreds of papers in the literature claiming wonderful results. Try asking them what their gold standard is; what is the balance between precision and recall. If they look perplexed or shifty don’t buy it.
So why is this important for cyberscience? Because we are going to use very large amounts of data and we need to know how good it is. That can normally only be done by using robots. In some cases these robots need a protocol that has been thoroughly tested on a small set – the gold standard – and then we can infer something about the quality of the data of the rest. Alternatively we develop tools for analysing the spread of the data, consistency with know values, etc. It’s hard work, but a necessary approach for cyberscience. And we shall find, if the publishers let us have access to the data, that everyone benefits from the added critical analysis that the robots bring.


[1] I can’t find this by searching Google – so repository managers make sure your theses are well indexed by search engines. Anyway, Kelcey, many thanks for making your thesis Open Access.
[2] “PyrH and PrnB Crystal Structures”, School of Chemistry and Centre for Biomolecular Sciences University of St Andrews, Walter De Laurentis, December 2006, Supervisor ­ Prof. J datastore[‘chm8’] = { “Element”: “H”, “Text”: “H”, “Type”: “CM” } H Naismith

Posted in cyberscience, data | 1 Comment

cyberscience: CrystalEye at WWMM Cambridge

We’ve mentioned CrystalEye frequently on this blog but not announced it formally. We were about to post it about three weeks ago but had a serious server crash. Also we are very concerned about quality and want to make sure there as as few bugs as possible (there will always be bugs, of course). But now here is a formal announcement. The work and the site has all been done by Nick Day (ned24 AT our email address). I am only blogging this because Nick is too busy using the data for research and hoping to get enough results by the end of the academic year.
We believe that this is one of the first resources in any discipline created and continued by robotic extraction of data from primary publications and re-published for Openly re-use. Every day a spider visits the sites of publishers who expose crystallographic data, downloads it, validates it, etc. Here’s a schematic diagram:
spider.png
This post stresses the right-hand site – the systematic scraping of data from publisher sites and the left hand side shows we can also do it for theses (The SPECTRa-T project). That’s still in its infancy but at least it’s within academia’s control (as long as they don’t give away the right to control theses).
CrystalEye adds a LOT of validation. The central part above expands to a detailed workflow:
markupa.PNG
We are generating individual 3D molecules, 2D diagrams, InChIs and HTML – and none of this is available in its raw form on the publisher site. Remember that we are simply taking the raw data from the author – untouched by publisher. So we are adding a great deal of value without infringing any rights
There’s a lot more to say about CrystalEye. So read the FAQ. But we’ll be back and explain how it changes the face of secondary publishing. We simply don’t need humans retyping the literature.
Now if you are a publisher (or editor employed by a publishing house – I have to watch my terminology) this post will probably cause one of the following reactions:

  • This is really exciting. How can we help the development of cyberscience? We could benefit from a whole new market.
  • This is an appalling threat. The scientists are stealing our data. How can we stop them?

If you are not in the first category you are in the second. “This is boring”, “data isn’t important, only full-text matters”, “we can’t afford to be involved”, “let’s wait for 10 years and see what happens”. So I am gently inviting the publishers to tell me whether they will help me or prevent me use their data for cyber-science.

Posted in cyberscience, data, open issues | Leave a comment

cyberscience: Where does the data come from?

[In several previous and future posts I use the tag “cyberscience” – a portmanteau of E-Science (UK, Europe) and Cyberinfrastructure (US) which emphasizes the international aspect and importance of the discipline.]
Cyberscience is a vision for this century:

The term describes the new research environments that support advanced data acquisition, data storage, data management, data integration, data mining, data visualization and other computing and information processing services over the Internet. In scientific usage, cyberinfrastructure is a technological solution to the problem of efficiently connecting data, computers, and people with the goal of enabling derivation of novel scientific theories and knowledge.

So we in Cambridge are attempting to derive novel scientific theories and knowledge. We may do much of this in public – on this blog and other media – and we welcome any help from readers.
Many cyberscientists will bemoan the lack of data. Often this is because science is difficult and expensive – if I want a crystal structure at 4K determined by neutron diffraction I have to send the sample to a nuclear reactor. If I want images of the IR spectrum of parts of the environment I need a satellite with modern detectors. If I want to find a large hadron I need a large hadron collider.
But we are getting a lot better at automating data collection. Almost all instruments are now digital and many disciplines have now put effort into devising metadata and markup procedures. The dramatic, but largely unsung, developments in detectors – sensitivity, size, response, rate, wavelength, cost makes many data trivially cheap. And there is a corresponding increase in quality – if all data are determined from a given detector then it is much easier to judge quality automatically. We’ll explore this in later posts.
So where does data come from? I suggest the following, but would be grateful for more ideas:

  • directly from a survey to gather data. This is common in astronomy, environment, particle physics, genomes, social science, economics, etc. Frequently the data are deposited in the hope that someone else will use them. It’s unknown in chemistry (and I would not be optimistic of getting funding). It sometimes happens from government agencies (e.g. NIST) but the results are often not open.
  • directly from the published literature. This is uncommon and the next post will highlight our CrystalEye project. The reasons are that (a) we haven’t agreed the metadata (b) scientists don’t see the point (c) the journals don’t encourage or even permit it. However this has been possible in our CrystalEye
    project (described in later posts). Note, of course, that this is an implicit and valuable way of validating published information.
  • retyped from the current literature. Unfortunately this is all too common. It has no advantages, other than that it is often legal to retype facts but not for robots to ingest them as above. It is slow, expensive, error prone and almost leads to closed data. It may be argues that the process is critical and thus adds value – in some cases this is true – but most of this is unnecessary – robots can often do a good job of critiquing data.
  • output of simulations and other calculations. We do a lot of this – over 1 million calculations on molecules. If you believe the results it’s a wonderful source of data. We’ve been trying to validate the calculations in CrystalEye.
  • mashups and meshups. The aggregation of data from any of the 4 sources above into a new work. The aggregation can bridge disciplines, help validation, etc. We are validating computation against crystallography and crystallography against computation. Both win.

So given the power of cyberscience why is it still so hard to find data even when they exist? (I except the disciplines which are explicitly data-driven). Here are some ideas:

  • The scientists don’t realise how useful their data are. Hopefully public demonstration will overcome this.
  • The scientists do realise how useful their data are (and want to protect them). A natural emotion and one that repositories everywhere have to deal with. It’s basically the Prisoners’ dliemma where individual benefit competes against communal good. In many disciplines the societies and the publishers have collaborated to require deposition of data regardless of screams from authors, and the community can see the benefit.
  • The data are published, but only in human-readable form (“hamburgers”). This is all too common, and we make gentle progress with publishers to try to persuade authors and readers of the value of semantic data.
  • The publishers actively resist the re-use of data because they have a knee-jerk reaction to making any content free. They don’t realise data are different (I am not, of course asserting that I personally agree with copyrighting the non-data, but the argument is accepted by some others). They fear that unless they avidly protect everything they are going to lose out. We need to work ceaselessly to show them that this is misguiding and this century is about sharing, not possessing scientific content.
  • The publishers actively resist the re-use of data because they understand its value and wish to sell us our own data back. This blog will try to clarify the cases where this happens and try to give the publishers a chance to argue their case.

The next post shows how our CrystalEye project is helping to make data available to cyberscience.

Posted in cyberscience, data, open issues | Leave a comment

Request to Elsevier for robotic extraction of data from their journals

In previous posts I have written on the value of robotic extraction of data in scientific articles. By default Elsevier do not allow robotic extraction:

All content in this Site, including site layout, design, images, programs, text and other information (collectively, the “Content”) is the property of Elsevier and its affiliated companies or licensors and is protected by copyright and other intellectual property laws.
… and …
You may print or download Content from the Site for your own personal, non-commercial use, provided that you keep intact all copyright and other proprietary notices. You may not engage in systematic retrieval of Content from the Site to create or compile, directly or indirectly, a collection, compilation, database or directory without prior written permission from Elsevier.
The Site may contain robot exclusion headers, and you agree that you will not use any robots, spiders, crawlers or other automated downloading programs or devices to access, search, index, monitor or copy any Content

PMR: So I have written the following letter:

To: permissions@elsevier.com
Subject: Permission to extract crystallographic data robotically from Elsevier publications
Dear Clare Truter,
I and colleagues have built a repository of crystallographic information published in scientific journals. This data is factual, and not copyrighted by the original authors. Major publishers such as the International Union of Crystallography and the Royal Society of Chemistry encourage (and often demand) the publication of such data as part of the scientific record and mount it on their sites as “supporting information” or “supplemental data”. It is of extremely high quality and over the last 30 years the crystallographic and chemical community have shown that it is an essential resource for data-driven science – a concept with the NSF and JISC among other see as a large part of future science.
We have built robots which have analysed over 50, 000 papers on publishers’ sites and extracted the crystallography. Note that the major publishers I have referred to do NOT require a subscription to access this information. We have agreed protocols whereby our robots run at times and frequencies that do not cause denial of service (DOS) – i.e. we try to be responsible.
Elsevier journals do not expose this as public supplemental information but I believe it is available to toll-access subscribers.I would like permission to extract crystallographic data from any Elsevier journals using robotic techniques and to make the TRANSFORMED extracted data public under  a CC-BY licence (Creative Commons) or an OpenData license from the Open Knowledge Foundation . All data so extracted would be referenced through the DOI of the article thus allowing any user (human or robot) to give full citation and therefore credit to the authors and the journal.
To help the discussion we note that facts, per se, are not copyrightable and that the authors do not claim copyright. The data are almost always direct output from an instrument. We need not store the actual documents (normally retrieved as IUCr CIF files) as our derived work is a value-added document in XML-CML which retains none of the creative work of formatting and pagination in the original.
I am sure you will agree that this is a reasonable request and that Elsevier as a major scientific publisher would wish to do whatever it could to foster the birth of a new science.
I am guessing that Elsevier journals (e.g. Tetrahedron, Polyhedron, etc.) contain a total of ca 20,000 relevant papers – until we are able to examine them robotically I can’t be more precise. Obviously I cannot write for permission for each paper individually so I am asking for general permission to carry out  robotic extraction of crystallographic data from all Elsevier journals to which I have access through my institution. And I would obviously agree to devising a robotic protocol that was friendly to your web server.
If you and colleagues wish to be convinced of the value and quality of this cyberscience please have a look at http://wwmm.ch.cam.ac.uk/crystaleye where you can see the aggregated material from the other publishers. Although we haven’t published the results formally yet, two graduate students have carried out thousands of days’ work of theoretical calculations on the data which we believe have led to new insights into crystal and molecular structure.
I hope that Elsevier will be excited by the new vision and that we can move rapidly towards extracting this data. Note that the robots operate on a daily basis and provide news feeds to the community about new exciting derived data.
Note that this is a public request – I have explained the reasons on my blog (http://wwmm.ch.cam.ac.uk/blogs/murrayrust?/p=432) in which this letter is contained. Since this is a matter of considerable current public interest I request permission to post your replies – if there is material that you wish to remain confidential please send a separate mail to me indicating confidentiality which I will honour.
Peter Murray-Rust
Unilever Centre for Molecular Sciences Informatics
University of Cambridge,
Lensfield Road,  Cambridge CB2 1EW, UK
+44-1223-763069

Posted in data, open issues | Leave a comment

cyberscience: by default Elsevier's licence, copyright, etc. cripples us

I have recenly been invited to write an article on Open Access for an Elsevier journal, “Serials Review”. I would normally refuse as this is a closed access journal but it is an opportunity to get some of the arguments for Open Access and Open Data across. Before committing to write this article I explored all the conditions imposed by Elsevier – and also who might be able to read the article, for what period, whether they could keep a copy, re-use it, etc. I think we can all agree that whatever the rights and wrongs and differences of philosophy, practice, etc. it is complicated. So, in writing this article I am setting aside at least two day of my time for discourse with the publisher. Fortunately in this case the discourse will form part of the article (except, of course, for those parts which the publisher does not allow).
In the SR article I shall publish creative digital works, authored by me. I shall have to negotiate with the publisher as to who retains copyright and what re-use is possible. (I have been distressed by Wiley’s policy that scientific data (as graphs) is the publishers’ property.) I need to find out if Elsevier take the same view.
I’m carrying out this discussion in public. This is because previous conventional correspondence with publishers is frequently unsatisfactory. For example letters are sometimes not acknowledged, discussions break off in midstream (Springer have never replied to my reasonable question as to why they don’t use CC-BY given that their informal discourse suggests this.). So I shall post a public copy of my questions, copy them to a named individual at the publisher and inform them that in the reply to my public questions I would like to publish their reply unless they indicate otherwise). So this is the first of these requests.
I will be the first to admit that I do not always reply to reasonable requests and am not blameless. We are all overwhelmed by information. But I sometimes get the impression that replies and non-replies from publishers are designed to wear the correspendent down rather than to help.
So I haven’t highlighted Elsevier on this blog because they don’t publish any Open Access chemistry that I could discover. (Maybe the HHMI will change this their agreement with Elsevier for double-payment free-to-read-but-not-to-do-anything-useful-with “open access”).
Cyberscience involves using HUGE amounts of data. CrystalEye has used a modest 100,000 crystal structures from about 50,000 publications over several years. The reason it isn’t higher is:

  • Most chemistry departments throw this data away or let it decay (ca 80% revealed in our SPECTRa study)
  • Most theses do not contain this info in machine-readable form.

in principle we can change this by changing the culture (and putting in some funding). But the most serious reason is:

  • Several of the major publishers either do not require this data, or throw it away, or hide it behind toll-access firewalls or worst of all (like Wiley) copyright it.

So I have been trying to find out what Elsevier’s policy is. Be warned, it is complex. Let’s look at readers first (or users in the case of robots). Here is what I found on the masthead of Serials Review. (I have not asked permission to quote this material, but I have given attribution, and will defend this as fair use, in the public interest. I have tried to copy it exactly by cutting and pasting, but may have excised some of the non-printing characters that most publishers add in their creative works).

This website (“Site”) is owned and operated by Elsevier B.V., Radarweg 29, 1043 NX Amsterdam, The Netherlands, Reg. No. 33156677, BTW No. NL005033019B01 (“Elsevier”).
By accessing or using the Site, you agree to be bound by the terms and conditions below (“Terms and Conditions”). These Terms and Conditions expressly incorporate by reference and include the Site’s Privacy Policy and any guidelines, rules or disclaimers that may be posted and updated on specific webpages or on notices that are sent to you. If you do not agree with these Terms and Conditions, please do not use this Site.
Elsevier reserves the right to change, modify, add or remove portions of these Terms and Conditions in its sole discretion at any time and without prior notice. Please check this page periodically for any modifications. Your continued use of this Site following the posting of any changes will mean that you have accepted the changes.
Copyrights and Limitations on Use
All content in this Site, including site layout, design, images, programs, text and other information (collectively, the “Content”) is the property of Elsevier and its affiliated companies or licensors and is protected by copyright and other intellectual property laws.
You may not copy, display, distribute, modify, publish, reproduce, store, transmit, create derivative works from, or sell or license all or any part of the Content, products or services obtained from this Site in any medium to anyone, except as otherwise expressly permitted under applicable law or as described in these Terms and Conditions or relevant license or subscriber agreement.
You may print or download Content from the Site for your own personal, non-commercial use, provided that you keep intact all copyright and other proprietary notices. You may not engage in systematic retrieval of Content from the Site to create or compile, directly or indirectly, a collection, compilation, database or directory without prior written permission from Elsevier.
The Site may contain robot exclusion headers, and you agree that you will not use any robots, spiders, crawlers or other automated downloading programs or devices to access, search, index, monitor or copy any Content, including but not limited to harvesting other’s postal or email addresses from the Site for purposes of sending unsolicited or unauthorized commercial material, is prohibited. Any questions about whether a particular use is authorized and any requests for permission to publish, reproduce, distribute, display or make derivative works from any Content should be directed to Elsevier Global Rights.

PMR: This is clear. Elsevier owns everything in the article I submit to them. (I may have missed something, but I get the impression that if I query it they will thank me and claim ownership to it as well). By default Elsevier forbids me to carry out cyberscience on any of their material. Maybe this is an oversight and the responsible thing is to write for permission. Tedious, but here goes:

PERMISSIONS
JournalsBooks
PERMISSIONS – JOURNALSElsevier is pleased to announce our partnership with Copyright Clearance Center to meet your licensing needs.With Copyright Clearance Center’s Rightslink® service it’s faster and easier than ever before to secure permission from Elsevier titles to be used in a book/textbook, journal/magazine, newspaper/newsletter, coursepack/classroom materials, TV programme/documentary, presentation/slide kit, thesis/dissertation, CD-ROM/DVD, promotional materials/pamphlet, conference proceedings, CME material, training materials, in a poster, use a journal cover image, or photocopy.Getting permission via Rightslink is easy, simply follow the steps below:1. Locate your desired content on External link ScienceDirect (guest users can view abstracts for free)
2. Click on the “Request Permissions” link to the right of the abstract
3. The following page will then be launched (turn off your pop-up blocker):
Back to top
Rightslink
Select the way you would like to reuse the contentCreate an account if you haven’t alreadyAccept the terms and conditions and you’re doneFor questions about using the Rightslink service, please contact Customer Support via phone 877/622-5543 (toll free) or 978/777-9929, or email customercare@copyright.com.
PERMISSIONS – BOOKS / MATERIAL NOT ON SCIENCEDIRECTRequests to re-use material from any of our publications not on ScienceDirect can be submitted by completing the Permission Request Form below. For clarification, this covers the following imprints:
  • Academic Press
  • Balliere Tindall
  • Butterworth-Heinemann (US)
  • Cell Press
  • Churchill Livingstone
  • CMP
  • Elsevier
  • Elsevier Current Trends
  • JAI
  • Medicine Publishing
  • Morgan Kaufmann
  • Mosby
  • North-Holland
  • Pergamon
  • Syngress
  • Saunders

Submitting your request via the Permissions Request Form enables us to respond more quickly and our aim is to process routine requests within 15 working days of receipt. However, every effort will be made to meet more immediate deadlines if indicated.
It is also possible to contact our Permissions departments directly:
Clare Truter
Rights Manager
Global Rights Department
Elsevier Ltd
PO Box 800
Oxford OX5 1DX
UK
Tel: (+44) 1865 843830 (UK) or (+1) 215 239 3804 (US)
Fax: (+44) 1865 853333
E-mail: permissions@elsevier.com or healthpermissions@elsevier.com

The next post contains a copy of a letter to Elsevier requesting permission.

Posted in data, open issues | 1 Comment

Copyright drains academic productivity and the birth of cyber-science

I am now starting a train of thought that will show how cyber-science (e-science in UK) might be practised. It’s real, and the work that Joe, Nick and I have done will lead to conventional publications in reputable journals. Yet the work depends completely on the access to data in and associated with primary publication. In a few cases the publishers have been helpful but in general the publishers are doing everything they can to stop me and my colleagues accessing this data in the appropriate form for cyber-science. In effect the publishers are actively preventing work being done. I’ll first outline some basics and then sson start showing the science we have done in e-crystallography. We may very well present some of the results before they are formally submitted to a journal so those publishers who forbid this can relax, you won’t have to deal with a paper from us.
I missed the article below when it first came out. It’s a well presented case of the damage that copyright is doing to academia in requiring librarians, other staff and students to spend a lot of time worrying about copyright issues that none of them want. The reason it has been so relevant is that I have just been invited to write two articles on Open Access, in each case for closed access publications. Normally I would simply refuse on tha basis that I cannot be true to my message if I publish in closed access outlets. (I have to fudge this for chemistry as there are no open access journals in my area – yes, I shall publish in Chemistry Central when possible…). But in this case there is a special case to get my message across to a wider community – I am not interested in the academic glory of publishing in these journals – I have no idea what the impact factor is – but it is a useful way of spreading the message.
When I publish in a closed access journal the first thing I worry about is what rights I have to transfer to the publisher. The first thing. Not “what am I going to say?”, “is this article friendly towards text-mining software?” “how shall I attach repository-friendly software”. But “will I or my readers get lawyers’ letters if we get it wrong?” Please accept that this is  an enormous drainon my creativity. I shall show in the next post that it is liekly to take at least a day of my time debating with the publisher what I am allowed to say in my article and what not. Nothing to do with the content of the article, simply whether the publisher wishes to possess my creative work. If I multiply this by the number of closed access publishers I publish with and the hassle that I or my co-author Henry Rzepa has with them we are talking several days a year just worrying about copyright. Wasted time. (Note of course that when Henry and I published with BioMed Cantral we did not have to worry about any of this – only the awful publishing technology we were encouraged to use.)
So, simply, copyright in academic publication – even if we were all in favour of it and thought it the most wonderful thing in the word – is a HUGE drain on academia. Enough of me – here’s Paul Staincliffe – abstract and some snippets…
[Note, ironically, that the article carries no copyright or licence and the eprints engine also gives no licence or copyright. So in principle I have to write to Paul and ask for permission to abstract or hope that my snippets will be interpreted as fair use. In contrast our own repository at Cambridge announces something really helpful like “all items are protected by copyright”. As Paul shows librarians are worrying too hard. Surely the intent of a repository is “we would like you to read and re-use and republish anything in here without our permission unless we say otherwise”. We could fix this tomorrow if we wanted. ePrints, DSpace, Fedora – take note – just add a “default licence = CC-BY unless otherwise”]. Enough of me…

The nonsense of copyright in libraries : digital information and the right to copy
Staincliffe, Paul (2006) The nonsense of copyright in libraries : digital information and the right to copy. In Proceedings LIANZA Conference 2006, Wellington (New Zealand).
Abstract
The notion of copyright is deeply entrenched in the psyche of librarians, who remain one of the few groups who consistently support or uphold it. Given the growth of digital information and consequential change in the behaviour of information creators and users the paper posits that copyright administration in libraries has become a cumbersome burden whose “time has come”. Changes in information provision by libraries towards delivering more digital information have ironically highlighted the paradox libraries face between providing the best possible service and upholding copyright. The notion that there exists in the digital environment a “right to copy” is put forward. Copyright is legally complicated, controversial, subject to a number of misunderstandings and generally not fully understood even by the librarians whose daily tasks include administering it. To better understand the current status of copyright and its impact on libraries the notion of copyright is briefly outlined, along with what exactly copyright is, its historical roots and its suitability in the current environment. In examining the legislation the paper critiques its aims and how it fails in these; compares arguments in favour and against its retention, investigates how it serves to restrict creativity rather than encourage it and in closing suggests why libraries should abandon the struggle to uphold copyright. Examples from New Zealand, Australia, the US and the UK are used to highlight inconsistencies that support the argument that copyright in the digital environment is a nonsense that no longer works.
… snippets …
The notion of copyright is deeply entrenched in the psyche of librarians, who remain one of the few groups who consistently support or uphold it.
Anybody who has ever watched student behaviour in an information commons or a customer at a photocopier will know that copyright is the last thing on their mind as they download or copy page after page after page of data. If you are a conscientious librarian you will also be faced with a dilemma; should I approach them, should I question or challenge them on their behaviour, should I be the ogre and remind them of the copyright regulations which we have gone to great trouble to display prominently, or should I just pretend I didn’t see them, or just let them get on with their work? After all it’s hard enough being a student without me policing their behaviour and they must need the data anyway….
Some brief facts regarding copyright:
1. It is complex and confusing. A whole legal industry has grown around the notion. Legal journals and texts are printed in large numbers and the discipline now encompasses intellectual property and trademark law. As with all legal disciplines, opinion on the same issue is often at either end of the spectrum.
2. The New Zealand Act itself is lengthy at almost 200 pages (New Zealand Government 1994).
3. For a uniform notion, different rights exist in different countries. There are many similarities but no one single agreement on rights exists universally (including the Berne Convention). For example, Crown works in New Zealand are copyright protected (s. 26). In the US, federal government works are not protected but state or local government works may be protected (Crews 2006).
4. It promotes a monopoly arrangement and trading position.
5. The work must be original. Copyright cannot exist on a copy that has been plagiarised.
6. A distinction must be made between the works themselves and the copyright to the work. The two are separate entities but are intertwined. It is important to understand the notion that it is the “expression” that is protected and not the “idea”.
7. Ownership of an item does not confer any rights of copyright over the item.
8. It encompasses a bundle of rights including the right to copy, make adaptations, perform or broadcast the work and have sole ownership (although true to the illogical nature of copyright, this is not as straightforward as it may appear).

PMR: anyone disagree? So we are saying we have a complex C20 byzantine juggernaut that we have to operate in the C21.
… PS: Who wants copyright?

Libraries and universities
Libraries and universities find themselves in an interesting position. Ironically they find it hard to accept the loss of copyright because they are founded on print culture. They are faced with having to open up to a challenge, having to change and having to accept the new. Our profession is not renowned for its willingness to grasp change and the new. Yet as far as the profession is concerned we really have nothing to gain from it. We do not receive royalties or a fee; we are not remunerated for administering the provisions of the legislation or the cost of notices, time, stress and worry about what we are asked to do by customers. A recent example from my own experience is of a music portfolio submitted in fulfilment of a degree which necessitated an inordinate amount of time and correspondence between the student, the faculty, departmental secretaries and library staff. Even though library staff are generally following their organisations’ regulations, they often suffer from the stress, and retain a sense of concern, that they are not held personally liable should someone decide to sue for breach of copyright.
Copyright places a huge financial burden on academic institutions in New Zealand. In 2004, total revenue for copyright licenses paid to Copyright Licensing was $4.8 million ($4 million domestic revenue), up 14% from 2003. 50% of the domestic licensing revenue was paid by universities, amounting to some $2 million. Even the Chief Executive Officer of Copyright Licensing was forced to admit that “licensing in the educational sector has almost reached saturation level” (Sheat 2005).
… and who doesn’t …
In addition to students and creators the following could be said to be in favour of abandoning copyright:
1. Those paying the vast amounts of licensing fees.
2. Universities and other academic institutions who must administer copyright.
3. Those unable to make use of a copyrighted work.
4. Those that blatantly or through ignorance ignore copyright (a conservative 99% of the population?).
5. Faculty members who happily turn their papers over to domains such as institutional repositories or pre-print archives would seem to have little concern for giving up their rights.
6. Those attempting to trace copyright holders to gain permission, or determine if copyright still applies to a work (known as orphan works).
7. Those that see copyright as unworkable in relation to digital data.

PMR: #6 and #7 are the points at issue for me. I’ll show in the next post how restrictive licences from publishers are destroying e-science and cyber-infrastructure.

PS: Conclusion
Copyright legislation is complicated and in many cases either contradictory, illogical and/or completely confusing. In the midst of a new era of information creation and distribution the legislation fails to keep pace with developments. The public and young people in particular, regard copyright as an illogical impediment to their social or work behaviour. The existence of punitive penalties and the lack of prosecutions of library customers contribute to the notion held by customers that copyright can be violated without fear of prosecution. Library staff are generally completely unaware of the breaching of legislation, or hesitate to challenge customers they suspect of breaching legislation.
The digital environment has created new formats of data and the ability to transmit that data instantaneously and with ease. The rational for copyright based on an analogue work and to ensure an economic reward for the creator’s labour to stimulate further works is no longer valid. In the digital environment, creating a work and placing it in the public domain where access and control over the work is practically unenforceable results in the failure of copyright and the right to copy.

Posted in data, open issues | Leave a comment

Hughes afraid of the Big Bad Wolf?

From Peter Suber, blogging Alex Palazzo


Alex Palazzo, JCB to HHMI: Why did you sell out to Elsevier? The Daily Transcript, July 18, 2007. Excerpt:

Yesterday…I came across a commentary by Mike Rossner and Ira Mellman, the two big guys at the Journal of Cell Biology. The commentary concerns the resolution of a year long fight between the Howard Hughes Medical Institute and Elsevier. To force the hand of the publishers and to support open access, HHMI instituted a new policy – they would evaluate prospective and continuing HHMI investigators based on published manuscripts that were freely accessible within 6 months of the publication date. In other words, HHMI evaluators could not consider any manuscript that was published in a journal like Cell, whose policy is to allow open access of manuscripts only after 12 months. Since Elsevier is one of the major publishing companies that has a >12month wait period, and since Elsevier owns Cell, one of the premiere journals, this action by HHMI was seen by some as a clash between these two institutions.
Recently HHMI and Elsevier came to a compromise, in that the former would pay the latter 1,500$ per manuscript that came from an HHMI investigator. In exchange, Elsevier would allow free access to these publications via PubMed Central within the 6 month waiting period. So is this a victory for open access? Not really….
[Here]…is the editorial from the June 18th edition of JCB:

How the rich get richer. HHMI will bestow monetary rewards on a commercial publisher in return for the type of public access already provided by many nonprofit publishers….Two problems with this deal immediately come to mind. First, there is a clear potential for conflict of interest when a publisher stands to benefit financially by publishing papers from a particular organization. Second, and even more seriously, this action by HHMI undermines the effort to persuade commercial publishers to make their content public after a short delay, by rewarding them for not doing so….
For many years The Rockefeller University Press and many other nonprofit publishers have released all of their content to the public after only six months, and have proven that such a policy does not reduce subscription revenues. We thus provide all authors with a free service for which HHMI will now pay Elsevier. Commercial publishers should need no financial incentive to provide this service to the scientific research community, on whom they rely for their content, their quality control, their subscribers, and for the patronage of their advertisers. Instead, Elsevier has accepted a deal that does a disservice to that community by increasing publication costs and thus further reducing the funds available for research.
HHMI has rewarded Elsevier for their steadfast refusal to release their content by further enhancing their already highly profitable business model….It is unfortunate that HHMI has forfeited its substantial bargaining power in a deal that represents a setback to the mission of public access….

    PS: Comment. Exactly. See my similar evaluation in the April issue of SOAN.

PMR: I share this sadness and add brief comments why it is even worse:

  • The quality of “open/free/author” “access/choice/science” created by publishers is a disgrace. There is no effort at providing anything that an author could feel pleased with.
  • The final “product” is hardly ever “Open Access” in the full BBB sense. Readers (that quaint archaic word for end-users) have few rights – they cannot text-mine, extract data, etc. At the very least the funders shoudl get total OA for their contribution. In many cases they are simply paying for a marginal difference in visibility to the other papers on the site.
  • It gives the impression that HHMI either doesn’t care, or doesn’t understand the issues. Or is overwhelmed by corporate pressure of FUD or enticements. The really sad thing is that it makes it so much more difficult for the real pioneers such as Wellcome trust who have publicly fought this all the way – insisting on full Open Access and taking access to data seriously
  • It suggests that all that matters is some fuzzy visual access to research. Scientists need proper access to data. They need robots extracting it. (Of course if the publisher had any pfretensions to live in the 21st century they would actually require authors to publish their data. I shall return to this.
  • and there is just a general feeling of mess. With Wellcome, and the NIH, we had a united, simple, message. Full Open Access. Proper value for money.

But the days of darkness are limited. Not through the vapid efforts of HHMI, but because the publishers – who are now the problem – will be swept away by the next information revolution which, in their resistance of the natural pressures for change, will condemn most of them to oblivion in a decade. And there is nothing they can do to stop it. Adapt or die.

Posted in data, open issues, Uncategorized | Leave a comment