petermr's blog

OA at Stirling

Posted on April 9, 2008 by pm286

I was on the Staff at the University of Stirling (in Scotland) for 15 years so I am delighted to repost Peter Suber: Stirling U adopts an OA mandate: (Stirling research goes global, a press release from Stirling University, April 9, 2008. The press release includes an announcement and the full text of the new policy. Excerpt:

The University of Stirling has become the first academic institution in the UK to oblige staff to make all their published research available online.
Stirling is leading the way in open access to its research work, after the University’s Academic Council issued an institutional mandate which requires self-archiving of all theses and journal articles.
PMR: Nostalgia – I used to sit on the Academic council…
Professor Ian Simpson, Deputy Principal (Research and Knowledge Transfer) said: “We believe that the outcomes of all publicly funded research should be made available as widely as possible. By ensuring free online access to all our research output, we will maximise the visibility and impact of the University’s work to researchers worldwide.”
The four year project to create STORRE (Stirling Online Research Repository) has been brought to fruition by information technology specialists Clare Allan and Michael White.
Clare Allan said: “The University now requires all published journal articles to be deposited by authors, as soon as possible after they are accepted for publication, and in compliance with the publishers’ copyright agreements.
“It is an important landmark in our archival development and marks the conclusion of a process that started in 2004 when Stirling was one of 20 academic institutions which signed up to the OATS (Open Access Team for Scotland) declaration. The repository project initially focused on electronic theses and in session 2006/07 we became one of the first universities to require these to be submitted electronically.
“The next stage was a pilot scheme for self-archiving of journal articles by some researchers, and this has now become mandatory. We are also building up a retrospective archive.” …
Michael White added: “We are hopeful of a very positive response from researchers to the requirement to self-archive, as they will benefit from greater visibility of their work – such as increased citations from their published work, which in turn can lead to improved funding. To quantify this, they can track how often each article is viewed.” …

PS Comment. The Stirling policy is not only the first university-level OA mandate in the UK [PMR: PS corrected this – Soton beat them by a few days] , but the second worldwide (after Harvard’s) to be adopted by faculty rather than administrators. Moreover, it’s detailed and strong. I’m especially glad to see that it requires deposit “immediately upon acceptance for publication” even if it permits delayed OA “until the item has been published, and until any publishers’ or funders’ embargo period has expired.” Kudos to all involved.

PMR: We’ve already benefitted from the Open theses available from Stirling and use them as exemplars for our Spectra-T work.

It is clear decisions like this, pushing the frontiers of Open Access which help to change the world. The more that this happens, the greater the courage it gives to others. Stirling – as a new (1967) University was always keen to innovate and I’m proud to feel part of this. (We had our 40th re-union last autumn).

Posted in Uncategorized | Leave a comment

UKSG – Jim Griffin

Posted on April 9, 2008 by pm286

Spent the last 1.5 days at UK Serials Group in Torquay – a mixture of univeristy LIS people, publishers and suppliers of services/products in scholarly publishing. A clear indication that major change is in the air but few pointers that it’s moving ahead. Our final speaker – Jim Griffin – is the incoming president and is bringing experience from the music industry. He’s exploring that any child should have access to any cultural work, yet we cannot underavlue creativity by having it unrewarded. We are in the time of a bionomic flood – networks behave like biological things. Information is like rivers or the flow of blood. It takes the straightest or most economic course – that’s the natural flow. The river recutting it course may flood existing communities.
There is good in this force. But there are cases where it has deprived creators of a just reward. Our content is at risk from technology – good and bad. We pay lipservice to control, but actually move towards compensation. Actuarial compensation replaces actuarial control. The idea of charging for content is not tenable. And we haven’t yet started the revolution. We shall be awash with wireless digitization. Soon we shall have pervasive digits – all we want – and we shall have no need to carry them around. The future is the just-in-time delivery – not the storage. Delivery will eat distribution. Warehouses are signs of inefficiency.
McLuhan said that we will never understand the media of our time – we are as unconscious as fish of water. We need to look in the rearview mirror to see where we are going.
The 1920’s were when electricity started and the changes were more savage than the dotcom. Music becoming electric was more savage than becoming digital. Television is outrageous dislocation. How did we deal with it then? Speak to the nonagenarians as a resource to understand change.
In the time of Victor Hugo is was unlawful to read books aloud. They formed the first collective licensing organisation for music. We cannot now control the spread of music. Is a robot downloading music something that requires compensation. We live in a time of Tarzan economics – we cling to a vine of product to keep you off the floor. Future is about swinging to the next vine and leave go of the current one.
As a community you have an advantage in having a high female membership. Web 3.0 is about feminising marketing by starting relationships that never end.
Four characteristics:

we will be there when we have removed the need for pirating rather than searching for its mechanism. Until then we are in trouble. If the business model depends on stopping people making copies – it becomes impossible. Technology has obviated the mechanism of copyright as an enforceable business.
Piecing out ideas in small bits is the antithesis of creativity which is about bundling. Substituting albums with single will not save the music industry
You competition is not with pirates but with the clock
The universities are they key – any business model must be built on them.

We will have done the right thing when our content feels free even if it isn’t.

Posted in Uncategorized | Tagged uksg | Leave a comment

Have any closed access articles appeared in PMC?

Posted on April 8, 2008 by pm286

I’m giving a talk to the UK Serials Group tomorrow and I would like to be able to show articles deposited in PMC under the new mandate. I have no idea whether any have arrived (or whether authors have being doing this for months in anticipation). Any pointers (URLs) would be much appreciated.
While I’m blogging, Bill Hooker has a valuable post reiterating clearly all the points about full Open Access:
var imagebase=\’file://C:/Program Files/FeedReader30/\’;

He finishes:

Having said all that, though, I’ll add that an explicit description of machine readability requirements would be an addition to the accepted definition of OA — and one that I would welcome. Peter Murray-Rust recently noted that, according to the “price and permission barriers” view of Open Access, PubMed isn’t OA — even PubMed Central isn’t OA.I’ll go even further: can anyone point me to a single Open Access repository? I don’t know of even one such site that removes both price and permission barriers. Surely there must be some, but the Big Names (PubMed Central, arXiv, Cogprints, CiteSeer, RePEc, etc — see ROAR) don’t seem to qualify, because digital objects in these repositories carry their own copyrights, rather than being covered by a blanket license provided by the repository.
Can this be true? Five years after the BBB definition came together, more than ten years since Stevan Harnad’s subversive proposal and on the first day of the NIH mandate — widely referred to as an OA mandate! — can it be that we really don’t have a single truly OA repository in all the world? And if it is true, would it help to make the official definition more explicitly machine-friendly?

PMR: Indeed. I don’t know of a true OA repository (of fulltext) – where a robot can go and be assured that it can download anything without getting letters from lawyers. We MUST develop full machine-readable licences (yes CC has these already for articles, but are they actually used?)

Posted in Uncategorized | 1 Comment

Update on text-mining NIH

Posted on April 7, 2008 by pm286

There have been a number of useful replies to my concern over text-mining the NIH. To resolve some of the confusion:

NIH have ca 1,000,000 journal articles. These are NOT permissionFree Open Access. There is a limit on what you may legally do with them. They retain publisher copyright. In each case there may have been intensive negotiations with publishers as to what conditions apply. You may not bulk download these, either through robots or OAI-PMH. robots.txt is irrelevant to these articles. If you try to mine you will probably be cut off. If you republish for whatever purpose and if you go beyond fair use (in the publisher’s judgment) you may be pursued by the publisher. This is NOT good enough for any text-mining.
The NIH have ca 50, 000 articles in “OA journals” or otherwise known to be “OA” – which are permissionFree. You may mine these and do whatever (though I am unclear whether there is a trap on CC-By vs CC-NC vs CC-ND). In addition there are about 10,000 author-deposited articles.

Only 6% of the NIH material is therefore Open/permissionFree. You can mine this, etc. as several correspondents have pointed out. But as we know in the OA struggle, 5% is about all you get from requests to authors and publishers. It required a legal mandate from George Bush to ensure that authors HAVE to deposit. This is the real concern of my postings – the 94% than cannot be text-mined.
We need to show WHY textmining is critical. Here is a splendid post from Glen Newton. PLEASE let us collect other examples to send to the NIH…

FREE THE ARTICLES! (Full-text for researchers & scientists and their machines)

At a recent plenary I gave [earlier post] at the Colorado Association of Research Libraries Next Gen Library Interfaces conference, I went a little off-script and was educating (/haranguing) the mostly librarian audience about the present-and-near-future importance of the accessibility of full-text research articles to their researchers and scientists.
By accessibility of full-text I didn’t mean the ability of a human to access the PDF or HTML of an article via a web browser: I was referring to the machine-accessibility of the text contained in the article (and the metadata and the citation information).
I was concerned because of the increasing number of discipline-specific tools that use full-text (& metadata & citations) to allow users (via text mining, semantic analysis, etc.) to navigate, analyze and discover new ideas and relationships, from the research literature. The general label for this kind of research is ‘literature-based discovery‘, where new knowledge hidden in the literature is exposed using text mining and other tools.
Most publisher licenses do not allow for the sort of access to the full-text that many of these discovery and exploration tools need.
When I asked for a show of hands of how many were aware of this issue, of the ~200 in the audience, no one raised their hand.
I went on to suggest/rant that librarians should expect more of their researcher/scientist patrons to be needing/demanding this sort of access to the full-text of (licensed) journal articles. They need to anticipate this response, and I suggested the following non-mutually-exclusive strategies:

demanding licenses from publishers and aggregators that allow them to offer access to full-text for analysis by arbitrary patron tools

asking publishers to publish their full-text in the Open Text Mining Interface (OTMI)

supporting Open Access journals which allow-for much of this this out-of-the-box

Recently I retro-discovered an article[1] in The Economist, which explains to the lay-person some of the kind of things that can be done with access to the literature. This study [2] shows how researchers discovered the biochemical pathway involved in drug addiction from the literature alone. They did no experiments. This discovery3]. Clearly, this sort of analysis can save time and money in discovering important and relevant scientific knowledge.
[1] Drug Addiction: Going by the book (2008). The Economist, January 10 print issue.
[2] Li, C., Mao, X., Wei, L. (2008). Genes and (Common) Pathways Underlying Drug Addiction. PLoS Computational Biology, 4(1), e2. DOI: 10.1371/journal.pcbi.0040002
[3] Swanson, D. (1986). Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med, 30:1:7-18.
This was derived from an analysis and extraction of information from more than 1000 articles! This is not the first time this sort of thing has happened

PMR: If am now off to talk to the UK Serials Group in Torquay. I shall highlight this example of why it;s so important.

Posted in Uncategorized | 2 Comments

The value of text-mining

Posted on April 7, 2008 by pm286

In response to my concern about access to the full text in PubmedCentral the Blog Suicyte Notes questions the value of text-mining:

I cannot think of a single example where text-mining has ever made a major contribution to solving any real-life biomedical problem. Even if there are such eamples, their number will be small. If we compare the health benefits from text mining efforts to those provided by real (human) scientist reading the literature, I have no doubt that the latter would prevail by a big margin.
There should be no doubt about it, it would clearly be a good thing to enable text-mining on PMC. However, describing the current situation of free access to PMC papers for scientists as useless without added text-mining capabilities appears to be, well, kind of biased.

PMR: I actually said “desperately impoverished”, not “useless”, to which I stick. The post has generated a series of comments on Suicyte which are worth reading and generally highly supportive.
From my own experience the average bioscience paper is incredibly difficult to read. The terminology is arcane and in places bizarre. What does “hedgehog” mean? You and I might think it was a spiky mammal, but actually it’s a gene and signaling pathway). (The drosophilia community delights in using amusing names – such as “clueless” for their genes – other communities use opaque abbreviations/acronyms such as BRCA1 and RAD5.) And there is the chemistry – how many readers know what “epibatidine” is? So a very simple, quick, extremely valuable lighweight use of text-mining is to annotate papers for easy reading.
This annotation and republication is forbidden by most publishers. Many have no interest in making papers easy to read or use – it costs money. So, through our OSCAR software, we could – if we were allowed – annotate the chemistry in most of the world’s literature. We are forbidden to do so.
Even this lightweight annotation would be an enormous boon to science. But fulltext-mining goes way beyond that. The bisoscience literature is full of observations that are either not explained or are later revised. Machines play a major role in trying to help us understand this mass of science. I’m not claiming that machines can replace humans – the human-to-human communication in most papers is so esoteric and unsemantic that it’s currently impossible. [If we had semantic authoring things would be different.] But when machines do those bits that humans hate – searching, linking, resolving synonyms, etc. then human productivity is vastly increased. To the extent that we undertake new things as we are freed from the boring stuff. One example from the comments (Lars Jensen):

3) There are actually cases where text data mining was used to make discoveries of direct medical relevance. The most famous examples are the links between Raynaud syndrome and fish oil and between migraine and magnesium deficiency.

Posted in Uncategorized | 2 Comments

Text-mining the NIH mandated papers

Posted on April 7, 2008 by pm286

There have been some very useful responses (see comments to Can I data- and Text-mine Pubmed Central? and followup) to my assertion that we may not text-mine the major part of the material to be deposited under the NIH mandate. Peter Suber , as always, makes it precisely clear in No data- or text-mining at PMC

Peter MR is right. PMC removes price barriers and leaves permission barriers in place. Users may not exceed fair use, which is not enough for redistribution or most kinds of text- and data-mining. For detail –and official confirmation– see Question F2 in the NIH FAQ:

What is the difference between the NIH Public Access Policy and Open Access?
The Public Access Policy ensures that the public has access to the peer reviewed and published results of all NIH funded research through PubMed Central (PMC). United States and/or foreign copyright laws protect most of the articles in PMC; PMC provides access to them at no cost, much like a library does, under the principles of Fair Use.
Generally, Open Access involves the use of a copyrighted document under a Creative Commons or similar license-type agreement that allows more liberal use (including redistribution) than the traditional principles of Fair Use. Only a subset of the articles in PMC are available under such Open Access provisions. See the PMC Copyright page for more information.

PMR: I was aware of this paragraph – and like so many – it expounds general principles without giving precise indications as to what can be done. It is clear now that PMC does not – by default – promote text- or data-mining.
[Additional points: Some correspondents suggest the only NIH/PMC barrier is robots.txt – a guide to how and when robots can download. This is not the primary problem – it is that the papers in PMC still carry restrictive copyright and PMC restricts download from all except the Open Access subset. And others have commented that PMC has an Open Access subset. Yes, and I am aware of this and have been working on how to mine it. But if voluntary Open Access – authors and publishers – was delivering what we want then there would have been no need for the mandate. The mandate forces authors to deposit copies of Closed Access publications by removing price barriers.]
PeterS continues:

Removing price barriers from NIH-funded research was a major victory, and one we couldn’t have achieved if we demanded the removal of permission barriers at the same time. But Peter is right that researchers need more and that we have to keep working for further goals. In time, I hope we can shorten the permissible 12 month embargo and remove permission barriers from the copies covered by the NIH policy.

PMR: Yes. So it’s critical that we provide a submission to the NIH on this point. I know there are individuals in the NIH who appreciate the value of full-text mining but it’s not obvious to many.
I will deal with the question of “is text-mining useful” [which surprised me] in a separate post.

Posted in Uncategorized | 3 Comments

NO-ONE MAY DATA- OR TEXT-MINE PUBMED CENTRAL

Posted on April 6, 2008 by pm286

I realised with considerable disappointment ( Can I data- and Text-mine Pubmed Central?) that I might not be able to text- and data-mine the material that the NIH has required to be deposited in Pubmed Central in its mandate. Now I have got confirmation by email from an authoritative source (who asks not to be named in case the information is not quite precise). But in general terms the answer is simple:
NO-ONE MAY DATA- OR TEXT-MINE PUBMED CENTRAL
In short Pubmed Central is “free access” (no price barriers), not “open access” (no permission barriers). You may not download material from it (except to expose it to your own eyeballs), and certainly not redistribute it. You may not data-mine it.
I am aware of the struggle that was required to get George Bush to sign the mandate and it certainly wasn’t the time to break ranks. But now that the mandate is passed (and starts tomorrow) we must press ahead immediately to campaign for full access to the text.
We have the right and the duty to submit our views to NIH. For example Stevan Harnad has argued (recommendations to the NIH) that it is better to reposit in institutional repositories (“green”). Whether or not this is a good idea (and I personally don’t think so as it make datamining almost impossible) it is clearly outside the current approach from Pubmed Central. For example, I gather, the mirrors of PMC have to agree to the same absolute permission barriers that PMC imposes – it would be impossible to ensure that thousands of libraries enforced this – almost draconian – contractual system.
So we have to argue to the NIH that bioscience is desperately impoverished by the unreasonable permission barriers that are now in place. I’m not a (US) politician and I think the NIH and advocates have done well to win the first battle. But at present the policy is seriously hindering modern science.
So the whole area is incredibly complex. The goal is simple – use scientific publications to further our understanding of science and – hopefully – make progress in enhancing human health. For this we MUST have robots. We cannot do it with humans alone – every week we get thousands of new papers.
I’d be grateful to know what the position is with Wellcome. I thought they had removed permission barriers.

Posted in Uncategorized | 6 Comments

Egon Willighagen and the Blue Obelisk

Posted on April 6, 2008 by pm286

I mentioned I was acting as Egon Willighagen’s promoter here is his report of being awarded a doctorate:

I normally do not do these kinds of blog items, but, in reply to Christoph’s blog, here’s an overview of the ceremony (see also T-26 and T+18):

Egon thoroughly deserves this. His thesis is an excellent overview of chemometrics and chemoinformatics (since everything was done in public I can say this as all the opponents praised Egon).
Egon is, of course, a major figure in the Blue Obelisk and an uncompromising defender of Open Source and Open Data. So it was no surprise that several members of the Blue Obelisk (including Christoph Steinbeck and Stefan Kuhn) met at the reception:
And, here (map) was the dinner in the evening:

We took time to discuss current status and future plans. We now have a good range of software and it’s increasing steadily. The main obstacle is data and we talked about ways of liberating it from the literature using data- and text-mining approaches for compounds and spectra such as NMR and IR. These are increasingly well reported technically so that it is feasible to create a repository of data from the literature. This mirrors what we have done in CrystalEye…

Posted in Uncategorized | Leave a comment

The geographic spread of (Open) crystallography

Posted on April 6, 2008 by pm286

Andrew Walkingshaw has made an impressive movie on The geographic spread of crystallography
I’d hoped to present this at OR08 in my plenary but the Mac movie technology defeated me/Jim/Vista. I think Jim Downing managed to show it later.
What the movie shows is every Open crystallographic publication over the last 7 years mashed up with the geographic location of the work. I’ll leave you to pick up the main message from the movie – it’s very clear.
By Open publication I mean:

any Open supplemental data (even attached to closed publications).
any data contributed to the Crystallography Open Database.
any material extracted from institutional or departmental repositories (the eCrystals federation shoul create some of this).

So it’s completely automatic to translate the aggregated crystallography into CML (JUMBO) and thence into RDF (Andrew). Andrew then extracts geolocations from the authors’ addresses and mashes them into KML for Google display.
Andrew’s done a great job, and it’s not detracting from it to say it was done in days, not months (as would have been required 2-3 years ago).
Could we do the same for chemistry rather than crystallography? Yes – as the authors’ addresses are on the abstracts. And our robots can download and mine the abstracts can’t they?
Or can they? I am now less clear what I can legally do than ever. Thank you, publishers.

Posted in Uncategorized | 1 Comment

Open Combinatorial Chemistry in the Undergraduate Laboratory?

Posted on April 6, 2008 by pm286

In a recent post with several valuable comments: Open Science in the Undergraduate Laboratory: Could this be the success story we’re looking for? Cameron Neylon brings together ideas on how undergraduates could do chemistry in parallel to explore a wide range of compounds which could be screened against biological targets relevant to malaria. I don’t know how many chemistry undergraduates there are worldwide – with India and China I’ll guess 1 million. Jean Claude has publicised a well known reaction – Ugi – which bolts together three different groups (call them X, Y, Z). You can buy 100 different variants of X (X1-> X100) and the same for Y (Y1-Y100) and Z (Z1-Z100). So you can in principle make 1 million compounds (100*100*100). Each undergraduate would do a slightly different reaction, purify the compound and record its spectra. The records would all be Open.
They have also suggested we can do theoretical calculations on these. With a modern machine this takes a few hours to get very good accuracy, and that could be run on the students’ own machines. We have developed CML-based technology as in CrystalEye which can represent this in semantic fashion. The whole material should take about a terabyte – not challenging by today’s standards. And it could all be stored in Pubchem, or Chemspider or Amazon or even an institutional repository.
Of course not all reactions will work, and some of the undergraduates will make mistakes (that’s how education works!). But it’s a great vision. The main problem is that most chemists are very conservative and undergraduates do the same experiment year after year. This would take some effort…
… but it would be worth it.

Posted in Uncategorized | 5 Comments

OA at Stirling

UKSG – Jim Griffin

Have any closed access articles appeared in PMC?

Update on text-mining NIH

FREE THE ARTICLES! (Full-text for researchers & scientists and their machines)

The value of text-mining

Text-mining the NIH mandated papers

NO-ONE MAY DATA- OR TEXT-MINE PUBMED CENTRAL

Egon Willighagen and the Blue Obelisk

The geographic spread of (Open) crystallography

Open Combinatorial Chemistry in the Undergraduate Laboratory?

Recent Posts

Recent Comments

Archives

Categories

Meta