petermr's blog

#pantonfellow update; making videos is fun

Posted on July 28, 2013 by pm286

We are currently processing the applications for our Panton Fellowships (sponsored by CCIA – http://en.wikipedia.org/wiki/Computer_%26_Communications_Industry_Association to whom many thanks). On Thursday Michelle Brook and I worked on this in C4CC – the OKFN’s London hangout:

Firstly many thanks to everyone who applied (and those who spread the word). Obviously we won’t give any personal details at this stage. This year we expanded the eligibility to “Europe” and we’ve had applications from north east south west which is very gratifying.

It’s a pleasure and privilege to read the applications (and I’ve read them thoroughly). People are doing exciting things and having exciting ideas. We appreciate the time you’ve put in.

The Panton Advisory Board (which does the analysis and selection) includes Tim Hubbard, Cameron Neylon, Rufus Pollock, John Wilbanks and myself and is serviced by Michelle. The remit of the Fellowships is broad – it should be based on the ideas of Panton (Open Science Data) and should further those aims, but we looking for applicants to come up with new ideas – and we aren’t afraid of risk. It’s important that the Fellows engage with a community and aim to make an impact but we don’t constrain how. Although it’s absolutely not required it’s gratifying to see applicants who have already come up with ideas and tried them.

It’s very difficult to judge people from a CV and a paper proposal so after an initial selection we ask a short list of applicants to prepare a short video or similar as a complementary way of showing themselves and their ideas. This isn’t judged on technical quality but how the person and idea come over (i.e. you don’t have to hire a specialist! – phones and PCs have reasonable cameras. ) If you can’t read the detail, then it shouldn’t be there!

And I wouldn’t ask anyone to do something I wouldn’t. Two years ago I applied for a Shuttleworth Foundation fellowship which required a 4.5 minute video. The timing caught me in the wilds of Washington State (PNNL) without a video camera. I re-used some existing footage and got enormous help from Jenny Molloy. She used the services in Oxford Computing Services – each night she would edit the latest version and send it to me (9 hours out of phase). So there was only one turnround a day – I would make some written comments and we might have a short Skype. I think it actually turned out quite well.

When I got back I found I had been selected for the next stage and had to make a different video. I couldn’t impose on Jenny, so I found out how to do it on a PC. And it was nowhere near as difficult as I thought. I used “Windows Live Movie Maker” as an editor (which came bundled with the machine). I’m not promoting this product but it worked for me in a rush. I could shoot and edit and release as WMV files (these cannot be read everywhere and I have since moved to using the FLOSS Handbrake for converting to better more compact more universal formats such as MP4).

I didn’t get a Fellowship, but the experience was really valuable. It sharpened by ideas and my technology hugely. Most competitive grant applications are unsuccessful – this is due to competition and not necessarily because they are flawed. So if you are a current Panton applicant and aren’t selected don’t regard your application as necessarily bad. And perhaps look for somewhere different to apply to with a (possibly modified) proposal.

With tools such as Doodle, Googledocs and Skype it’s easy to get synchronous and asynchronous communication over a few days. At least 2 and possibly more Board members are out of the country so we have to aim for restricted hours for Skype – but we’ve done this before and it’s universal in OKFN activities.

Thanks again to all applicants.

Posted in Uncategorized | Leave a comment

#openscience in Oxford

Posted on July 25, 2013 by pm286

I/we had a great evening on Wednesday at the Open Science meeting run by Jenny Molloy and colleagues /pmr/2013/07/24/hack4ac-content-mining-and-open-science-in-oxford/ . I was leading the meeting on “content mining” and we had about 12 attendees including bioscientists, librarians, physicist, informatics, etc. It was very informal and we started by talking abour our own interests and then I gave some demos and introduction on content mining.

Jojo Scobie @paraphyso took this picture of Chuff @okfn_okapi in the pub

I was delighted to see the interest and involvement of the group in phylogenetics. At least half could be described as having a significant interest or practice in the area. So we were able to look in depth at the sort of science published and to explore the issues, both technical and organizational. And they were forgiving of my ignorance and spent a long time educating me!

I’ve discussed much of the basis before, but in essence Ross Mounce and I will be extracting data from PDF publications and systematically publishing it. We looked at the things we would like to extract. There was an important discussion on whether extracting the single tree from a paper was valuable – the authors should publish a much fuller amount of data so 1 tree isn’t always a good representation of the result. But we generally agreed it was a lot better than zero.

We discussed the value of indexing the literature by species and here there was great agreement – if the scientific literature were indexed by species (and possible geodata and dates as well) that would be really valuable – and it’s technically about the simplest example of high-quality content-mining.

Our species are in danger see http://www.zsl.org/conservation/regions/africa/okapi/ which reports “The workshop highlighted that the okapi is faring worse than scientists previously thought”. And there are 100+ species in the highly critical list. Finding all the published information on species is an essential (but not sufficient) activity and it should be possible for anyone anywhere in the world to get all the peer-reviewed and grey literature on a species. Content-mining is a necessary approach.

There is obviously a critical mass of interest in expertise in Oxford – supported by Jenny’s tireless efforts. We proposed we should have a hackathon on “species” – we could make a lot of progress.

Posted in Uncategorized | Leave a comment

Making images Open can and should be routine

Posted on July 24, 2013 by pm286

One of the many serious problems in re-use of scientific data is that it often occurs in diagrams. Here’s a simple example taken from http://www.biomedcentral.com/1471-2180/11/174/ (BMC is an Open Access, CC-BY publisher so there’s no problem with anything in this blog post …)

These are diagrams which simple record x-y data pairs and an estimate of the relation between them (lines and curves). A reader might very well want to extract the data and reanalyse them. Or simply post them, perhaps to applaud them or to criticize them. After all that’s science.

But images are copyright, aren’t they? A list of numbers is data, but an image? Might I get sued if I try to re-use the diagram without permission? Well, that’s what happened to Shelley Batts, when she did just that – she got a legal letter from Wiley (http://boingboing.net/2007/04/26/wiley-threatens-scie.html ). After huge blogosphere reaction, Wiley retracted the threat – to Shelley – but they have never said they won’t do it to someone else.

And there’ good reason not to. “The publishers own the copyright”. And they can make money by reselling images. Many publishers will charge to include an image in a publication (over 50 USD – see Springergate ) /pmr/2012/06/13/springergate-springer-replies/ where Springer claimed copyright on all images in their journals). The immediate effect of this is that authors can’t afford to re-use images and so science suffers drastically.

Note, of course, that the publisher normally makes ZERO contribution to the creation of the image. (They may reject it for technical commercial reasons – too “difficult”/ doesn’t fit their publishing workflow). But they “own” it.

Do scientists want this? If you are an author do you think:

” I want the publisher to make money from my images and prevent other people reusing them for legitimate purposes”.

If you do, stop reading…

But if you would like your images to be free for reuse, here’s a very simple thing. I built it in a 45-min train journey to hack4ac – it can be improved easily. You just run a 1-page Java program, that merges your image with a small icon indicating the image is free to re-use. It takes less than a second. It could be run as SoftwareAsAService (e.g. hosted in the cloud). Here are two examples of adding a simple tag (the first is deliberately large so you can see it).

Here the legend includes the authorship (this is trivial to customize)

The importance is that it is immediately clear that the image is free for re-use. (We could use CC-BY, but CC0 is probably more suitable). Note that nothing the publisher does, nothing you sign, takes away this right. The image carries its own permanent copyright.

And moreover it’s trivially obvious to all readers. It spreads the word.

It will save millions (literally) in saved time and effortless re-use.

A simple and effective way for it to be encouraged and implemented would be by all image-producing software (e.g. ImageJ, phylogenetic tree s/w) to offer this as default. If you WANT to give the publisher exclusive rights to resell and restrict your work you can switch the default off.

And, of course, some legacy publishers might even welcome it. (Stop fantasizing, PMR!)

Posted in Uncategorized | 2 Comments

Hack4ac, Content-mining and open-science in Oxford

Posted on July 24, 2013 by pm286

Jenny Molloy has invited me to introduce a session in Oxford this evening (Monthly meetings held at Oxford e-Research Centre, 7 Keble Road,
19:00-20:30 – See more at: http://science.okfn.org/community/local-groups/oxford-open-science/#sthash.h1XPy1Pm.dpuf and http://science.okfn.org/community/local-groups/oxford-open-science/ ) on Content Mining. It will be very informal – anyone can come and play.

The basis theme will be that:

Content mining is now routinely possible
YOU can do it
Your involvement will be a massive help

The idea is to build a community and collect/create community tools. There’s been a massive step forward with Jailbreaking the PDF (http://www.duraspace.org/jailbreaking-pdf-hackathon ) and I’ve blogged some of this (/pmr/2013/05/28/jailbreaking-the-pdf-a-wonderful-hackathon-and-a-community-leap-forward-for-freedom-1/ ). We also had a tremendous hackday in London (http://hack4ac.com/ ) (see Ross Mounce’s blog http://rossmounce.co.uk/2013/07/09/hack4ac-recap/ ). This was run by new-generation publishers (Ian Mulvany (PLoS), Jason Hoyt (PeerJ)) and looked at how hacking can change scholarly publishing. We came up with several ideas (I’ll blog my own soon) and Ross proposed figures2Data – what can we extract from the *figures* in the literature (not just the text).

We got a critical mass of 4-5 people and a great reservoir of knowledge and ideas. We made fantatstic progress. This is a very difficult subject and I assumed we wouldn’t manage much. However we found communal resources showing it can be done relatively simply using classical Computer Vision (Image analysis) such as Hough transforms and character recognition. I’d known that I’d have to hack the latter some time and was dreading it, but we found that Tesseract (http://en.wikipedia.org/wiki/Tesseract_%28software%29 ) would provide a huge amount out-of-the-box. This is a great example of stepping back from the problem and letting in fresh light. I’m now pretty confident that we can manage to hack a wide range of scientific diagrams. Here’s a diagram

And here is what Thomas managed to extract (in an hour or two):

The OpenCV http://en.wikipedia.org/wiki/OpenCV suite is tremendous and so powerful that navigation is a problem but Thomas has found all the key components. I am confident that Tesseract will recognize the characters and so we are well on the way to extracting all the information from this.

As readers know, Ross and I are working on phylogenetic trees (hypothesising the formation of species). The great thing about this subject is that anyone interested in science should be able to understand the basic concepts. It’s particularly useful for conservationists, biodiversity etc. So we are seeing what can be extracted from papers on this subject. I’m told there may be some people tonight specifically interested in this. This is a great area in which to start practical open-science.

The main problem, alas, is that most publishers have put in place legal restrictions to stop us doing this. There is no scientific reason for this. It’s to “protect their commercial interests”. So Jenny, Diane Cabell and I reviewed this for the Open Fellowship Academy http://www.openforumacademy.org/library/ofa-fellows-reference-library p57: There we put forward a manifesto for content-mining under the mantra

The right to read is the right to mine.

Simply – if you have paid to be able to read science your machines should also have the right to read it.

But no, the publishers are fighting this and trying to licence the “privilege”. It’s being debated in Europe (Licences 4 Europe) and Ross presented a superb set of slides (http://www.slideshare.net/rossmounce/content-mining) and this will give you a good indication of the technology and the restrictive practices.

So I’m hoping that tonight will see an expansion of our content-mining community!

Posted in Uncategorized | Leave a comment

Wikipedia raises the awareness and need for #openaccess

Posted on July 22, 2013 by pm286

I was alerted today by a Wikipedia initiative (http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Open_Access/Signalling_OA-ness ) by Daniel Mietchen (the primary editor of this page, but all WP pages belong to the world). I think it could have enormous impact in and for #openaccess.

I shall blog about Open Access shortly but I’ll comment (and probably get attacked for it) that effectively the only people who know about Open Access are:

Universities and their staff (current and recent)
Scholarly publishing houses
Funders (research councils, Trusts)
Policy makers (governemnts and civil services
People who have left one of these in the last five years

Beyond that I suspect that that Open Access is unknown as a term and unknown as an issue in the wider population whether in the rich West or elsewhere. Does your neighbour know what Open Access is? Or your parents? I’m guessing not. Open Access has (AFAIK) almost zero impact in the wider Internet population (please please prove me wrong!). The average Net user will come across closed access as a paywall, but they won’t know it by that name – they’ll simply be offered the opportunity to pay 40 USD for 1 day’s reading – they’ll compare this with Amazon, eBooks, etc . and move on.

So where do they look for organized information?

Wikipedia. Everyone has heard of Wikipedia, haven’t they? And even if they have only heard of Google, the WP entry is usually in the top 2-3 of non-sponsored links.

I have just finished using WP to identify http://en.wikipedia.org/wiki/Diascia_%28plant%29 which we got as a present. And it’s fascinating – Diascia coevolved with its pollinators so there should be some wonderful phylogenetics in the literature. There is! http://phylodiversity.net/dtank/Tank_Lab/Publications_files/Aust.%20Syst.%20Bot.%202006%20Tank.pdf . Quite by chance it’s a CSIRO publication (whom I’ve been in touch with when in Melbourne).

El Grafo / CC-BY-SA-3.0 (via Wikimedia Commons).

I might want to add this article as a reference in the WP article. I would expect that many amateur gardeners could appreciate this article – there’s no hairy concepts (after all *I* can understand it, and I’m a chemist).

But can I? Is it Open Access? *I don’t know*. It’s on a web page from one of the authors. Were they allowed to post it as “Green”? I have no idea. Is it permanent? Could the publishers force them to take it down? Might it decay? I don’t know.

And nor do the readers.

Daniel writes:

This page is about how Wikipedia pages could signal to readers whether a particular reference is open access or not. The main purpose of such signalling would be to spare them the disappointment of clicking through to the resource only to find out that they do not have access rights to read it. The scheme is also useful for Wikipedia editors who can see at a glance whether a given reference would be licensed in a way that allows for the images, media or even text to be reused in Wikipedia articles.

Exactly. The key words are “disappointment” and “reused”. If I click through to Tank’s paper I find some pictures – these could be very useful for me to re-use. And many phylogenetic trees. These could also be very useful. But can I re-use them?

Daniel’s idea is for WP contributors to label all the references to articles as follows:

Behind a paywall
Freely available but not free to re-use (as in Tank)
Certified as BOAI-compliant (e.g. with a CC-BY licence)

He notes that the padlock icon above was developed by PLoS to denote BOAI-compliant but is being increasingly used to mean simply free-to-read. He suggests as I would

CC-BY icons.

So why am I so enthusiastic?

Because WP readers who try to use a reference will immediately be alerted to the issue. And it will be explained in very clear terms. So we shall rapidly increase the number of people outside the self-interested ivory towers of the #openaccess issue and the injustice of making a business of forbidding access to information and knowledge.

Posted in Uncategorized | 2 Comments

Hack4ac

Posted on July 5, 2013 by pm286

I’m going to http://hack4ac.com/ (Hacking academia better together) tomorrow in London. From the site:

Why?

We have two goals

Demonstrate the value of the CC-BY licence within academia. We are interested in supporting innovations around and on top of the literature.

PMR: Yes – this is critical. Only documents with CC-BY or CC0 can be fully legitimately re-used. (All attempts to convince you that CC-NC or “viewable by humans” are either misguided. misinformed or deceitful). This is the digital age where we need to work with machines to increase our powers by an order of magnitude. CC-BY means better science, more community involvement, greater downstream value and much more

Reach out to academics who are keen to learn or improve their programming skills to better their research. We’re especially interested in academics who have never coded before.

PMR: Yes. Coding is relatively unimportant. What matters is knowing what to do, finding material, organising it, evangelising and creating teams. You probably need to find a coder and feed them some pizza but the other aspects are the important ones.

What do you mean by focussing on CC-BY licence within academia?

The CC-BY licence is the Creative Commons licence that allows for downstream remixing of the original work, so long as the original author is credited. It puts no other restrictions on what you can do with that work. The recent Research Councils UK Policy on Open Access prefers all new work that they fund to be published under this licence.

The hope is that by having access to remix and reinvent the scholarly literature we can create better tools on top of that literature. This hackday will explore ideas around what one can do with this kind of material.

PMR: Yes, Yes.

OK, what should I build?

We are just starting to gather ideas now, but how about a tool to help unlock all of the great material in institutional repositories? How about a tool to re-imagine what a journal article looks like? What about a tool to help gather real time metrics on topics of interest? What about a tool to data mine all of the CC-BY literature for trends? What about a tool to help identify whether a paper is CC-BY in the first place? There are many many exciting ideas, and we want to hear yours!

PMR: Yes. Hacking is often a question of finding a number of tools and glueing them together. (Cost: ca 2 pizzas)

Who?

Jason (PeerJ) asked Ian (eLife), and he said yes.

PeerJ has created Charlie the blue monkey, so I’ll be bringing cardboard Charlie:

(When are we going to have a proper Charlie?)

How?

You don’t need to wait until the day of the event to get started, in fact you can start pitching your ideas now. We’ve created an ideas pitch page on the wiki for anyone to list their idea and what type of skills are needed to make it happen.

What happens on the day of the event?

On the actual day, anyone with an idea will have 90 seconds to pitch. If you don’t have an idea then that is fine too! Just look for a team that you want to join.

PMR: I’ll pitch something.

PMR: There are still a few tickets left – grab them now. It will be fun.

Posted in Uncategorized | Leave a comment

UPDATE: coast2coast, hack4ac, CICM, and Open

Posted on July 5, 2013 by pm286

I very occasionally blog personal matters – I’ve been offline (literally) for ca 2-3 weeks doing the Coast2Coast walk (192 miles) across the top of England (middle of Britain) and had a wonderful time for 16 full days. I was relatively unfamiliar with the LakeDistrict and was able to do several of the additional routes including HighStyle ridge and Helvellyn/StridingEdge

We’ve blogged our exploits (https://oldgitswalking.wordpress.com/ ) – many of us are scientists and interested in digital freedom so we’ve had many in-depth conversations. The route is very popular with visitors to Britain so if any readers want to see a very exciting and varied section of the country it’s really worth doing (it was on Australian TV so there were a lot of Australians). There is NO climbing involved.

One of the guest houses:

Britain’s industrial revolution (10% of world lead was extracted here) and exported by steam train

The 900-year old Richmond castle

The reformation (dissolution of the monasteries)

Botany and ornithology:

Dactylorhiza fuchsia (Common spotted orchid) and other spp

Corvus corax (Raven) at 54° 31′ 38.03″ N, 3° 0′ 57.79″ W

950 m

So back to the future:

Tomorrow we have an exciting day at “hack4ac” http://hack4ac.eventbrite.co.uk/ – an Open-access inspired hackathon to see how fully Open (CC-BY) papers can be re-used.
On Monday Dave M-R and I go to CICM2013 http://www.cicm-conference.org/2013/cicm.php Conferences on Intelligent Computer Mathematics where we are presenting our Declaratron (a semantic approach to re-usable reproducible computation in science). More later
And I will be heavily thinking about the latest discussions of Open Access (cf. Mike Taylor)

Posted in Uncategorized | Leave a comment

Mike Taylor’s brilliant analysis of #openaccess

Posted on July 2, 2013 by pm286

I have been off-net for some time but yesterday read Mike Taylor’s interview (poynder.blogspot.fi/2013/07/open-access-where-are-we-what-still.html ) with Richard Poynder on #openaccess. I agree with everything Mike says and it summarises (part of) my position almost exactly. It needs augmenting/annotating and I shall do that in a few days. Read Mike’s post – here I will just extract a few key thoughts (emphasized):

Those publishers are not our partners, they’re our exploiters. We don’t need to negotiate with them; we don’t even need to fight them. We just need to walk away.

PMR: the last 2-3 years have shown this absolutely. Those with political or spending power must abandon legacy publishers.

The term “open access” was given a perfectly good definition by the Budapest Open Access Initiative back when it was first coined: “free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose“. Immeasurable confusion has resulted from people proposing alternatives – either through ignorance or malice. Let’s stick with the original and best meaning of the term.

PMR: Exactly so. The problem has been that many “OA advocates” have failed to follow these. We now need a simple set of actions consequent from these principles. It’s amazing that this has not been done.

Open is so much more important than Green or Gold. [But when we come the current RCUK policy on Open Access, the specific conception of Green OA that it requires is badly degraded, to the point where it’s not really open access at all. Green articles in the RCUK sense can be encumbered by non-commercial clauses, stripping them of much of their value to the taxpayer, and can be delayed by embargoes of up to two years – a truly disgraceful state of affairs given that the old RCUK policy only allowed six months.]

PMR: Exactly. From “Open” (as in the Open Definition) everything follows. Green and Gold are effectively a meaningless hodgepodge of terms and ideas.

So, no, I am not a fan of hybrid!

PMR: Nor am I.

I think it’s only gradually dawned on me just how many different ways the traditional academic publication game was broken – not just by publishers, but by administrators consistently rewarding the wrong things, and by researchers in every field and at every career stage finding special-pleading reasons why they can’t be expected to be the ones who break free of the system.

PMR: Exactly. It’s incredible that almost no head of University has made a useful contribution to this field.

Still, there’s no question that we’re much further forward than even a short time ago, and we have a lot of momentum in mostly the right direction. …

PMR: I agree the momentum. But we are effectively leaderless and policy-less. Without those there will be little planned progress. Disjointed initiatives (the current approach) will lead to more mess before it gets better.

The best numbers I have suggest that OA is going to cost us about 9% of what we’re currently paying in subscriptions.

PMR: Yes. It is still amazing that Universities stiil adopt the policy of asking the publishers how much they want paying and then trying to chip a few present off. It’s OUR money they are pouring into publishers.

OA is cheaper, but that’s not why it matters. What counts is not that it has lower cost, but that it has higher value. The real cost in all this is the opportunity cost of not having universal open access.

PMR: The opportunity cost of the current mess probably runs into hundreds of billions. That’s the value that Universities have failed to deliver to the world.

PMR: Quite simply, Mike is the best and most coherent advocate for Open Scholarship that we have.

PMR: We now need a plan. I’ll be throwing out some thoughts in a few days

Posted in Uncategorized | Leave a comment

Jailbreaking the PDF – 4; Making text from characters

Posted on May 30, 2013 by pm286

In previous posts I have shown how we can, in most cases, create a set of Unicode characters from a PDF. If the original authors (e.g. many Government documents) were standard-compliant this is almost trivial. For scholarly publications, where the taxpayer/student pays 5000 USD per paper, the publishers refuse to use standards. So we have to use heuristics on this awful mess. (I have not yet found a scholarly publisher which is compliant and makes a syntactically manageable PDF – we pay them and they corrupt the information). But we have enough experience that for a given publisher we are correct 99->99.999% of the time (depending on the discipline – maths is harder than narrative text).

So now we have pages and on each page we have an UNORDERED list of characters. (We cannot rely on the order in which characters are transmitted – I spent two “wasted” months trying to use sequences and character groupings). We have to reconstruct text from the following STANDARD information for each character:

Its XY coordinates (raw PDF uses complex coordinates, PDFBox normalises to the page (0-600, 0-800))
Its FontFamily (e.g. Helvetica). This is because semantics are often conveyed by Fonts – monospace implies code or data. (I shall upset typographical purists as I should use “typeface” (http://en.wikipedia.org/wiki/Typeface ) and not “font” or “font family”. But “FontFamily” is universal in PDF and computer terminology.
Its colour. This can be moderately complex – a character has an outline (stroke) and body (fill) and there are alpha overlays, transparency, etc. But most of the time it’s black.
Its font Weight. Normal or Bold. It’s complicated when publishers use fonts like MediumBold (greyish)
Its Size. The size is the actual font-size in pixels and not necessarily the points as in http://en.wikipedia.org/wiki/Point_%28typography%29 .

Characters in the same font have different extents because of ascenders and descenders:
Its width. Monospaced fonts (http://en.wikipedia.org/wiki/Monospaced_font ) have equal width for all characters:

Note that “I” and “m” have the same width. Any deliberate spaces also have the same width. That makes it easy to create words. The example above would have words “Aa”, “Ee”, “Qd”. (A word here is better described as a space-separated token, but “word” is simpler. It doesn’t mean it makes linguistic or numeric sense.

If the font is not monospaced then we need to know the width. Here’s a proportional font (http://en.wikipedia.org/wiki/Typeface#Proportion ):

See how the “P” is twice as wide as the “I” or “l” in the proportional font. We MUST know the width to work out whether there is a space after it. Because there are NO SPACES in PDFs.
Its style. Conflated with “slope”. Most scientists simply think “italic” (as in Java). But we find “oblique” and “underline” and many others. We need to project these to “italic” and “underline” as these have semantics.

Note that NormalBold , Normal|Italic, Normal|Underline can be multiplied to give 8 variants. Conformant PDF makes this easy – PDFBox has an API which includes:

public float getItalicAngle()
public float getUnderlineThickness();
public float getItalicAngle()
public
static
boolean isBold(Font font)

If we have all this information then it isn’t too difficult to reconstruct:

words
Weight of words (bold)
Style of word (italic or underline)

Which already takes us a long way.

Do scholarly publishers use this standard?

(You probably guessed this.) For example I cannot get the character width out of ELife, the new Wellcome/MPI/HHMI journal. This seems to be because ELife hasn’t implemented the standard. They launched in 2012. There is no excuse for a modern publisher not being standards-compliant.

So the last posts have shown non-compliance in Elife, PeerJ, BMC. Oh, and PLoSOne also uses opaque fontFamilies (e.g. AdvP49811) . So the Open Access publishers all use non-standard fonts.

Do you assume that because closed access publishers charge more, they do better?

I can’t answer that because they have more money to pay lawyers.

I’ll let you guess. Since #AMI2 is Open Source you can do it yourself.

Posted in Uncategorized | Leave a comment

“Licences4Europe” has not accepted “The Right to Read is the Right to Mine”

Posted on May 29, 2013 by pm286

One sentence summary (this link has all the documentation)

Stakeholders representing the research sector, SMEs and open access publishers withdraw from Licences for Europe

I have formally been a member of EC-L4E-WG4 a working group of the European Commission concentrating on Text and Data Mining (TDM, though I prefer “Content Mining”). I haven’t attended meetings (due to date clashes) but Ross Mounce has stood in for me and given brilliant presentations). The initial idea of the WG was to facilitate TDM as an added value to conventional publications and other sources. (The current problem is that copyright can be interpreted as forbidding TDM). When I and others joined this effort it was on the assumption that we would be looking for positive ways forward to encourage TDM.

When I buy a book I can do what I like with it. I can write on it.

from (http://en.wikipedia.org/wiki/Marginalia ) I can cut it up into bits. I can give/sell the book to someone else. I can give/sell the cut-out bits to someone else. I can stick the cut-out bits into a new book. I can transcribe the factual content. I can do almost anything other than copy non-facts.

With scholarly articles I can’t do any of this. I cannot own an article, I can only rent it. (Appalling concession #1 by Universities went completely unnoticed – I shall blog more). I cannot extract facts from it. (Even more Appalling concession #2 by Universities went completely unnoticed – I shall blog more). So the publishers have dictated to Universities that we cannot anything with the 10,000,000,000 USD we give to the publishers each year.

The publishers are now proposing that if we want to use any of OUR content (which we have already paid for) we should pay the publishers MORE. That TDM is an “added service” provided by publishers. It’s not. I can TDM without any help from the publishers. The only thing the publishers are doing is holding us to ransom.

If you don’t feel this is unjust and counterproductive stop reading. Back to “Licences for Europe”…

The L4E group has had no chance to set the group assumptions. From the outset the chair has insisted that this group is “L4E”, licences for Europe. The default premise is that document producers can and should add additional restrictions through licences. In short – we have fought this publicly and the chair has failed to listen to us, let alone consider our arguments. Who are we?

The Association of European Research Libraries (LIBER)
The Coalition for a Digital Economy
European Bureau of Library Information and Documentation Associations (EBLIDA)
The Open Knowledge Foundation
Communia
Ubiquity Press Ltd.
Trans‐Atlantic Consumer Dialogue
National Centre for Text Mining, University of Manchester
European Network for Copyright in support of Education and Science (ENCES)
Jisc

Not a lightweight list. Here’s the formal history:

We welcomed the orientation debate by the Commission in December 2012 and the subsequent commitment to adapt the copyright framework to the digital age. We believe that any meaningful engagement on the legal framework within which data driven innovation exists must, as a point of centrality, address the issue of limitations and exceptions. Having placed licensing as the central pillar of the discussion, the “Licences for Europe” Working Group has not made this focused evaluation possible. Instead, the dialogue on limitations and exceptions is only taking place through the refracted lens of licensing. This incorrectly presupposes that additional relicensing of already licensed content (i.e. double licensing) – and by implication also licensing of the open internet– is the solution to the rapid adoption of TDM technology.

We wrote expressing our concerns (March 14) – some sentences (highlighting is mine):

10. Data driven innovation requires the lowest barriers possible to reusing content. Requiring the relicensing of copyright works one already has lawful access to for a non – competing use is entirely disproportionate, and raises strong ethical questions as it will affect what computer based medical and scientific research can and cannot be undertaken in the EU.

11. A situation where each proposed TDM based research or use of content, to which one already has lawful access, has to be submitted for approval is unscalable*, and will raise barriers to research and reduce online innovation. It will slow medical discoveries and data driven innovation inexorably, and will only serve to drive jobs, research, health and wealth – creation elsewhere.

12. For the full potential of data driven innovation to become a reality, a limitation and exception that allows text and data mining for any purposes, which cannot be over – ridden by private contracts is required in EU law.

13. Subject to point 3, we must be able to share the results of text and data mining with no hindrances irrespective of copyright laws or licensing terms to the contrary. 14. In the European information society, the right to read must be the right to mine.

(I am particularly pleased that my phrase “the right to read must be the right to mine” expresses our message succinctly.

Unfortunately the response (http://www.libereurope.eu/sites/default/files/130316-researchers-reply-signed.pdf ) was anodyne and platitudinal (“win-win solutions for all stakeholders”). It became clear that this group could not make any useful progress and at worse would legitimize the interests of the “content owners”.

So we have withdrawn.

Having placed licensing as the central pillar of the discussion, the “Licences for Europe” Working Group has not made this focused evaluation possible. Instead, the dialogue on limitations and exceptions is only taking place through the refracted lens of licensing. This incorrectly presupposes that additional relicensing of already licensed content (i.e. double licensing) – and by implication also licensing of the open internet– is the solution to the rapid adoption of TDM technology.

…

Therefore, we can no longer participate in the “Licences for Europe” process. We maintain that a vibrant internet and a healthy scholarly publishing community need not be at odds with a modern copyright framework that also allows for the barrier – free extraction of facts and data. We have already expressed this view sufficiently well within the Working Group.

And we have concerns about transparency.

We would like to reiterate our request for transparency around the “Licences for Europe” dialogue and kindly request that the following actions be taken:

That the list of organisations participating in all of the “Licenses for Europe” Working Groups be made publicly available on the “Licences for Europe” website;
That the date of withdrawal for organisations leaving the process is also recorded on this list;
That it is made clear on any final documents that the outputs from the working group on TDM are not endorsed by our organisations and communities.

If you feel that we have a right to mine our information, then help us fight for it. Because inaction simply hands our rights to vested interests.

Posted in Uncategorized | 2 Comments

#pantonfellow update; making videos is fun

#openscience in Oxford

Making images Open can and should be routine

Hack4ac, Content-mining and open-science in Oxford

Wikipedia raises the awareness and need for #openaccess

Hack4ac

I’m going to http://hack4ac.com/ (Hacking academia better together) tomorrow in London. From the site:

Why?

What do you mean by focussing on CC-BY licence within academia?

OK, what should I build?

Who?

How?

What happens on the day of the event?

UPDATE: coast2coast, hack4ac, CICM, and Open

Mike Taylor’s brilliant analysis of #openaccess

Jailbreaking the PDF – 4; Making text from characters

“Licences4Europe” has not accepted “The Right to Read is the Right to Mine”

Stakeholders representing the research sector, SMEs and open access publishers withdraw from Licences for Europe

Recent Posts

Recent Comments

Archives

Categories

Meta