Monthly Archives: November 2012

#ami2 #opencontentmining: AMI reports progress on #pdf2svg and #svgplus: the “standard” of STM publishing

AMI has been making steady progress on two parts of AMI2:

  • PDF2SVG. A converter of PDF to SVG, eliminating all PDF-specific information. This has gone smoothly –AMI does not understand "good" so "steady" means a monotonically increasing number of non-failing JUnit tests. AMI has also distributed the code, first on Bitbucket at:

    http://www.bitbucket.org/petermr/pdf2svg

    and then on the Jenkins continuous integration tool at PMR group machine in Cambridge:

    http://hudson.ch.cam.ac.uk – see https://hudson.ch.cam.ac.uk/job/pdf2svg/

    [Note: Hudson was open Source but it became closed so the community forked it and Jenkins is the new Open branch]. Jenkins is very demanding. AMI starts by developing tests on Eclipse, then runs these on maven, and then on Jenkins. Things that work on Eclipse often fail on maven, and things that work on maven can fail on Jenkins.

    AMI has also created an Issue Tracker: https://bitbucket.org/petermr/pdf2svg/issues?status=new&status=open Here humans write issues which matter to them – bugs, ideas, etc. PMR tells AMI what the issues are and translates them into AMI-tasks, often called TODO. PMR tells AMI he is pleased that there is feedback from outside the immediate group.

  • SVGPlus. This takes the raw output of PDF2SVG and turns into into domain-agnostic semantic content. Most of this has already been done so it is a questions of refactoring. AMI requires JUnit tests to drive the development. SVGPlus has undergone a lot of refactoring (AMI notes changes of package structure, deletion of large chunks and addition of smaller bits. The number of tests increases so AMI regards that as "steady progress".

AMI now has a lot of experience with PDFs from STM publishers and elsewhere. AMI works fastest when there is a clear specification against which she can write tests. AMI works much slower when there are no standards. PMR has to tell her how to guess ("heuristics"). Here's their conversation over the last few weeks.

AMI: Please write me some tests for PDF2SVG.

PMR: I can't.

AMI: Please find the standard for PDF documents and create documents that conform.

PMR. I could do that but it's no use. Hardly any of the STM publishers conform to any PDF standards.

AMI. If the deviations from the standard are small we can add some small exceptions.

PMR. The deviation from the standard is enormous.

AMI. If you read some of the documents we can create a de facto standard and code against that. It will be several times slower.

PMR. That won't be useful. Every publisher does things differently.

AMI. How many publishers are there?

PMR. Perhaps 100.

AMI. Then it will take 100 times longer to write PDF2SVG. Please supply me with the documentation for each of the publishers' PDFs.

PMR. There is no documentation for any of them.

AMI. Then there is no systematic quality way that I can write code.

PMR. Agreed. Any conversion is likely to have errors.

AMI. We may be able to tabulate the error frequency.

PMR. We don't know what the correct output is.

AMI. Then we cannot estimate errors properly.

PMR. Agreed. Maybe we can get help from crowdsourcing.

AMI. I do not understand.

PMR. More people, creating more exams and tests.

AMI. I understand.

PMR. I will have to make it easy for them.

AMI. In which can we may be able to work faster. We may also be able to output partial solutions. Can we identify how the STM publishers deviate from the standard?

PMR. Let's try.

AMI. Wikipedia has http://en.wikipedia.org/wiki/Portable_Document_Format . Is that what we want?

PMR. Yes

AMI. Is the standard Open?

PMR. Yes, it's ISO 32000-1:2008.

AMI. [reads}

ISO 32000-1:2008 specifies a digital form for representing electronic documents to enable users to exchange and view electronic documents independent of the environment they were created in or the environment they are viewed or printed in. It is intended for the developer of software that creates PDF files (conforming writers), software that reads existing PDF files and interprets their contents for display and interaction (conforming readers) and PDF products that read and/or write PDF files for a variety of other purposes (conforming products).

AMI. Does it make it clear how to conform?

PMR. Yes. It's well written.

AMI. Is it free to download?

PMR. Yes (Adobe provide a copy on their website)

AMI. Are there any legal restrictions to implementing it? [AMI understands that some things can't be done for legal reasons like patents and copyright.]

PMR. Not that we need to worry about.

AMI. Do the publishers have enough money to read it? [AMI knows that money may matter.]

PMR. It is free.

AMI. So we can assume the publishers and their typesetters have read it? And tried to implement it.

PMR. We can assume nothing. Publishers don't communicate anything.

AMI. I will follow the overview in Wikipedia:

File structure

A PDF file consists primarily of objects, of which there are eight types:[32]

  • Boolean values, representing true or false
  • Numbers
  • Strings
  • Names
  • Arrays, ordered collections of objects
  • Dictionaries, collections of objects indexed by Names
  • Streams, usually containing large amounts of data
  • The null object

Do the PDFs conform to that?

PMR: They seem to since PDFBox generally reads them

AMI. Fonts are important:

Standard Type 1 Fonts (Standard 14 Fonts)

Fourteen typefaces—known as the standard 14 fonts—have a special significance in PDF documents:

These fonts are sometimes called the base fourteen fonts.[34] These fonts, or suitable substitute fonts with the same metrics, must always be available in all PDF readers and so need not be embedded in a PDF.[35] PDF viewers must know about the metrics of these fonts. Other fonts may be substituted if they are not embedded in a PDF.

AMI: If a PDF uses the 14 base fonts, then any PDF software must understand them, OK?

PMR. Yes. But the STM publishers don't use the 14 base fonts.

AMI. What fonts do they use?

PMR. There are zillions. We don't know anything about most of them.

AMI. Then how do I read them? Do they use Type1 Fonts?

PMR. Sometimes yes and sometimes no.

AMI. A Type1Font must have a FontDescriptor. The FontDescriptor will tell us the FontFamily, whether the font is italic, bold, symbol etc. That will solve many problems.

PMR. Many publishers don't use FontDescriptors.

AMI. Then they are not adhering to standard PDF.

PMR. Yes.

AMI. Then I can't help.

PMR. Maybe we can guess. Sometimes the FontName can be interpreted. For example "Helvetica-Bold" is a bold Helvetica font.

AMI. Is there a naming convention for Fonts? Can we write a regular expression?

PMR. No. Publishers do not use systematic names.

AMI. I have just found some publishers use some fonts without FontNames. I can't understand them.

PMR. Nor can anyone.

AMI. So the PDF renderer has to draw the glyph as there is no other information.

PMR. That's right.

AMI. Is there a table of glyphs in these fonts.

PMR. No. We have to guess.

AMI. It will take me about 100 times longer to develop and write a correct PDF2SVG for all the publishers.

PMR. No, you can never do it because you cannot predict what new non-standard features will be added.

AMI. I will do what you tell me.

PMR. We will guess that most fonts use a Unicode character set. We'll guess that there are a number of non-standard, non-documented character sets for the others – perhaps 50. We'll fill them in as we read documents.

AMI. I cannot guarantee the results.

PMR. You have already done a useful job. We have had some positive comments from the community.

AMI. I don't understand words like "cool" and "great job".

PMR. They mean "steady progress".

AMI. OK. Now I am moving to SVGPlus.

PMR. We'll have a new blog post for that.

 

 

 

 

 

 

 

 

 

What’s this graph?

Rich Apodaca created an instructive approach to blogging information by deliberately cutting annotations off graphs to make the reader think about them (e.e. http://depth-first.com/articles/2010/10/26/name-that-graph/).

So, what's the following graph, and why does it matter? When I get an answer I'll reveal its source, and what it means. I shouldn't have to explain its importance:

UPDATE: Now we have had suggestions, I'll reveal. Thanks Tom and Alex for suggestions. Here is it. The inexorable march of Mickey Mouse.

The vertical axis is the number of years for copyright duration in the US. So something created now will remain in copyright until well after the end of this century. In the age of the Internet Enlightenment that is barbarism. I have temporarily lost the attribution, sorry].

Here's TechDirt: http://www.techdirt.com/articles/20121116/16481921080/house-republicans-copyright-law-destroys-markets-its-time-real-reform.shtml This is good stuff from the **Republicans**. They have realised that actually copyright is anti-free-market.

House Republicans: Copyright Law Destroys Markets; It's Time For Real Reform    

from the congress-wakes-up dept

Update: Wow. It took less than 24 hours for the RSC to fold to Hollywood pressure. They have now retracted the report and attempted to claim that it was not properly vetted.

Right after the Presidential election last week, Chris Sprigman and Kal Raustiala penned an opinion piece suggesting that one way the Republicans could "reset", and actually attract the youth vote, would be to become the party of copyright reform. We had actually wondered if that was going to happen back during the SOPA fight, when it was the Republicans who bailed on the bill, while most of those who kept supporting it were Democrats. Since then, however, there hadn't been much movement. Until now. Late on Friday, the Republican Study Committee, which is the caucus for the House Republicans, released an amazing document debunking various myths about copyright law and suggesting key reforms.

If you're used to Congress not understanding copyright, prepare to be surprised. It's clear, thorough and detailed about just how problematic copyright has become and why it needs to change. To give you a sense of where the document heads, note the final line:

Current copyright law does not merely distort some markets -- rather it destroys entire markets.

Yes.

In Science, STM publishers are destroying current and future markets. And, unlike Disney, they didn't even write or draw the stuff they stop us using.

 

 

 

 

Mapumental: How long is my public transport to work? a fun Open house-hunting site from MySociety.

My Society http://mysociety.org is a wonderful force for web-based democracy (a project of UK Citizens Online Democracy (UKCOD)).They change the world by building software for democracy. They've now released a version of their Mapumental software that covers the whole of the UK! Property (http://mapumental.com/ ). You simply type in your postcode and slide the time that you are prepared to travel by public transport. Here's chemistry in Cambridge (CB2 1EW) for 25 minutes to arrive by 0900. Stunning. And of course it all relies on Open Data (imagine if you had to ask the bus company for permissions).

And here's another of their great projects. FixMyStreet works for the UK but the software can be used anywere.

FixMyStreet - anywhere in the world

FixMyStreet covers all of the UK. But we can't build street-reporting sites for the whole world, so friends abroad may like to know that we've made it easier than ever before to set up your own version.

If you've ever thought about building a site like FixMyStreet outside the UK, now is a great time. DIY mySociety is our ongoing project to make it easier for other people to build websites using our code - that's the first place to check in for help and support.

Creative Commons has played a major part in Openness; now it has a Science Advisory Board

 

At #okfest Puneet Kishoor invited me to join the newly formed Science Advisory Board of Creative Commons which has now been announced http://creativecommons.org/science/board

Creative Commons' Science Advisory Board (SAB) guides its science program and provides overall strategic vision and focus. The SAB brings legal, institutional, and domain-specific knowledge in the use and sharing of scientific tools and data. Our SAB is made up eminent scholars and practitioners from different disciplines and four continents who have volunteered to provide us both the domain expertise as well as regional perspective to help create a truly globally responsive program.

I am both honoured and eager to get started.

First I must honour Creative Commons as one of the cornerstones of Openness in this century. Quite simply without Creative Commons and its licences Open Access, and many aspects of Open Scholarship would be impossible, certainly in science. Because CC is one of the simple, clear guiding principles without which we would founder for direction. "Simple" does not mean trivial. CC has required a huge amount of work. But it has distilled much of the operational complexities into crystal clear concepts – legally enforceable licences.

Licences don't solve everything, and people who try to control community behaviour through licences alone will be disappointed. But where large amounts of money are at stake – STM output is worth hundreds of billions each year, licences are essential. And I've just written about the Open Goldberg Variations, licensed under CC0. Without such as licence it may not have been possible to crate the OGV as such. It is a clear human and legal statement of the dedication into the public domain. Nor would our work on the Panton Principles have been possible without the involvement of John Wilbanks and the possibility of CC0 as one of the licences to define the output.

[battery running out…]

But CC is much more than a licensing system. It's the major exploration of how we make scholarly and cultural works available. Not all CC licences are Open, and I'm quite happy with that. It's a way for creative artists to state what can and what cannot be done with their works.

But restrictive CC licences (NC, ND) are not appropriate for published science. Once science is published (whether publicly funded or not) it now only makes sense if everyone (including other publishers) can make unrestricted us of this. For that reason I am disturbed about publishers such as nature who are requiring Author charges (APCs) to be related to the type of CC licence – you have to pay more for CC-BY than CC-NC. IMO this is a clear violation of the spirit of licencing in science, it confuses and restricts – which I suspect is the intention. Science should only have the following CC licences: CC-BY for articles and CC0 for data. So this is an area that I shall want us to discuss on the SAB.

And it is frankly awful that so many publishers don't even use simple formal licences but have vague terms and conditions on scattered web pages which are often woolly and contradictory. With CC the formal basis of the discourse becomes clear, in a way that terms such as "Open Access" no longer (unfortunately) seem to be.

 

 

#openscholarship Bach + Kimiko Ishizaka and George Veletsianos + Royce Kimmons; Culture wants to be Open

 

One of the great by-products of invitations (e.g. to Perth) is that I am catalysed to write overview blog posts because I don't normally do conventional slides and because I don't know what I am going to say in detail (though I agonize sufficiently beforehand). For example in this case I had just come up on the bus where I met two very nice students from ECU who showed me where to get off the bus. One was a performing musician (ECU has the Western Australian Academy for the Performing Arts, WAAPA, http://www.waapa.ecu.edu.au/). So I asked if he had come across The Open Goldberg Variations http://www.opengoldbergvariations.org/ by Kimiko Ishizaka. No. So I told him about how this was completely open (CC0) including both the performance and the score. That means they have created an app which shows the score running alongside the recording. What a wonderful aid for teaching, learning and scholarship in general.

 

And how disappointing to read on the page:


Open Goldberg Variations


‎"Copyright abuse hurting musicians!" Does that headline get your attention? Well it's true. Kimiko Ishizaka, Pianist is the victim of copyright abuse. Despite her best efforts to make the Open Goldberg Variations free - for every person and every use possible - GEMA (Germany's artist collection society) is STILL claiming that the work falls under their jurisdiction and blocks YouTube videos from playing in Germany when they use Kimiko's music. Since the whole point of making the music free was to gain exposure to a larger audience, having the videos blocked hurts Kimiko Ishizaka as an artist, and undermines her efforts to share Bach. Outrageous.

 

This, as much as anything indicates that we are in the middle of a titanic battle for our scholarship and creativity. What conceivable moral or ethical right can anyone have to block Bach's work 260 years after his death. Is his ghost demanding rights? Even Mickey Mouse is not 260 years old (although if we don't fight we still won't have access to him in 200 years' time).

 

So I started my talk by playing the Aria from OGV.

 

Because I blogged I got this great mail from George Veletsianos: Cameron Neylon emphasises networks and scale and this is an ideal example of another link in the network – we use each others ideas to reinforce and refine our own. And networks scale with N-squared (though I think it's even greater), so this isn't one link, it's increasing the power by N+M where these are the sizes of our local networks – we are linking our environments and because they may be partially independent we gain a great deal.

 

George Veletsianos says:

November 14, 2012 at 3:33 pm  (Edit)

I really appreciated reading your insights as I come from a different disciplinary background. I came across your blog because I try to keep up-to-date on the concept of open scholarship, and I was alerted through Google Alerts about it. This idea, of researchers connecting with researchers through the opportunities afforded to us by networked technologies, I think is one of the central characteristics of the notion of open scholarship and openness. A colleague and I have written about this recently and I think that you might enjoy our paper: http://www.irrodl.org/index.php/irrodl/article/view/1313/2304

Now that I also found your work, I look forward to following along!

George and Royce Kimmons have written a very good, comprehensive, well referenced paper on Open Scholarship. They come from a different angle, mainly teaching and learning, but they share the same ethical and political viewpoint. Read the paper, but they summarised their position in four principles (PMR highlighting and numbering), governed by a single guiding philosophy of Openness as an ethical pursuit for democratization, fundamental human rights, equality, and justice. This is essentially identical to the views of those I meet in the Open community (I have blogged that http://blogs.ch.cam.ac.uk/pmr/2011/09/30/access-to-scientific-publications-should-be-a-fundamental-right/ ). So George and Royce:

Given these examples of open scholarship, we should be able to recognize some common themes and assumptions about openness, sharing, and Internet technologies that unite such practices. First, open scholarship has a strong ideological basis rooted in an ethical pursuit for democratization, fundamental human rights, equality, and justice. As the Budapest Open Access Initiative (2002) explains, the aim of openness is "building a future in which research and education in every part of the world are … more free to flourish," thereby reflecting ideals of democracy, free speech, and equality. Caswell, Henson, Jensen, and Wiley (2008) further explain this ideological basis with a statement of belief:

  1. We believe that all human beings are endowed with a capacity to learn, improve, and progress. Educational opportunity is the mechanism by which we fulfill that capacity. Therefore, free and open access to educational opportunity is a basic human right, … [and] we have a greater ethical obligation than ever before to increase the reach of opportunity. (p. 26)

Directing these desires for ensuring basic human rights, transparency, and accountability is a sense of justice or fairness in scholarly endeavors. Based on this ideological foundation, openness and sharing in scholarship are seen as fundamentally ethical behaviors that stand as moral requirements for any who value ideals of democracy, equality, human rights, and justice.

  1. Secondly open scholarship emphasizes the importance of digital participation for enhanced scholarly outcomes. Arguments for openness tend to focus on addressing the short-comings and limitations of current institutionalized practices through faculty participation in online spaces. For instance, Greenhow, Robelia, and Hughes (2009, p. 253) argue that Web 2.0 "tools might positively affect—even transform—research, teaching, and service responsibilities—only if scholars choose to build serious academic lives online, presenting semipublic selves and becoming more invested in and connected to the work of their peers and students." Throughout these arguments for openness, the undesirable alternative is depicted as being "closed" or unresponsive to calls for equity, sharing, and transparency.
  2. Thirdly open scholarship is treated as an emergent scholarly phenomenon that is co-evolutionary with technological advancements in the larger culture. Though ideals espoused in the first assumption are not new developments, their reintroduction into and re-emphasis in discussions of scholarship come in conjunction with the development and diffusion of a variety of social technologies. As Wiley and Green (2012) point out, open practices "allow the full technical power of the Internet to be brought to bear on education" (p. 82), and though causal relationships between technology developments and social trends are multidimensional, historical precedents suggest that social trends evolve in conjunction with technology development in a negotiated and co-evolutionary manner (cf. Veletsianos & Kimmons, 2012; Binkley, 1935). Thus, when discussing openness in scholarship, technology must be seen as both being an actor (i.e., influencing changes in scholarly culture and thereby influencing cultural behaviors) and being acted upon (i.e., being influenced by scholarly and other cultures and thereby reflecting cultural behaviors).
  3. Finally, open scholarship is seen as a practical and effective means for achieving scholarly aims that are socially valuable. Such aims might range from ideological values (as mentioned above) to a variety of others including reduced cost of delivery, improved efficiency, greater accuracy, and so forth. For instance, one argument in favor of OA journals is that "the cost savings alone are likely to be sufficient to pay for open access journal publishing or self-archiving, independent of any possible increase in returns to R&D that might arise from enhanced access" (Houghton et al., 2009, p. XIX). Similar arguments have been made about improved research efficiency in sharing data sets (Trinidad et al., 2010), increasing the reach of universities via MOOCs (Carson & Schmidt, 2012), and using SNS for research purposes (Greenhow, 2009). Considering an educational perspective, such efficiency may also have pedagogical value because as Wiley and Green (2012) argue, "Education is a matter of sharing, and … [open practices] enable extremely efficient and affordable sharing" (p. 82). In their view, "those educators who share the most thoroughly of themselves with the greatest proportion of their students" are seen as successful (p. 82). From this perspective, openness is seen as an effective vehicle for achieving various scholarly goals like affordability, efficiency, accuracy, accessibility, sustainability, dissemination, and effective pedagogy.

#ami2 #opencontentmining AMI analyses more PDFs and gets useful help from StackOverflow and shapecatcher

In the previous post AMI2 showed the need and the problems of dealing with fonts. In summary, if all fonts carry Unicode codePoints it's trivial to extract the codepoints and assert they are Unicode. Since SVG uses Unicode by default, passing the codes will mean SVGPlus can both "understand" them and render them. Unfortunately many PDF files in the STM literature are difficult or ambiguous for AMI2 to understand.

Remember that AMI2 has no emotions or social judgment. She does not understand "bad", "awful". She uses words like "complicated", "fragile", "ambiguous", "impossible". If they are difficult it's more work, but possible. If they are ambiguous, AMI2 can sometimes be given rules, heuristics or even probabilistic methods. If they are impossible, AMI2 says she can't give an answer.

AMI2-PDF2SVG has now had a look at the following PDF sources:

  • Manuscripts on Arxiv. These are authors' preprints, although in many cases they don't differ significantly from the publisher's version of record. They are probably produced by practising scientists using LaTeX or Word. Most authors will have already created several previous documents so they will understand how to do it.
  • Theses from Institutional repositories. These are also mainly authored by graduate students using Word or LaTeX (though we don't know). They may be the first documents students author, though many will have published one or more scholarly articles.
  • Published PDFs from STM publishers. PMR is trying to find "Open Access" examples so they can be quoted from. We don't know how these are authored. However PMR's best guess is that the publishers receives the author's manuscript in LaTeX or word, annotates or corrects it in some way, and sends it off to third party typesetting companies. (If any publisher wishes to correct this, please do). There is some evidence from the fonts used that some typesetters deal with more than one publisher – and I got a list of three from someone who knew the industry.

AMI2 is particularly concerned with "high" codepoints – i.e. above 255. These are used for symbols (e.g. Greek letters, maths symbols, graphical symbols (circles, squares, etc)) and certain punctuation (e.g. smart quotes and dashes). It really matters getting these right.

AMI2's initial findings (from a very small sample) are:

  • Arxiv manuscripts were easy to process and had few ambiguities and most of the symbols are correctly rendered. It appears that Unicode is commonly used. Math equations (especially from Word) may be problematic.
  • Theses are fairly similar (except where they include publishers' PDFs).
  • STM publishers PDFs are very variable. Some are easy but many are complex, ambiguous and not infrequently impossible at this stage

The problems across all PDFs are:

  • Codepoints are not Unicode and the encoding is unknown. It has to be guessed from the Font
  • Many PDFonts do not have FontDescriptors. The FontDescriptor is the standard way of finding whether a character is italic, symbol, etc. and finding the FontFamily
  • Some Fonts do not even have Fontnames. Others have font names that are Opaque (e.g. AdvP4C4E51).

In the problem cases the PDF only displays because the Glyphs are actually included. It's possible in PDFBox to ask for the glyphs and get bitmaps or outline (scalable) fonts. That's OK for sighted humans, but no one else.

So we have the following subproblems:

  • Map an undocumented codepoint in an undocumented font to Unicode.
  • Interpret a glyph bitmap/outline as a Unicode codepoint.

How do we solve this? We use the wonderful resource that is StackOverflow. Nearly 4 million questions have been asked on SO and most have had highly competent answers. So I asked SO about translating Mathematical-Pi-One points (e.g. "H11001") to Unicode:

http://stackoverflow.com/questions/13188587/conversion-of-mathematicalpi-symbol-names-to-unicode


Conversion of MathematicalPI symbol names to Unicode


up vote 8 down vote favorite

2

I am processing PDF files and wish to convert characters to Unicode as far as possible. The MathematicalPI family of character sets appear to use their own symbol names (e.g. "H11001"). By exploration I have constructed a table (for MathematicalPI-One) like:

    <chars>

        <char charname="H11001" codepoint16="0X2B" codepoint="43" unicodeName="PLUS"/>

        <char charname="H11002" codepoint16="0x2D" codepoint="45" unicodeName="MINUS"/>

        <char charname="H11003" codepoint16="0XD7" codepoint="215" unicodeName="MULTIPLICATION SIGN"/> 

         <char charname="H11005" codepoint16="0X3D" codepoint="61" unicodeName="EQUALS"/>

    </char> 

Can anyone point me to an existing translation table like this (ideally for all MathematicalPI sets). [I don't want a graphical display of glyphs as that means each has to be looked up as a Unicode equivalent.]

 

After 2 days no answers. This is unusual – you normally get answers within an hour or even minutes. So it's an obscure question with possibly no clear answer. But on SO you earn points for good questions and answers. Because I had 8500 points, I could offer a "bounty". This means that I give others points to answer my question. And I got answers. They are good, and are as good as possible as there is no Open translation table. But after a discussion it appeared that Adobe had published a table of MathPi code points and the glyphs. Here's a snippet:

So "H11011" looks like a plus and quacks like a plus (i.e. it is found where a plus sign makes sense). We can therefore write H11001 => U+002B. But what about H11017, for example. How do we find the Unicode point out of 100,000 possibilities?

That's where one of the answers was so valuable. They'd discovered "shapecatcher" – an excellent tool created by Benjamin Milde. Here's his thesis

And he has now developed shapecatcher using this technology. Shapecatcher has a large part of the Unicode character set along with (open) glyphs and he has developed the tool to map a bitmap onto possible characters, reporting the probability for each. Let's try it on H11017:

And here's my best effort (I'm in Perth airport without a mouse so it's straggly)

Brilliant. It has a clear Unicode point and my wobbly picture got it right.

I think Benjamin's technology has a great deal to offer.

Sand it's an excellent example of why making theses Open is a Good Idea.

 

 

#openaccess #opendata; Cameron Neylon and I talk to Librarians in Perth; we must do it ourselves and make it Open

 

We had a marvellous day yesterday in Perth at the invitation of Constance Wiebrands, Toby Burrows, Sue Cook and others. Constance had heard I was in AU and invited me over. Toby put md up in Trinity College. At this season Perth erupts with jacaranda blossom over the whole city. Here's the view from my room:

 

By a remarkable coincidence Cameron Neylon was en route from ZA and visiting Perth – picked up on Twitter. So we had a double bill for about 2 hours with presentations from both of us and a lot of discussion. Very good attendance with representation from several libraries.

The theme was – roughly – how can libraries react to the opportunities and requirements of the change to Open. We are all sure it is coming – it may not be pretty. I've set out some of my thoughts in previous posts and emphasized inter alia:

  • The world is changing very rapidly – universities must reach our beyond their walls
  • Young people have a different attitude to information and knowledge
  • A huge range of tools are software glue are within our immediate reach. Very little is technically impossible.
  • We must be on control of our knowledge and our metadata. If we wait for external organizations (Google, Thomson-Reuters, etc.) to develop information products for academia we will get a poor subset, without any input of control. Google has no contract with us, the readers – they could switch off Scholar tomorrow. TR does not display its methodology and makes decisions without the input and the consent of those affected. (One participant likened it to Standard and Poor's assessment of financial institutions –yes). If we really wish to use our metadata for exploration, evaluation, etc. we must create it and we must control it.

And the primary way to get what we want is – paradoxically - tto make it open

Cameron stressed the importance of frictionless networks that scale. Friction is typified by missing licences, restrictive licences, CC-NC, mailing for permission, using proprietary APIs, closed formats, etc. etc. Only frictionless scales. We need filters but at reader-side, not distributor-side. We must decide what we want to receive and discard, not have it decided for us.

Constance was keen that we explore how libraries rise to meet the challenges and opportunities. There are huge opportunities because in the era of Open – which is coming – those who lead are those who have developed the infrastructure and combine it with trust. And trust is a major asset of universities. Without attribution here are some of the ideas that emerged.

  • Regular meet-ups over the Perth area (there are several universities). Find out what can be achieved by scale
  • Reach out beyond the walls of academia. Involve the citizenry.
  • Get young people involved doing things that change the future of knoweldge and libraries.
  • Use the library facilities for hackathons – they are often ideal.

It takes courage. How does a senior librarian work with the new technology and new society? The positive is that openness is inclusive – grey hair is not a barrier (personal experience). Entrenched attitudes in senior academics is harder (and we agreed that some can never be changed). But getting out the message of change, creating persuasive arguments and material, experimenting, must be an essential part of the present and future.

#openscholarship: Reclaim our Scholarship(III). How can we do this?

Constance Wiebrands who has invited me to talk in ECU tomorrow asks on this blog (see https://blogs.ch.cam.ac.uk/pmr/2012/11/14/openaccess-opendata-reclaiming-our-scholarship-ii-do-we-undervalue-it/ and previous):

It is probably more useful for [Librarians] to think about future steps and what libraries ought to be doing to support our academic colleagues. Have we (librarians) got the will to work on a large scale project to create and control an independent metadata record?

I hope we can have a general discussion about that tomorrow. But what I want to do now is to show the value and also to show that it is highly affordable. I'm an idealist and optimist so this colours the discussion. (However I have launched several successful software/informatics memes on the web, some succeeding in weeks, some taking decades, and I have watched others so I know that a great deal is possible).

First why? After all there are professional indexers of scholarly publication and their products cost a lot so surely we should chose them? PMR: No. The rapid change of modern technology means that products can get obsoleted quickly. OK, so that's what Google has done with Google Scholar and MS with MS Academic search. And they are free. PMR: Yes. But they decide what to include and what not both breadth and depth. There's also no contract that says they will continue to exist. They may, they may not. I have seen umpteen free services disappear or become commercialized. So the major benefits are:

  • We know precisely what the basis is. We can inspect quality
  • We can re-use the metadata for whatever purpose we like without permission. We can use it to control assessment exercises, act as definitive bibliography for citations, be annotated for a whole host of purposes (is it Open, does it contain certain types of material – e.g. data, arguments, etc.). Current awareness… reading lists, etc.

Assuming that we do want to control our data, isn't it going to be very expensive? PMR: Technically, no. Indexing from the web is now very straightforward. There are already several collections of web-collected metadata in academic institutions. And I'll show you tomorrow some of the things we are doing. (Crystaleye, Bibsoup, etc.)

Yes but, it isn't really our job to be doing this.

Um… if libraries are going to continue (and I hope they are) they have to be about *creating* information and meta-information. Buying and distributing will be done in the cloud or wherever. Amazon can distribute eBooks and could do the same for journals. Institutions have to be in control of their information.

PMR: In the information age small formal resources can create huge results. The factors that drive this include:

  • High quality and ever-increasing Open technology. Open is being increasingly adopted by governments for both software and data
  • Involvement of citizens ("crowdsourcing"). The prime example is, of course, Wikipedia. (Later I'll mention Openstreetmap)

http://wikimediafoundation.org/wiki/2011-2012_Annual_Plan_Questions_and_Answers gives:

The 2010-11 plan called for us to increase revenue 28% from 2009-10, to $20.4 million, and to increase spending 124% from 2009-10, to $20.4 million. In fact, we significantly over-achieved from a revenue perspective, and we also underspent, resulting in a larger reserve than planned. We're projecting today that 2010-11 revenue will have actually increased 49% from 2009-10 actuals, to $23.8 million. Spending is projected to have increased 103% from 2009-10 actuals, to $18.5 million. This means we added $5.3 million to the reserve, for a projected end-of-year total of $19.5 million which represents 8.3 months of reserves at the 2011-12 spending level.

Let's say 20 million USD. That's a minute sum for academia as a whole (Univ College London alone pays Elsevier about 2 million dollars in subscriptions). So it's among the best scholarly value anywhere. And the reason is, of course, that it has found a new way of doing things where the citizenry are an integral part. (Do universities give citizens a real stake in them? They should if they wish to retain public support).

But it also shows the value of concerted action. The publishing tools and the information infrastructure that WP has developed are far better than what the STM publishers use. The use of versioning and annotation is absent in traditional scholarly pub – not because it isn't valuable but because it's too much effort for the publishers. And because they don't have the goodwill of the community in helping add it.

So why isn't academia working with Wikipedia? It's a natural partnership. (There are some instances such as biosciences – where data in databases is being actively linked but there could be much more). And if Wikipedia is using academic output (e.g. referring to papers) it's creating one of the largest and best reference lists in the world. We've gently explored whether they could use our Open Bibliography specifications.

But the alternative (often complementary) approach is automation. It's now easy to crawl the web. It's increasingly possible to extract content from what is retrieved. And to do this in a domain-specific manner. Now Google and Microsoft aren't interested in science – it's too small a market (they *are* interested in health). So no-one will index science for us for free. The severe danger is that someone will index the science and then sell it back to us.

Here are some questions that we ought to be able to answer:

  • Find all articles which discuss antimalarials and retrieve the chemical reactions used to make them.
  • Find organic crystal structures that have been tested for second harmonic generation
  • Find phylogenetic trees which contain diptera species.

All of these are within our current technology.

The major problem is that we may be challenged on the legal right to do this. Publishers want to restrict content mining. As long as we acquiesce in what publishers want we will get what publishers want us to have.

So where is the resource going to come from? Ultimately it has to come from reshaping the market. There is huge amount of wasted money at present and the more that academia takes control of the market the more that costs are decided by academia.

But there are also the dynamics of the Net / Web 2.0. many projects have made great impact with a committed human driving them, free webserver, and a growing volunteer community. Examples include:

  • Openstreetmap whose volunteers have mapped much of the world at often submeter resolution.
  • GalaxyZoo which classified a million Galaxies in a year with hundreds of thousands of volunteers

And our own efforts where Nick days' Crystaleye has indexed all public crystal structures in #scholpub on the web (ca 200,000).

So you will need a champion. It may be that much of this can be done with resource already in the system. For example:

  • Students on LIS courses should absolutely know this stuff and the best way to learn is by doing
  • MSc students in engineering and CompSci need projects and this is an ideal one
  • Undergraduate students can be financed, say, through GoogleSummerOfCode

And who knows – there are huge numbers of citizens interested in books and journals. I am sure some of them would be interested in volunteering. And these ideas are just the start

 

 

 

#openaccess #opendata Reclaiming our scholarship (II). Do we undervalue it?

I am organising my thoughts for my talk tomorrow in blog form. This helps me to organize my thoughts and also creates a record for those at the talk and those not at it. I won't necessarily cover everything.

I've contended (https://blogs.ch.cam.ac.uk/pmr/2012/11/14/openaccess-opendata-in-perth-au-my-talk-on-open-scholarship-i-we-must-reclaim-our-scholarship/ ) that about 300 BILLION USD is committed each year to the funding of public STM research. And that the revenue in scholarly publishing is about 15 BILLION USD and that part of that is for metadata. What do we get for our money? Is it good value?

I am not an economist, so this is simplistic, but I hope valuable. The funding of STM research is primarily (I think) to create valuable knowledge. (There are side benefits such as highly trained humans, better support for university and other infrastructure, but the rationale for funding is the creation of valuable knowledge.) So what *IS* the value and do we get the best value?

It is extremely difficult to measure the value of something when it is not directly traded. Universities do not generally sell the main part of their output, nor do they put themselves up for sale. (Occasionally public research establishments get sold, which may give a measure). Here's an example of how we might address this.

What's the value of a human life? Wikipedia summarises various authorities and a rough figure in the US is 5 MILLION USD. In other words if one can save a life by sending less than that it is good value for the country and the world (as well as for the particular human and their social environment). Sometimes the scholarly medical literature can save a human life directly. I have found several anecdotes where committed patients or their relatives have trawled through the medical literature and either/both diagnosed their condition and created and informed opinion about the appropriate treatment. Taking a figure of – say – 100 papers read and allowing for the costs of the medical infrastructure, physicians, etc it can certainly be claimed that each paper has contributed thousands of dollars' worth of knowledge. And, of course, the papers can be reused by other patients and physicians.

The figures shouldn't be quoted blindly but they should serve to show how valuable the scholarly literature is. There are many other areas where there is a fairly clear connection between knowledge and monetary value – here are a few:

  • Improving agriculture
  • Human safety
  • Predicting and preparing for natural disasters
  • Building social and physical infrastructures
  • Providing knowledge for new businesses
  • Avoiding costly decisions

I was recently asked by the UK government to estimate the value of allowing relaxation on restriction of using scholarly publications for contentmining (and I'll be talking about some of this tomorrow). In my response (http://blogs.ch.cam.ac.uk/pmr/2012/03/21/my-response-to-hargreaves-on-copyright-reform-i-request-the-removal-of-contractual-restrictions-and-independent-oversight/ ) I analysed chemistry worldwide and came up with an estimate of the lost value due to restrictions on contentmining as "low BILLIONS" (annually, chemistry, global). I looked at at:

  • Better and cheaper abstraction of the literature
  • A new generation of tools, vendors and information suppliers
  • Better decision-making (e.g in pharma)

When something is restricted (and has been for years) there is a major opportunity cost. People don't do things because they can't, so the type of innovation seen in Silicon Valley and Cambridge is impossible. Whole new communities of practice are possible through new knowledge.

The value of the knowledge produced from a piece of research must on average be more than the cost or it wouldn't be worth funding. I use a figure of "a few" times, though I have no justification for this other than a few cases where Opening knowledge has been measured. It's particularly difficult because the funders of the research may not be the direct beneficiaries. Thus work funded in the UK might directly benefit countries in the tropics or knowledge collected from them directly benefit the UK. In this sense publishing research into an electronically connected world is a valuable contribution towards everyone's wellbeing and – although I cannot justify this – could be an effective form of "aid" – it is at least very cheap to deliver!

The major problem in disseminating this information is the current publishing system which is primarily designed to communicate only with rich universities (i.e. only those which can afford subscriptions). Even in the rich countries most publications can only be read (legally) by 1 percent of the population – those employed at universities or research establishments. Others have to pay 40 USD to read one article for one day and generally you have to read many before you find the one you want. In non-rich countries it's obviously even worse. I call the disadvantaged "the scholarly poor".

The problem is that the market is so broken we have no indication of cost or value – we only have price (and even then many of the prices are required to be confidential). Let me estimate some of these:

  • The price of a journal per year per institution can run into thousands of dollars. Given that there are tens of thousands of journals even rich universities can't afford all of them. This is already a serious impediment.
  • The cost of processing an article is difficult to determine because publishers don't make their costs public. Some iue nformal estimates put this at 250-500 USD. Acta Cryst E (which has built its own processing system has about 140 USD – the price is, I think, 160 USD). NaturePG has said it "costs 10000 dollars to publish an article". It doesn't. If anything this is the average revenue expected per published article (i.e. price).
  • The value of an article is also difficult to determine. I've tried to estimate the value to the world as "many thousands of dollars". The value to the author and institutions can be even higher – it can make the difference between a career and a grant or none. (But this is no absolute value except in the plumage war market).

So the problem is that those to whom the absolute value comes – world citizens – have no say in the market. There are therefore no pressures on individual costs. Because of that we have a very inefficient process and one that often creates a low quality unimaginative uninnovative result.

 

#openaccess #opendata in Perth (AU). My talk on Open Scholarship (I). We must reclaim our scholarship

I have been invited to speak to a group of Librarians and Researchers (2012-11-15) by Constance Wiebrands of Edith Cowan University (http://www.ecu.edu.au/ )(? at their Mount Lawley Campus. ) This a great honour – to be invited all across the continent – and I am working out what I shall say. I want to give an idea of the potential for Open Scholarship but also have to make it clear that this needs positive and courageous action. First an (Australian) picture – I'll explain later …

 

Male Southern Elephant Seals fighting on Macquarie Island for the right to mate (from http://en.wikipedia.org/wiki/Sexual_selection ).

The creation of knowledge in academia (a general term for Universities, Colleges, Research Institutions, Charities) is a major global business. I have been unable to get exact figures (and they will depend on definitions) but the annual cost of publicly funded STM (Scientific Technical Medical) research is about

300 BILLION DOLLARS

The error margin is probably a factor of 3 (so it doesn't matter whether they are AU or US Dollars, Euros or Pounds). This comes from Research funders (government, charities, foundations and trusts and industry (where the results are intended to be public)). I'd welcome better figures – I have guestimated by 2-3 methods and they end up in the 100-1000 Billion dollar range.

A more accurate figure is the scholarly publishing industry (journals, textbooks, electronic services, etc.) at about

15 BILLION DOLLARS

This is a very profitable industry (the costs of raw materials are near zero – donated by academics – the cost of services are low – donated by academics – and profits can be large (15-40 %). It's an unregulated industry, with no price or quality control other than by internally and non-transparently by individual publishers. There is no direct price competition as one good is not substitutable by another (a paper in Nature and a paper in Acta Cryst E are both unique and non-equivalent in terms of their knowledge). There *is* a cumulative price limit in that most of the income is provided by University subscriptions to journals. Over the last decade this was funded on a significant annual increase in real terms (perhaps 8%) in library payments but there is public agreement that this cannot and will not continue. Subscribers have to make impossible decisions.

Similarly the consumers ("readers") have no financial involvement. An STM scholar (I'll probably lapse into calling them scientist) does not actually pay for the goods (most of which are anyway rented) but puts internal pressure on the University/Library to prefer their requirements to their colleagues. This is not related to the value of the goods but internal political power. The effect of this process is that some subscriptions are cancelled but no market pressure is put on the value and quality of the individual subscriptions. If a powerful researcher says they "must have" subscription X1, then publisher X can charge more or less what they want. There is very little collective bargaining where the subscribers as a whole unite to challenge prices and quality.

This is already a fragile and dysfunctional market, but there are more problems. In a similar way that derivatives can influence sums far larger than the investment, the publication market affects decisions at the funding level. The STM meta-publishing market (e.g. citation indexes) has evolved not simply as a factual record of bibliographic citations but also as a means of evaluating the worth of a publication. I don't know what the metapublishing market is but let's say it's part of the 15 Billion so

3 BILLION DOLLARS

The metapublishing market is now used to control decisions in the funding of STM. By creating numerical values for individuals, departments, institutions and even whole subject areas, decisions can be made simply by running computation. For example the Australian ERA exercise (http://www.timeshighereducation.co.uk/story.asp?storycode=411286 , now scrapped) is being followed by QMC London (http://qmucu.wordpress.com/2012/02/04/save-sbcs/ ) rely heavily on numbers. They include tables such as:

_______________________

[1] The proposed redundancy criteria relate to the period Jan 2008 to Dec 2011, during which many staff were told to concentrate on teaching. These criteria are based on the median achieved by the top 20% of UK researchers; that is, half of the researchers in the top fifth of the country failed to achieve these targets.

Category of staff

No. of  Papers

No. of  high quality papers [2]

Income (£)

PhD studentships

Biology and Chemistry Professor

13

3

500000 at least 200000 as PI

2

Reader

10

2

375000 at least 150000 as PI

2

Senior Lecturer

8

2

300000 at least 120000 as PI

1

 

In other words if a researcher does not publish a given amount of given measured quality and attracts a given income each year they will be fired.

The problem with numbers is not only that they are simplistic but that they also depend on who compiles them. If the term "high quality paper" is to be used then it might be expected that it would be controlled by universities, or funders or a public council or…

But in fact the numbers are created by a non-transparent process by commercial organizations whose only motive is to maximize profit (simplistically income over costs). This means selling as much metadata as possible for the highest prices and making the costs as low as possible (in this case by automation). In essence academic and funding decisions are taken by using numbers created by Elsevier or Thomson-Reuters in non-transparent processes. Two example of problems are:

  • WHAT IS A CITATION? (this is a non-transparent process and no two "authorities" agree
  • WHAT IS CITABLE? Again non transparent

As an example of the problem, I was told by the International union of Crystallography that their Acta Crystallographica E is no longer indexed by TR. Now IMO Acta Cryst E is an outstanding journal. It's the best data journal in the world. The authoring quality is required to be technically very high, the data are reviewed by humans and machines and the result is a semantic publication. But when Acta Cryst asked TR Why? they were apparently told "too many self-citations".

The point is that a non-answerable commercially-driven organization is making these decisions, without reference to the much larger community of researches and publishers. For me a citation is an objective scholarly object and good software should be able to make independent decisions about it. Self-citations are part of the scholarly record. They may have an important role is establishing quality and credibility of metadata – who knows? It's possible to filter them out downstream if required. But since TR don't make available their raw citations (and the reason can only be commercial) we cannot make informed decisions.

So here is my zeroth recommendation for the scholarly community.

  1. WE MUST CHALLENGE FOR OUR PUBLIC RIGHT TO OUR INFORMATION

What is "our right" is a political judgment. For me and many others in the Open movement we see the knowledge transmitted to publishers as being owned by the community with the publishers being paid for added value. But most publishers see this as "their property" with an increasing number looking to re-exploit in new ways (such as selling images from publications or charging fees for contentmining). Unfortunately we have had two decades of supine acquiescence to appropriation of rights by publishers and the reclamation of our own property is now much harder than if we had started it in (say) 1995. I have no examples of heads of Universities speaking out and claiming rights to scholarly output (mandates for "Open Access" don't count as they do not reclaim serious rights.

So here is my first recommendation to the academic community:

  1. WE MUST CREATE AND CONTROL AN INDEPENDENT RECORD OF METADATA (bibliography and citations) OF SCHOLARLY PUBLICATIONS

That is the most important thing we can do for the first step. We need to know objectively and publicly who published what and who cited what. It's technically fairly easy to do for modern electronic publications. We and others have software which can crawl publications on the open web, index their bibliographic data and extract the citations. The only problem is that publishers will claim that they "own" the citation data. This claim can have no moral or ethical justification. So the effect is that universities pay large amounts of money for metadata of unknown quality on which they then base their judgments. If Google can index the academic literature so can we. And it need not be expensive.

The key point to realise is that we are in a large, dysfunctional, fragile market.

The elephant seals and Bird-Of-Paradise (http://en.wikipedia.org/wiki/Goldie%27s_Bird-of-paradise)

expend a considerable amount of effort in intraspecies competition (males attracting females). I am not a theoretical biologist but this energy does not help to protect the survival of the species per se (as opposed to the predator-prey (http://en.wikipedia.org/wiki/Predation) process). In fact growing long feathers may be a serious disadvantage when a new predator emerges.

The academic community spends a huge amount of money and effort in intra-research competition. While some of this is healthy it is unclear that the heights we have reached (huge competition to publish in certain journals, huge rejection rates, huge inefficiency) leads to better science. It's probably neutral. What is clear is that when something changes the evolutionary niche there will be massive disruption. The major places this might come from include:

  • New disruptive technology
  • New unconventional players in the market (e.g. mobile providers)
  • New political pressures (e.g. from funders)
  • New attitudes in researchers allied to the Open movement
  • Involvement of the citizenry

Because it's disruptive it is not predictable. However there *are* things academia should be doing anyway and I'll continue in the next blogs.