#ami2 Can only academics understand scientific papers? Or can the #scholarlypoor be scientists as well? We need us

 

A FORB (Wikipedia)

One of the arguments scholarly publishing is that it is for “academics to publish to academics”. Even Open Access advocates such as Stevan Harnad have stated this publicly. I find this arrogant and unacceptable – I think with modern resources such as Wikipedia and Internet search engines much of science is accessible to a huge number #schiolarlypoor. (people outside rich universities with no access to closed publications).

I am trained as a chemist, crystallographer, self-taught computer-scientist and I have no formal biology training. But Ross Mounce and I are working on liberating the world’s phylogenetic trees. DON’T switch off at “phylogenetic” – like many scientific terms you know much about http://en.wikipedia.org/wiki/Phylogenetic_tree already. Can you understand:

A phylogenetic tree or evolutionary tree is a branching diagram or “tree” showing the inferred evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical and/or genetic characteristics. The taxa joined together in the tree are implied to have descended from a common ancestor.

I think anyone with high school education will (or should!) be familiar with everything here. The only difficult words are “entities” (posh word for “thing”) and “taxa” which is either fairly obvious or you can look up. Again from Wikipedia:

A taxon (plural: taxa) is a group of one (or more) populations of organism(s), which a taxonomist adjudges to be a unit. Usually a taxon is given a name and a rank, although neither is a requirement. Defining what belongs or does not belong to such a taxonomic group is done by a taxonomist with the science of taxonomy. It is not uncommon for one taxonomist to disagree with another on what exactly belongs to a taxon, or on what exact criteria should be used for inclusion.

And here is the tree. You may not understand all the names (*I* don’t!) but you can see “Bacteria”, “Animals”, “fungi”, “plants”, etc. I don’t need to understand everything – because I have colleagues such as Ross Mounce and Matthew Wills at Bath I am working with.

So here is a page of BMC Evolutionary Biology that #AMI2 has turned into HTML. Can you understand it? (It’s a LOT easier than understanding domestic energy tariffs in UK):

From this interaction follows that divergent selection between ecological niches is a major driving force differentiating lineages until reproductive isolation occurs [17]. Ecologically divergent pairs of populations will show higher levels of reproductive incompatibility and lower levels of gene flow than ecologically more similar population pairs [29]. A resulting corollary is that ecological speciation is more likely to arise in regions with patchworks of contrasting habitats and/or distinct environmental gradients.

PMR. Some of the long words are precise terms but I think this could be written in simpler language.

The number of taxa within the insect order Coleoptera exceeds that of any known plant or animal group [30]. More than half of the beetles are phytophagous, including the species rich superfamilies Curculionoidea and Chrysomeloidea, of which a majority feeds on angiosperms [31]. The increase in phytophagous beetle diversity was facilitated by the rise of flowering plants [31]. The family Chrysomelidae currently consists of more than thirty-five thousand recognized species including economically important pest species such as the Colorado potato beetle ( Leptinotarsa decemlineata), the Northern corn rootworm ( Diabrotica virgifera), the Cereal leaf beetle ( Oulema melanopus), and the Striped turnip flea beetle ( Phyllotreta nemorum). The biological and economic importance of the superfamily Chrysomeloidea make it vital to understand the factors that drive diversification in this group.

Here, we present a case of ecological niche differentiation in the alpine leaf beetle Oreina speciosissima that may represent the early stages of ecological speciation. The genus Oreina currently includes twenty-eight species, of which only seven early-diverging taxa do not exclusively occur in high forbs (i.e. five develop in stone run vegetation and two can be found in both high forbs and stone runs) [32]. According to current knowledge [34], the most parsimonious explanation is that high forbs vegetation is the ancestral niche for the remaining twenty-one Oreina lineages, among which only our focal taxon Oreina speciosissima shows a partial reversal, since it is found both in high forbs and stone run vegetation.

Oreina speciosissima is distributed across nearly the entire range of the genus Oreina (from the Pyrenees in the west to the Carpathian Mountains in the east) through a wide altitudinal gradient (ranging from 800 to 2700 m above sea level). At lower elevations it generally colonizes the very abundant high forbs vegetation whereas at higher elevations it is found in stone run habitats across a small portion of its distribution range [unpublished observations MB, TVN][32]. Kippenberg [32] and personal observations suggest that Oreina speciosissima feeds exclusively on Asteraceae ( Achillea, Adenostyles, Cirsium, Doronicum, Petasites, Senecio and Tussilago) and colonizes four distinct habitats

Did you understand it’s about how beetles in European mountains evolve? You may very probably know about European biology (when at school I used to travel to the Alps and identify and photograph alpine plants and to ring (band) birds). That was before I became an “academic”. But I knew all the binomials of European birds and plants I had seen. If you are similar you are entitled to be part of open scholarship.) There are words I don’t know: “forbs”

[“A forb (sometimes spelled phorb) is a herbaceous
flowering plant that is not a graminoid (grasses, sedges and rushes). The term is used in biology and in vegetation ecology, especially in relation to grasslands[1] and understory. From Wikipedia]

And I didn’t know “stone run” either:

A stone run (called also stone river, stone stream or stone sea[1]) is a conspicuous rock landform, result of the erosion of particular rock varieties caused by myriad freezing-thawing cycles taking place in periglacial conditions during the last Ice Age.[2]

But I am sure you understood it!

We have the equipment to open scholarship to the world. Let’s embrace and use it.

 

Posted in Uncategorized | Leave a comment

#ami2 and @tabula : collaboration vs competition; #scholrev


Oreina gloriosa from Wikipedia (you’ll see why)

I’ve read 25+ academic papers about extraction of information from PDFs and only 1 of those makes any mention of availability of code. These papers are published to announce a new (usually incremental or even repeated) advance and the main driving force is academic glory and reward. I don’t blame the authors in most cases – that’s how the academic systems works (and that’s what I blame). But it means that when I, for example, wanted to create #ami2 I had to start from scratch.

No, that’s completely unfair to PDFBox and Apache on which #ami2 is based. But in terms of analysing scientific PDFs I had to start from PDFBox. No existing code to help with tables, graphs, trees, text, etc. And although I have heard many presentation by academics there is very little re-usable code – so I had to write my own.

Not MY own. OUR own. Because everything I do is for US. That’s what works in the Blue Obelisk. (I was delighted to hear yesterday that Jmol now has a completely JavaScript version. That means I don’t have to write a JavaScript viewer for 3D chemistry.) And it is what will work in #scholrev. A community approach to building the tools for open scholarship.

Today I got a tweet that a group (@Tabula) was working on extraction of tables from PDFs – the area I am spending a lot of my time in. A typical academic reaction might be “Blast. We’ve been scooped”. Because that means we couldn’t publish anything on extraction of tables. (That’s not true, of course; duplicate work often gets published – Just not in the glamour mags. And duplication – within reason – is good because it cross-fertilizes and acts as a check).

So MY/OUR reaction was Great! I don’t have to do tables. I can use @Tabula instead. Now let’s see what it does. I haven’t yet corresponded with @tabula folks but it’s related to Mozilla and anyway it’s under an Open licence and invites collaboration. So I know I can use it – only question is what sort of technology – static/dynamic link, web service, or even translating code. (Of course this would be done with agreement and acknowledgement.

Let’s have a look: http://source.mozillaopennews.org/en-US/articles/introducing-tabula/

A table with ruling lines


A fully lined table.

Tables without row or column graphic separators are also common. For these type of tables, we cluster together the words that vertically overlap each other. The row boundaries are the bounding boxes of each detected cluster of words.

A table without graphic separators


Detected row boundaries in a table without graphic separators.

An analogous procedure is then carried out for detecting column boundaries. Tabula clusters together words that overlap horizontally. The bounding boxes of those clusters are the column boundaries.

Wow. This looks exactly complementary to #ami2-svg2xml. Here’s where we have got to with #AMI2 – chopping up the page. A table from BMC Evolutionary Biology. (BMC is a commercial Open Access CC-BY publisher who WANT you to re-use material, unlike most mainstream “closed publishers” who make it extremely difficult).

 

#AMI has chopped the page into bits (this is not all of it) and has identified the Table because it says T-a-b-l-e. (We have to teach AMI every word). The “Table” consists of a box with (a) caption and (b) table body (c) a footer. The table body has column headers (e.g. Code, Population). AMI2 does not yet understand what these actually mean – but we shall teach her.

I haven’t yet tried out @Tabula but I am very hopeful it will manage the body of the table.

When it does we then have to find out what the columns mean. I expect that words like “Coordinates” and “Year” will be very common and we can develop heuristics or machine learning. The format of the columns also contains vital information. Note that the altitudes are all > 1000 m so we have an alpine context.

What’s it about? “Sampled population of …” suggests population studies. And we can look in the text:

Oreina speciosissima” occurs in italics. This is suggestive of a binomial organism name. Here’s NL Wikipdeia http://nl.wikipedia.org/wiki/Oreina_speciosissima . A web search gives us http://www.biol.uni.wroc.pl/cassidae/European%20Chrysomelidae/oreina%20speciosissima.htm where we have pictures (they are copyright but very beautiful). I’ll give you an http://en.wikipedia.org/wiki/Oreina_speciosa instead

I hope you can see how all this links together. Beetles, places, mountains, dates, etc. A new type of science.

And why I am so ANGRY about mainstream publishers preventing us doing this.

Posted in Uncategorized | Leave a comment

The Lancet’s new #openaccess policy. Do they/Elsevier take me for an (April) Fool?

Posted in Uncategorized | 2 Comments

#openaccess The current standard of “debate” is unacceptable; arrogant and ignorant

I have my head down and am trying to write code – to liberate knowledge (and I haven’t forgotten #scholrev!) but occasionally have to break off and blog. Simply: the standard of debate (if it can be called such) in #openaccess is appalling. Either non-existent or fuelled by prejudice and ignorance. Since (a) many of the “debaters” and academics to whom we might look for clarity, fairness and guidance and (b) we are losing billions (sic) by not getting our act together.

Twitter retweeted

An interestingly well-balanced critique of the dash for #openaccess

Linked to http://www.psa.ac.uk/political-insight/blog/open-access-and-psa (PSA = Political Studies Association).

It was neither well-balanced nor correct. Given that #openaccess is now a matter of debate in universities we are seeing a large number of new commentators, many from Humanities and Social Sciences (HSS). This is to be welcomed as these subjects have much to offer and it is sad that they have been relatively silent up to now. However there has been a dash to publish and much of it is rubbish. A particular criticism is of the CC-BY licence (Attribution, and re-use permission for any legal purpose, by anyone), which seems to be held up as destroying academic freedom. In the last week or two the rubbish has included “CC-BY allows discussion by Neo-Nazis” (Nottingham Trent University) , “CC-BY makes possible to create dangerous drugs” and many examples of loss of control. The present article illustrates the problem.

Before we start let me point out that I am (a) a member of CC Science Advisory Board (b) a scientist and (c) a member of the Open Knowledge Foundation Advisory Board. This means I know quite a lot, and don’t know a lot more. When I don’t know, I don’t make up answers but I post the questions on lists where others may know better.

In addition to this, the PSA has also given evidence to the RCUK call for pre-consultation and that of HEFCE. Both organisations were criticised by the Lords for their previous lack of consultation. They had both come out strongly in favour of Gold OA for all publicly funded research and the use of the CC-BY license for the copyright of that material. The PSA (along with a large number of learned societies) took the view that there had been insufficient consultation and that the CC-BY license has defects when applied to HSS subjects that lack the patents most STEM research obtains prior to publication; in other words the intellectual property of authors is insufficiently protected. CC-BY effectively means authors lose control over their product once it has been published.

This is completely wrong. Most STEM research does not obtain patents prior to publication – I’d be surprised if 1% of STEM papers were associated with patents. And patents are orthogonal to copyright (which is what CC addresses). If you want a patent you apply for it before publishing – if you announce your work publicly then it almost certainly invalidates the patent. And CC-NC-ND does not protect a patent. CC-BY has been used for many years by BMC, PLoS, and other CC-BY organs with no problems for patents. (Authors might have to decide whether to publish and forgo patentability, or to delay publication until patent rights are obtained – but this does not depend on the CC licence).

In other words, academics are expected to pay an article processing charge (APC) to a publisher of up to $3,000, and then there is the possibility of additional page charges. After that, under CC-BY they have no further claim on that work and it is up to others to commercially exploit it if they wish.

Any copyright protection is simply for the object that is licensed – not the ideas within it. In general this is not a problem in STEM where the actual paper as such is never commercially exploited (e.g republished by another publisher). The purpose of publication is to publish facts, ideas and opinions and expect others to build on them.

This includes overseas institutions, individuals and entrepreneurs. CC-BY is the product of the Creative Commons, a US organisation that was established to provide the product of research free at the point of delivery.

Although its headquarters are in the US, CC is international with a European presence http://europe.creativecommons.org/mission . It was not provided to “provide the product of research free …”

It has a liberal libertarian perspective in much of what it does. Its founders were heavily influenced by the STEM model and it fails to take full account of the different perspective in social science.

Creative Commons covers many fields of creative commons and has at least as much history of supporting creative arts as STEM: (http://wiki.creativecommons.org/FAQ#What_is_Creative_Commons_and_what_do_you_do.3F );

Creative Commons is a global nonprofit organization that enables sharing and reuse of creativity and knowledge through the provision of free legal tools. CC has affiliates all over the world who help ensure our licenses work internationally and who raise awareness about our work. Our legal tools help those who want to encourage reuse of their works by offering them for use under generous, standardized terms, those who want to make creative uses of works, and those who want to benefit from this symbiosis. Our vision is to help others realize the full potential of the internet.

Although Creative Commons is best known for licenses, our work extends beyond just providing copyright licenses. CC offers a number of other legal and technical tools that also facilitate sharing and discovery of creative works. Unlike other public legal tools, Creative Commons’ licenses and tools were designed specifically to work with the web, which makes content that is offered under their terms easy to search for, discover and use. CC also offers other legal tools, such as CC0, a public domain dedication for rightsholders who wish to put their work into the public domain in advance of the expiration of applicable copyright, and the Public Domain Mark, a tool for marking a work that is in the worldwide public domain. Additionally, Creative Commons makes available tools used by scientific communities, such as standard materials transfer agreements.

 


A sign in a pub in Granada notifies customers that the music they are listening to is freely distributable under a Creative Commons license. (from http://en.wikipedia.org/wiki/Creative_Commons

For example, it does not take account of third party usage. It is likely that international publishers, especially those of US journals will not take UK papers shackled with a CC-BY license, thus preventing UK academics from accessing world-class journals. It is even likely UK based journals owned by learned societies will similarly respond in order to protect their intellectual property and that of their authors

I don’t even understand this. I think it is the concern that if someone publishes a scholarly article then they cannot then turn it into a book because others could do the same. If so it’s possibly worth debating, though no-one can do this without fully attributing the original author.

Might as well have the Neo-Nazis here: http://www.socialsciencespace.com/2012/10/why-open-access-is-good-news-for-neo-nazis/ (Robert Dingwall is a consulting sociologist, providing research and advisory services particularly in relation to organizational strategy, public engagement and knowledge transfer.  He is also a part-time professor in the School of Social Sciences at Nottingham Trent University.)

With a CC-BY licence, however, nothing stops the group taking hold of the paper, editing it down and using it as a recruitment tool: “Famous professor says we are just ordinary people responding in a reasonable way to the problems of our community…” You cannot pick and choose users: free access for Big Pharma is also free access for neo-Nazis.

These two are excesses, but they don’t get criticized by “mainstream” #openaccess. So there has been virtually no useful DEBATE on #openaccess. There have been assertions, evangelism, guesses, pontifications, and worse. The main mailing list GOAL http://mailman.ecs.soton.ac.uk/pipermail/goal/ is populated with a number of posters who are not prepared to welcome viewpoints other than their own. I occasionally post when I feel something has to be challenged, but only to try to put the record right, not because I hope for informed debate. In particular if we try to define what we are talking about we are likely to be told it’s irrelevant. There are few other places to debate in a proper manner – which is why we set up http://lists.okfn.org/pipermail/open-access/ where we take #openaccess as inspired by and consistent with the BOAI declaration (which is also consistent with CC-BY and not other CC licences such as CC-ND, CC-NC). On that list we welcome constructive debate.

The problem is now that because there has been no debate it is very difficult to develop a reasonable way forward. The field is highly factional. Ignorance (especially about copyright, and licences) is frequent.

For that reason some of us are trying to build new ways of #openaccess and STM has been in the lead, with BMC, PLoS, eLife, SCOAP3 and PeerJ. They’ve all actively required CC-BY licences, which mainstream publishers have been dragging their heels, prevaricating, fudging, obscuring and talking rubbish. Because they know they can get away with it. They are the only ones who benefit from the mess. And they are benefitting to the tune of hundreds of millions of dollars because we don’t hold them to account.

If a scientist talks rubbish then sooner or later the laws of physics or bioscience will prove them wrong. I cannot make energy from nothing. But in #openaccess you can talk rubbish, even if you students would be ashamed of you.

Posted in Uncategorized | 4 Comments

#openaccess; Let’s get rid of “Green” “Gold” and use precise language such as “CC-BY”. And be joyous.

Cameron Neylon has written a compelling article http://www.timeshighereducation.co.uk/comment/opinion/lets-get-this-straight/2002789.article and why we should get rid of “Green” “Gold” “Open Access” as meaningful labels. Because they no longer mean anything. They are as useful as “healthy” in a burger advertisement. I’m not going to repeat Cameron’s arguments – just read them yourself and redistribute.

Most publishers now produce inconsistent quasi-legal rubbish on their web pages. The try to write terms and conditions that are meaningful and normally they aren’t. They are almost an insult to readers (most of whom are actually intelligent knowledgeable humans). There is a spectrum of rubbish, varying from specialist departments of “Universal Access” whose business is in producing platitudes and not answering questions, to others that think that “all-rights-reserved” means something.

I was alerted to an article in IOP http://iopscience.iop.org/1367-2630/15/3/033037/article (Don’t switch off – it’s about building Klingon-like cloaking devices)

New Journal of Physics
Volume 15 March 2013

J C Soric et al 2013 New J. Phys.
15 033037 doi:10.1088/1367-2630/15/3/033037

Demonstration of an ultralow profile cloak for scattering suppression of a finite-length rod in free space

 

And I could READ it! It proclaims:

Great – it’s CC-BY. I can download it and feed it to #ami2 – our semantic program for extracting science from PDFs. But can #ami2 use it? I’d better check…

I look for the terms that refer to an individual http://iopscience.iop.org/page/terms_individual like me – and my #ami2. I don’t seem to have many rights (my emphasis):

You may access, download, store, search and print hard copy of text.  Copying must be limited to making a single printed copy or electronic copies of a reasonable number of individual articles or other similar items.  No text accessed via the Service may be made available to a third party, either for commercial reward or free of charge, except that for inter-library loan purposes a single paper copy of an electronic original may be made and sent non-digitally to a library in the same country as you under fair dealing/use exemptions.  In addition, for inter-library loan purposes, you may make a single paper copy of an electronic original available to a library in the same country by secure transmission using Ariel (or its equivalent) whereby that electronic file is deleted immediately after printing.  Such supply must be for the purpose of research or private study and not for commercial use or onward transmission or distribution.  In the USA, such copies may only be made in compliance with Section 108 of the Copyright Act of the USA and within CONTU guidelines.

[#ami2 asks me what an “Interlibrary loan” is. I tell her it’s a piece of paper. She crashes.]

So these TaC forbid me to (say) redistribute this article by posting it in a text-corpus – on Bitbucket – for mining. (That’s a really important activity, BTW).

We have a contradiction. And physics hates contradictions. I have always thought of the IoP as reasonably good guys (not all scientific societies fall into this classification). I think something needs fixing.

There is a spectrum of publisher attitudes to licences. At one end we have BMC, PLoS, eLife, peerJ Charlie, and Tim Gowers initiatives and Ubiquity Press and… They positively WANT people to re-use material. It’s honest. At the other end we have unnamed (because I will get sued) publishers who state they are “incredibly helpful” to people like me and somehow seem to make re-use impossible through fudge, inconsistency deliberately unhelpful licences, bad or non-existent labelling etc. Phrases on Open Access papers like “This journal is Copyright XYZ”. Yes, the *journal* is copyright but the paper is APC-paid Open Access and you haven’t the decency to tell the world. That’s weasel words and an insult to the authors and readers. Be honest and say

“This article is CC-BY”. Revere the authors. They want you to acknowledge them and use the article or bits of it for anything anywhere for any legal purpose and they rejoice in people making money out of it without their explicit permission because the more this happens the prouder they feel and the more others value them.

So maybe we need a joyous declaration on scholarly papers. After all Open Access is good and wonderful.

A; Open access means people can live and make a better planet. Not-A: Closed access means people die. A OR not-A ?

I agree there are technical difficulties in some of this. So why doesn’t OASPA produce a simple template for its OA publishers (the ones that actually believe in OA) making a clear positive statement that can be stuck on web pages. You are welcome to mine as a starting point.

 

 

 

Posted in Uncategorized | 1 Comment

#openaccess Can I use Wiley’s “Open Access” for teaching? NO

Wiley has an “Open Access” offering. I couldn’t find papers any so I tweeted and got:

“gold padlocks” and “purple padlocks”. “free” and “open”. Words and images that can mean anything. No idea whether it’s usable for teaching. Another tweet:

So off I go to the URL, find a paper on chemistry (there aren’t many, of course):

Is it actually Open? I find

So NO. I can’t use it for teaching (which is a commercial activity). I look for permissions:

And I get back

Which is useless.

So Wiley would like to hear from me, it says.

OK Wiley – I don’t think you are really trying hard enough. Open Access is about helping people get material, not making a trail of difficulties through purple and gold and open and half open and …

You’re actually telling us we don’t matter.

Just do the honourable thing like BMC PLoS and eLife and PeerJ and make it

CC-BY

That’s simple. It’s BLACK but it reads the same in any colour

Posted in Uncategorized | 16 Comments

#scholrev: Strategy and decentralisation

I have already suggested our #scholrev should be decentralized (

/pmr/2013/03/21/scholrev-why-are-we-doing-this-and-immediate-thoughts-on-how-to-proceed/,

/pmr/2013/03/22/scholrev-revolutionising-scholarship-shape-of-the-community-and-practice/,

/pmr/2013/03/24/scholrev-decentralized-open-infrastructure-an-example-from-the-blue-obelisk/

) – now I’ll say why and suggest how we proceed.

Those of us in #scholrev are disillusioned enough that we want to do something different. Perhaps the most well promoted was “an alternative to Google Scholar” (http://www.force11.org/node/4291 ) by Stian Håklev .

We need an open alternative to Google Scholar (like OSM [OpenStreetMap] is to GMaps). Imagine OJS/EPrints/DSpace pinging a central server with bibliographic metadata whenever a new article is published (like blogs pinging pingomatic), letting users contribute their own bibliographies. Every article would have a unique ID, enabling easy citation in any setting (a simple API would give citations in any format given the identifier, would also let you look a PDF file based on its hash, like MusicBrainz, or search). The database would be available for bulk download and data mining. Strongly integrated into all OA tools/citation managers, etc.

Why hasn’t this happened already? Because libraries would rather buy things than build them. That gets us locked into an increasing cycle of deprivation – the more we buy the less capacity we have for building. And every year it gets worse. We already see that institutional repositories look 10 years out of date – they aren’t full, no-one wants to put things in, they can’t be searched etc. Compare that with Stackoverflow, Github and Bitbucket, OpenStreetMap, etc. and you can escape the sense of frustration.

We want to do our own thing.

So for me, #scholrev has the following drivers:

  • Innovation
  • Social justice
  • Cost-effectiveness
  • Democracy

How to proceed? We have a lot of ground to catch up. But if OSM could change the world in 5 years so can we. We face two main problems:

  • The indifference and possibly hostility of universities
  • Lawyers and vested interests

The first problem just requires courage and determination (Wikipedia was trashed by Universities until they couldn’t ignore it). The second is a real problem and we have to minimise it. But both suggest that we should have some or all of our work outside the current academic infrastructure. If we are to reach out to the #scholarlypoor (the global South, SMEs everywhere, patients, etc.) we cannot do this through centralised mechanisms. Wikipedia and OSM had single clear goals initially (an open encyclopedia of everything, and an open map of the world). Our task is more varied. The grand visions for reforming scholarship include (and you will think of more) :

  • Machine semantic Indexing/access to some/all of the literature (“some” if the lawyers stop us doing “all”)
  • Democratising scholarship
  • Creative approaches to combining scholarship and authoring
  • Intelligent machines for reading and interpreting the literature
  • Alternatives to monographs

(these are all impossible at present).

These visions are too large and varied to plan top-down and must be bottom-up. They are also too large to coordinate at a detailed level. However #scholrev has shown there are lots of groups starting to do-their-own-thing. The history of the web shows that some of these will flourish and others won’t. This is an absolute judgment, it’s more that the time is right for some and not for others (it’s taken us 20 years to get semantic Chemistry moving). So we shouldn’t judge new developments too quickly but give them time to flourish.

What about duplication and waste? Wouldn’t (say) 20 independent authoring systems be worse than none at all? Shouldn’t we coordinate this centrally and have just one? In fact both are problematic. In the Blue Obelisk (v.i.) we’ve effectively solved this by constantly keeping in touch and watching what others do. For example I once spent a lot of time on developing a graphical display for chemistry. It wasn’t very good. And then I saw Jmol (http://jmol.org) and realised that *I* didn’t need to do it all myself.

I junked my code. A year’s worth. And rejoiced. From there we went on to the Blue Obelisk and now we have this great ecosystem. A few partial duplicates – but that’s useful for checking correctness, different platforms. And because we have legitimised the idea of components that interoperate the world has come to understand and respect what we have done.

That’s the key step. We don’t have to boil the ocean by ourselves. Or even in our groups. We build components. It’s the right way to build.

Can you publish components in high-impact closed journals?

Probably not. But that is not why we are building them. By building components we can reach out well beyond academia. An open scholarly indexer does not have to be built solely or even by academics. Let’s get software engineers and journalists and graphic designers involved. And patients.

We couldn’t have done this 5 years ago. We can now. What’s happened?

  • Wikipedia, OSM have shown that grand visions can be accomplished
  • GalaxyZoo has shown that meaningful subtasks can be created and that huge numbers of citizens can take part. Bringing their own innovation and enhancing the process.
  • StackOverflow has shown that social tools can be compelling and exciting
  • Github and Bitbucket have shown how to create repositories that people want to put things in because these repos do something useful
  • New lightweight tools such as NoSQL , d3.js, and HTML5
  • (and in Open Knowledge Foundation) we see the world outside academia adopting new ideas by the week.

So we can’t tell where and how the new things will happen. Something that looked impossible 2 years ago may now be very tractable. Glueing distributed systems together is far easier than it used to be.

So a distributed system is now a positive asset, not a problem to be solved by aggregation and central control. In the same way the communities can be glued by modern approaches and culture. That’s why I’m suggesting we should be distributed but communicating.

There are only a few basic rules:

  • Respect others
  • Try to work with people rather than compete
  • Keep everything completely open. An open API is problematic if the data can’t be dumped. Code relying on a closed component will crash when that component disappears.
  • Creating and giving are critically important. Some jobs are boring, tedious and necessary. We must find social ways of making them worthwhile.

So we can have more than one discussion list. More than one wikipage. Let’s first see what we can offer rather than what we want to accomplish. (Doesn’t have to be gold-plated.) And make these creations and their creators easy to find.

To start the process here’s some of what I and my collaborators can Openly offer:

  • A PDF2XHTML converter for scholarly articles and converters
  • Pubcrawler to discover and collect bibliographic metadata
  • Semantic scientific units of measurement
  • Semantic tools for physical science (especially chemistry) (useful for indexing and transforming)

So let’s see what we want to bring to and get from our marketplace of tools and ideas.

 

Posted in Uncategorized | Leave a comment

A scholarly rather sick Puzzle

What’s this? It’s topical. It’s 20% of a page.

It is possible to answer this from the information given. Why is 20% important? And why am I angry?

Answer tomorrow.

Posted in Uncategorized | 4 Comments

#scholrev: #BTPDF2, FORCE11, Current position and ways forward

I’m blogging some of my ideas about the Scholarly Revolution (#scholrev) and how it should proceed. I’ve already said I think it should be decentralised and I’ll explain what that means and why. I’m going to concentrate on the underlying social and political aspects – the technologies will follow. But first I am recapping where it started.

To recap, we gathered for an ad hoc meeting at Beyond the PDF 2 at lunchtime on Wednesday (2013-03-20) and we’ve been blogging and tweeting since then. #scholrev would not have happened without the wider meeting which Maryanne Martone has described at http://www.force11.org/node/4326 .

I think that I share with many the recollection that BtPDF1 was a unique and transformative event. It was the first venue where many different groups with clearly a lot of pent up frustration with the current state of scholarly communication and a lot of tools and ideas for moving us beyond the pdf (including new types of pdf’s) came together.  Unlike most conferences where there were a few polite questions, the discussion was lively and uninhibited.  I’d been to conferences where hash tags were posted, but few people used them beyond a few graduate students.  Here the twitter stream regularly exploded and discussion lists were used well before and after the conference..   Many of the audience were clearly masters of new modes of feedback and communication and weren’t afraid to use them.  Indeed, it was the level of enthusiasm and the quality of the discussion that led to formation of FORCE11, because we wanted a vehicle for capturing and focusing the energies on display.  FORCE11 and its Manifesto was produced by the follow up workshop at Dagstuhl later that year.   But I consider the first BtPDF conference the beginning of the movement, if we can call it one. 

PMR: Agreed. I was at BTPDF1 and felt it was transformative and exciting. I wasn’t at Dagstuhl so I can’t comment on the atmosphere. But yes, I hoped that something would come from BTPDF1 and looked for it in BTPDF2.

Looking over the program from the first BtPDF, we are clearly continue to struggle with some of the same issues:  semantic mark-up, authoring tools, data, nano-publications.  But a lot of work has been done and a lot of progress made.  … Open Access is being openly debated and supported by funding agencies, institutional repositories and researchers. 

PMR: I was particularly concerned about authors at BTPDF1 and the way they are treated in the current system. They have no effective voice and are largely pawns in any debate of #openaccess (and there is little constructive debate). The authors should be a major part of new communications, as should the readers and both groups have been marginalised. Personally I feel no excitement for the current approaches (“Green” and “Gold”) which both allow injustice, vested interests and massive waste to continue.

So are we done yet?  I would say no.  By and large, I would say, we still have failed to deliver tools and convincing use cases to the larger scholarly community, who are still locked in old modes of publishing and evaluation.  All one has to do to have one’s enthusiasm on the state of scholarly publication dampened is to sit on a promotion committee or a meeting of an editorial board. 

And that’s the problem. Put simply, while rich (Northern) academics debate evaluation, publications are still locked by piublishers.

And lack of publications mean people die. I’ve said that before and Eve Gray said it very clearly at the meeting. The first three presentations at BTPDF2 addressed the inequities, but the meeting slipped into cosy introspection during the rest. The meeting should have been angry at injustice. It wasn’t.

Data:  where to put it, what to do with it, when to do it, and who will do it, still looms over everything. 

YES. And unless we do something different we’ll end up with the same mistakes – data publication controlled by vested interests. (Some weeks ago a librarian came to me and said: “Isn’t it wonderful, we can now buy a data citation index”. I screamed).

The scholarly corpus in biomedical science is still fractionated, with no global access to the entire biomedical literature by automated agents.  The inefficiences of spending large amounts of time and money to turn complex research objects into digestible narratives and then an equally large amount of money trying to extract and recover the research objects from the narrative still need to be overcome.  And, as will be explored in the business case, we still haven’t figured out the model that will pay for it all. 

Biomedical science would be automated if it was legally allowed by the publishers (I sit on EuropePMC and we could index the whole literature technically.) Get angry, for goodness sake!

But I am confident that change is a comin’ and I look forward to BtPDF2 as an incubator and catalyst for that change. 

I looked for fundamental changes at BTPDF2 and I didn’t see them. A great deal of incremental stuff about how we could tinker with the current system. Very little about its fundamental sickness. About how we could revise the scholarly monograph (i.e. books) – we had a good lead on that but no follow-up. Well the world is reinventing the book and academics don’t seem to realise that books are for reading, not primarily for generating an ivory tower reputation.

Which is why we so rapidly gathered a group under the banner of “Scholarly Revolution”. I don’t know whether BTPDF2 will generate revolution, but it’s got to start doing it soon or not at all. It needs to tap into the twenty-first century and here are some ideas:

  • Make change, don’t just talk about it. I’m now so used to hackathons that I find 2 days sitting and listening to people talking makes my fingers twitchy. I and others said that next time there must be a hackathon where we create something new.
  • Bring in the outside world and listen to them. Academia is behind the times, not in front of it. At our hackdays we get journalists, medics, banks, creatives, central and local government and much more. They’re not hung-up about impact factors – they want to see information developed into communal knowledge. Tools that promote democracy. New ways of working.
  • Trust the young. The new world is a young world, not a continuation of the existing one. I was pleased to see special representation of young people and I hope they are brave enough to say what they want.
  • Fight injustice. The current system is seriously unjust – to the world.

I’m grateful to #BTPDF2 organizers for the meeting. Maryanne has rightly asked that we link to http://force11.org. and they have highlighted #scholrev. . I’m very happy for FORCE11 to provide resources for #scholrev. But they will only keep connected if they each tap into the other’s social and political dynamics.

Posted in Uncategorized | Leave a comment

#scholrev: Decentralized Open infrastructure: an example from The Blue Obelisk

I have suggested that the Scholarly Revolution should be decentralized and communicating and this post gives an example of why and how. “Decentralized” means that no one person or subgroup is critical to its operation and more importantly its continued operation. It also means that we do not have to agree on everything (and we certainly shan’t – the mess in “Open Access” should be a clear warning). We should not have to rely on key components – for example building the roads before the houses and then finding people don’t want to live where the roads are but where they can cross the river.

The good news is that information infrastructure can be very cheap and – certainly at an early stage – can be radical altered (refactored) if the community wants. The key thing is COMMUNICATION. As long as we know what other people are doing and saying many of the difficulties are solved.

So here’s an example of a bottom-up community that works. It costs me 20 pounds a year to run – that’s less than a dinner. It’s growing and it’s changing the world of chemical science. It would continue to run and flourish if I weren’t able to be involved. Everyone has their own homestead (http://en.wikipedia.org/wiki/Homesteading_the_Noosphere ) but there is also a commons. It’s a bazaar (http://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar – if you don’t know this, read it – it’s Open). There are other similar metaphors – “cooperative”, “tietotalkoot” (http://p2pfoundation.net/Rural_Cooperation_and_the_Online_Swarm ), “marketplace”, etc.

I’ve written it as part of a chapter we’ve offered for the http://www.openforumacademy.org/ because it’s more important to spread new ideas than gain impact factor.

Bottom-up Open Chemistry – the Blue Obelisk

 

Chemical software and data is a major activity, almost certainly exceeding 1Billion USD per year. But almost all of it is Closed, represented mainly by domain-specific software companies and traditional STM publishers. This is often aggressively protected; when the NIH set up an Open[*] database of chemicals and compounds the American Chemical Society (ACS) lobbied to politically to have this curtailed and threatened Wikipedia with legal action for publishing the widely used CAS identifiers for chemicals. A major software producer will take legal action against licensees who publish program output, including bugs.

A number of independent, often unfunded, chemical hacker activities grew up during the 1990’s and by 2000 a handful of codes were available but there was little continuity or coordination. We used to meet occasionally at ACS meetings and in 2006 we met in a bar near the large Blue Obelisk in Horton Plaza , San Diego. We felt that we had a consensus of philosophy, that the world undervalued our software and that we had the potential to change the future. We then agreed to loosely coordinate (not pool) our efforts. I suggested the name “Blue Obelisk” and our mantra ODOSOS – “Open data, Open Standards, Open Source “. To support this we created a Wiki, a mailing list and agreed to meet for dinner whenever we had a critical mass. There is no budget, no membership, no formal mechanisms – the mantra is our collective and very powerful DNA.

This has proved extremely successful and might work in other disciplines. We have about twenty projects which are happy to be counted as Blue Obelisk (http://en.wikipedia.org/wiki/Blue_Obelisk ) and which fit into our criteria of ODOSOS. Our dinners are open to all – and closed source providers have attended and been relaxed. In 2007 we published a paper outlining our components. Recently we reviewed this in a 2011 paper with about 20 groups as authors.

When someone or organization does something meritorious (normally an identifiable software product or data resource) I award a quartz Blue Obelisk (remarkably these are common and inexpensive). These loose traditions work. We now have software components in most of the chemical infrastructure for pharmaceuticals and increasingly in materials. The biggest problem is data – chemists do not publish machine computable data (though they should) , instead embedding a subset in formal (Closed Access) publications. We have machine extraction software but risk being prosecuted for extracting data.

Governance is minimal and we have been blessedly spared form either factionalism or imperialism. Each project is self-contained but uses other B/O libraries where possible or more recently runs them as web services. The main language is Java, followed by Python and C(++) – with some historical FORTRAN. There is generally a leader to each project and while the Benevolent Dictator for Life (BDFL) occurs the commonest is “Doctor Who”, where the Doctor hands on to a successor at irregular intervals.

Originally dismissed as cranks, we are now taken seriously. Companies (e.g. Kitware, NY, and CCG) contribute significant amounts of code (and as importantly) the critical mass of internal and external confidence. National labs (e.g. PNNL in US) have been awarded Blue Obelisk for collaborating on Open Source. We know that or code is widely used in pharma companies but we have few metrics (a common problem of Open Source in secretive industries).

As with all volunteer Open Source projects we do not have clear timelines, but progress over the last 5 years has been very good. It’s possible to find high-quality components in most subdomains, including unit and regression testing.

The main problems we face are that chemistry (surprisingly) often does not engineer its own solutions but prefers to buy them. This puts a value on shrink-wrapping and hand-held maintenance which gratis Open Source cannot easily provide. Academics producing new code often get little credit and it’s worse when they reengineer existing solutions, even when the result is markedly superior. It’s also difficult to get funding (“it’s a solved problem”). The fragmented nature of the commercial domain makes semantic interoperability very difficult –companies protect legacy walled garden approaches. The internal messes created by unvalidated variants of legacy files in the pharma industry (e.g. when the result of a merger requires data reconciliation) has probably cost well over 100 million dollars in human effort, while the B/O could have provided common semantics.

However I think we are approaching a breakthrough. Chemical software has made few objective advances in the last 10-15 years so that we now have implemented most of the major algorithms. For an organization which takes a responsible view of costs and values innovation, the Blue Obelisk can be an attractive part of a solution.

References

[http:// http://en.wikipedia.org/wiki/

Avogadro_%28software%29

[CC-BY-SA]

^ Guha, R; Howard, MT; Hutchison, GR; Murray-Rust, P; Rzepa, H; Steinbeck, C; Wegner, J; Willighagen, EL (2006). “The Blue Obelisk-interoperability in chemical informatics”. Journal of chemical information and modeling
46 (3): 991–8. doi:10.1021/ci050400b. PMID 16711717. [for bean counters: cited 281]

^ O’Boyle, N; Guha, R; Willighagen, EL; Adams, SE; Alvarsson, J; Bradley, JC; Filippov, IV; Hanson, RM et al. (2011). “Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on”. Journal of Cheminformatics
3. doi:10.1186/1758-2946-3-37. PMC 3205042. PMID 21999342.

 

 

Posted in Uncategorized | Leave a comment