petermr's blog

#rds2013 Managing Data and Liberation Software; we must remember Aaron Swartz

Posted on February 26, 2013 by pm286

Something is seriously wrong with our current values in academia.

The world is changing and the tensions between digital openness and digital possession-for-power-and-gain gets daily stronger. We cannot manage research data unless we manage our values first. In January this year the broken values were made very public with the death of Aaron Swartz.

[picture and text from EFF https://www.eff.org/deeplinks/2013/01/farewell-aaron-swartz ]

Aaron did more than almost anyone to make the Internet a thriving ecosystem for open knowledge, and to keep it that way. His contributions were numerous, and some of them were indispensable. When we asked him in late 2010 for help in stopping COICA, the predecessor to the SOPA and PIPA Internet blacklist bills, he founded an organization called Demand Progress, which mobilized over a million online activists and proved to be an invaluable ally in winning that campaign.

I’ve blogged before on Aaron. I never met him though I know people who have. I can’t write with authority, so I’ll quote

Tim Berners Lee,

tweeted: “Aaron dead. World wanderers, we have lost a wise elder. Hackers for right, we are one down. Parents all, we have lost a child. Let us weep.”

And I had the privilege of hearing Tim 2 weeks ago and the central, passionate, coherent part of his speech was about Aaron.

Why does this matter to #rds2013? Because if we make the simple step to recognize that knowledge must be Open and fight daily to make it happen – politically, socially, technically, financially, then the tools, the protocols, the repositories, the ontologies follow.

I’ve said that access to public knowledge is a fundamental human right. And now I find myself remarkably in strange company – with Darrell Issa [the proposer of the anti-NIH RWA act] [I don’t understand US politics]

“[Aaron] and I probably would have found ourselves at odds with lots of decisions, but never with the question of whether information was in fact a human right … Ultimately knowledge belongs to all the people of the world — unless there’s a really valid reason to restrict it.” [PMR emphasis]

If we take that axiom, then we have to build the global knowledge commons. It’s an imperative. And tomorrow I shall announce my own, initially very small, tribute to Aaron. I’ll finish with [parts of] his guerrilla manifesto [2008]. Ultimately this is not about technology, it’s about fairness and justice. [my emphases]

Information is power. But like all power, there are those who want to keep it for themselves. The world’s entire scientific and cultural heritage, published over centuries in books and journals, is increasingly being digitized and locked up by a handful of private corporations. Want to read the papers featuring the most famous results of the sciences? You’ll need to send enormous amounts to publishers like Reed Elsevier.

Scanning entire libraries but only allowing the folks at Google to read them?

Providing scientific articles to those at elite universities in the First World,

but not to children in the Global South? It's outrageous and unacceptable.

"but what can we do? The companies hold the copyrights, they make enormous

 amounts of money by charging for access, and it's perfectly legal — there's

nothing we can do to stop them." But there is something we can, something

 that's already being done: we can fight back.

Those with access to these resources — students, librarians, scientists —

you have been given a privilege. You get to feed at this banquet of

 knowledge while the rest of the world is locked out.

But sharing isn't immoral — it's a moral imperative. Only those blinded by

greed would refuse to let a friend make a copy.

Large corporations, of course, are blinded by greed. The laws under which

they operate require it — their shareholders would revolt at anything less.

And the politicians they have bought off back them, passing laws giving them

the exclusive power to decide who can make copies.

There is no justice in following unjust laws. It's time to come into the

light and, in the grand tradition of civil disobedience, declare our

opposition to this private theft of public culture.

We need to take information, wherever it is stored, make our copies and share them with the world. We need to take stuff that's out of copyright and add it to the archive. We need to buy secret databases and put them on the Web. We need to download scientific journals and upload them to file sharing networks. We need to fight for Guerilla Open Access.

With enough of us, around the world, we'll not just send a strong message opposing the privatization of knowledge — we'll make it a thing of the past. Will you join us?

PMR: I join. #ami2 is developed as Liberation Software. We need software just as revolutions need arms and transport. Liberation software is designed , at least part, to make knowledge free.

And “free” means completely free – free-as-in-speech and free throughout the world.

*** And completely coincidentally the following news broke just after I had written this http://crooksandliars.com/emptywheel/doj-used-open-access-guerilla-manifesto#comment-2225359

It appears that, only by researching the Manifesto, a First Amendment protected publication that largely espoused legal information sharing, did the government even get around to treating [the JSTOR action] as a crime.

Posted in Uncategorized | 2 Comments

#rds2013: Why academia must look outward; “closed data means people die”

Posted on February 26, 2013 by pm286

@McDawg (Graham Steel, indefatigable fighter for openness and patients rights has just blogged a powerful story (http://figshare.com/blog/Open_Access_Is_Not_Just_For_Scientists_It%27s_For_Everyone/72 ) of how a teenager has made a medical breakthrough despite the publishing industry’s paywalls. From Jack Thomas Andraka.

“After a close family friend died from pancreatic cancer, I turned to the Internet to help me understand more about this disease that had killed him so quickly. I was 14 and didn’t even know I had a pancreas but I soon educated myself about what it was and started learning about how it was diagnosed. I was shocked to discover that the current way of detecting pancreatic cancer was older than my dad and wasn’t very sensitive or accurate. I figured there had to be a better way!”

He began to think of various ways of detecting and preventing cancer growth and terminating the growth before the cancer cells become pervasive. Andraka’s breakthrough nearly didn’t happen. He asked around 200 scientists for help with his research and was turned down every time. (PMR Of course lots of professional scientists get turned down all the time, and I admire Jack’s perseverance).

Luckily he eventually established contact with a Dr. Anirban Maitra, a Professor of Pathology and Oncology at Johns Hopkins University, who provided him with lab space and served as a mentor during the test’s development.

“This was the [paywall to the] article I smuggled into class the day my teacher was explaining antibodies and how they worked. I was not able to access very many more articles directly. I was 14 and didn’t drive and it seemed impossible to go to a University and request access to journals”.

In an interview with the BBC, Andraka said the idea for his pancreatic cancer test came to him while he was in biology class at North County High School, drawing on the class lesson about antibodies and the article on analytical methods using carbon nanotubes he was surreptitiously reading at the time. Afterwards, he followed up with more research using Google Search on nanotubes and cancer biochemistry, aided by online Open Access scientific journals.

Earlier this month, Andraka had a guest post published on the PLOS Student Blog entitled Why Science Journal Paywalls Have to Go.

Can you read the next sentence without a feeling of anger?

“I soon learned that many of the papers I was interested in reading were hidden behind expensive paywalls. I convinced my mom to use her credit card for a few but was discouraged when some of them turned out to be expensive but not useful to me. She became much less willing to pay when she found some in the recycle bin!”

An encapsulation of why we are suffering. No-one should have to buy articles to find out they aren’t any use (and if you have just come into this area many papers cost over 50 USD).

Also earlier this month, Adraka was invited to the State of the Union Address where he met and spoke with President and Michelle Obama about his work. See ‘Mr. Speaker, The President of the United States…and Jack Andraka!‘ for more details.

“Open access would be an important first step. I would love to see research that is publicly funded by taxes to be publicly available. It would make it so much easier for people like me to find the information they need. If I can create a sensor to detect cancer using the Internet, imagine what you can do”.

And meanwhile academics and publishers collude on paying huge amounts of money into the Academic-STMPublisher complex which generates reputations for academics by preventing access to publicly funded information. I’ve written earlier that “Closed Access means people die”. There was a howl of protest from some academics – I didn’t have evidence. Well Jenny Molloy did. But the statement is self-evidently true. One story like this should be enough.

Closed data from public research is immoral, unethical and unacceptable. Don’t argue. Just fix it. (and I shall do my best).

Posted in Uncategorized | 1 Comment

#rds2013 Current Problems in Managing Research Data

Posted on February 26, 2013 by pm286

I am going through the various sections in my presentation to http://cdrs.columbia.edu/cdrsmain/2013/01/esearch-data-symposium-february-27-2013/ . I’ve got to “Problems in Managing Research Data”. Warning: This section is uncomfortable for some. In rough order (I might swap 1 and 2):

Vested Commercial interests. There are at least these problems:
- STM publishers. I’ll concentrate on this because until we have an Open Data Commons we can’t work out how to manage it. STM publishers not only stop me and other getting the data, they stifle innovation. That leaves STM about 15 years behind commerce, civics, and the Open movement in terms of technologies, ontologies, innovation.
- Instrument manufacturers. Many instruments produce encrypted or proprietary output which cannot be properly managed. In many cases this is deliberate to create lockin.
- Domain software.
  Some manufacturers (e.g of computational chemistry software) legally forbid the publication of results, probably to prevent benchmarking for performance and correctness of science
- Materials.
  Many suppliers will not say what is in a chemical, what the properties of a material are, etc. You cannot build ontologies on guesswork nor create reliable metadata
Academic apathy and misplaced values. I continue to be appalled by the self-centeredness of academia. The debate is data is “how can my data be given metrics”, not “how can I make data available for the good of humanity”. Yes, I’m an idealist, but it hasn’t always been this way. It’s possible to do good scholarship, that is useful, and that is recognised. But academia is devising systems based on self-glorification. With different values, the publisher problem would disappear. The Super Happy Block Party Hackathon (Palo Alto) shows how academia should be getting out and working for the community.
Intrinsic difficulty. Some research data is hard. A lot of bioscience. But they solve that by having meetings all the time on how to describe the data. You can’t manage data that you can’t describe. I’ve been working with Dave M-R on integrating the computational declarative semantics of chemistry and mathematics. That’s completely new ground and it’s hard. It’s essential for reproducible computational chemistry (a billionUSD+ activity). Creating chemical ontologies (ChemAxiom, Nico Adams) is hard. Computational ontologies (OWL) stretch my brain to its limits. Materials science is hard. To understand piezoelectricity you have to understand a 6*6 tensor.

But that’s what the crystallographers have been doing for 30 years. And they have built the knowledge engines.
Finance. The least problem. If we want to do it, then the costs is a very small proportion of the total research funding. And a miniscule amount of what we pay the STM publishers. Open Street map was built without a bank balance.

It’s simple. You have to WANT TO DO IT. The rest follows.

Posted in Uncategorized | 1 Comment

#rds2013: My reply from Elsevier on publishing supplemental data

Posted on February 26, 2013 by pm286

Two weeks ago I wrote to Elsevier’s Director of Universal Access about making research data Openly available. /pmr/2013/02/11/i-request-elsevier-to-make-experimental-data-cc0-and-release-crystallography-from-ccdc-monopoly/ – the title is fairly self explanatory. I have just got her reply which I publish in full below. My request was “rather long and involved” and “its description of various events do not always align with ours“. (The latter statement is meaningless as there were no events involved). To save readers referring back the essence of my mail was that:

Many closed-access publishers (ACS, RSC, Nature) publish authors’ supplemental data under an apparently licence-free/PD/CC0 approach (though not always explicit). Elsevier either puts data behind a paywall or sends it to the closed database of CCDC based on a subscription model. I asked (I thought clearly)

I am therefore asking you do the following:

· Announce that all supplemental data accompanying Elsevier papers IS licensed as CC0.

· Require the CCDC to make all primary CIF data from Elsevier publications CC0. (The author’s raw deposition, not CCDC’s derivative works)

· Extend this policy to all other experimental data published in Elsevier journals (in chemistry this would be records or synthesis, spectra, analytical data, computational chemistry, etc.). When you agree to this I can give public advice as to the best way to achieve this.

I leave you to judge whether Elsevier has answered any of my requests or (as I read it) sidestepped them and added a list of platitudes. But I am probably very slightly biased as I have tried for 4 years to get any straight answers out of Elsevier. I might just fail to hide my bias when I speak tomorrow.

Dear Peter,

Thank you for your message. It is rather long and involved, and its description of various events do not always align with ours, but it is an important issue that you raise and I am very happy to respond on behalf of Elsevier. Datasets are sometimes published as supplementary information to journal articles. Authors provide Elsevier with only a non-exclusive license to publish/promote these supplementary datasets and so only the authors can decide to use a CC0 license for these datasets.

This having been said Elsevier shares your vision for open data and a future in which data are much more broadly managed, preserved, and reused for the advancement of science. Professional curation and preservation of data is, like professional publishing, neither easy nor inexpensive. The grand challenge is to develop approaches that maximise access to data in ways that are sustained over time, ensure the quality of the scientific record, and stimulate innovation.

Here at Elsevier we:

believe rich interconnections between publications and scientific data are important to support our customers to advance science and health
work with others to identify, if needed develop, and deploy standard approaches for linking publications and data.
encourage authors to document their data and to deposit their data with an appropriate open
data centre or service and to make their data available for reuse by others, ideally prior to publication of articles based on analysis of these data, and with a permanent standard identifier to link from the publication to the dataset.
recognise that scientists’ invest substantially in creating and interpreting data, and their intellectual and financial contributions need to be recognised and valued
believe data should be accompanied by the appropriate metadata to enable it to be understood and reused.
help to communicate the benefits of data curation and reuse for different stakeholders in the scholarly communication landscape including authors, funders, publishers, researchers, and university administrators.
encourage authors to cite datasets that have been used in their research and that are available for reuse via a data curation center or service.
deploy our expertise in certification, indexing, semantics, and linking to add value to data
champion the importance of long term preservation of data, and accreditation systems/standards for digital curation services.

You and your readers might find this short video by my colleague, IJsbrand Jan Aalbersberg, of interest. It is a 5-minute flash presentation from a recent STM Innovation seminar on this topic: http://www.youtube.com/watch?v=3KuBToc4Nv0 .

Last but not least, our policies in this space are similar to those of other publishers. There are two industry position statements that many of us adhere to, and which your readers may find of interest. They are: http://www.stm-assoc.org/2006_06_01_STM_ALPSP_Data_Statement.pdf and http://www.stm-assoc.org/2012_12_04_STM_on_Data_and_IP_For_Scholarly_Publishers.pdf

In closing, we at Elsevier welcome your thoughts and are committed to working with researchers to realize our shared vision for open data. I will post this response to your blog comment stream as well.

With very kind wishes,

Alicia

Dr Alicia Wise

Director of Universal Access

Elsevier I The Boulevard I Langford Lane I Kidlington I Oxford I OX5 1GB

M: +44 (0) 7823 536 826 I E: a.wise@elsevier.com

Twitter: @wisealic

Posted in Uncategorized | 1 Comment

#rds2013: #okfn Content-mining: Europe MUST legitimize it.

Posted on February 26, 2013 by pm286

I’m on an EC committee looking at how to make content available for mining. (At least I thought that was the point – it seems it isn’t).

“Licences for Europe –A Stakeholder Dialogue”

Working Group 4: Text and Data Mining

Unfortunately I haven’t been able to attend the first meeting as I have been in Australia, but @rmounce has stood in and done a truly exceptional job. The WG is looking at licences and WG4 is on content mining. Ross reported back on Saturday and was disappointed. It seems that the WG4 has been told it has no course of action other than to accept that licences are the way forward.

This is unacceptable in a democratic system. It is difficult enough for us volunteers to compete against the rich media and publisher community. If I go to Brussels I have to find the money. These WGs are monthly. That’s a huge personal cost in time and money. The asymmetry of fighting for digital rights is a huge burden. Note also that it’s a huge drain of opportunity costs. Rather than writing innovative code we have to write letters to Brussels. And that’s what we have done (I’m not on, but I would have been). Here’s our letter.

http://www.libereurope.eu/news/licences-for-europe-a-stakeholder-dialogue-text-and-data-mining-for-scientific-research-purpose

Quotes:

We write to express our serious and deep-felt concerns in regards to Working Group 4 on text and data mining (TDM). Despite the title, it appears the research and technology communities have been presented not with a stakeholder dialogue, but a process with an already predetermined outcome –namely that additional licensing is the only solution to the problems being faced by those wishing to undertake TDM of content to which they already have lawful access. Such an outcome places European researchers and technology companies at a serious disadvantage compared to those located in the United States and Asia.

The potential of TDM technology is enormous. If encouraged, we believe TDM will within a small number of years be an everyday tool used for the discovery of knowledge, and will create significant benefits for industry, citizens and governments.McKinsey Global Institute reported in 2011[1]that effective use of ‘big data’ in the US healthcare sector could be worth more than US$300 billion a year, two-thirds of which would be in the form of a reduction in national health care expenditure of about 8%. In Europe, the same report estimated that government expenditure could be reduced by €100 billion a year. TDM has already enabled new medical discoveries through linking existing drugs with new medical applications, and uncovering previously unsuspected linkages between proteins, genes, pathways and diseases[2]. A JISC study on TDM found it could reduce “human reading time”by 80%, and could increase efficiencies in managing both small and big data by 50%[3]. However at present, European researchers and technology companies are mining the web at legal and financial risk, unlike their competitors based in the US, Japan, Israel, Taiwan and South Korea who enjoy a legal limitation and exception for such activities.

Given the life-changing potential of this technology, it is very important that the EU institutions, member state governments, researchers, citizens, publishers and the technology sector are able to discuss freely how Europe can derive the best and most extensive results from TDM technologies.We believe that all parties must agree on a shared priority, with no other preconditions – namely howto create a research environment in Europe with as few barriers as possible, in order to maximise the ability of European research to improve wealth creation and quality of life. Regrettably, the meeting on TDM on 4th February 2013 had not been designed with such a priority in mind. Instead it was made clear that additional relicensing was the only solution under consideration,with all other options deemed to be out of scope.We are of the opinion that this will only raise barriers to the adoption of this technology and make computer-based research in many instances impossible.

We believe that without assurance from the Commission that the following points will be reflected in the proceedings of Working Group 4, there is a strong likelihood that representatives of the European research and technology sectors will not be able to participate in any future meetings:

All evidence, opinions and solutions to facilitate the widest adoption of TDM are given equal weighting, and no solution is ruled to be out of scope from the outset;
All the proceedings and discussions are documented and are made publicly available;
DG Research and Innovation becomes an equal partner in Working Group 4, alongside DGs Connect, Education and Culture, and MARKT – reflecting the importance of the needs of research and the strong overlap with Horizon 2020.

The annex to this letter sets out five important areas (international competitiveness, the value of research to the EU economy, conflict with Horizon 2020, the open web, and the extension of copyright law to cover data and facts) which were raised at the meeting but were effectively dismissed as out of scope. We believe these issues are central to any evidence-based policy formation in this area and must, as outlined above be discussed and documented.

We would be grateful for your response to the issues raised in this letter at the earliest opportunity and have asked susan.reilly@kb.nl(Ligue des Bibliothèques Européennes de Recherche) to act as a coordinator on behalf of the signatories outlined below.

Participants:

Sara Kelly, Executive Director, The Coalition for a Digital Economy

Jonathan Gray, Director of Policy and Ideas, The Open Knowledge Foundation

John McNaught, National Centre for Text Mining, University of Manchester

Aleks Tarkowski, Communia

Klaus-Peter Böttger, President, European Bureau of Library Information and Documentation Associations (EBLIDA)

Paul Ayris, President, The Association of European Research Libraries (LIBER)

Brian Hole, CEO, Ubiquity Press Ltd.

David Hammerstein, Trans-Atlantic Consumer Dialogue

PMR: I and collaegues are now technically able to mine the scientific literature in vast amounts. #ami2 takes about 2 seconds per page on my laptop. Given 1 years * 10 million papers * 10 pages that’s 2.0E+8 – 200 million seconds. That means 5 cpus – a trivial amount – can mine and index this data at the rate it appears – and we get machine-readable tables, graphs, trees, chemistry, maps and masses else. It’s a revolution.

I am legally allowed to read these papers.

But If I try to mine them I will be sued.

The planet and humanity desperately need this data. It does not belong to “publishers”. It’s the world’s right to mine this.

Posted in Uncategorized | 2 Comments

#rds2013: Managing Research Data : Ideas from Ranganathan

Posted on February 26, 2013 by pm286

Ranganathan is one of the great visionaries of the C20 and 90 years ago created http://en.wikipedia.org/wiki/Five_laws_of_library_science. These are as true today. I’ve urged that libraries and academics understand the true points of Ranganathan – they aren’t business rules, they are rules for a fair social system for information. In their simple form:

Books are for use.
Every reader his [or her] book.
Every book its reader.
Save the time of the reader.
The library is a growing organism.

I spoke on these 3 years ago and proposed 12 actions points – somewhat off the cuff. There’s a good report here http://frommelbin.blogspot.com/2009/10/peter-murray-rusts-12-point-action-plan.html . I hoped they might spark discussion – but very little (at least back to me – Melbin says “I am surprised that there has not been more debate about his address”). Here are the 12 points in no particular order.

1. We should act as citizen librarians towards a common or shared goal.

2. Post all academic output publicly: ignore Copyright.

3. Text mine everything.

4. Put 2nd year students in charge of developing educational technology resources

5. Actively participate in obtaining science grants

6. Actively participate in the scientific publishing process.

7. Close the science library and move it all to the departments.

8. Handover all purchasing to national Rottweiler publishing officers.
9. Set up a new type of university press.

10. We should develop our own metrics system.

11. We should publicly campaign for openness.

12. We should make the library an addictive “game

Most are still true, though given the lack of response I think I’d regard 9 as a lost cause and. I wrote 2 before I knew of Aaron Swartz. 1, 3, 4, 10, 11 are key. Academic libraries have very little time left: 5, 6, 7, 8, 12 will be irrelevant if we have no libraries. So here’s another interpretation of Ranganathan in the data age.

Data belongs to the world. We are on a sick planet and data is a critical part of any solution. Data should not belong to people or institutions but to the people of the world and their machines.
Data is for use. I wish this was self-evident.
Every reader their data. I don’t have a good modern word: I am using “reader” to encompass humans and machines. This means that a reader should be able to access any data they need.
Every data its reader. This means that there is potentially at least one person/machine interested in data that you might produce.
Save the time of the reader. Make it as easy as possible to discover, understand and use data. Make it as easy as possible to create data.
The data community is a growing organism. This is excitingly fact, though not generally in Universities.

The word “reader” is asymmetric. I’d like to add another law such as

Every reader is an author and every author a reader. This was not true in Ranganathan’s time – books were physical objects requiring much effort. But now everyone can take part at every level.

Posted in Uncategorized | 1 Comment

#ami2 @rmounce gets AMI2 award for Liberation Software

Posted on February 24, 2013 by pm286

#ami2 is our program for liberating scientific content – by transforming PDFs into semantic documents. People who make important contributions to this – code, data, testing – are awarded little AMIs. Murray Jensen has written critical code for AMI and got a little AMI two months ago. Ross Mounce has done a huge amount in providing data and testing. Over a thousand PDFs (yes we are all legally allowed to read them). Here’s the award at #opendataday @okfn in London yesterday.

Posted in Uncategorized | 1 Comment

#rds2013: We should demand* a Global Knowledge Commons

Posted on February 24, 2013 by pm286

I am developing ideas for my 15-minute presentation at #rds2013 http://cdrs.columbia.edu/cdrsmain/2013/01/esearch-data-symposium-february-27-2013/ . I spent yesterday at a wonderful hackathon run by OKF (Ross Mounce and Rufus Pollock) as part of the worldwide OpenDataDay. http://okfn.org/events/open-data-day-2013/

(Hacking for Health, photo: Ross Mounce)

http://opendataday.org/ asks “Who is the event for?” and answers:

EVERYONE

If you have an idea for using open data, want to find an interesting project to contribute towards, learn about how to visualize or analyze data or simply want to see what’s happening, then definitely come participate! No matter your skillset or interests, we are encouraging organizers to foster opportunities for you to learn and help the global open data community grow.

Developers

We need computer cowboys and cowgirls like yourself to wrangle data into something useful. That means visualization, notification, integration, etc., all in the name of doing something crazy and fantastic.

Designers

We need people like you to make the everything look amazing, feel intuitive, and have a smooth user experience. The best application in the world that no one can use… isn’t much use! You know the drill.

Librarians

I heard you folks like books and eat catalogs of data for breakfast. You beautiful people are going to scour the earth for interesting data, help the rest of us figure out what’s important, and generally be useful.

Statisticians

YES! YOU ARE SO NEEDED. Seriously. While we can find it, blow it up, calculate it, and make it look pretty, we needs us some mean number crunchin’ to present meaningful visualizations. Join up.

Citizens

We need you the most. If it weren’t for you, this whole thing wouldn’t be happening. We need ideas, cheerleaders, and friends t

It was worldwide. Countries, cities.. We were collecting data on budgets, transport everything. Here’s Palo Alto: http://data.cityofpaloalto.org/

The City of Palo Alto teams with Stanford University to complete the City’s first hack-a-thon. The challenge, build an application in twenty-four hours to utilize geographical information system data provided by the City.

So I’ve come up with a vision for managing research data.

We should create a Global Knowledge Commons

From that everything that I’m suggesting follows. Obviously there are pieces of our world and practice that should not be global. We’ve been discussing this on the OKF Open science list and I have changed my views as a result of this discussion. John Wilbanks (http://lists.okfn.org/pipermail/open-science/2013-February/002197.html ) makes it clear that we have a difficult balance between privacy and availability of information. He argues that we could start from a default of openness

… the privacy element is a poor reason to argue against open data, and arguing the risks of secrecy as a counterargument is an elegant formation that I think we should embrace…

There is data that is difficult to manage – some is technical (size, complexity, semantics, provenance) and some carries societal and other problems. These add an order of magnitude greater effort.

But much data – probably the majority – can and should belong to the world. Reductionist bioscience, astronomy, crystallography have already made this commitment to a Global Knowledge Commons. So have Wikipedia and Open Streetmap. These communities have created a vision and the practice and tools have followed naturally. It’s happened in software, with Open examples such as Sourceforge, StackOverflow, Bitbucket, and many others created by the community. The community creates what the community wants, needs and values.

My assertion, therefore, is that a Global Commons will drive what we need. It will allow the innovation and the diversity that iterates towards common practice. “Managing Research data” is then just part of creating the common.

But academia, where most of the research funding goes, is behind the rest of the world in modern thinking. It was noticeable that at the Open Data Day with 40 participants there was no one from academia. Health, finance, government, transport, etc – committed and excited.

Academia needs to get out into the real world and invite the real world in. If not, it will not learn how to manage its data (OUR data) in a modern manner.

*title strengthened after Glyn Moody’s tweet!

Posted in Uncategorized | 1 Comment

#rds2013 Managing Research Data. “Where are we at? And who are ‘we’?”

Posted on February 22, 2013 by pm286

I’m actively putting together what I intend to present at #rds2013 – Managing Research Data – on Feb 27^th next Wednesday. It’s difficult to present in 15 minutes something whose details need years to work out. And something where there is currently no clear consensus. In previous posts I have brain-dumped some of the things I want to say:

/pmr/2013/02/12/rds2013-0-managing-research-data-at-columbia-ny-i-am-not-constrained-in-what-i-say/ General Introduction
/pmr/2013/02/13/rds2013-principles-for-managing-research-data/ about 15 principles for us to think about. At one stage I thought this would be the backbone of my talk and I would have one slide per principle. Now I am not sure – maybe we want a different view, while still keeping the principles and letting people add on to them.

I am going to try a different slice through the problem. Even if I don’t use it, it’s helpful.

Where are we at?
What do we want to do?
What are the obstacles (UPDATE)
How can we do it?

I’ll deal with the first in this post…

Where are we at? And who are “we“?

Domains. I’m going to make some restrictions – not to talk about BIG SCIENCE (e.g. the http://en.wikipedia.org/wiki/Square_Kilometre_Array ). Big Science plans its data capture, management and release and access. I’d like to know what percentage of a Big Science project spend is data management– it would be a very useful indicator. In the same vein I shan’t cover data on human subjects and other sensitive areas – that rules out much social science and medical research and since I don’t practice the former I’ll primarily stick to the long-tail of biosciences and physical science. I hope, however, that some of the principles are still valid and useful. Note that the long-tail of science is very long – it’s not a second class citizen. It’s just that its data are very badly managed at present. And I shall also stick to “potentially public data” – data which has an impact on publicly visible research – not necessarily publicly funded. So, if a pharma company does cheminformatics research then if it publishes the results, the data on which the research is based is potentially public data. There’s also the requirement IMO to make available any data on which public decisions are to me made. If a drug company wants to licence a drug, then the supporting data is potentially public data (PPD). There might be reasons for not publishing it, but they must be clear and independent.

Players. We can list:

Researchers. This is where the research originates. (Researchers want to be paid to do it| get the facilities to do it| tell the world about it| get credit for it| and to make money out of it) select one or more. Data management is a chore – they don’t get paid for it; it’s tedious; there are no decent tools. Data management is left to the lab head, who is probably just too old to have any idea how to do it properly. Many researchers are highly protective of “their” data.
Academia. They have a responsibility for managing the data in their institutions (The UK FOI has made that clear). They haven’t addressed the problem, probably don’t know who or what could help. They don’t cost data. Almost all scientific data their researchers collect is outside their control. When something goes wrong (i.e. climateGate they pick up a lot of flak).
Government. Increasingly realise data is important and make noises about it. They make their own data public. They cost data. Probably manage it somewhere in the fairly-poor to fairly-good scale.
Funders. They know data is important. Some manage it well and have data centres. Others leave it to researchers.
Publishers. They generally haven’t much clue about managing data (there are honourable exceptions – IUCr, EGU),. Many refuse to publish any data with their papers. Some have data collection and resale products. This is almost all after-the-fact. Some would love to own and control this space just as they have with publishing – after all academia doesn’t know the value of its output – getting them to give it away and buy it back would work very nicely here.
Informatics industry. Big ones (Google, Microsoft, Yahoo, probably IBM, SAS) not interested – small fragmented market. Real scope here for newcomers to possess and resell data.
Science Industry. Does data quite well. Recognizes it as a cost and an asset. Often secretive (e.g. pharma) so incurs major inefficiency costs as there is no pre-competitive support and many software companies are poor.
Citizens. Excluded by academia. Many are committed to new generation approaches (Wikipedia, Open Streetmap). They contribute a lot and get little in return.

Scale. Enormous. I calculate publicly funded research is of the order of 300,000,000,000 USD/year. Maybe 500 B if you include work published by industry and government. Assume 10% of this is data management and assume half is Big Science we still end up with 15 Billion USD per year for data in the long-tail. That 15 Billion is an unrecognised an unsupported cost. It should be recognised – it’s not respomsible just to hope it will manage itself on a PC and Excel.

What’s the Value? The human genome had a multiplier of 180. For each dollar put in the human race got 180 dollars back. That won’t apply to all data but I’d settle for a multiplier of at least 5. So if a project costs 200,000 USD it generates 1 million in value. Of that at least 10% is data – i.e. 100,000. So if a project fails to publish or manage data it is throwing away huge amounts of value.

Example. Computational chemistry can predict the behaviour of matter (crystals, proteins, etc.) I estimate that >> 1 billion USD is spent on PPD. And it’s value is several times this. None of this is published. None. This is certainly the single most valuable area where we would benefit from good modern data management. The lost value to the world, especially industry (semiconductors, carbon storage, energy, medical proteins, etc.) is certainly billions.

Organizations. Who can help? (NOT the publishers as they have shown themselves to act against the interests of anyone other than their shareholders.)

International bodies. ICSU/CODATA, UNESCO, Learned societies (if decoupled from commercial publishing). Funders (NSF, NIH, RCUK). National Laboratories. Governments. JISC. NCBI, EBI, Ueir/PubMedCentral

OKFN, SPARC, OSI, OSF, Foundations. All very sumpathetic.

Training. Very little. Sophie Kershaw (our Panton fellow) is changing that through novel graduate course and she’ll be feeding me material. It’s probably the most important things – I’ll say more under “what we can do”.

UPDATES

Current Practice. There is very little consensus on how / what to do and widespread apathy. There are no positive pointers from #openaccess . University repositories have spent perhaps 2 billion USD worldwide and they have provided very little of any kind for Science. They are not designed for science data and should never be used.

The best examples of practice come from software – with Github/Bitbucket and a large number of innovative and highly useful tools. If we could emulate these repositories for science, both in their value and their relative openness, that could be massive. But data are not software and it’s hard

Posted in Uncategorized | 1 Comment

Why should we continue to pay typesetters/publishers lots of money to process (and even destroy) science? And a puzzle for you.

Posted on February 21, 2013 by pm286

The average large STM publisher receives several thousand USD (either in subscriptions or in author processing charges (APC)) to process an article. This huge and unjustified sum contains not only obscene profits (30-40 %) but also gross inefficiencies and often destruction of scientific content. The point is that publishers decide what THEY want to do irrespective of what authors or sighted human readers want (let alone unsighted readers or machines).

I am highly critical of publishers’ typesetting. Almost everything I say is conjecture as typesetting occurs in the publisher’s backroom process, possibly including outsourcing. I can only guess what happens. I’ve tried to estimate the per-page cost and it’s about 8-15 USD. So the cost per paper in typesetting alone is well over 100 USD. And the result is often awful.

The purpose of typesetting is to further the publisher’s business. I spoke to one publisher and asked them “Why don’t you use standard fonts such as Helvetica (Arial) and Unicode)” Their answer (unattributed) was:

“Helvetica is boring”.

I think this sums up the whole problem. A major purpose of STM typesetting is for one publisher to compete against the others. For me, scientific typography was solved (brilliantly) by Don Knuth with TeX and Metafont about 40 years ago. Huge numbers of scientists (especially physicists) and mathematicians communicate in (La)TeX without going near a typesetter. When graduate students write their theses they use LaTeX.

Or Word. I’ve read many student theses in Word and it’s perfectly satisfactory. Why shouldn’t we use LaTeX and Word for scientific communication?

Well, of course we do. That’s what ArXiV takes. If scientists ran publishing it would cost a fraction of the cost and increase scientific communication. That’s what Tim Gowers and colleagues plan to do – dump the publishers completely. I completely support this of course.

The publishers have grown an artificial market in creating “final PDFs”. They are often double-column. Why!? Because they can be printed. We are in the C21 and publishers are creating print. D-C PDF is awful on my laptop. Wrong aspect ratio and I have to scroll back for every page. It’s destroying innovation and we are paying hundreds of millions.

And of course every publisher has to be different. It would make sense for authors to use a standard way of submitting bibliography. There is, of course. It’s http://en.wikipedia.org/wiki/BibTeX. Free as in beer and free as in speech. Been going for nearly 40 years .

But its main problem is that it makes publishing too democratic and easy. Publishers would have nothing create gatekeeper rituals with. So instead they ask authors to destroy their bibliographic information by perpetuating arcane ways of “formatting references”. Here’s an example from Dino Mike’s paper (https://peerj.com/articles/36.pdf ) in PeerJ – and remember PeerJ is only a month or two old. But they still use the same 100-year approach to typesetting. Here’s a “citation” from Mike’s bibliography to a PLoS paper:

This is an example of the publisher-imposed wastage of human effort. What is the 5(10)? I actually have no idea (I can guess). It’s not needed – the DOI is a complete reference. It’s a redundant and error-prone librarian approach. But Mike had to type these meaningless numbers into the paper. And, not surprisingly, he got it wrong (or at least something is wrong). Because the target paper says:

“5(11)”. It’s a waste of Mike’s time to ask him to do this. And the reason is simple. Publishers and metapublishers make huge amounts of money by managing bibliography, and they do it in company-specific ways. So the whole field is far more complex and error-prone than it should be. Muddle the field and create a market. (Or historically, perpetuate a muddled market). It’s criminal that bibliography is still a problem.

Back to the typography:

Mike wrote his paper in LaTeX. Word (see comments) He used Ticks (U+2713) [See comments after my blog post for what follows]. His Unicode ticks were converted by the publisher (Kaveh) into “Dingbats”. (Dingbats is not a standard PDF font – ZapfDingbats is and I have to guess they are the same (there are mutants of Dingbats).
I haven’t met Kaveh, but I have considerable respect for his company. But I do not accept his argument: he justifies typesetting by: “But the primary reason for “typesetting” is to produce an attractive “paginated” form of the information, according to traditions of hundreds of years

It is precisely “hundreds of years” that is the problem. We should be looking to the present and future. And I will show you a typical example of the irresponsible destruction of scientific information. But first I will comment that IMO publishers treat unsighted humans with apparent disregard. In analysing several thousands of papers I have come across huge numbers of characters which I am absolutely sure could not be interpreted by current tools. Here’s an example that I would be amazed if any machine-reader in the world could interpret correctly. It ought to speak “left-paren epsilon subscript(400) equals 816 plus-or-minus 56 cm superscript -1. ”

But it doesn’t get the “epsilon”. The true interpretation is so horrible that I hesitate to post it – and will leave it as an exercise for you to guess. I am almost sure that the horror has been introduced by the publisher as they use a special font (AdvOTb92eb7df.I) which I and no-one else has ever heard of.

Who’s the publisher? Well it’s one with a Directorate of Universal Access.

But that doesn’t seem to provide accessibility for unsighted humans or machines. And it’s even beaten our current pdf2svg software – it needs matrix algebra to solve.

UPDATE:

Villu has got it right. “So this “epsilon” is actually “three (italic)”, which is first flipped horizontally and then vertically? The italic-ness can be determined after the name of the font (ends with “.I”), and the number of flips by observing the fat tail of the stroke (should be at the lower left corner, but is at the upper right corner)?” PMR: It’s only flipped once – see chunk below where you can see that FakeEpsilon and 3 are the same way up.

PMR: The fontMatrix is actually FLIP * SKEW so comes out as

(d is the skew angle) Which gave me a negative font size (so failed to show in the output. Could a speech reader detect that and convert the “3” to an italic epsilon? Not a chance.

For the record Unicode has every character you could conceivably want in normal scientific use. Here are the epsilons from http://www.fileformat.info/info/unicode/ :

I’m getting bored. There are several more pages of UNICODE epsilons. It’s inconceivable that one of those wouldn’t be suitable. But no, the publisher takes our money to produce an abomination.

Posted in Uncategorized | 21 Comments

#rds2013 Managing Data and Liberation Software; we must remember Aaron Swartz

#rds2013: Why academia must look outward; “closed data means people die”

#rds2013 Current Problems in Managing Research Data

#rds2013: My reply from Elsevier on publishing supplemental data

#rds2013: #okfn Content-mining: Europe MUST legitimize it.

#rds2013: Managing Research Data : Ideas from Ranganathan

#ami2 @rmounce gets AMI2 award for Liberation Software

#rds2013: We should demand* a Global Knowledge Commons

Developers

Designers

Librarians

Statisticians

Citizens

#rds2013 Managing Research Data. “Where are we at? And who are ‘we’?”

Why should we continue to pay typesetters/publishers lots of money to process (and even destroy) science? And a puzzle for you.

Recent Posts

Recent Comments

Archives

Categories

Meta