US requires Open Access to Scientific Research – huge progress

[I have been relatively quiet recently because I am in Lithuania working flat out to liberate Crystallographic Data and make it Open – expect several posts in the near future.]

Seven years ago I approached SPARC to suggest that together we ran an “Open Data” mailing list – it was one of the times the term “Open Data” had been used – now it’s everywhere, of course. I’m delighted to repost the following item from the list:

https://groups.google.com/a/arl.org/forum/#!topic/sparc-opendata/b2Qkwx5K-nA

In essence it says that the US is going to make Open Data in science actually happen. My thanks to SPARC and many others who have pushed the cause. There’s a lot more that needs to happen but notice the clause allowing Content Mining which I
have highlighted.

 

For Immediate Release

Thursday, January 16, 2014

                                               
 

Contact: Ranit Schmelzer                                                           

202-538-1065                                                                                                

sparcmedia@arl.org

 
 

PUBLIC ACCESS TO SCIENTIFIC RESEARCH ADVANCES

Omnibus Appropriations Bill Codifies White House Directive

 
 

Washington, DC – Progress toward making taxpayer-funded scientific research freely accessible in a digital environment was reached today with congressional passage of the FY 2014 Omnibus Appropriations Act.  The bill requires federal agencies under the Labor, Health and Human Services, and Education portion of the Omnibus bill with research budgets of $100 million or more to provide the public with online access to articles reporting on federally funded research no later than 12 months after publication in a peer-reviewed journal.

 
 

“This is an important step toward making federally funded scientific research available for everyone to use online at no cost,” said Heather Joseph, Executive Director of the Scholarly Publishing and Academic Resources Coalition (SPARC).  “We are indebted to the members of Congress who champion open access issues and worked tirelessly to ensure that this language was included in the Omnibus.  Without the strong leadership of the White House, Senator Harkin, Senator Cornyn, and others, this would not have been possible.” 

 
 

The additional agencies covered would ensure that approximately $31 billion of the total $60 billion annual US investment in taxpayer funded research is now openly accessible.

 
 

SPARC strongly supports the language in the Omnibus bill, which affirms the strong precedent set by the landmark NIH Public Access Policy, and more recently by the White House Office of Science and Technology Policy (OSTP) Directive on Public Access.  At the same time, SPARC is pressing for additional provisions to strengthen the language – many of which are contained in the Fair Access to Science and Technology Research (FASTR) Act – including requiring that articles are:

·      Available no later than six months after publication;

·      Available through a central repository similar to the National Institutes for Health’s (NIH) highly successful PubMed Central, a2008 model that opened the gateway to the human genome project and more recently the brain mapping initiative.  These landmark programs demonstrate quite clearly how opening up access to taxpayer funded research can accelerate the pace of scientific discovery, lead to both innovative new treatments and technologies, and generate new jobs in key sectors of the economy; and 

·      Provided in formats and under terms that ensure researchers have the ability to freely apply cutting-edge analysis tools and technologies to the full collection of digital articles resulting from public funding.

“SPARC is working toward codifying the principles in FASTR and is working with the Administration to use PubMed Central as the implementation model for the President’s directive,” said Joseph.  “Only with a central repository and the ability to fully mine and reuse data will we have the access we need to really spur innovation and job creation in broad sections of the economy.”

 
 

Background

 
 

Every year, the federal government uses taxpayer dollars to fund tens of billions of dollars of scientific research that results in thousands upon thousands of articles published in scientific journals.  The government funds this research with the understanding that it will advance science, spur the economy, accelerate innovation, and improve the lives of our citizens.  Yet most taxpayers – including academics, students, and patients – are shut out of accessing and using the results of the research that their tax dollars fund, because it is only available through expensive and often hard-to-access scientific journals.

 
 

By any measure, 2013 was a watershed year for the Open Access movement:  in February, the White House issued the landmark Directive; a major bill,  FASTR, was introduced in Congress; a growing number of higher education institutions – ranging from the University of California System, Harvard University, MIT, the University of Kansas, and Oberlin College – actively worked to maximize access to and sharing of research results; and, for the first time, state legislatures around the nation have begun debating open access policies supported by SPARC.

 
 

Details of the Omnibus Language

 
 

The Omnibus language (H.R. 3547) codifies a section of the White House Directive requirements into law for the Department of Labor, Health and Human Services, the Centers for Disease Control (CDC), the Agency for Healthcare Research and Quality (AHRQ), and the Department of Education, among other smaller agencies.

 
 

Additional report language was included throughout the bill directing agencies and OSTP to keep moving on the Directive policies, including the US Department of Agriculture, Department of the Interior, Department of Commerce, and the National Science Foundation.

 
 

President Obama is expected to sign the bill in the coming days.

 
 

###

 
 

SPARC®, the Scholarly Publishing and Academic Resources Coalition, is an international alliance of academic and research libraries working to correct imbalances in the scholarly publishing system.  Developed by the Association of Research Libraries, SPARC has become a catalyst for change.  Its pragmatic focus is to stimulate the emergence of new scholarly communication models that expand the dissemination of scholarly research and reduce financial pressures on libraries.  More information can be found atwww.arl.org/sparc and on Twitter @SPARC_NA.

Posted in Uncategorized | Leave a comment

ePUB is a revolution in scholarly publishing. We explore BioMedCentral’s offering.

Scholarly publishing is one of the least enlightened industries of this century. Retail, travel, entertainment, medicine, democracy, and much more have all either been completely changed or show the potential. Yet scholarly publishing idolises the print image (PDF) coupled to a mark of esteem (Journal Impact Factor) as worth paying 3000 USD for 10 pages. It’s absurd to argue that humans were born to read double column PDF on landscape screens and it’s the pinnacle of graphic presentati0on. It’s not. And the idea that readers love the variety of different presentational brands from journals – this is only done for the marketers, not the readers.

Now some publishers are starting to use a new format, ePUB. BioMedCentral have pioneered this (let me know of others) http://blogs.biomedcentral.com/bmcblog/2012/12/11/biomed-central-now-publishes-in-epub-format/ . This deserves much more praise and use than I think it has had, and I’ll explain why.

I, and I suspect many other readers/consumers, hate the way that science is presented by scholarly publishers. My remarks apply to almost all publishers, independently of whether they are “open access” or not. Simply, no publishers care much about their readers. There are three main approaches – some publishers have only one – some have all three.

  • PDF – a sighted-human-only format. Its proponents argue it’s beautiful. But it’s inaccessible to machines and unsighted humans and scholarly publishing produces the worst technical PDF I have come across. It’s taken me a year to build a system to read it semantically and we’re not finished. It cannot adapt to different needs – size of screens, lack of eyes, etc. It has no place in modern science. Its continued use is a severe discouragement to any form of innovation in communication.
  • HTML. This was designed by TimBL to be a simple, powerful, way of communicating science. And it is. I can do almost anything I want with HTML. It interoperates with markup languages, SVG, etc.

    Except when it’s created by most scholarly publishers. Much of their HTML cannot be formally read. Here’s a typical example of total garbage (slightly truncated):

    <meta name=”dc.description” content=”The increase in … with a technique we call ” range=”” modulation”.=”” with=”” modulation,=”” sobp.”=””>


What’s happened is that the original HTML had a quoted section. Something like:

The increase in … with a technique we call “range modulation”. With modulation …

The publisher – or more accurately the contractor they have outsourced to – uses a garbage toolkit. There’s a standard way to treat quotes (&quot;)and it’s been used for 20 years. It’s actually easier to produce conformant HTML than garbage. And there are zillions of free tools. But the crap above crashes all the HTML parsers I have used.

 

People have paid for this – maybe thousands of dollars. Libraries should cancel/reduce subscriptions. Authors should refuse to pay APCs. If a garage fills your brake fluid reservoir with screenwash they can be prosecuted. But no one cares about the quality of publishing.

  • XML. This is the engineering solution. It’s useful for people who want to re-use the content. Perhaps for content-mining. Perhaps for a different presentational approach. Perhaps for computation (e.g. Maths). Perhaps for data (chemistry). It’s routinely provided by PubMed (EuropePMC). Because NIH have a commitment to accuracy and quality the XML is of good quality. Publishers often re-offer the XML, but much of that is probably limited to Biomedical. (Most publishers use XML internally, but they don’t publish it because they don’t want readers having free access to something useful that they can be charged extra for).

 

So now we have ePUB. https://en.wikipedia.org/wiki/EPUB . Potentially ePUB takes us a giant step forward. HTML suffers because while it is good at representing semantic and document structure, and can be repurposed, it’s terrible as a container format (you cannot easily distribute images with HTML). Conversely PDF has no semantics and cannot be repurposed, but is a tolerable container format for images.

ePUB supports both of these. It’s a standard format (ZIP) that contains a series of components (HTML, images, XML, etc.). It’s SVG friendly (which in itself can revolutionise science). So I have downloaded a sample from BMC and started to analyse it. If a publisher WANTS the reader to read the content, then it’s a good start. For publishers who do not want people to read their output and (even worse) reuse it, it’s a major threat. (except – salvation – by DRM we can stop anyone reading it). (When we see the first DRM’ed ePUBs we’ll know it’s 1984).

Here’s the example (http://www.biomedcentral.com/content/epub/1471-2105-14-172.epub). [It won’t do anything in your browser unless you have an epub reader, but it’s easy to do this in Mozilla – and I suspect others (see https://addons.mozilla.org/en-US/firefox/addon/epubreader/ and chrome://epubreader/locale/welcome.html ).

  • Go to the webpage where you want to download the ePub file and click on the download link. If you already downloaded the ePub file to your PC, just open it via the “File/File open” menu or drag the file on the Firefox window.
  • Now EPUBReader starts working: it downloads/opens the ePub file, uncompresses it and does some other processing. At the end it will present you the ready to read ePub file immediately!
  • EPUBReader created a page where all the ePub files, you downloaded/opened, are listed. You can open this page at four locations:
    • EPUBReader added a bookmark to this page called “ePub-Catalog” which you can find at the end of your bookmark list.
    • You find a new menuitem in your Firefox “Tools” menu, called “ePub-Catalog”.
    • You can add a button to your Firefox toolbar. If you use Firefox 4.0 or later, the button has been added automatically.
    • When you are reading an ePub-file, you find a button in the bottom toolbar.

If you want to have more information about EPUBReader, please read the manual or the FAQ.

Enjoy it!

Here’s the display. It fits the screen precisely. No vertical scrolling required. Horizontal scrolling, because we’re on a horizontal device. No need for typesetting!!


Isn’t that already a zillion times better than PDF? Yes, so why don’t you demand it?

But that is only 10% of the value of ePUB and I’ll discuss that in a later post.

 

 

 

 


 

Posted in Uncategorized | 8 Comments

Happy New Year from #animalgarden

#animalgarden wish everyone a very happy time interval 2014-01-01T00:00:00.000Z/2014-12-31T23:59:59.999Z

#animalgarden have made a number of contributions to Open-ness including:

#animalgarden continue to welcome new members – here a Christmas present of Rattus norvegicus (front row, albino form). The picture is of unique species; so there are penguins, bears, okapis, kiwis who play important roles who aren’t shown. R. norvegicus (https://en.wikipedia.org/wiki/Brown_rat has been bred to become https://en.wikipedia.org/wiki/Laboratory_rat . We’ll be helping with slides, blogs and videos on mining content for the benefit of the world.

Posted in Uncategorized | Leave a comment

Content Mine: Sunlight in California – can AMI help make Spending data Open?

Marc Joffe of http://www.publicsectorcredit.org/ has an ambition – to make Open vast amounts of spending data in California. The Sunlight Foundation has funded Marc (http://sunlightfoundation.com/blog/2013/07/05/opengov-voices-local-government-financial-transparency-scalling-it-up/ ) – the problem is that the data is present in PDFs. So Marc mailed #ami2 a typical document and asked if she could understand it:

She can get the pictures out easily, but that’s not what Marc wants – he wants the data. Like this:

AMI thinks she can find some time to tackle this and help Marc. She’s not interested in money (she has the emotional age of a FORTRAN compiler) but she needs to hack table for science and this one should be possible. (Of course #animalgarden is working with @TabulaPDF as well).

Marc’s blogged about it at http://blog.okfn.org/2013/12/19/pdf-liberation-hackathon-january-18-19/.

Open government data is valuable only to the extent that it can be used cost-effectively. When governments provide “open data” in the form of voluminous PDFs they offer the appearance of openness without its benefits. In this situation, the open government movement had two options: demand machine readable data or hack the PDFs – using technology to liberate the interesting data from them. The two approaches are complimentary; we can pursue both at the same time.

 

Whether your motive is to improve government, lower the cost of data journalism or free scientific data, you are welcome to join The PDF Liberation Hackathon on January 18-19, 2014 – sponsored by The Sunlight Foundation, Knight-Mozilla OpenNews and others. We’ll have hack sites at the NYU-Poly Incubator in New York, Chicago Community Trust, Sunlight’s Washington DC office and at RallyPad in San Francisco (one or two locations will have an opening social on the evening of the 17th). Developers can also join remotely because we will publish a number of clearly specified PDF extraction challenges before the hackathon. – See more at: http://blog.okfn.org/2013/12/19/pdf-liberation-hackathon-january-18-19/#sthash.n8e3HXH5.dpuf

 

PMR will be in Lithuania liberating crystallography but hopes to connect in.

And we hope you will too.

 

Posted in Uncategorized | Leave a comment

OKFN’s Open Science Working group is Global!

Jenny Molloy and Michelle Brooks have done a wonderful job in driving forward Open Science within the OKFN. Science is a huge topic and there are many areas where the Open philosophy and tools are now critical. Here’s Jenny’s end-of-year report – the easiest way to get involved is to subscribe to the mailing list and take it from there. Or contact any of us.

YOU DON’T HAVE TO BE A “PROFESSIONAL SCIENTIST” TO BE AN OPEN-SCIENTIST! We are very keen that citizens of any nation and activity are involved.

Hi All

 

I thought as we get to the end of 2013 it would be good to let you know where the working group is at currently and let you have your say on how we might proceed in our mission to open up science during the next year!

We have grown to over 630 members (!) on the mailing list and now have local groups or representation in:

Sweden
Finland
UK (Oxford, London, Cambridge)
Australia
Austria
France

Brazil

US

 

Which is fantastic, but we’d love to expand further! If you would like to be an open science ambassador for your region/country/city/university then get in touch.

To help us keep track of all these activities and make sure we’re being as effective as possible at providing a space for discussions and collaboration around open science, it would be great to hear from anyone who would like to be more involved in managing the working group (also, I have a thesis due in 9 months so any and all help is very much welcome from my perspective!).

 

We would like to put together a group of community organisers to contribute to a variety of roles. There can be a name for this group but we haven’t settled on one yet – suggestions welcome.The time commitment will be flexible and relatively low, but it will make a big difference to have someone keeping an eye on specific areas. The roles and tasks might include:

 

Organising working group meetings

Planning open science events at OKFest 2014

Documenting events and updates from the working group

Coordinating specific projects or documents

Blog Editing

Tech/Dev Liaison

Event Organisation

Designing publicity materials and logos etc.

 

Do get in touch if any of this sounds of interest even if you are only able to contribute a small amount of time. If you have any questions, please let Zara, Michelle and I know and we look forward to hearing from you! Look out for another email about specific projects, events and 2014 working group activities coming shortly.

Jenny


_______________________________________________
open-science mailing list
open-science@lists.okfn.org
https://lists.okfn.org/mailman/listinfo/open-science
Unsubscribe: https://lists.okfn.org/mailman/options/open-science

Posted in Uncategorized | 2 Comments

Content-mining: #animalgarden write a crawler

 

Chuff: Hello Gulliver! I’m the OKFN OKAPI.

Gulliver: Hello! I’m Gulliver, the BMC turtle.

C: I know – you’ve got a blog http://gulliverturtle.wordpress.com/ where you tell the world about #openaccess. What’s #ami2 doing?

G: She’s crawling BioMedCentral content.

C: Gosh! It looks painful.

G: It’s very painful for humans, but as you know #ami2 has no emotions. So it’s not a problem. She’s good at it.

C: Yes, she doesn’t get tired, angry, bored. She does exactly what she is told by PMR. So how does it work?

G: Here’s my content page: http://www.biomedcentral.com/content – that tells you where all the articles are

A: PMR told me to read each bibliodata and follow the link. I have to do that for all the papers.

C: Well there are only 25 so it won’t take too long.

G: We’re MUCH more popular than that. This is one PAGE. There’s nearly 6000 pages.

C: Wow! Is that because you’re Open Access?

G: Yes. PMR publishes many of his papers with me.

C: Let’s see. 25 * 6000 = 150,000. Wow! What a lot of Open Access papers. Can #ami2 do them all?

G: Yes. We’ll tell her how long she has to wait between reading each paper?

C: But #ami2’s very fast. She doesn’t need to wait.

G: If she tries to download papers too quickly – like 1000 per second –it might confuse our servers. Because that might look like a hostile robot.

C: But #ami2 IS a robot.

G: Yes, but she’s a friendly robot. We’ll tell her what the maximum speed is.

C: She told me it was 6 seconds for PLoSONE. (I wish PLoSONE would get a mascot – all #openaccess publishers should have animals).

G: Yes. PeerJ has Charlie the monkey. Anyway let’s do the sums. 6 secs is 10 papers per minute. We need 150,000 / 10 minutes. Which is 15,000 minutes.

C: which is about 10 days. That’s to get the backlog. How many would it be per day?

G: It think about 150. I’ll have to check with Amye. That’s BMC Amye!

C: Yes, I met her at the @OpenDrinks last week. That’s about the same number as PLoSONE. 150 articles is 15 minutes.

G: And you can do all BMC at the same time. Because you can alternate requests every 3 seconds.

C: That’s clever. Wow – there’s even a journal for data mining: http://www.biodatamining.org/. How big is an article?

G: depends what you want. http://www.biodatamining.org/content/pdf/1756-0381-6-21.pdf is about 6 MBytes. But there’s also HTML and XML.

C: Are they all the same?

G: not quite. The XML and HTML are quite similar. #ami2 can read them easily. They don’t have any pictures.

C: But I like pictures.

G: The pictures are there, but separate. You have to follow links.

C: #ami2 can do that. We have to organize it… But the PDF may contain vector graphics. PMR loves vector graphics because he can teach #ami2 to build real science from the pictures. He’s not so keen on PNGs and GIFs and TIFFs and JPEGs. But #ami2 isn’t perfect at reading PDFs. No one is perfect at reading PDFs except sighted humans. It’s hard to teach animals like #ami2. But she’s improving.

G: OK – sounds like you want the PDF AND the XML AND the HTML.

C: Sounds like it. So we have to let #ami2 know where they are. Trouble is every publisher does it differently. PMR’s just written AbstractCrawler.java .

G: So will that read BMC?

C: No. It won’t read anything. It only downloads things.

G: So will it download my articles?

C: Not yet. We have to write a special crawler for each publisher. PMR’s written PlosoneRecentCrawler. which extends
AbstractCrawler. He’s tried to make it easy to include a new publisher.

G: Please speak nicely to him and ask him to write a Gulliver Crawler.

C: I will try. He’s a bit tired. Humans need to rest.

G: How boring and inconvenient!

C: He sometimes writes code during the cricket. But he’s upset about the cricket…

G: I suppose #ami2 is pleased about the cricket?

C: No – remember she has the emotional apparatus of a FORTRAN compiler. PMR will point her at the BMC content page. She will start at the first article and then go on to the next. Let’s start at:

http://www.biomedcentral.com/content/?page=1&itemsPerPage=100

I *think* that means that we start at page 1. If we go to the next page we find:

http://www.biomedcentral.com/content/?page=2&itemsPerPage=100

G: That looks promising. But rember that these pages are updated. Their content could change as we add new papers>

C: Oh my stripes and paws! Let’s ask Amye.. She wrote to PMR: 

We have both a search API, an OAI-PMH API and an FTP location with a zip file of all our XML. The OAI-PMH API allows for retrieval of article metadata and fulltext XML for all articles or specific subsets. It also allows flexible date stamp restriction. So that should meet your needs. We also have a feed API (giving the latest articles, picks and most viewed) and a case report API for our Cases Database.

G: Well done. I think it will take another mail or skype to clarify…

C: Goodbye for now, and goodbye from #ami2

 

 

 

 

 

Posted in Uncategorized | Leave a comment

IFLA supports copyright exceptions for Text and Data Mining

Yesterday International Federation of Library Associations and Institutions (IFLA) issued a welcome statement on TDM: http://www.ifla.org/publications/ifla-statement-on-text-and-data-mining-2013. Snippets:

IFLA maintains that legal certainty for text and data mining (TDM) can only be achieved by (statutory) exceptions. As an organization committed to the principle of freedom of access to information, and the belief that information should be utilised without restriction in ways vital to the educational and cultural well-being of communities, IFLA believes TDM to be an essential tool to the advancement of learning, and new forms of creation.


We live in an era of “Big Data”. OECD figures show that more digital information was created between 2008 – 2011 than in all previous recorded history (World Economic Forum (2012) ‘Global Information Technology Report: living in a hyper-connected world’ p.59, http://www3.weforum.org/docs/Global_IT_Report_2012.pdf) No human can read such vast volumes of information, which is why “computer based reading”, using tools such as text and data mining, is so important.


Research organisations see TDM as an engine to improve the performance of science by speeding up new potential discoveries based upon existing literature without the need for further laboratory based research.  TDM is a tool also increasingly being used by researchers and creators in the arts and humanities fields, to offer new interpretations of history, literature and art. Libraries are also increasingly undertaking TDM themselves, to improve information services and offer new insights into their collections. Government data sets are also increasingly being made available to researchers, archives and libraries undertaking TDM, as they offer much potential economic value in an era of Big Data. Commercial innovators are also utilising TDM.


The technical act of copying involved in the process of TDM falls by accident, not intention, within the complexity of copyright laws – in fact analysis of facts and data has been the basis of learning for millennia. As TDM simply employs computers to “read” material and extract facts one already has the right as a human to read and extract facts from, it is difficult to see how the technical copying by a computer can be used to justify copyright and database laws regulating this activity.

“That these new uses happen to fall within the scope of copyright regulation is essentially a side effect of how copyright has been defined, rather than being directly relevant to what copyright is supposed to protect.” (Hargreaves Review of Intellectual Property and Growth (2011), UK Intellectual Property Office, http://www.ipo.gov.uk/ipreview.htm)  

TDM is one of several new tools in the digital environment to which copyright norms devised 300 years ago do not readily apply.


Solution             

Researchers must be able to share the results of text and data mining, as long as these results are not substitutable for the original copyright work – irrespective of copyright law, database law or contractual terms to the contrary. Without this right, legal uncertainty may prevent important research and data driven innovation putting researchers, institutions and innovators at risk.

IFLA does not support licensing as an appropriate solution for TDM. If a researcher or research institution, or another user accessing information through their library, has lawfully acquired digital content, including databases, the right to read this content should encompass the right to mine. Further, the sheer volume and diversity of information that can be utilised for text and data mining, which extends far beyond already licensed research data bases, and which are not viewed in silos, makes a licence-driven solution close to impossible.

Posted in Uncategorized | Leave a comment

Content Mining: Recent progress

A lot has happened in the last month and it’s kept me so busy that I haven’t blogged as much as I would have liked.

The simple message is that we are starting to mine the scholarly literature on global scale, starting with the easiest (sociopolitical) areas . We’ve started to build the community, build the tools and deploy the results. I am not frightened by scale, as there are in-place solutions.

The most important thing is community. If there’s a perceived need then CM will happen and fast. And on Tuesday we made massive community progress.

We met PLoS in the Haymakers pub (Cambridge) and talked about how they could help us crawl a daily PLoS. Then PLoS held an OpenDrinks in KingsCross London. Everyone was excited about the way scholarly communication could – and in some cases will – open up. BioMedCentral (AmyeK), CrossRef (GeoffB) and lots of OKFers. I came away with the strong feeling that we agree on the Why, Whether of CM and have now moved to How?

We’re doing a “soft launch” of the Content Mine. Something new every day. Advocacy and news from all sectors of the community. Debugging as we go. So we’ve started, not with a bang, but a snowflake. The avalanche will come.

One of the most important things is that we have set up an OKFN mailing list for CM. https://lists.okfn.org/mailman/listinfo/open-contentmining . Mailing lists are one of the best ways of collecting ideas, resources, community. If you have questions, offers of help, insights – please post. It’s a friendly community.

Some community milestones

 

2013-11-14 I was invited to present at UK Serials Group (core of librarians, publishers, university admin) and that gave me the chance to put slides together http://www.slideshare.net/petermurrayrust/the-content-mine-presented-at-uksg. It was well received and the delegates were interested in CM. UKSG recorded a video “Scientific Data costs billions but most is thrown away – what should be done?”. It’s at http://www.youtube.com/watch?v=LHkHGgYfaP0 (ca 25 mins). Many thanks.

2013-11-27 Open-science Oxford. http://science.okfn.org/community/local-groups/oxford-open-science/content-mining-scholarly-data-liberation-workshop/ . Wonderful event run by Jenny Molloy. It meant I had to get a portable demo ready (more below)

 

Tools

2013-11-25 CKAN/the Datahub. I decided we should use CKAN (OKFN) for the extracted content. CKAN is an open system for managing metadata and URL-based data storage. I think it will do very well for us. It’s got a vibrant developer and user community. Mark Wainwright gave us an excellent intro /pmr/2013/11/26/content-mining-the-scientific-literature-into-ckan/ . We’ve learned how to use it – having people online really helps – and we are doing our bit by contributing back a revised CKANClient-J. You are welcome to browse, e.g. http://datahub.io/dataset/species-in-articles-from-biomedcentral but please realise that this is Open Science – we are building it as we go. For example “In vivo” isn’t a species – it’s a false positive and we are refining the filters daily.

We’ve been joined by Mark Williamson and Andy Howlett in the Unilever Centre. They are doing a great job in helping refactor the existing code and framework. Andy’s working on a plugin mechanism so that if YOU want to – say – search for galaxies we can make it easy to insert an Astro plugin. Mark has done a huge job on making the system robust and distributable – the commandline-interface and deployment we used in Oxford. We are aiming at a system which is very easy to deploy so that when we run workshops it will be easy for all participants.

The latest tools are all on Bitbucket:

  • https://bitbucket.org/petermr/xhtml2stm-dev/wiki/Home
    Visitor (plugin) architecture for adding discipline-specific analyses.
  • https://bitbucket.org/petermr/crawlerrepo. A crawler architecture. (currently PLoSOne and soon BMC). I hope to make this very general so it’s easy-er to create crawlers. Crawlers are never fun. They reflect the horrible effects of creating information for sighted humans only. But they are an excellent place for crowd contributions – one crawler per publishers/journal. And a CKAN Datahub repo client.
  • https://bitbucket.org/petermr/imageanalysis/wiki/Home . Computer vision for scientific diagrams. All the basic technology exists, with Java solutions for almost all. I’ve been pleasantly surprised how well it performs. It’s experimental. I did a lot on Hough, but it looks like Thinning and Segmentation is actually better. OCR is a problem – Tesseract is C++ and very messy, JavaOCR ought to be the answer but is impenetrable, Lookup (Cross-correlation) is not quite what I want. I think I’m going to have to write a scientific OCR. Anyone interested in joining is very welcome. It won’t be as hairy as it sounds – I have some ideas.

Framework    

 

I am a strong believer in http://en.wikipedia.org/wiki/Convention_over_configuration – create a standard simple way of doing things so the number of options you HAVE to specify is small. So, for example, if no –output is given the system puts them in a known place. That also helps community – you can Skype: ” can you find plos/2013-12-13/daily.log in your /extracted folder?”

And documentation

We like the Bitbucket Wiki and its markdown. It’s relatively easy to record what we have done and what we want people to do.

So – watch out for content appearing flake by flake.

 

And join in…

 

Posted in Uncategorized | Leave a comment

Content-mining: how can I help?

I got a request today offering help for CM. Great! CM isn’t a single activity –ideally it’s a community of collaborating people and organizations combining resources. The first thing you can do is join, and post to, https://lists.okfn.org/mailman/listinfo/open-contentmining. Here’s what I replied:

I’m delighted to have had an enquiry of help for content-mining. The good news is:

*Everyone has a role to play in content-mining*

Here are some important areas – please submit others. There are lots of micro-tasks that everyone can become involved in.

==project==

* identifying a need

* coordinating a community effort

* summarising current practice (e.g. rights, barriers, resources)

* creating resources (e.g.corpora)

* running a project

==crawling==

* identifying sites to mine

* collecting bibliographic metadata (e.g. tables of content)

* agreeing web-friendly protocols (e.g. delay times)

* writing or finding crawlers

* creating or deploying crawl scripts

* managing workflow manually or or automatically

* recording crawl log

* saving crawled materials

==document==

* formalising structure of document (e.g. sections)

* creating or finding vocabularies for annotation

==generic tools==

* crawlers

* PDF readers

* flat text readers

* graphics analyzers

* image analyzers

==databases==

* customization

==natural language==

 

* collection of NLP tools

* vocabularies

* corpora for training

* training

* testing

* domain tools

== graphics==

* reconstruction of diagrams from primitives

* SVG tools

==images==

* selection

* croppings

* binarisation

* edge detection/segemnts

* optical character recognition

==text==

* fonts

==tables==

* reconstruction

* interpretation

==audio==

==video==

==semantics==

* annotation

* links

==domain==

* maths

* chemistry

* geo

* dates

* units of measurement

==argumentation==

* document structure

* sentiment analysis

==documentation==

==sociopoliticololegal==

==community==

* mailing lists

* crowdcrafting


Posted in Uncategorized | Leave a comment

Remember Aaron Swartz in January and join the New Hampshire walk

Larry Lessig has mailed today to remember Aaron Swartz and to join a many-day walk through New Hampshire. Is anything planned in UK/Europe?

Seven years ago this January, Aaron Swartz visited me in Berlin to convince me to give up my work on intellectual property and take up the fight against corruption. Nothing sane would get done within IP — or anywhere else — he believed, until we released our political system from the stranglehold of moneyed interests and corporate influence. He convinced me. 

On January 11, the anniversary of Aaron’s death, we will begin a 185-mile walk along the length of New Hampshire, to continue the fight that Aaron began. Along the way, we will recruit as many New Hampshirites as we can to the battle against corruption. We want every presidential candidate at every New Hampshire primary event to be asked just one question: “How will you fix the corruption in Washington?” 

Click here to show your support for this historic march — and help fight back against political corruption.

Aaron brought a contagious idealism to everything he did. That idealism inspired thousands to join the battle against SOPA and PIPA. But as he believed, it was the thousands — and not him — that won that fight. As he said in the last speech he gave, “We won this fight because everyone made themselves the hero of their own story. Everyone took it as their job to save this crucial freedom and threw themselves into it.”  

His idealism continues in the work that the incredible Demand Progress community has done over the last few years. Now we need that idealism, and this community, to spread to the fight against corruption. 

We’re marching from Dixville Notch (the place the New Hampshire primary begins) to Nashua to save our democracy. Click here to join — or support us however you can.

If you can walk, join us. Even for just a day, join us. If you can’t walk with us, then help us by spreading the word. Tell your New England friends about this insane idea. And if you don’t have any friends in the northeast, then you can pay the NE-Friendless Tax by chipping in whatever you can — from $10 to whatever.

Please do as Aaron convinced me to do 7 years ago: join this fight too. Two years ago, in the SOPA fight, you took on DC’s most powerful lobby, and beat it. If we can can recruit 10 times the support, we can take on the power of all DC lobbyists, and restore a democracy to this Republic. 

Whether you lend your feet, your voice or your dollars, we need you in this fight. This January, and every January, until we win.

See you on the road. 

– Lessig

Demand Progress and Rootstrikers

PS — If you’d like to sponsor me on the walk, you can do that here.


Paid for by Demand Progress (DemandProgress.org) and not authorized by any candidate or candidate’s committee. Contributions are not deductible as charitable contributions for federal income tax purposes.


One last thing — Demand Progress’s small, dedicated, under-paid staff relies on the generosity of members like you to support our work. Will you click here to chip in $5 or $10? Or you can become a Demand Progress monthly sustainer by clicking here. Thank you!


You can unsubscribe from this mailing list at any time.

http://act.demandprogress.org/cms/unsubscribe/unsubscribe/?akid=2419.609082.K8RKOY

Posted in Uncategorized | Leave a comment