Building an OKFN model for reproducible economics; why we need it (and a puzzle for you).

On Saturday we are having an economics hackathon in London. I’d love to be there but unfortunately am going to the Eur Sem Web Conf in Montpelier. It’s run by Velichka and colleagues – here’s the sort of reason why (from OKFN blog)

Velichka Dimitrova revisted the disgraced Reinhart-Rogoff paper on austerity economics, the perfect evidence of the need for open data in economics – and was picked up by the London School of Economics and the New Scientist.

The point is that economists made very serious mistakes and that proper management of the data and tools could have prevented it. We have to work towards reproducible computation in sciences and economics. From Velichka’s blog (and then I set you a puzzle at the end):

Another economics scandal made the news last week. Harvard Kennedy School professor Carmen Reinhart and Harvard University professor Kenneth Rogoff argued in their 2010 NBER paper that economic growth slows down when the debt/GDP ratio exceeds the threshold of 90 percent of GDP. These results were also published in one of the most prestigious economics journals – the American Economic Review (AER) – and had a powerful resonance in a period of serious economic and public policy turmoil when governments around the world slashed spending in order to decrease the public deficit and stimulate economic growth.

Yet, they were proven wrong. Thomas Herndon, Michael Ash and Robert Pollin from the University of Massachusetts (UMass) tried to replicate the results of Reinhart and Rogoff and criticised them on the basis of three reasons:

  • Coding errors: due to a spreadsheet error five countries were excluded completely from the sample resulting in significant error of the average real GDP growth and the debt/GDP ratio in several categories
  • Selective exclusion of available data and data gaps: Reinhart and Rogoff exclude Australia (1946-1950), New Zealand (1946-1949) and Canada (1946-1950). This exclusion is alone responsible for a significant reduction of the estimated real GDP growth in the highest public debt/GDP category
  • Unconventional weighting of summary statistics: the authors do not discuss their decision to weight equally by country rather than by country-year, which could be arbitrary and ignores the issue of serial correlation.

The implications of these results are that countries with high levels of public debt experience only “modestly diminished” average GDP growth rates and as the UMass authors show there is a wide range of GDP growth performances at every level of public debt among the twenty advanced economies in the survey of Reinhart and Rogoff. Even if the negative trend is still visible in the results of the UMass researchers, the data fits the trend very poorly: “low debt and poor growth, and high debt and strong growth, are both reasonably common outcomes.”


Source: Herndon, T., Ash, M. & Pollin, R., “Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff, Public Economy Research Institute at University of Massachusetts: Amherst Working Paper Series. April 2013.

What makes it even more compelling news is that it is all a tale from the state of Massachusetts: distinguished Harvard professors (#1 university in the US) challenged by empiricists from the less known UMAss (#97 university in the US). Then despite the excellent AER data availability policy – which acts as a role model for other journals in economics – the AER has failed to enforce it and make the data and code of Reinhart and Rogoff available to other researchers.

Coding errors happen, yet the greater research misconduct was not allowing other researchers to review and replicate the results through making the data openly available. If the data and code were made available upon publication in 2010, it may not have taken three years to prove these results wrong, which may have influenced the direction of public policy around the world towards stricter austerity measures. Sharing research data means a possibility to replicate and discuss, enabling the scrutiny of research findings as well as improvement and validation of research methods through more scientific enquiry and debate.

So Saturday’s hackathon (I might manage to connect in on Eurostar?) is about building reliable semantic models for reporting economics analyses. Since economics is about numbers and chemistry is about numbers there’s a lot in common and the tools we’ve developed for Chemical Markup Language might have some re-usability. So this morning Velichka, Ross Mounce and I had a skype to look at some papers.

We actually spent most of the time on one:


And here’s one example of a data set in the paper. (Note it’s behind a paywall (JSTOR) and I haven’t asked permission and I don’t need to tell you about what happened between Aaron Swartz and JSTOR. But I argue that these are facts and fair cricitism):


The authors regressed the dependent variable (log GDP) against the other two. My questions, as a physical scientist are:

  • What are the units of GDP? After all someone might use different ones. And I personally cannot understand the values.
  • What is “main mortality estimate”? If you guess without reading the paper you will almost certainly be wrong. You have to read the paper very carefully and even then take a lot on trust.

I’m not suggesting that this research is reproducible just from this table (although it should be possible to regenerate the same results). I’m arguing that data of this sort (I exclude it as 12 years old) is not acceptable any more. Data must be unambiguously labelled with units and described so the source and data are replicable.

Posted in Uncategorized | 2 Comments

JailBreaking the PDF

The Scholarly Revolution #scholrev is forging ahead. Alexander Garcia Castro is running a fantastic hackathon n Montpelier immediately after the SePublica Polemics workshop.

 

Join us in Montpellier for a one-day event to hack on scholarly PDFs!

Do you have tools that may help us to extract information from PDFs?
send us an email so that we can include them in the hackathon.

Would you like to extract citations from existing PDFs?

Wouldn’t it be cool if we, scholars, did not have to pay for citation
data? What about author disambiguation?

Are you interested in identifying and extracting meaningful parts from PDFs?

Would you like to have XML/RDF for scholarly PDFs? What if you could
have access to the actual content of the PDF for supporting the Web of
Data?

We are interested in all of these issues, send us your tools, ideas,
comments and join us in Montpellier. We are also supporting remote
participation to the hackathon -hangout and webex.

Visit us at http://scholrev.org/hackathon/

casey.mclaughlin@cci.fsu.edu
alexgarciac@gmail.com


Alexander Garcia
http://www.alexandergarcia.name/
http://www.usefilm.com/photographer/75943.html
http://www.linkedin.com/in/alexgarciac


One the important aspects of a revolution is having the right tools and this hackathon will collect what we’ve got and work out how to deploy them. “Jailbreaking” PDFs is not easy. It’s complex and it’s messy. But we are getting to the stage where we have the tools to:

  • Download PDFs from the open web.
  • Turn them into semantic form
  • Filter the semantics and repurpose them – everything from metadata to citations to chemistry to phylogenetic trees
  • Build a community

And since we work with open source everything we do is a step forward. Once we have solved a problem it can’t be unsolved (unlike commercial closed tools which are often withdrawn of locked). There’s a great deal we can do with collaborative action (each person can add a stone to the building.

All we have to do is care enough.

Posted in Uncategorized | Leave a comment

SePublica: Polemics in the Semantic Web (SEWC) – we need “the crazy ones”!

I have been very honoured to be invited to lead off a workshop session at the European Semantic Web Conference (ESWC). This workshop is a radical initiative to change the way we think about information. Here’s the description: http://sepublica.mywikipaper.org/drupal/

There is much controversy in the world of publishing and semantic publishing needs to both create waves in publishing and to ride the waves of change approaching in the world of publishing. We therefore invite statements for presentation at a discussion session at SePublica 2013 at ESWC in Montpellier on 26 May 2013.

We want radical, controversal and polemical positions to be articulated about semantic publishing and how we should achieve semantic publishing of scholarly works, data and all sorts of stuff. To be presented, statements must be relevant, legal and not too offensive(as judged by the workshop organisers).


All acccepted statements wil be presented. Submission will be through easychair; all accepted polemics will be

published before the meeting on the Knowledgeblog platform (http://www.knowledgeblog.org), where they will be permanently archived, and open for public comments. Submissions should be limited to 500 words. We can accept submissions in most formats, including Word, simple HTML (nothing in the header, no active content) or Latex (again the simpler the better). Presentations on the day wil be restricted to one slide that will be presented for two minutes (we will do this via timed slides) – all slide presentations must be submitted in advance. Presentations will be followed by a vivid discussion.

Illustrating what we would like to have…

Here’s To The Crazy Ones. The misfits. The rebels. The trouble-makers. The round pegs in the square holes. The ones who see things differently. They’re not fond of rules, and they have no respect for the status-quo. You can quote them, disagree with them, glorify, or vilify them.

About the only thing you can’t do is ignore them. Because they change things. They push the human race forward. And while some may see them as the crazy ones, we see genius. Because the people who are crazy enough to think they can change the world – are the ones who DO !” (I [AlexanderGC?] believe this is from Steve Jobs, but I am not sure about the right atribution of sentence.)

Welcome to SEPUBLICA 2013

For over 350 years, scientific publications have been fundamental to advancing science. Since the first scholarly journals, Philosophical Transactions of the Royal Society (of London) and the Journal de Sçavans, scientific papers have been the primary, formal means by which scholars have communicated their work, e.g., hypotheses, methods, results, experiments, etc. Advances in technology have made it possible for the scientific article to adopt electronic dissemination channels, from paper-based journals to purely electronic formats. However, In spite of improvements in the distribution, accessibility and retrieval of information, little has changed in the publishing industry so far. The Web has succeeded as a dissemination platform for scientific and non-scientific papers, news, and communication in general; however, most of that information remains locked up in discrete digital documents that are replicates of their print ancestors; without machine-interpretable content they lack the exploitation we have begun to expect from other data. Semantic enhancements to scholarly works would expose both the content of those works and the implicit discourse between those works. Scholarly data and documents are of most value when they are interconnected rather than independent. 

This is a tremendous vision and I am deeply honoured to be asked to spark it off. I’ll try and indicate over the next 3-4 days some avenues. Polemics (https://en.wikipedia.org/wiki/Polemic ) are:

a contentious argument that is intended to establish the truth of a specific understanding and the falsity of the contrary position. Polemics are mostly seen in arguments about very controversial topics.

My current title is

“How do we make Science Semantic”?

But even as I write I am seeing new challenges and opportunities and these posts are exploring this.

So we are challenging the way that we communicate “publishing” and there is much to challenge. Not many areas have been unaffected by the Web revolution, but scholarly publishing is one of those (the publishers have simply shipped the printing bill to the readers). In 1994 I was privileged to hear TimBL at CERN/WWW1 setting out the semantic web vision and it transformed my life. I assumed it would transform science, but it hasn’t. And that’s my first and explicit polemic.

Science, with the exception of parts of bioscience has not adopted semantics even after 20 years of opportunity. I’m not sure why, though I have revised my ideas (downward) about conservatism in academic institutions. There is a glowing opportunity – Tim can see it, I can see it, and a number of my collaborators can see it, but the vast bulk of science is untouched. Ironic that CERN was the birthplace of the Web.

It becomes clear that semantics is about revolution. The semantic web potentially empowers the individual over top-down organizations. Semantics creates human-machine organisms that communicate with other human-machine organisms. That changes the structure of society and the nature of humanity. And every year that revolution is stalled is a year of building tensions.

The primary theme is publishing. TimBL envisaged a system where everyone could be author, publisher and reader. Pre-1993 electronic (or any) publishing was an arcane art. In 1993 NCSA changed that, with the Mosaic browser and even more importantly NSCA HTTPD. My web server became my own personal radio station – I could publish to the world and my only challenge – a fair one – was whether the world would listen. We see this now in blogs, of course, but blogs do not capture the true essence of the semantic revolution. They are critical in establishing the new democracy and reshaping society, but in a relatively conventional technical manner.

But today the critical polemic is digital freedom or digital slavery. There are huge interests attempting to control us – to limit our activities, to tell us what to think, to filter what we say. And for this reason much of the semantic web is stalled. For me the biggest developments in semantic information have been with Wikipedia, Open Street map and other extra-academic organizations. And, of course, the Open Knowledge Foundation w here the practice of semantic information is a core part of our practice.

And yes, we must have the crazies. Socrates was a crazy. Aaron Swartz was a crazy. TimBL was a crazy.

The most important message is that single people with a passion can change the world. It’s never been easier. Crazies don’t need confidence – they already have it. But they need help, and if I can persuade people they should follow crazies, then I will have succeeded.

If you sit back and wait for the world to change, it won’t be your world.

[NOTE: I have been very busy hacking AMI2 – a PDF2Semantic tool – and hope to show at least some of it. It’s taken just over a year so far. I must be really crazy. But I can afford to be and I have a duty to be.]

Posted in Uncategorized | Leave a comment

#ignorantchemist Typographical amusement #ami2

We are doing well at reconstructing semantic material from PDFs (#AMI2) but the challenges we are thrown are considerable. Here’s today’s amusement:

#AMI2 can reconstruct most of this perfectly, but she doesn’t know what to do with a hyphenated-subscript. Nor do I, but I’m just an ignorant chemist. The publishing industry tells us that they need our money to produce beautiful easily readable typeset documents. So here’s an example of human readability from the same paper:

#AMI2 can read this, but can you? Wouldn’t it be easier to typeset it as equations? But that would take up an awful lot of space, and as we know journals have to reduce the space (I never understand why).

I have a plane journey so AMI and I can do some real hacking. We hope to release an alpha version RSN.

Posted in Uncategorized | 2 Comments

#openaccess: American Chemical Society charge additional 1000 USD for Creative Commons Licences

From the start of this month all RCUK-funded researchers will have to publish “Open Access”. Exactly what this means has been the subject of a messy set of polemics. But on the assumption that authors wish to publish under a CC-BY licence (effectively the only one compliant the with BOAI declaration – free to copy, use, re-use and redistribute) then are they able to?

I’ve taken a prominent journal – Journal of the American Chemical Society – in which I have previously published. Can I publish “Open Access” and comply with the RCUK requirements?

There’s a useful tool http://www.sherpa.ac.uk/fact/

Many publishers have been extremely poor at providing simple information for readers and authors. Often you have to chase round the buttons on the site (avoiding the (self-)advertising). Sometimes I get the impression that the publishers aren’t really trying to be helpful. Ross Mounce has done a great job on trying to winkle out licence and prices info and SHERPA have now done much of the grunt work in providing the right button to click. systematize this as well. So I can go straight to the key info:

What’s “Author Choice”? It’s ACS-specific and it’s some form of “Open Access” (according to the ACS). Many of these publisher-specific labels ( (Author|Reader|Free|Open)(Access|Choice|Article) have fuzzy words and fuzzy conditions.

But we have Creative Commons (and without CC we would be in an awful mess). CC provide a range of licences. ONLY CC-BY (CC0, and possibly CC-BY-SA) fit the BOAI definition of open access. Only CC-BY allows copying, re-use and redistribution.

Which, simply, is what Science is about.

Any restriction of access or re-use is anti-scientific.

It may be good business, but it harms science.

So it is possible to use a CC-BY licence when publishing with the ACS. But ONLY by paying an extra 1000 USD.

Does it COST this much to add a CC-BY licence?

Of course not. It shouldn’t cost anything (it’s a standard 50 characters on a page and a hyperlink).

It’s effectively a ransom from the publisher to raise extra revenue. The publishers can make up any set of charges they like. And the authors will either pay it or hide their publication behind an embargo-wall (say for 1-2 years).

Is this good for science? Of course not. It makes it harder to detect bad science. Humans and machines can validate or invalidate science if they are allowed to read the full text.

Very few publishers have earned respect during the evolution of Open Access. Most have been seen to value commerce above other considerations. There is no price pressure on OA.

And many “open access advocates” have actually welcomed non-CC-BY and embargoed green OA – which has led us to these huge APCs for BOAI Open Access.

To fight this we need strength from the funders and unanimity of purpose.

And we have this and it’s the primary redeeming feature in Open Access.

We need tools for uniform practice – what does a publisher offer? And we are getting them (kudos in UK to JISC, SHERPA, and Ross) and they are cutting through the fuzz.

We need tools for measuring author compliance. Because many authors simply don’t care about the funders requirements and will still publish in a completely closed manner so as to advance their careers and funding prospects. And we are getting them.

The organizations that have let us down are the Universities and their libraries. They don’t really care. They could have fought this battle 10 years ago instead of waiting for the funders to do it. They accept whatever prices the publishers charge for OA APCs and route tax-payer money or student fees to the publishers…

But that’s another blog post. Soon…

Posted in Uncategorized | 4 Comments

Update: The struggle continues… #ami2 would like alpha testers

A quick update. I’ve been spending most of my time on #ami2 which is now at raw alpha (see below). Other items of note include:

  • Mendeley is now owned by Elsevier. I shall blog this. If you care about Open scholarship you have to be seriously concerned.
  • Open Data Workshop (http://blog.okfn.org/2013/02/27/open-data-on-the-web-workshop-april-2013/, http://www.w3.org/2013/04/odw/ ). Really exciting to see the concentration of interest. There was a pre-workshop evening run by OKFN – lightning talks (I gave a short one (3-4 mins) on #ami2 and the problems of scientific data. Many international visitors came.
  • Ross and Avril got married (@rmounce) – their 2nd or 3 weddings. Great occasion – thanks all.
  • Went to talk by Glyn Moody on Copyright.
  • Meeting by JISC/Cameron on tools to determine openness of livcences in scholpubs.
  • Opening of Materials centre at QMU (Martin Dove). CML continues to be valuable.
  • Good progress on CML dictionaries for compchem.
  • We keep fighting for “the right to read is the right to mine” at Brussels (Licences for Europe). Do university libraries care?? They’d rather buy things than fight.

Overall I worry seriously about Open Scholarship. The universities and their libraries don’t care and are giving it away and then buying it back. It’s getting worse not better. We should be fighting for our rights.

#ami2 is at raw alpha. That means that it can do useful stuff if you know what you are doing and know the limitations. We are not appealing for volunteers yet but if you want to be involved please let me know. You will need to be able to:

  • Run Maven and Java.
  • Use Bitbucket.
  • Get excited about really boring stuff (like errors in fonts, pagination etc.)
  • Sort problems yourself/communally.
  • Want to liberate information from PDFs.
  • Have a few minable papers (“Open” in some sense).
  • Be patient.
  • Respect copyright.

Currently there are no proper metrics but:

  • Ca. 1 sec per page
  • Useful compression for text-only (images can’t compress, of course).

Mail me or leave a message here or simply use Bitbucket (http://www.bitbucket.org/petermr/svg2xml-dev ) and give feedback.

Posted in Uncategorized | Leave a comment

#animalgarden Bottom-up Ontologies in Physical Science

On Thursday (2013-04-11) I was invited by Fiona McNeill to give a 5-minute talk on ontologies at Edinburgh (http://dream.inf.ed.ac.uk/events/ukont-13/2013_workshop_program.html ). The workshop aims included:

Amongst other areas of interest, there will be a particular focus on creating and using open data. The program and audience is intentionally very diverse; the aim is to cover areas from many disciplines. We are particularly interested in bringing together those creating and developing the technology with those using the technology in industry, government and public organisations.

A short talk requires special preparation. No point in trying to prove theorems in first-order logic. In fact I argue that this is far too complicated and unnecessary for physical science. So #animalgarden offered to make a presentation. (They didn’t have time to have a proper shoot so they have re-used old slides and there’s no music yet). The slides are at http://www.slideshare.net/petermurrayrust/ontologies-in-physical-science – there are a few snapshots here. (Conventional chemists can read the words – which are deadly serious – and ignore the animals L )

The problem is that much of physical science doesn’t even use common identifiers or vocabularies. So the problems are people-problems, not technical ones.

There are a very few chemical ontologies but few people use them and this is even more problematic in materials science. This domain is probably the easiest of all sciences to create ontologies for but paradoxically it hasn’t happened. Crystallography (www.iucr.org/cif) is a shining exception but computational chemistry has nothing.

So a number of us are joining together to create “bottom-up ontologies”. Firstly small coherent group systematize the description of what they do in semantic form. Computational chemistry is particularly well suited to this – the programs (codes) have implicit semantics (because the code works and gives the right answers)! Then the community looks at the resultant collection of ontologies and systematizes them where they have the same concepts. In these cases there is a common entry in a communal ontology.

When this isn’t possible the ontologies create machine-readable conventions.

But few computational codes have explicit ontologies. Some define a few of the terms in their manuals, but they aren’t linked to the programs. We’ve developed Chemical Markup Language a which does exactly this. Each code (NWChem, Hyperchem, DLPOLY…) creates their own ontology using a common syntax (CML) but their own identifiers.

There are immediate benefits – the program output becomes semantic and can be re-used for analysis, aggregation, etc. If two groups have ontologies they compare notes and create a toplevel dictionary. As more groups join, the top-level dictionary gains more knowledge and acceptance from the community. And everyone has a feeling of ownership.

We are delighted that Hyperchem http://www.hyper.com/ have recently offered to join in the communal effort. See /pmr/2011/11/02/searchable-semantic-compchem-data-quixote-chempound-fox-and-jumbo/ for an overview of the collaboration with PNNL. And /pmr/2013/02/03/topics-and-links-for-my-talk-on-semantic-web-for-materials/ for work with CSIRO. And some idea of the great contribution from Kitware /pmr/2013/03/01/liberation-software/

The slides are CC-BY. I need to add this.

Posted in Uncategorized | 1 Comment

#ami2 #ukont2013 15-min demonstration of AMI2 (and maybe OPSIN and ChemicalTagger)

I’m demoing after lunch to the 2nd UK Ontology Network Workshop in Edinburgh and it’s billed as AMI2 (our content-mining software for #scholpub and related documents). Why content-mining at an ontology meeting? Because many ontologies are created “bottom-up” from the language we use. This post is just to announce what I am going to show (hopefully) and also to give URLs.

  • AMI2 will read PDFs and convert them to XHTML (prior to creating domain-specific XML). AMI2 is at: https://bitbucket.org/petermr/pdf2svg (for converting PDF to SVG) and https://bitbucket.org/petermr/svg2xml (for converting SVG2XML). Use https://bitbucket.org/petermr/pdf2svg-dev and https://bitbucket.org/petermr/svg2xml-dev for the code for the bleeding edge versions (I’ll be demoing the latter, using Maven from the commandline). We’re beginning to get collaborators – recently AMI2 started working with Renaud Richardet in EPFL Lausanne , for example.

    For newcomers, AMI2 reads a PDF using PDFBox, and uses PDF2SVG to interpret STM publisher characters (which usually are not Unicode). That creates a raw SVG made up of single characters and discrete paths and images. Then she uses SVG2XML to create running text and separate figures and tables. We’ll show how species can be extractedThat’s where today stops. (In the final phase, AMI2-Aaron (in memory of Aaron Swartz) we shall support domain-specific plugins).

  • Then we’ll show OPSIN to show an example of a domain-specific plugin that translates chemical names to Chemical Markup Language.
  • Lastly we’ll show Chemical Tagger (http://chemicaltagger.ch.cam.ac.uk/ ) which uses Natural Language Processing to create semantic chemistry (using CML/XML ontology).

PARTICIPANTS: PLEASE LET AMI2 HAVE SOME PDFs TO EAT!

Posted in Uncategorized | 5 Comments

#openaccess Who owns the Law? Who owns scholarship? You must listen to Ed Walters

IF YOU HAVE ANY INTEREST IN OPENACCESS spend 15 Minutes on http://vimeo.com/63123518 “Ed Walters – Who Owns The Law?” It’s worth the time.

 

In a chillingly precise, researched piece Ed shows how US states have handed over the ownership of their Law to commercial publishing companies. Elsevier and Thmoson-Reuters.

Heard of them? Yes, the same companies that publish Scopus and WebOfScience .

I don’t want to take away the chilling effect of Ed’s presentation – so listen. And be outraged.

And then realise that the same thing is happening in Science and that naïve Open Access is making it worse. Assuming that other people will look after our rights, and meanwhile handing over our freedom. It’s happening right now.

And unless we wake up and challenge, it will be too late.

I’ll blog in more detail after you’ve watched Ed’s video.

 

Posted in Uncategorized | Leave a comment

Teaching #ami2 to recognize biological names (binomial)

 

Erithacus rubecula (Wikimedia Commons) “the Robin”

 

#ami2 can now read the text of scientific articles as HTML (she has a little trouble with bold letters and strange fonts but we’ll teach her how to manage). Here is how she finds organisms in text. Having created the HTML (which is also XML) she can search it with XPath. XPath is one of the simplest and most powerful search tools for moderate chunk of information. Here she searches a page for italic phrases with at least one space (e.g.

I heard an Erithacus Rubecula
Erithacus rubecula today. (@rmounce points out the capitalization!)

AMI has extracted the HTML (<i>…</i> means italics)

<p>I heard an <i>Erithacus rubecula</i> today.</p>

Now she creates an xpath :

“.//html:i[contains(.,’ ‘)]”

This means:

  • .// anywhere in the document (we can increase the precision later)
  • html:i a chunk of italics
  • contains(.,’ ‘) which (.) contains a space (‘ ‘)

It’s not flowing prose but it’s trivial for AMI. And the result (using Jaxen query() in XOM) is:

  • & Evolution
  • 16S, COI
  • 16S, COI, COII
  • 16S, P
  • Achillea macrophylla, Adenostyles alliarae
  • Achillea, Adenostyles, Cirsium, Doronicum, Petasites, Senecio
  • Advances in Chrysomelidae Biology 1.
  • Ae. triuncialis
  • Aegilops geniculata
  • Annals of the Entomological Society of
  • Annals of the Entomological Society of America
  • Annual Review of Ecology and
  • Applied Statistics
  • BMC Bioinformatics
  • BMC Evolutionary Biology
  • Bioinformatics 2005, 21(24):4423-4424. 69. Sikes DS, Lewis PO: PAUPRat: PAUP implementation of the parsimony ratchet.
  • Biological Journal
  • Biology and Evolution
  • Boston University, Boston,
  • COI (13 PPIc among 16 polymorphic sites) and
  • COII, P
  • Cladistics-the International Journal of the Willi Hennig Society
  • Current Biology
  • Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs
  • Diabrotica virgifera
  • Die Käfer Mitteleuropas.
  • Doronicum clusii
  • Doronicum grandiflorum

Clearly not all italics are organisms. Many are bibliographic indicators. There are two simple ways to improve the precision:

  • Remove false positives. We can probably remove most of the bibliography by context (they occur on title pages and in references)
  • Include only known species. This is probably the best way forward and we have an excellent Open Source tool (Linnaeus) from Casey Bergmann and colleagues at Manchester with > 10000 commonest species.

There are other ways:

  • Morphology and lexical analysis of digraphs (the letter frequency in organisms is very different from English prose – higher vowel frequency for example).
  • Local context (include Hearst patterns … but hey, I have to go…)

So we easily get:

  • Achillea macrophylla, Adenostyles alliarae
  • Achillea, Adenostyles, Cirsium, Doronicum, Petasites, Senecio
  • Ae. triuncialis
  • Aegilops geniculata
  • Diabrotica virgifera
  • Doronicum clusii
  • Doronicum grandiflorum

So I hope you are now clear about how powerful content-mining is, how it will revolutionise science and how it is a crime against human knowledge to restrict its deployment.

Posted in Uncategorized | 2 Comments