<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>petermr&#039;s blog</title>
	<atom:link href="http://blogs.ch.cam.ac.uk/pmr/feed/" rel="self" type="application/rss+xml" />
	<link>http://blogs.ch.cam.ac.uk/pmr</link>
	<description>A Scientist and the Web</description>
	<lastBuildDate>Tue, 21 May 2013 14:30:49 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Building an OKFN model for reproducible economics; why we need it (and a puzzle for you).</title>
		<link>http://blogs.ch.cam.ac.uk/pmr/2013/05/21/building-an-okfn-model-for-reproducible-economics-why-we-need-it-and-a-puzzle-for-you/</link>
		<comments>http://blogs.ch.cam.ac.uk/pmr/2013/05/21/building-an-okfn-model-for-reproducible-economics-why-we-need-it-and-a-puzzle-for-you/#comments</comments>
		<pubDate>Tue, 21 May 2013 14:30:49 +0000</pubDate>
		<dc:creator>pm286</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blogs.ch.cam.ac.uk/pmr/?p=4979</guid>
		<description><![CDATA[On Saturday we are having an economics hackathon in London. I&#8217;d love to be there but unfortunately am going to the Eur Sem Web Conf in Montpelier. It&#8217;s run by Velichka and colleagues – here&#8217;s the sort of reason why (from OKFN blog) Velichka Dimitrova revisted the disgraced Reinhart-Rogoff paper on austerity economics, the perfect [...]]]></description>
			<content:encoded><![CDATA[<p>On Saturday we are having an economics hackathon in London. I&#8217;d love to be there but unfortunately am going to the Eur Sem Web Conf in Montpelier. It&#8217;s run by Velichka and colleagues – here&#8217;s the sort of reason why (from OKFN blog)
</p><p style="margin-left: 36pt">Velichka Dimitrova <a href="http://blog.okfn.org/2013/04/22/reinhart-rogoff-revisited-why-we-need-open-data-in-economics/">revisted the disgraced Reinhart-Rogoff paper on austerity economics</a>, the perfect evidence of <strong>the need for open data in economics</strong> – and was picked up by the <a href="http://blogs.lse.ac.uk/impactofsocialsciences/2013/04/24/reinhart-rogoff-revisited-why-we-need-open-data-in-economics/">London School of Economics</a> and <a href="http://www.newscientist.com/article/dn23448-how-to-stop-excel-errors-driving-austerity-economics.html">the New Scientist</a>.
</p><p>The point is that economists made very serious mistakes and that proper management of the data and tools could have prevented it. We have to work towards reproducible computation in sciences and economics. From Velichka&#8217;s blog (and then I set you a puzzle at the end):
</p><p style="margin-left: 36pt"><span style="font-family:Times New Roman;font-size:12pt">Another economics scandal made the news last week. Harvard Kennedy School professor Carmen Reinhart and Harvard University professor Kenneth Rogoff argued in <a href="http://www.nber.org/papers/w15639"><span style="color:blue;text-decoration:underline">their 2010 NBER paper</span></a> that economic growth slows down when the debt/GDP ratio exceeds the threshold of 90 percent of GDP. <a href="http://www.aeaweb.org/articles.php?doi=10.1257/aer.100.2.573"><span style="color:blue;text-decoration:underline">These results were also published</span></a> in one of the most prestigious economics journals – <a href="http://www.aeaweb.org/aer/index.php"><span style="color:blue;text-decoration:underline">the American Economic Review (AER)</span></a> – and had a powerful resonance in a period of serious economic and public policy turmoil when governments around the world slashed spending in order to decrease the public deficit and stimulate economic growth.
</span></p><p style="margin-left: 36pt"><span style="font-family:Times New Roman;font-size:12pt">Yet, they were proven wrong. Thomas Herndon, Michael Ash and Robert Pollin from the University of Massachusetts (UMass) <a href="http://www.peri.umass.edu/fileadmin/pdf/working_papers/working_papers_301-350/WP322.pdf"><span style="color:blue;text-decoration:underline">tried to replicate the results of Reinhart and Rogoff</span></a> and criticised them on the basis of three reasons:
</span></p><ul style="margin-left: 72pt"><li><span style="font-family:Times New Roman;font-size:12pt"><strong>Coding errors:</strong> due to a spreadsheet error five countries were excluded completely from the sample resulting in significant error of the average real GDP growth and the debt/GDP ratio in several categories
</span></li><li><span style="font-family:Times New Roman;font-size:12pt"><strong>Selective exclusion of available data and data gaps:</strong> Reinhart and Rogoff exclude Australia (1946-1950), New Zealand (1946-1949) and Canada (1946-1950). This exclusion is alone responsible for a significant reduction of the estimated real GDP growth in the highest public debt/GDP category
</span></li><li><span style="font-family:Times New Roman;font-size:12pt"><strong>Unconventional weighting of summary statistics:</strong> the authors do not discuss their decision to weight equally by country rather than by country-year, which could be arbitrary and ignores the issue of serial correlation.
</span></li></ul><p><span style="font-family:Times New Roman;font-size:12pt">The implications of these results are that countries with high levels of public debt experience only &#8220;modestly diminished&#8221; average GDP growth rates and as the UMass authors show there is a wide range of GDP growth performances at every level of public debt among the twenty advanced economies in the survey of Reinhart and Rogoff. Even if the negative trend is still visible in the results of the UMass researchers, the data fits the trend very poorly: <a href="http://www.bloomberg.com/news/2013-04-17/reinhart-rogoff-on-debt-and-growth-fake-but-accurate-.html"><span style="color:blue;text-decoration:underline">&#8220;low debt and poor growth, and high debt and strong growth, are both reasonably common outcomes.&#8221;</span></a>
		</span></p><p style="margin-left: 36pt"><img src="http://blogs.ch.cam.ac.uk/pmr/files/2013/05/052113_1430_BuildinganO1.jpg" alt="" /><span style="font-family:Times New Roman;font-size:12pt">
		</span></p><p style="margin-left: 36pt">Source: Herndon, T., Ash, M. &amp; Pollin, R., &#8220;Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff, Public Economy Research Institute at University of Massachusetts: Amherst Working Paper Series. April 2013.<span style="font-family:Times New Roman;font-size:12pt">
		</span></p><p style="margin-left: 36pt"><span style="font-family:Times New Roman;font-size:12pt">What makes it even more compelling news is that it is all a tale from the state of Massachusetts: distinguished Harvard professors (<a href="http://colleges.usnews.rankingsandreviews.com/best-colleges/harvard-university-2155"><span style="color:blue;text-decoration:underline">#1 university in the US</span></a>) challenged by empiricists from the less known UMAss (<a href="http://colleges.usnews.rankingsandreviews.com/best-colleges/university-of-massachusetts-amherst-2221"><span style="color:blue;text-decoration:underline">#97 university in the US</span></a>). Then despite the excellent <a href="http://www.aeaweb.org/aer/data.php"><span style="color:blue;text-decoration:underline">AER data availability policy</span></a> – which acts as a role model for other journals in economics – the AER has failed to enforce it and make the data and code of Reinhart and Rogoff available to other researchers.
</span></p><p style="margin-left: 36pt"><span style="font-family:Times New Roman;font-size:12pt">Coding errors happen, yet the greater research misconduct was not allowing other researchers to review and replicate the results through making the data openly available. If the data and code were made available upon publication in 2010, it may not have taken three years to prove these results wrong, which may have influenced the direction of public policy around the world towards stricter austerity measures. Sharing research data means a possibility to replicate and discuss, enabling the scrutiny of research findings as well as improvement and validation of research methods through more scientific enquiry and debate.
</span></p><p><span style="font-family:Times New Roman;font-size:12pt">So Saturday&#8217;s hackathon (I might manage to connect in on Eurostar?) is about building reliable semantic models for reporting economics analyses. Since economics is about numbers and chemistry is about numbers there&#8217;s a lot in common and the tools we&#8217;ve developed for Chemical Markup Language might have some re-usability. So this morning Velichka, Ross Mounce and I had a skype to look at some papers.
</span></p><p><span style="font-family:Times New Roman;font-size:12pt">We actually spent most of the time on one:
</span></p><p><img src="http://blogs.ch.cam.ac.uk/pmr/files/2013/05/052113_1430_BuildinganO2.png" alt="" /><span style="font-family:Times New Roman;font-size:12pt">
		</span></p><p><span style="font-family:Times New Roman;font-size:12pt">And here&#8217;s one example of a data set in the paper. (Note it&#8217;s  behind a paywall (JSTOR) and I haven&#8217;t asked permission and I don&#8217;t need to tell you about what happened between Aaron Swartz and JSTOR. But I argue that these are facts and fair cricitism):
</span></p><p><img src="http://blogs.ch.cam.ac.uk/pmr/files/2013/05/052113_1430_BuildinganO3.png" alt="" /><span style="font-family:Times New Roman;font-size:12pt">
		</span></p><p>The authors regressed the dependent variable (log GDP) against the other two. My questions, as a physical scientist are:
</p><ul><li>What are the units of GDP? After all someone might use different ones. And I personally cannot understand the values.
</li><li>What is &#8220;main mortality estimate&#8221;? If you guess without reading the paper you will almost certainly be wrong. You have to read the paper very carefully and even then take a lot on trust.
</li></ul><p>I&#8217;m not suggesting that this research is reproducible just from this table (although it should be possible to regenerate the same results). I&#8217;m arguing that data of this sort (I exclude it as 12 years old) is not acceptable any more. Data must be unambiguously labelled with units and  described so the source and data are replicable.</p>]]></content:encoded>
			<wfw:commentRss>http://blogs.ch.cam.ac.uk/pmr/2013/05/21/building-an-okfn-model-for-reproducible-economics-why-we-need-it-and-a-puzzle-for-you/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>JailBreaking the PDF</title>
		<link>http://blogs.ch.cam.ac.uk/pmr/2013/05/21/jailbreaking-the-pdf/</link>
		<comments>http://blogs.ch.cam.ac.uk/pmr/2013/05/21/jailbreaking-the-pdf/#comments</comments>
		<pubDate>Tue, 21 May 2013 13:31:46 +0000</pubDate>
		<dc:creator>pm286</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blogs.ch.cam.ac.uk/pmr/?p=4974</guid>
		<description><![CDATA[The Scholarly Revolution #scholrev is forging ahead. Alexander Garcia Castro is running a fantastic hackathon n Montpelier immediately after the SePublica Polemics workshop.  Join us in Montpellier for a one-day event to hack on scholarly PDFs!Do you have tools that may help us to extract information from PDFs?send us an email so that we can [...]]]></description>
			<content:encoded><![CDATA[<p><span style="font-family:Arial">The Scholarly Revolution #scholrev is forging ahead. Alexander Garcia Castro is running a fantastic hackathon n Montpelier immediately after the SePublica Polemics workshop.
</span></p><p>
 </p><p style="margin-left: 36pt"><span style="font-family:Arial"><em>Join us in Montpellier for a one-day event to hack on scholarly PDFs!<br /><br />Do you have tools that may help us to extract information from PDFs?<br />send us an email so that we can include them in the hackathon.<br /><br />Would you like to extract citations from existing PDFs?<br /><br />Wouldn&#8217;t it be cool if we, scholars, did not have to pay for citation<br />data? What about author disambiguation?<br /><br />Are you interested in identifying and extracting meaningful parts from PDFs?<br /><br />Would you like to have XML/RDF for scholarly PDFs? What if you could<br />have access to the actual content of the PDF for supporting the Web of<br />Data?<br /><br />We are interested in all of these issues, send us your tools, ideas,<br />comments and join us in Montpellier. We are also supporting remote<br />participation to the hackathon -hangout and webex.<br /><br />Visit us at <a href="http://scholrev.org/hackathon/" target="_blank"><span style="color:blue;text-decoration:underline">http://scholrev.org/hackathon/</span></a><br /><br /><a href="mailto:casey.mclaughlin@cci.fsu.edu"><span style="color:blue;text-decoration:underline">casey.mclaughlin@cci.fsu.edu</span></a><br /><a href="mailto:alexgarciac@gmail.com"><span style="color:blue;text-decoration:underline">alexgarciac@gmail.com</span></a><br /><br />&#8211;<br />Alexander Garcia<br /><a href="http://www.alexandergarcia.name/" target="_blank"><span style="color:blue;text-decoration:underline">http://www.alexandergarcia.name/</span></a><br /><a href="http://www.usefilm.com/photographer/75943.html" target="_blank"><span style="color:blue;text-decoration:underline">http://www.usefilm.com/photographer/75943.html</span></a><br /><a href="http://www.linkedin.com/in/alexgarciac" target="_blank"><span style="color:blue;text-decoration:underline">http://www.linkedin.com/in/alexgarciac</span></a>
			</em></span></p><p style="margin-left: 36pt"><img src="http://blogs.ch.cam.ac.uk/pmr/files/2013/05/052113_1331_JailBreakin1.gif" alt="" /><span style="font-family:Times New Roman;font-size:12pt"><em>
			</em></span></p><p>One the important aspects of a revolution is having the right tools and this hackathon will collect what we&#8217;ve got and work out how to deploy them. &#8220;Jailbreaking&#8221; PDFs is not easy. It&#8217;s complex and it&#8217;s messy. But we are getting to the stage where we have the tools to:
</p><ul><li>Download PDFs from the open web.
</li><li>Turn them into semantic form
</li><li>Filter the semantics and repurpose them – everything from metadata to citations to chemistry to phylogenetic trees
</li><li>Build a community
</li></ul><p>And since we work with open source everything we do is a step forward. Once we have solved a problem it can&#8217;t be unsolved (unlike commercial closed tools which are often withdrawn of locked). There&#8217;s a great deal we can do with collaborative action (each person can add a stone to the building. 
</p><p>All we have to do is care enough.</p>]]></content:encoded>
			<wfw:commentRss>http://blogs.ch.cam.ac.uk/pmr/2013/05/21/jailbreaking-the-pdf/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SePublica: Polemics in the Semantic Web (SEWC) – we need “the crazy ones”!</title>
		<link>http://blogs.ch.cam.ac.uk/pmr/2013/05/21/sepublica-polemics-in-the-semantic-web-sewc-we-need-the-crazy-ones/</link>
		<comments>http://blogs.ch.cam.ac.uk/pmr/2013/05/21/sepublica-polemics-in-the-semantic-web-sewc-we-need-the-crazy-ones/#comments</comments>
		<pubDate>Tue, 21 May 2013 09:24:34 +0000</pubDate>
		<dc:creator>pm286</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blogs.ch.cam.ac.uk/pmr/?p=4969</guid>
		<description><![CDATA[I have been very honoured to be invited to lead off a workshop session at the European Semantic Web Conference (ESWC). This workshop is a radical initiative to change the way we think about information. Here&#8217;s the description: http://sepublica.mywikipaper.org/drupal/ There is much controversy in the world of publishing and semantic publishing needs to both create [...]]]></description>
			<content:encoded><![CDATA[<p>I have been very honoured to be invited to lead off a workshop session at the European Semantic Web Conference (ESWC). This workshop is a radical initiative to change the way we think about information. Here&#8217;s the description: <a href="http://sepublica.mywikipaper.org/drupal/">http://sepublica.mywikipaper.org/drupal/</a>
	</p><p style="margin-left: 36pt"><br /><span style="color:#222222;font-family:Arial"><span style="background-color:white">There is much controversy in the world of publishing and semantic publishing needs to both create waves in publishing and to ride the waves of change approaching in the world of publishing. We therefore invite statements for presentation at a discussion session at SePublica 2013 at ESWC in Montpellier on 26 May 2013.</span><br /></span><br /><br /><span style="color:#222222;font-family:Arial"><span style="background-color:white">We want radical, controversal and polemical positions to be articulated about semantic publishing and how we should achieve semantic publishing of scholarly works, data and all sorts of stuff. To be presented, statements must be relevant, legal and not too offensive(as judged by the workshop organisers).</span><br /><br /></span><br /><span style="color:#222222;font-family:Arial"><span style="background-color:white">All acccepted statements wil be presented. Submission will be through easychair; all accepted polemics will be</span><br /></span><br /><span style="color:#222222;font-family:Arial"><span style="background-color:white">published before the meeting on the Knowledgeblog platform (<a href="http://www.knowledgeblog.org" title="http://www.knowledgeblog.org">http://www.knowledgeblog.org</a>), where they will be permanently archived, and open for public comments. Submissions should be limited to 500 words. We can accept submissions in most formats, including Word, simple HTML (nothing in the header, no active content) or Latex (again the simpler the better). Presentations on the day wil be restricted to one slide that will be presented for two minutes (we will do this via timed slides) &#8211; all slide presentations must be submitted in advance. Presentations will be followed by a vivid discussion.</span><br /></span><br /><span style="color:#222222;font-family:Arial"><span style="background-color:white">Illustrating what we would like to have&#8230;</span><br /></span><br /><span style="color:#222222;font-family:Arial"><span style="background-color:white">Here&#8217;s To The Crazy Ones. The misfits. The rebels. The trouble-makers. The round pegs in the square holes. The ones who see things differently. They&#8217;re not fond of rules, and they have no respect for the status-quo. You can quote them, disagree with them, glorify, or vilify them.</span><br /></span><br /><span style="color:#222222;font-family:Arial"><span style="background-color:white">About the only thing you can&#8217;t do is ignore them. Because they change things. They push the human race forward. And while some may see them as the crazy ones, we see genius. Because the people who are crazy enough to think they can change the world &#8211; are the ones who DO !&#8221;</span> (<span style="background-color:white">I [AlexanderGC?] believe this is from Steve Jobs, but I am not sure about the right atribution of sentence.)</span></span>
	</p><p style="margin-left: 36pt"><strong>Welcome to SEPUBLICA 2013</strong>
	</p><p style="margin-left: 36pt"><span style="font-family:Times New Roman">For over 350 years, scientific publications have been fundamental to advancing science. Since the first scholarly journals, Philosophical Transactions of the Royal Society (of London) and the Journal de Sçavans, scientific papers have been the primary, formal means by which scholars have communicated their work, e.g., hypotheses, methods, results, experiments, etc. Advances in technology have made it possible for the scientific article to adopt electronic dissemination channels, from paper-based journals to purely electronic formats. However, In spite of improvements in the distribution, accessibility and retrieval of information, little has changed in the publishing industry so far. The Web has succeeded as a dissemination platform for scientific and non-scientific papers, news, and communication in general; however, most of that information remains locked up in discrete digital documents that are replicates of their print ancestors; without machine-interpretable content they lack the exploitation we have begun to expect from other data. Semantic enhancements to scholarly works would expose both the content of those works and the implicit discourse between those works. Scholarly data and documents are of most value when they are interconnected rather than independent. </span>
	</p><p>This is a tremendous vision and I am deeply honoured to be asked to spark it off. I&#8217;ll try and indicate over the next 3-4 days some avenues. Polemics (<a href="https://en.wikipedia.org/wiki/Polemic">https://en.wikipedia.org/wiki/Polemic</a> ) are:
</p><p style="margin-left: 36pt"><em> a contentious <a href="https://en.wikipedia.org/wiki/Argument" title="Argument">argument</a> that is intended to establish the truth of a specific understanding and the falsity of the contrary position. Polemics are mostly seen in arguments about very controversial topics.
</em></p><p>My current title is 
</p><p><strong>&#8220;How do we make Science Semantic&#8221;?
</strong></p><p>But even as I write I am seeing new challenges and opportunities and these posts are exploring this.
</p><p>So we are challenging the way that we communicate &#8220;publishing&#8221; and there is much to challenge. Not many areas have been unaffected by the Web revolution, but scholarly publishing is one of those (the publishers have simply shipped the printing bill to the readers). In 1994 I was privileged to hear TimBL at CERN/WWW1 setting out the semantic web vision and it transformed my life. I assumed it would transform science, but it hasn&#8217;t. And that&#8217;s my first and explicit polemic.
</p><p>Science, with the exception of parts of bioscience has not adopted semantics even after 20 years of opportunity. I&#8217;m not sure why, though I have revised my ideas (downward) about conservatism in academic institutions. There is a glowing opportunity – Tim can see it, I can see it, and a number of my collaborators can see it, but the vast bulk of science is untouched. Ironic that CERN was the birthplace of the Web.
</p><p>It becomes clear that semantics is about revolution. The semantic web potentially empowers the individual over top-down organizations. Semantics creates human-machine organisms that communicate with other human-machine organisms. That changes the structure of society and the nature of humanity. And every year that revolution is stalled is a year of building tensions.
</p><p>The primary theme is publishing. TimBL envisaged a system where everyone could be author, publisher and reader. Pre-1993 electronic (or any) publishing was an arcane art. In 1993 NCSA changed that, with the Mosaic browser and even more importantly NSCA HTTPD. My web server became my own personal radio station – I could publish to the world and my only challenge – a fair one &#8211; was whether the world would listen. We see this now in blogs, of course, but blogs do not capture the true essence of the semantic revolution. They are critical in establishing the new democracy and reshaping society, but in a relatively conventional technical manner.
</p><p>But today the critical polemic is digital freedom or digital slavery. There are huge interests attempting to control us – to limit our activities, to tell us what to think, to filter what we say. And for this reason much of the semantic web is stalled. For me the biggest developments in semantic information have been with Wikipedia, Open Street map and other extra-academic organizations. And, of course, the Open Knowledge Foundation w here the practice of semantic information is a core part of our practice.
</p><p>And yes, we must have the crazies. Socrates was a crazy. Aaron Swartz was a crazy. TimBL was a crazy. 
</p><p>The most important message is that single people with a passion can change the world. It&#8217;s never been easier. Crazies don&#8217;t need confidence – they already have it. But they need help, and if I can persuade people they should follow crazies, then I will have succeeded.
</p><p>If you sit back and wait for the world to change, it won&#8217;t be your world.
</p><p>[NOTE: I have been very busy hacking AMI2 – a PDF2Semantic tool – and hope to show at least some of it. It's taken just over a year so far. I must be really crazy. But I can afford to be and I have a duty to be.]</p>]]></content:encoded>
			<wfw:commentRss>http://blogs.ch.cam.ac.uk/pmr/2013/05/21/sepublica-polemics-in-the-semantic-web-sewc-we-need-the-crazy-ones/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>#ignorantchemist Typographical amusement #ami2</title>
		<link>http://blogs.ch.cam.ac.uk/pmr/2013/04/30/ignorantchemist-typographical-amusement-ami2/</link>
		<comments>http://blogs.ch.cam.ac.uk/pmr/2013/04/30/ignorantchemist-typographical-amusement-ami2/#comments</comments>
		<pubDate>Tue, 30 Apr 2013 05:54:16 +0000</pubDate>
		<dc:creator>pm286</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blogs.ch.cam.ac.uk/pmr/?p=4964</guid>
		<description><![CDATA[We are doing well at reconstructing semantic material from PDFs (#AMI2) but the challenges we are thrown are considerable. Here&#8217;s today&#8217;s amusement: #AMI2 can reconstruct most of this perfectly, but she doesn&#8217;t know what to do with a hyphenated-subscript. Nor do I, but I&#8217;m just an ignorant chemist. The publishing industry tells us that they [...]]]></description>
			<content:encoded><![CDATA[<p>We are doing well at reconstructing semantic material from PDFs (#AMI2) but the challenges we are thrown are considerable. Here&#8217;s today&#8217;s amusement:
</p><p><img src="http://blogs.ch.cam.ac.uk/pmr/files/2013/04/043013_0554_ignorantche11.png" alt="" />
	</p><p>#AMI2 can reconstruct most of this perfectly, but she doesn&#8217;t know what to do with a <strong>hyphenated-subscript</strong>. Nor do I, but I&#8217;m just an ignorant chemist. The publishing industry tells us that they need our money to produce beautiful easily readable typeset documents. So here&#8217;s an example of human readability from the same paper:
</p><p><img src="http://blogs.ch.cam.ac.uk/pmr/files/2013/04/043013_0554_ignorantche21.png" alt="" />
	</p><p>#AMI2 can read this, but can <strong>you</strong>? Wouldn&#8217;t it be easier to typeset it as equations? But that would take up an awful lot of space, and as we know journals have to reduce the space (I never understand why).
</p><p>I have a plane journey so AMI and I can do some real hacking. We hope to release an alpha version RSN.</p>]]></content:encoded>
			<wfw:commentRss>http://blogs.ch.cam.ac.uk/pmr/2013/04/30/ignorantchemist-typographical-amusement-ami2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>#openaccess: American Chemical Society charge additional 1000 USD for Creative Commons Licences</title>
		<link>http://blogs.ch.cam.ac.uk/pmr/2013/04/25/openaccess-american-chemical-society-charge-additional-1000-usd-for-creative-commons-licences/</link>
		<comments>http://blogs.ch.cam.ac.uk/pmr/2013/04/25/openaccess-american-chemical-society-charge-additional-1000-usd-for-creative-commons-licences/#comments</comments>
		<pubDate>Thu, 25 Apr 2013 08:55:00 +0000</pubDate>
		<dc:creator>pm286</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blogs.ch.cam.ac.uk/pmr/?p=4959</guid>
		<description><![CDATA[From the start of this month all RCUK-funded researchers will have to publish &#8220;Open Access&#8221;. Exactly what this means has been the subject of a messy set of polemics. But on the assumption that authors wish to publish under a CC-BY licence (effectively the only one compliant the with BOAI declaration – free to copy, [...]]]></description>
			<content:encoded><![CDATA[<p>From the start of this month all RCUK-funded researchers will have to publish &#8220;Open Access&#8221;. Exactly what this means has been the subject of a messy set of polemics. But on the assumption that authors wish to publish under a CC-BY licence (effectively the only one compliant the with BOAI declaration – free to copy, use, re-use and redistribute) then are they able to? 
</p><p>I&#8217;ve taken a prominent journal – Journal of the American Chemical Society – in which I have previously published. Can I publish &#8220;Open Access&#8221; and comply with the RCUK requirements? 
</p><p>There&#8217;s a useful tool <a href="http://www.sherpa.ac.uk/fact/">http://www.sherpa.ac.uk/fact/</a>
	</p><p><img src="http://blogs.ch.cam.ac.uk/pmr/files/2013/04/042513_0854_openaccessA1.png" alt="" />
	</p><p>Many publishers have been extremely poor at providing simple information for readers and authors.  Often you have to chase round the buttons on the site (avoiding the (self-)advertising). Sometimes I get the impression that the publishers aren&#8217;t really trying to be helpful. Ross Mounce has done a great job on trying to winkle out licence and prices info and  SHERPA have now done much of the grunt work in providing the right button to click. systematize this as well. So I can go straight to the key info:
</p><p><img src="http://blogs.ch.cam.ac.uk/pmr/files/2013/04/042513_0854_openaccessA2.png" alt="" />
	</p><p>What&#8217;s &#8220;Author Choice&#8221;? It&#8217;s ACS-specific and it&#8217;s some form of &#8220;Open Access&#8221; (according to the ACS). Many of these publisher-specific labels ( (Author|Reader|Free|Open)(Access|Choice|Article) have fuzzy words and fuzzy conditions. 
</p><p>But we have Creative Commons (and without CC we would be in an awful mess). CC provide a range of licences. ONLY CC-BY (CC0, and possibly CC-BY-SA) fit the BOAI definition of open access. Only CC-BY allows copying, re-use and redistribution.
</p><p>Which, simply, is what Science is about.
</p><p>Any restriction of access or re-use is anti-scientific.
</p><p>It may be good business, but it harms science.
</p><p>So it is possible to use a CC-BY licence when publishing with the ACS. But ONLY by paying an extra 1000 USD.
</p><p>Does it COST this much to add a CC-BY licence?
</p><p>Of course not. It shouldn&#8217;t cost anything (it&#8217;s a standard 50 characters on a page and a hyperlink).
</p><p>It&#8217;s effectively a ransom from the publisher to raise extra revenue. The publishers can make up any set of charges they like. And the authors will either pay it or hide their publication behind an embargo-wall (say for 1-2 years).
</p><p>Is this good for science? Of course not. It makes it harder to detect bad science. Humans and machines can validate or invalidate science if they are allowed to read the full text. 
</p><p>Very few publishers have earned respect during the evolution of Open Access. Most have been seen to value commerce above other considerations. There is no price pressure on OA. 
</p><p> And many &#8220;open access advocates&#8221; have actually welcomed non-CC-BY and embargoed green OA – which has led us to these huge APCs for BOAI Open Access.
</p><p>To fight this we need strength from the funders and unanimity of purpose.
</p><p>And we have this and it&#8217;s the primary redeeming feature in Open Access.
</p><p>We need tools for uniform practice – what does a publisher offer? And we are getting them (kudos in UK to JISC, SHERPA, and Ross) and they are cutting through the fuzz. 
</p><p>We need tools for measuring author compliance. Because many authors simply don&#8217;t care about the funders requirements and will still publish in a completely closed manner so as to advance their careers and funding prospects. And we are getting them.
</p><p>The organizations that have let us down are the Universities and their libraries. They don&#8217;t really care. They could have fought this battle 10 years ago instead of waiting for the funders to do it. They accept whatever prices the publishers charge for OA APCs and route tax-payer money or student fees to the publishers… 
</p><p>But that&#8217;s another blog post. Soon…
</p>]]></content:encoded>
			<wfw:commentRss>http://blogs.ch.cam.ac.uk/pmr/2013/04/25/openaccess-american-chemical-society-charge-additional-1000-usd-for-creative-commons-licences/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Update: The struggle continues… #ami2 would like alpha testers</title>
		<link>http://blogs.ch.cam.ac.uk/pmr/2013/04/25/update-the-struggle-continues-ami2-would-like-alpha-testers/</link>
		<comments>http://blogs.ch.cam.ac.uk/pmr/2013/04/25/update-the-struggle-continues-ami2-would-like-alpha-testers/#comments</comments>
		<pubDate>Thu, 25 Apr 2013 08:06:49 +0000</pubDate>
		<dc:creator>pm286</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blogs.ch.cam.ac.uk/pmr/?p=4954</guid>
		<description><![CDATA[A quick update. I&#8217;ve been spending most of my time on #ami2 which is now at raw alpha (see below). Other items of note include: Mendeley is now owned by Elsevier. I shall blog this. If you care about Open scholarship you have to be seriously concerned. Open Data Workshop (http://blog.okfn.org/2013/02/27/open-data-on-the-web-workshop-april-2013/, http://www.w3.org/2013/04/odw/ ). Really exciting [...]]]></description>
			<content:encoded><![CDATA[<p>A quick update. I&#8217;ve been spending most of my time on #ami2 which is now at raw alpha (see below). Other items of note include:
</p><ul><li>Mendeley is now owned by Elsevier. I shall blog this. If you care about Open scholarship you have to be seriously concerned.
</li><li>Open Data Workshop (<a href="http://blog.okfn.org/2013/02/27/open-data-on-the-web-workshop-april-2013/">http://blog.okfn.org/2013/02/27/open-data-on-the-web-workshop-april-2013/</a>, <a href="http://www.w3.org/2013/04/odw/">http://www.w3.org/2013/04/odw/</a> ). Really exciting to see the concentration of interest. There was a pre-workshop evening run by OKFN – lightning talks (I gave a short one (3-4 mins) on #ami2 and the problems of scientific data. Many international visitors came.
</li><li>Ross and Avril got married (@rmounce) – their 2<sup>nd</sup> or 3 weddings. Great occasion – thanks all.
</li><li>Went to talk by Glyn Moody on Copyright. 
</li><li>Meeting by JISC/Cameron on tools to determine openness of livcences in scholpubs.
</li><li>Opening of Materials centre at QMU (Martin Dove). CML continues to be valuable.
</li><li>Good progress on CML dictionaries for compchem.
</li><li>We keep fighting for &#8220;the right to read is the right to mine&#8221; at Brussels (Licences for Europe). Do university libraries care?? They&#8217;d rather buy things than fight.
</li></ul><p>Overall I worry seriously about Open Scholarship. The universities and their libraries don&#8217;t care and are giving it away and then buying it back. It&#8217;s getting worse not better. We should be fighting for our rights. 
</p><p>#ami2 is at raw alpha. That means that it can do useful stuff <strong>if you know what you are doing and know the limitations</strong>. We are not appealing for volunteers yet but if you want to be involved please let me know. You will need to be able to:
</p><ul><li>Run Maven and Java.
</li><li>Use Bitbucket.
</li><li>Get excited about really boring stuff (like errors in fonts, pagination etc.) 
</li><li>Sort problems yourself/communally.
</li><li>Want to liberate information from PDFs.
</li><li>Have a few minable papers (&#8220;Open&#8221; in some sense).
</li><li>Be patient.
</li><li>Respect copyright.
</li></ul><p>Currently there are no proper metrics but:
</p><ul><li>Ca. 1 sec per page
</li><li>Useful compression for text-only (images can&#8217;t compress, of course).
</li></ul><p>Mail me or leave a message here or simply use Bitbucket (<a href="http://www.bitbucket.org/petermr/svg2xml-dev">http://www.bitbucket.org/petermr/svg2xml-dev</a> ) and <strong>give feedback</strong>.
</p>]]></content:encoded>
			<wfw:commentRss>http://blogs.ch.cam.ac.uk/pmr/2013/04/25/update-the-struggle-continues-ami2-would-like-alpha-testers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>#animalgarden Bottom-up Ontologies in Physical Science</title>
		<link>http://blogs.ch.cam.ac.uk/pmr/2013/04/14/animalgarden-bottom-up-ontologies-in-physical-science/</link>
		<comments>http://blogs.ch.cam.ac.uk/pmr/2013/04/14/animalgarden-bottom-up-ontologies-in-physical-science/#comments</comments>
		<pubDate>Sun, 14 Apr 2013 08:30:48 +0000</pubDate>
		<dc:creator>pm286</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blogs.ch.cam.ac.uk/pmr/?p=4951</guid>
		<description><![CDATA[On Thursday (2013-04-11) I was invited by Fiona McNeill to give a 5-minute talk on ontologies at Edinburgh (http://dream.inf.ed.ac.uk/events/ukont-13/2013_workshop_program.html ). The workshop aims included: Amongst other areas of interest, there will be a particular focus on creating and using open data. The program and audience is intentionally very diverse; the aim is to cover areas [...]]]></description>
			<content:encoded><![CDATA[<p>On Thursday (2013-04-11) I was invited by Fiona McNeill to give a 5-minute talk on ontologies at Edinburgh (<a href="http://dream.inf.ed.ac.uk/events/ukont-13/2013_workshop_program.html">http://dream.inf.ed.ac.uk/events/ukont-13/2013_workshop_program.html</a> ). The workshop aims included:  
</p><p style="margin-left: 36pt"><em>Amongst other areas of interest, there will be a particular focus on creating and using <strong>open data</strong>. The program and audience is intentionally very diverse; the aim is to cover areas from many disciplines. We are particularly interested in bringing together those creating and developing the technology with those using the technology in industry, government and public organisations.
</em></p><p>A short talk requires special preparation. No point in trying to prove theorems in first-order logic. In fact I argue that this is far too complicated and unnecessary for physical science. So #animalgarden offered to make a presentation. (They didn&#8217;t have time to have a proper shoot so they have re-used old slides and there&#8217;s no music yet). The slides are at <a href="http://www.slideshare.net/petermurrayrust/ontologies-in-physical-science">http://www.slideshare.net/petermurrayrust/ontologies-in-physical-science</a> &#8211; there are a few snapshots here. (Conventional chemists can read the words – which are deadly serious &#8211; and ignore the animals <span style="font-family:Wingdings">L</span> )
</p><p>
		<img src="http://blogs.ch.cam.ac.uk/pmr/files/2013/04/041413_0830_animalgarde1.jpg" alt="" />
	</p><p>The problem is that much of physical science doesn&#8217;t even use common identifiers or vocabularies. So the problems are people-problems, not technical ones.
</p><p>There <em>are a very few</em> chemical ontologies but few people use them and this is even more problematic in materials science. This domain is probably the easiest of all sciences to create ontologies for but paradoxically it hasn&#8217;t happened. Crystallography (<a href="http://www.iucr.org/cif">www.iucr.org/cif</a>) is a shining exception but computational chemistry has nothing.
</p><p><img src="http://blogs.ch.cam.ac.uk/pmr/files/2013/04/041413_0830_animalgarde2.jpg" alt="" />
	</p><p>So a number of us are joining together to create &#8220;bottom-up ontologies&#8221;. Firstly small coherent group systematize the description of what they do in semantic form. Computational chemistry is particularly well suited to this – the programs (codes) have implicit semantics (because the code works and gives the right answers)! Then the community looks at the resultant collection of ontologies and systematizes them where they have the same concepts. In these cases there is a common entry in a communal ontology. 
</p><p><img src="http://blogs.ch.cam.ac.uk/pmr/files/2013/04/041413_0830_animalgarde3.jpg" alt="" />
	</p><p>When this isn&#8217;t possible the ontologies create machine-readable <strong>conventions.
</strong></p><p><img src="http://blogs.ch.cam.ac.uk/pmr/files/2013/04/041413_0830_animalgarde4.jpg" alt="" />
	</p><p>But few computational codes have explicit ontologies. Some define a few of the terms in their manuals, but they aren&#8217;t linked to the programs. We&#8217;ve developed Chemical Markup Language a which does exactly this. Each code (NWChem, Hyperchem, DLPOLY…) creates their own ontology using a common syntax (CML) but their own identifiers. 
</p><p>There are immediate benefits – the program output becomes semantic and can be re-used for analysis, aggregation, etc. If two groups have ontologies they compare notes and create a toplevel dictionary. As more groups join, the top-level dictionary gains more knowledge and acceptance from the community. And everyone has a feeling of ownership.
</p><p><img src="http://blogs.ch.cam.ac.uk/pmr/files/2013/04/041413_0830_animalgarde5.jpg" alt="" />
	</p><p>We are delighted that Hyperchem <a href="http://www.hyper.com/">http://www.hyper.com/</a> have recently offered to join in the communal effort.  See <a href="http://blogs.ch.cam.ac.uk/pmr/2011/11/02/searchable-semantic-compchem-data-quixote-chempound-fox-and-jumbo/">http://blogs.ch.cam.ac.uk/pmr/2011/11/02/searchable-semantic-compchem-data-quixote-chempound-fox-and-jumbo/</a> for an overview of the collaboration with PNNL. And <a href="http://blogs.ch.cam.ac.uk/pmr/2013/02/03/topics-and-links-for-my-talk-on-semantic-web-for-materials/">http://blogs.ch.cam.ac.uk/pmr/2013/02/03/topics-and-links-for-my-talk-on-semantic-web-for-materials/</a> for work with CSIRO.  And some idea of the great contribution from Kitware <a href="http://blogs.ch.cam.ac.uk/pmr/2013/03/01/liberation-software/">http://blogs.ch.cam.ac.uk/pmr/2013/03/01/liberation-software/</a>
	</p><p>The slides are CC-BY. I need to add this.</p>]]></content:encoded>
			<wfw:commentRss>http://blogs.ch.cam.ac.uk/pmr/2013/04/14/animalgarden-bottom-up-ontologies-in-physical-science/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>#ami2 #ukont2013 15-min demonstration of AMI2 (and maybe OPSIN and ChemicalTagger)</title>
		<link>http://blogs.ch.cam.ac.uk/pmr/2013/04/11/ami2-ukont2013-15-min-demonstration-of-ami2-and-maybe-opsin-and-chemicaltagger/</link>
		<comments>http://blogs.ch.cam.ac.uk/pmr/2013/04/11/ami2-ukont2013-15-min-demonstration-of-ami2-and-maybe-opsin-and-chemicaltagger/#comments</comments>
		<pubDate>Thu, 11 Apr 2013 12:08:50 +0000</pubDate>
		<dc:creator>pm286</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blogs.ch.cam.ac.uk/pmr/?p=4944</guid>
		<description><![CDATA[I&#8217;m demoing after lunch to the 2nd UK Ontology Network Workshop in Edinburgh and it&#8217;s billed as AMI2 (our content-mining software for #scholpub and related documents). Why content-mining at an ontology meeting? Because many ontologies are created &#8220;bottom-up&#8221; from the language we use. This post is just to announce what I am going to show [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m demoing after lunch to the 2<sup>nd</sup> UK Ontology Network Workshop in Edinburgh and it&#8217;s billed as AMI2 (our content-mining software for #scholpub and related documents). Why content-mining at an ontology meeting? Because many ontologies are created &#8220;bottom-up&#8221; from the language we use. This post is just to announce what I am going to show (hopefully) and also to give URLs.
</p><ul><li><div>AMI2 will read PDFs and convert them to XHTML (prior to creating domain-specific XML). AMI2 is at: <a href="https://bitbucket.org/petermr/pdf2svg">https://bitbucket.org/petermr/pdf2svg</a> (for converting PDF to SVG) and <a href="https://bitbucket.org/petermr/svg2xml">https://bitbucket.org/petermr/svg2xml</a> (for converting SVG2XML). Use <a href="https://bitbucket.org/petermr/pdf2svg-dev">https://bitbucket.org/petermr/pdf2svg-dev</a> and <a href="https://bitbucket.org/petermr/svg2xml-dev">https://bitbucket.org/petermr/svg2xml-dev</a> for the code for the bleeding edge versions (I&#8217;ll be demoing the latter, using Maven from the commandline).  We&#8217;re beginning to get collaborators – recently AMI2 started working with Renaud Richardet in EPFL Lausanne , for example.
</div><p>For newcomers, AMI2 reads a PDF using PDFBox, and uses PDF2SVG to interpret STM publisher characters (which usually are not Unicode). That creates a raw SVG made up of single characters and discrete paths and images. Then she uses SVG2XML to create running text and separate figures and tables. We&#8217;ll show how species can be extractedThat&#8217;s where today stops. (In the final phase, AMI2-Aaron (in memory of Aaron Swartz) we shall support domain-specific plugins).
</p></li><li>Then we&#8217;ll show OPSIN to show an example of a domain-specific plugin that translates chemical names to Chemical Markup Language.
</li><li>Lastly we&#8217;ll show Chemical Tagger (<a href="http://chemicaltagger.ch.cam.ac.uk/">http://chemicaltagger.ch.cam.ac.uk/</a> ) which uses Natural Language Processing to create semantic chemistry (using CML/XML ontology).
</li></ul><p>PARTICIPANTS: PLEASE LET AMI2 HAVE SOME PDFs TO EAT!</p>]]></content:encoded>
			<wfw:commentRss>http://blogs.ch.cam.ac.uk/pmr/2013/04/11/ami2-ukont2013-15-min-demonstration-of-ami2-and-maybe-opsin-and-chemicaltagger/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>#openaccess Who owns the Law? Who owns scholarship? You must listen to Ed Walters</title>
		<link>http://blogs.ch.cam.ac.uk/pmr/2013/04/06/openaccess-who-owns-the-law-who-owns-scholarship-you-must-listen-to-ed-walters/</link>
		<comments>http://blogs.ch.cam.ac.uk/pmr/2013/04/06/openaccess-who-owns-the-law-who-owns-scholarship-you-must-listen-to-ed-walters/#comments</comments>
		<pubDate>Sat, 06 Apr 2013 16:21:42 +0000</pubDate>
		<dc:creator>pm286</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blogs.ch.cam.ac.uk/pmr/?p=4941</guid>
		<description><![CDATA[IF YOU HAVE ANY INTEREST IN OPENACCESS spend 15 Minutes on http://vimeo.com/63123518 &#8220;Ed Walters &#8211; Who Owns The Law?&#8221; It&#8217;s worth the time.  In a chillingly precise, researched piece Ed shows how US states have handed over the ownership of their Law to commercial publishing companies. Elsevier and Thmoson-Reuters. Heard of them? Yes, the same [...]]]></description>
			<content:encoded><![CDATA[<h1><span style="font-size:12pt">IF YOU HAVE ANY INTEREST IN OPENACCESS spend 15 Minutes on <a href="http://vimeo.com/63123518"><span style="color:blue;text-decoration:underline">http://vimeo.com/63123518</span></a> &#8220;Ed Walters &#8211; Who Owns The Law?&#8221;  It&#8217;s worth the time.
</span></h1><p>
 </p><p>In a chillingly precise, researched piece Ed shows how US states have handed over the ownership of their Law to commercial publishing companies.  Elsevier and Thmoson-Reuters.
</p><p>Heard of them? Yes, the same companies that publish Scopus and WebOfScience . 
</p><p>I don&#8217;t want to take away the chilling effect of Ed&#8217;s presentation – so listen. And be outraged.
</p><p>And then realise that the same thing is happening in Science and that naïve Open Access is making it worse. Assuming that other people will look after our rights, and meanwhile handing over our freedom. It&#8217;s happening right now.
</p><p>And unless we wake up and challenge, it will be too late.
</p><p>I&#8217;ll blog in more detail after you&#8217;ve watched Ed&#8217;s video.
</p><p>
 </p>]]></content:encoded>
			<wfw:commentRss>http://blogs.ch.cam.ac.uk/pmr/2013/04/06/openaccess-who-owns-the-law-who-owns-scholarship-you-must-listen-to-ed-walters/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Teaching #ami2 to recognize biological names (binomial)</title>
		<link>http://blogs.ch.cam.ac.uk/pmr/2013/04/04/teaching-ami2-to-recognize-biological-names-binomial/</link>
		<comments>http://blogs.ch.cam.ac.uk/pmr/2013/04/04/teaching-ami2-to-recognize-biological-names-binomial/#comments</comments>
		<pubDate>Thu, 04 Apr 2013 11:17:39 +0000</pubDate>
		<dc:creator>pm286</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blogs.ch.cam.ac.uk/pmr/?p=4935</guid>
		<description><![CDATA[  Erithacus rubecula (Wikimedia Commons) &#8220;the Robin&#8221;  #ami2 can now read the text of scientific articles as HTML (she has a little trouble with bold letters and strange fonts but we&#8217;ll teach her how to manage). Here is how she finds organisms in text. Having created the HTML (which is also XML) she can search [...]]]></description>
			<content:encoded><![CDATA[<p>
 </p><p><img src="http://blogs.ch.cam.ac.uk/pmr/files/2013/04/040413_1318_Teachingami1.jpg" alt="" /><h1>
		</h1></p><h1>Erithacus rubecula (Wikimedia Commons) &#8220;the Robin&#8221;
</h1><p>
 </p><p>#ami2 can now read the text of scientific articles as HTML (she has a little trouble with bold letters and strange fonts but we&#8217;ll teach her how to manage). Here is how she finds organisms in text. Having created the HTML (which is also XML) she can search it with XPath. XPath is one of the simplest and most powerful search tools for moderate chunk of information. Here she searches a page for italic phrases with at least one space (e.g. 
</p><p>I heard an <span style="text-decoration:line-through"><em>Erithacus Rubecula</em></span>
		<em>Erithacus rubecula</em> today.  (@rmounce points out the capitalization!)
</p><p>AMI has extracted the HTML <strong>(&lt;i&gt;…&lt;/i&gt;</strong> means italics)
</p><p>&lt;p&gt;<span style="font-family:Courier New"><strong>I heard an &lt;i&gt;Erithacus rubecula<em>&lt;/i&gt;</em> today.&lt;/p&gt;</strong></span>
	</p><p>Now she creates an xpath :
</p><p style="margin-left: 18pt"><span style="font-family:Courier New"><strong>&#8220;.//html:i[contains(.,' ')]&#8221;
</strong></span></p><p>This means:
</p><ul><li><span style="font-family:Courier New"><strong>.//</strong></span> anywhere in the document (we can increase the precision later)
</li><li><span style="font-family:Courier New"><strong>html:i</strong></span> a chunk of italics
</li><li><span style="font-family:Courier New"><strong>contains(.,&#8217; &#8216;) </strong></span>which (.) contains a space (&#8216; &#8216;)
</li></ul><p>It&#8217;s not flowing prose but it&#8217;s trivial for AMI. And the result (using Jaxen query() in XOM) is:
</p><ul><li><span style="font-family:Times New Roman;font-size:12pt">&amp; Evolution
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    16S, COI
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    16S, COI, COII
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    16S, P
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Achillea macrophylla, Adenostyles alliarae
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Achillea, Adenostyles, Cirsium, Doronicum, Petasites, Senecio
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Advances in Chrysomelidae Biology 1.
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Ae. triuncialis
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Aegilops geniculata
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Annals of the Entomological Society of
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Annals of the Entomological Society of America
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Annual Review of Ecology and
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Applied Statistics
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    BMC Bioinformatics
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    BMC Evolutionary Biology
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Bioinformatics 2005, 21(24):4423-4424. 69. Sikes DS, Lewis PO: PAUPRat: PAUP implementation of the parsimony ratchet.
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Biological Journal
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Biology and Evolution
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Boston University, Boston,
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    COI (13 PPIc among 16 polymorphic sites) and
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    COII, P
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Cladistics-the International Journal of the Willi Hennig Society
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Current Biology
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Diabrotica virgifera
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Die Käfer Mitteleuropas.
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Doronicum clusii
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Doronicum grandiflorum</span>
		</li></ul><p><span style="font-family:Times New Roman;font-size:12pt">Clearly not all italics are organisms. Many are bibliographic indicators. There are two simple ways to improve the precision:
</span></p><ul><li>Remove false positives. We can probably remove most of the bibliography by context (they occur on title pages and in references)
</li><li>Include only known species. This is probably the best way forward and we have an excellent Open Source tool (Linnaeus) from Casey Bergmann and colleagues at Manchester with &gt; 10000 commonest species.
</li></ul><p>There are other  ways:
</p><ul><li>Morphology and lexical analysis of digraphs (the letter frequency in organisms is very different from English prose – higher vowel frequency for example).
</li><li>Local context (include Hearst patterns … but hey, I have to go…)
</li></ul><p>So we easily get:
</p><ul><li><span style="font-family:Times New Roman;font-size:12pt">    Achillea macrophylla, Adenostyles alliarae
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Achillea, Adenostyles, Cirsium, Doronicum, Petasites, Senecio
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Ae. triuncialis
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Aegilops geniculata
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Diabrotica virgifera
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Doronicum clusii
</span></li><li><span style="font-family:Times New Roman;font-size:12pt">    Doronicum grandiflorum</span>
		</li></ul><p>So I hope you are now clear about how powerful content-mining is, how it will revolutionise science and how it is a crime against human knowledge to restrict its deployment.
</p>]]></content:encoded>
			<wfw:commentRss>http://blogs.ch.cam.ac.uk/pmr/2013/04/04/teaching-ami2-to-recognize-biological-names-binomial/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
