Scholarly HTML hackfest


We are gearing up for the weekend scholarly hackfest in Cambridge. Like all hackfests it is organised chaos. But we are assembling a range of top-class creators. They include:

  • Peter Sefton (USQ, ICE, HTML)
  • Martin Fenner (Hannover, WordPress)
  • Brian McMahon (Int. Union of Crystallography – publishing, validation, dictionaries…)
  • Mark MacGillivray (Edinburgh, Open bibliography)
  • Dan Hagon (ace hacker)
  • JISC (Simon Hodson will be here on Friday)
  • PM-R group (Sam Adams, Joe Townsend, Nick England, David Jessop, Lezan Hawizy, Brian Brooks, PM-R, Daniel Lowe) Lensfield, JUMBO, OSCAR, Chemical Tagger, etc…)

So far the following themes are emerging:

  • Data publication. How do we take a semantic data object and publish it? Currently we are looking at chemistry (crystallography and compchem) and general scientific numeric data
  • Bibliography. How can we regain control of bibliography. WE authors need the tools to create what WE want to say – not to have to waste time creating something that sucks (“Harvard style” rather than BibTEX) and whose sole purpose is to save the publishers money.
  • A general flexible authoring platform under our control.

Here’s Martin Some excerpts…

Unfortunately allmost all bibliographies are in the wrong format. What you want is at least a direct link to the cited work using the DOI (if available), and a lot of journals do that. You don’t want to have a link to PubMed using the PubMed ID as the only option (as in PubMed Central), as this requires a few more mouseclicks to get to the fulltext article. And you don’t want to go to an extra page, then use a link to search the PubMed database, and then use a few more mouseclicks to get to the fulltext article (something that could happen to you with a PLoS journal). 

A bibliography should really be made available in a downloadable format such as BibTeX. Unfortunately journal publishers – including Open Access publishers – in most cases don’t see that they can provide a lot of value here without too much extra work. One of the few publishers offering this service is BioMed Central – feel free to mention other journals that do the same in the comments.


My idea for the hackfest is a tool that extracts all links (references and weblinks) out of a HTML document (or URL) and creates a bibliography. The generated bibliography should be both in HTML (using the Citation Style Language ) and BibTex formats, and should ideally also support the Citation Typing Ontology (CiTO) and COinS –  a standard to embed bibliographic metadata in HTML. I will use PHP as a programming language and will try to build both a generic tool and something that can work as a WordPress plugin. Obviously I will not start from scratch, but will reuse several alrady existing libraries. Any feedback or help for this project is much appreciated.

If I had a tool with which I could create my own bibliographies (and in the formats I want), I would no longer care so much about journals not offering this service. One big problem would still persist, and that is that most subscription journals wouldn’t allow the redistrubition of the bibliographies to their papers. A single citation can’t have a copyright, but a compilation of citations can. I’m sure we will also discuss this topic at the workshop, as Peter Murray-Rust is one of the biggest proponents of Open Bibliographic Data.

We are able to support this through an EPSRC “follow-up” grant – Pathways to Impact – whose purpose is to disseminate what we have already achieved. This hackfest builds on OSCAR and several JISC projects (who are also supporting some of the group at Cambridge).



This entry was posted in Uncategorized. Bookmark the permalink.

17 Responses to Scholarly HTML hackfest

  1. Chris Rusbridge says:

    Good luck with the Hackfest; sorry I can’t be there, although my coding skills are no longer at a relevant level! My concern re Scholarly HTML relates to the difficulty in saving an article in HTML and then accessing it later (one of PDF’s big advantages is that it does this simply and reliably). Safari does this well with .webarchive but it’s non-standard and no other browsers support this. Most use a really clumsy convention with a .html file and an associated directory; not very easy to manage.
    So my suggestion to help Scholarly HTML would be a plugin (eg for Firefox) or filter that converts Scholarly HTML articles to .epub (eg see mfenner’s plugin for epub from WordPress).

  2. I agree that there are situations where you want to have the HTML and all associated files (figures, CSS, suplementary information) in one single package – e.g. for submission to a repository or journal, or for archiving. This is the reason I became interested in ePub. I’m sure we will have a discussion about packaging formats over the weekend.

  3. Pingback: Hacking towards Scholarly HTML « ptsefton's Anotar discussion blog

  4. Pingback: Hacking towards Scholarly HTML « ptsefton

    • Dan Hagon says:

      Partially inspired by discussion between Chris and Martin above, earlier this week I wrote a blog post( that (amongst over things) looks at the issue of packaging, or what is the same thing: trying to fit an inherently dynamic object, i.e. some flavour of ScholarlyHTML, into an antiquated static model of scholarly communication derived from print media. Simply packaging a static collection of files into a tar-zipped archive will inherently involve loss of important (contextual) information that future scholars won’t have access to (see the Wayback machine for an example of the difficulties).
      Similarly, such dynamic objects don’t admit simply descriptions (e.g. “I was looking at that webpage at this time with all the dynamic data it loaded from all those other resources, which has since changed themselves also, and the reason I was doing this was …”). Bibliography is not just DOIs; see McKerrow’s book for some of the gory details ( – I shall bring my copy to the hackfest :).
      These are difficult problems to solve but they are interesting and challenging, which is why it’s whorthwhile and fun to tackle them.

  5. Ed Chamberlain says:

    Skills and schedule are in the way of this sadly, but you may want to take a look at this open source ebook publishing framework.
    Appears to be tackling the container issue as well as means to exploit rich content with HTML 5. Appears to be platform specific (iPad+webkit) so may not be of use.

  6. Mr. Gunn says:

    Hi Martin, PMR, et al –
    Just wanted to make sure you folks know about the Mendeley news – they’re offering a $10001 cash prize for great research apps – so that’s something to keep in your mind on the day. I know your aspirations are bigger and you’re in this to try to create some real change in the scholarly process, but that amount of money would go a long way towards getting a service up and going.
    Post here:

    • pm286 says:

      For me the main thing that Mendeley could do is to clarify whether any of their data is Open according to the Open Knowledge Definition. If Mendelay can provide an Open reference bibliography that will be useful to the community. If you think there is a chance of getting an answer I will use the OKF’s IsItOPenData query tool.

  7. Paul Thompson says:

    It is very easy to convert an .html only document to a .pdf. It involves defining a fake .pdf printer.
    Follow these steps:
    1) Download the Docucom pdf driver. It is available free at various places.
    2) Add the driver to your printer list
    3) When you wish to convert a .html file to a .pdf, simply “print” it to a file using the docucom printer driver. You give it a name, and VOILA! You have converted an .html to a .pdf

    • pm286 says:

      As I mentioned in your other comment, creating PDFs is easy. it’s READING them into machines and preserving semantics that is desperately difficult.

  8. Paul Thompson says:

    I made a .pdf out of this page, but don’t know how to post it. Can I email it to somebody? I did it using the \print to file\ approach that I mention is the post (as yet uncleared) above this one.
    And, BTW, you can process bibliographies using either perl or even emacs. I download bibliographies from various places, using emacs to pick them apart, and store them as bibtex format. I then read them into jabref and convert them into other formats from there. Of course, you need a separate macro for each publication style, and it is easy to get tripped up.

  9. Ed Chamberlain says:

    One other consideration is that of digital preservation. Libraires and archives have been developing container formats such as METS for storing and preserving complex objects and metadata standards such as PREMIS to record activity.
    This is certianly outside the scope of the inital weekend but development with an eye to output eventually being converted or encapsulated into such a format may be advantageous.

  10. Pingback: Scholarly HTML | Open Bibliography and Open Bibliographic Data

  11. Pingback: Scholarly HTML |  cottage labs

  12. Chris Morris says:

    One merit of scholarly HTML will be during collaborative editing of the paper. Writing a paper by emailing .docs is painful. Source code repositories work well for anything that can be diff/merged, including HTML. At present, my first choice of format is Open Office HTML – but invariably some other author converts it to a .doc.

  13. Pingback: You want it, you pay for it | Naturally Selected

Leave a Reply

Your email address will not be published. Required fields are marked *