ETD2009; Faculty and Libraries are failing Student expectations

I’ve been at ETD2009 #etd09 for half a day, and given my presentation. I’m going to feed back some impressions and some calls for action. It’s not based on much evidence just a general feeling.

First what I said (on The Semantic Scientific Thesis). I had 25 mins, and managed to give very brief accounts of SPECTRa-T where we (especially Joe Townsend) explored the extraction of chemistry from theses. Very simply:

  • It’s hard and messy but not impossible

  • PDF makes it much much worse

and TheOREm ORE for e-Theses – where the message is simple:

  • Do it.

  • You don’t have to give up your precious PDF. (there was a workshop by Adobe on how Adobe is wonderful and ideal for eTheses I didn’t go. And there was a presentation by a graduate student who thanked Adobe profusely.) But just add semantic material in a sensible format AS WELL. Word2007 for the structure; and domain markup languages where possible; else ASCII files

I also said that unless faculty and libraries got their act together they will have lost the opportunity. The current generation of students is getting tired of their inaction and will hold them responsible for the failures of Openness. More later.

I was followed by Julia Blixrud of ARL (and sometime) SPARC who made the same point much better and at greater length. Students want action now. They are getting tired of faculty inaction and library dithering. Get them into position where they can make an effective contribution.

SPARC is actively encouraging student action through its Sparky awards.

In the evening we had presentations of ETD awards and again evidence of the great creativity that students are putting into their theses one thanked her faculty for allowing her to present material in non-page form. But there’s clearly a large block of conservatism which stands in the way of this innovation.

So, faculty and libraries, move quickly. The digital age moves in minutes, not years.

Posted in Uncategorized | Leave a comment

ETD2009: TheOREm; Submitting semantic thesis in JISCWorld

Last year we were funded by JISC to explore ORE the new RDF-based approach to managing web objects. We decided to apply it to theses and JISC liked the idea. So we built an electronic scenario where a university (in JISCWorld for Pratchett fans) has a Graduate Office which manages the submission of semantic theses. And, because ORE allows for separated components, the thesis could be submitted in bits and published in bits. The work was done by Nick Day, Jim Downing and help from Joe Townsend and later Peter Sefton and the Toowoombans (in ICE-TheOREm).

The project built a great wiki and I’ll simply link to it.

There is a major effort worldwide to capture student theses as born-digital objects. Much of the effort to date has concentrated on PDF documents as a proxy for the printed version, but PDF cannot achieve the full potential value of born-digital theses. Publication of properly structured theses can allow rapid exploration of material, for example down to individual diagrams or the sub-subsection level. Individual citations can be extracted automatically. Supporting data can be associated with the thesis directly rather than reduced to tabular or graphical form in appendices. Future readers will want not only the document level descriptive metedata (author, institution, title, keywords, dates) but also access to subcomponents (data, tables, references, diagrams, etc.). For example a reader might wish to survey all the analytical instrumentation used within the last year and therefore only need access to the Materials and Methods sections, but to wish this for every thesis. Similarly a bibliographer might wish to analyse communities of practice from citations. What proportion of references are identical to those in earlier theses from the same institution? Longitudinal studies may be possible – retrieve all images of protein gels over the last 10 years and see whether the quality has increased.

Current package-based approaches to the transfer and publication of complex digital objects result in problems of balkanisation due to the multiplicity of available packaging standards that complicate interoperability between systems and the discovery and reuse of complex object components. They usually use a pass-by-value paradigm of data transfer, which creates additional complexities in data duplication, trust / access control and revisioning. The ORE pass-by-reference approach allows a much more flexible, disaggregated approach to complex object description, access and transfer.

There is a growing realisation in the repository community that the web is not only an essential part of content delivery, but also an underused and undervalued architectural substrate for the repository ecosystem that offers easier interoperability between systems within our domain, and with the web community at large. ORE works within the constraints of the web architecture, therefore it will be possible to combine it with other web standards (e.g. Atom Publishing Protocol, SWORD, XACML) to integrate large scale infrastructures.

Project Aims

The general aims of the project are as follows:

Test the applicability of the ORE standard in a realistic scholarly setting – thesis description, submission and publication.

Demonstrate the advantages of the ORE approach in complex object publication, by combining it with existing web-standards compliant technologies.

Provide examples to fully exercise the ORE specification in order to provide validation and future direction.

Project Description

The experimentation in TheOREM will contain two main strands.

We will create a small corpus of ideal born-digital theses based on real theses (see this page) and describe these as completely as possible using ORE (see this page).

We will define a realistic scholarly scenario in which such theses might be handled, and implement demonstrators for each component system in the scenario in order to show the capabilities and limitations of ORE (see scenario-creation for details of the scenario notes on implementation).

We will investigate how ORE and modern web technologies can help with Embargo.

Project Discussion

Click here to view the wiki pages detailing current discussion about the TheOREM project. The ICE team’s trac also has project documentation and progress.

In it you will find the complete scenario for the submission of a thesis. I will hope to take the ETD2009 audience through it. I think the pictures are beautiful.

Like the current submission scenario, the updated version consists of 6 parties, which are:

The student doing the degree.

The students supervisor.

The students viva examiners.

A departmental thesis management system.

The Graduate Studies Office.

An Institutional Repository.

These are shown in the diagram below. The pages linked further down the page, contain similar images to describe the interactions during that part of the submission process. The position of the parties are kept constant, though for each component all parties may not be included in each image.

graphics1

The updated submission process itself can be broken down into 6 sections:

Authoring of the thesis and embargo.

Initial submission.

Arrange the examiners and send them the thesis.

Doing the viva examination, fixing any corrections and resubmitting.

Approval of the final submission of the thesis and award of the degree.

Management of embargoed data.

ENJOY

Posted in Uncategorized | Leave a comment

ETD2009: SPECTRaT capturing chemistry; use Word not PDF

In my talk at #ETD2009 I will make reference to SPECTRa-T, a JISC funded project that looked at the extraction of semantic information from chemistry theses. It was a first attempt and we were grateful to Steve Ley and his group for giving us access to about 12 theses. These were in Word, without which we could not have done very much useful.

I’m showing the poster that the group created to present their work. I know it’s a bit small, but it’s mainly to acknowledge them and to show the complex workflows that we built. It’s not necessary to read the small print, just to acknowledge that we could run theses through the system.

graphics1

Here are some bits magnified (I am still learning the best way to use this blog):

graphics2This shows how OSCAR (our journal-eating robot) is able to extract the chemistry from the text of the thesis and put it in the repository

The next picture shows how we can extract semantic chemistry if the author has embedded chemical-specific objects (OLE, images) in the thesis. This is hard work for us but it can work:

graphics3

Here we show how both graphical chemistry (the molecular picture) and the chemical name can be turned into real semantic chemistry. The graph is a complete network of all the chemical reactions in the thesis automatically created by machine. Even the author has never seen this graph.

But it’s hard and lossy.

Word2003 allowed us to capture much of the chemistry.

PDF was much worse.

Much worse

Wouldn’t it be better if the chemist included all that in the thesis when it was submitted. The information all exists all they have to do is zip it up and deposit it.

Posted in Uncategorized | 2 Comments

ETD2009: The Semantic Scientific Thesis

I am giving a talk at ETD2009 (Electronic Theses and Dissertations) in Pittsburgh, PA tomorrow on the Semantic Scientific Thesis. I now like to try to blog the key points of my presentation since I shall show a number of demos which aren’t easy to capture.

My arguments will be that almost all theses are created with the book or journal metaphor and based on flat text, flat images and little if any linking or semantics. This is an increasingly outdated way of communicating to humans in today’s pervasive web where we learn to interact with information from the cradle upwards.

It’s even worse for machines. Most scientific theses contain a wealth of data which can be used for data-driven science. When we combine observations and conclusions from different sources, or different times we can often get radically new insights.

What can we do with an electronic thesis in PDF? Print it. Because PDF has been designed to talk to printers, not humans. It has no semantics (in its usual form). If we are luck we can search it for concept in text, but we cannot search the diagrams or the tables.

We also cannot use it for input into new science. It’s now quite common for a chemist to compute the properties of a molecule using quantum mechanics. But this normally means she has type up all the data (either from a journal or a thesis) to do the calculation. And yet, if the theses were machine friendly, we could do this for thousands of chemistry theses a year.

So how do we create semantic theses? It’ll take us 10 years or more to work through the process, and it needs the creation of ontologies as we go. But there are simple first steps:

  • preserve the actual document that the student authored. This is normally either Word/OpenOffice or LaTeX

  • preserve the key data files in the work.

These are easy to state and easy to do technically. Let’s not worry at this stage about exactly what the semantics of the data files are we shouldn’t worry about migration or semantic preservation for civilisations who dig up our digital artefacts. Let’s just get everyone into the habit of saving their data. In many subjects there are de facto standards and even if there are many it’s a lot better than nothing.

So, Graduate Offices and Repository Rats, just allow data to be deposited alongside the theses. The tools are coming (I shall show TheOREm, where we have used the ORE RDF technology to manage components of a thesis).

Let each student answer the question:

If someone in my discipline or in my lab wanted to build on my work next year, have I made it easy for her to use my data?

And let every examiner and every board make that a prerequisite of graduating.

Posted in Uncategorized | Leave a comment

Chemistry in the Digital Age at PSU

I’m talking to a group of mainly chemists and IT people at Penn State University (PSU) in Karl Mueller’s workshop on Chemistry in the Digital Age. We’ve already had three wonderful days here two hacking PDF2Chemistry and one or OREChem. I never know what I am going to say in detail but I shall be trying to relate to the students which will be there masters and PhD because they are the people who will change our chemical future.

I’m calling my talk The Chemical Semantic Web. This is based on TimBL’s ideas of a semantic web where all information is formally defined,labelled and linked. Some people use terms like Giant Global Graph and others Linked Open Data. There’s lot’s of excitement in the SW community and lots of new tools emerging, such as triplestores.

So I’ll point back to some previous blog posts on this subject. They’ll contain thoughts and commentary. The actual presentation will contain live demos which are very difficult to point to.

Just over a month ago I spoke in BioIT on Linked Semantic Open data. That’s not quite the same title but it’s near enough, and these posts can help:

BioIT in Boston: What I shall say and How I shall say it (slightly moderated by the occasion so skip this if it doesn’t make sense)

BioIT in Boston: What is Open?

(Very important. If data is not Open we cannot use it in the current century. Distinguish between free-as-in-beer (gratis, not good enough) and free-as-in-speech (libre))

BioIT 2009 – What is semantic? (Discusses in medium depth what you need)

BioIT 2009 – chemical semantic concepts introduces the idea of ontologies

BioIT 2009 – What is data? -1 chemistry is built in part on data. This data must be Open. So what is it?

BioIT 2009 – Where do we get Semantic Open Data?

This should give you some background into the ideas.

Posted in Uncategorized | Leave a comment

OREChem; PDFHamburger to Chemistry revealed (a bit)

More on the PDFHamburger2Chemistry story. It’s making progress, but progress is like walking through glue. Yesterday we had an illuminating and at times depressing insight into the ghastly innards of the PDF hamburger. As you know hamburgers contain all sorts of hidden horrors sawdust, fat, water, etc. and we had an insight into the PDF equivalents.

The story so far: Bill Brouwer is a physicist in PSU chemistry dept and has enthusiastically contributed to the ORE project, working closely with Karl Mueller and Lee Giles’ team (known for CiteSeer and ChemXSeer). One of the goals for OREChem is to extract chemistry from conventional sources (e.g. theses, articles) and convert to CML and then RDF. Bill and I worked in parallel for the last 5 months on different parts of this technology.

Bill had to get involved deeply in PDF technology. Has anyone heard of ASCII85? It’s a ghastly way of transmitting binary information to a printer and it perfuses the PDF technology. So, for example, rather than transmitting a simple character (O) to a printer, the PDF abbatoir will convert the character to a bitmap and then encode the bitmap in ASCII85 and transmit this multicharacter string to the printer. (It helps to remember that the main purpose of PDF is to talk to printers, not humans. Which is why it carries so little useful semantic information and so many horrors ASCII85 is not the only one). Anyway Bill was able to hack much of this and actually use the components of the PDF (features) to try to detect which bits were molecules, which were spectra, etc.

Then Mark Borkum came out to PSU 3 weeks ago and has taken over from Bill and made spectacular progress. Mark’s a first year PhD computer scientist now working in Jeremy Frey’s group. Mark’s continued interpretation of PDF has allowed us to design a 3-part system:

  • PDF2SVG (Mark). Mark wants to do this really properly (the current tools do a good, but not lossless conversion). And it needs semantics adding, such as superscripts. (PDF has no idea of superscripts, simply draw large character, the change to smaller font, increase y and x and draw). That’s a sub (or superscript. Remember that printers are dumb and PDF talks to printers. So it’s not trivial to reconstruct subscripts (you did remember the thousands of kerning character pairs, didn’t you?). The idea at the end of this is that Mark will have split up the document into semantic components we won’t know what all the semantics are but we can guess some.

  • SVG2Chemistry (PMR). SVG is a good technology for reconstructing semantic objects from graphics. This has to be heuristic after all what does two crossed lines + mean? It could be plus, or it could be tetramethylmethane. SVG and PDF don’t know. However with good SVG, including subscripts and text runs this is very promising

  • Spectral deconvolution and analysis (Bill). Bill has been interpreting the spectra in terms of components. He started doing this in the PDF analysis but it makes much more sense to do it at the end.

  • Note the value of modularisation. It allows each person to concentrate on the bits they are expert in. And not how SVG and CML represent formal contracts for handing over information. It makes unit testing and integration much easier.

  • We’ll be working out today at the meeting how we take this forward.

  • But even when we are successful.

  • NO MORE PDF CHEMISTRY HAMBURGERS PLEASE…

  • BTW: What mark is doing is generic and others must have struggled with this. There is a range of PDF2Foo tools (pdf2txt, pstoedit, PDFBox etc.) I congratulate all those who have waded through the same bog. I am sure Mark would welcome any help and experience here.

Posted in Uncategorized | Leave a comment

Turning Spectral PDF Hamburgers into Semantic Cows

I’m in Penn State University in the middle of Pennsylvania for two meetings:

The second, on Thursday, is run by Karl Mueller of PSU on chemistry in the digital age. I am kicking off with a talk on The Chemical Semantic Web. I’ve used the title before but the content keeps changing.

because of , among other things …

our OREChem project funded by Lee Dirks of Microsoft Research. We are having a review on Wednesday. For an overview see Carl Lagoze’s paper: http://journal.webscience.org/112/2/websci09_submission_10.pdf

There are five groups covering chemistry and informatics sometimes integrated, sometimes for different departments: Cornell, Indiana , Southampton , Penn State and Cambridge. OREChem is about using ORE to add RDF to chemical information. It’s all about linking and interoperability.

Many multi-site projects don’t really do more than fund individual sites. OREChem is not like them. Because we are digital and because we face a common problem there isn’t enough information there’s a strong sense of purpose. And that has crystallised in a subproject to liberate chemical information.

There’s lots of problems in this liberation most political but this is tackling the hamburger cow problem. (see, for example, a blog discussion last year). Turning a PDF into XML is like turning a hamburger back into a cow. Turning PDF into CML is even worse and into OREChem is yet again worse.

It would be so simple for chemists to archive their spectra as JCAMP files, but they and their publishers (who are the primary problems) are so far totally uninterested. They spend time and effort turning digital spectral cows into appalling epaper hamburger PDFs. A chemist could send their data to the publisher it’s less than a megabyte and this could be mounted as supplemental data. It would take a few minutes at each end. But no, the chemist has to create a paginated document in PDF which must take ages. Days, at least in some cases.

Here’s a typical example. As the publisher has COPYRIGHTED THE DATA I can’t reproduce this in this blog. Yes, we are even restricted in disseminating our own science, but that’s a different topic. So here’s a link which you can follow without paying or getting lawyer’s letters. The point is that if you scroll through this you will see zillions of lovely cows transformed into ugly epaper hamburgers.

And trust me, PDF is really, really horrible.

But Southampton, PSU and we are fighting back in this project. If (with a very few exceptions) the chemical community is not forward-looking enough to embrace semantic chemistry enthusiastically we are going to have to do it anyway. And we have developed a hamburger2cow processor. Like one of those sci-fi movies (the curry monster in Red Dwarf). Bill Brouwer and I started this last December and we’ve been doing our own thing in parallel. We met yesterday also with Mark who is doing a PhD at Southampton and we found that our bits fitted perfectly. We can turn a PDF hamburger into a cow. Not all hamburgers immediately. And some hamburgers are so awful they are past redemption. But certain types of hamburger. Enough to show the chemists what they are missing. And to highlight the lack of value that the publishing process currently adds and encourage/shame them into embracing the semantic chemistry age.

Details will be revealed at the meeting on Thursday.

Posted in Uncategorized | 2 Comments

Open Rights Group: more Web Democracy

I’ve woken up early (am in PA, US more later) and just discovered a tweet which pointed me to the recent action of the Open Rights Group on digital democracy democracy about digits and digits supporting the democratic process. First the ORG, and then the contents of the tweet (about MEPs and candidates’ stances):

Politicians and the media dont always understand new technologies, but comment and legislate anyway. The result can be ill-informed journalism and dangerous laws.

The Open Rights Group is a grassroots technology organisation which exists to protect civil liberties wherever they are threatened by the poor implementation and regulation of digital technology. We call these rights our digital rights.

In 2005, a community of 1,000 digital rights enthusiasts came together to create the Open Rights Group. Since then, ORG has spoken out on copyright term extension, DRM and the introduction of electronic voting in the UK. We have informed the debate on data protection, freedom of information, data retention and the surveillance state.

These are issues that affect all of us. Together, our community, which includes some of the UKs most renowned technology experts, works hard to raise awareness about them. If these are issues that you care about, become part of our community and support the Open Rights Group today.

Not surprisingly there is an overlap of people with mySociety and the Open Knowledge Foundation. It’s heartening to see these grass roots movements growing so widely and rapidly. As a commenter replied to me democracy is alive and well, it’s politics that is sick.

The ORG has run a campaign to find out MEPs views (Do your MEP candidates care about digital rights?) and has summarised the issues (Voting in the EU today: why it matters for digital rights). The tweet pointed me to (Do Your MEP Candidates Agree with ORG? ) with summaries for the regions (http://euelection.openrightsgroup.org/constituency/eastern) It shows me that the Lib Dem candidates you will remember I wrote to Andrew Duff have a 57% support for Reform Copyright Privacy Online Surveillance State Open Internet (Follow the links to find out what these headings mean in general agree means they support ORG.)

The table is slightly misleading as it conflates the percentage response with the number of agreements in those responses. So the Lib Dem is better expressed as 57% recall (i.e. 4/7) and 100% agreement. The conservative 0% on copyright is 100% recall and total disagreement with statements such as:

Europe should not extend copyright terms as longer terms damage innovation and reward the estates of deceased artists rather than working creators.

Labour had 0% recall on all issues. What really gets to many of the electorate in the UK is that they see evidence of politicians who simply don’t care. If the candidates cannot take the trouble to reply to well-presented important questions of policy then what right have they to ask us to support them.

This is real democracy. I know what my candidates stand for. And whether they care.

Posted in Uncategorized | Leave a comment

Arun's evangelism for Open Access in India and globally

I am really delighted to promote the new blog that Arun (Subbiah Arunachalam ) has started. Arun is a polymath who has worked in chemistry and information science:

Subbiah Arunachalam (known to friends as Arun) started his career as a research chemist, but found his calling in information science. In the past four decades, he has been a student of chemistry, a laboratory researcher (at the Central Electrochemical Research Institute and the Indian Institute of Science), an editor of scientific journals (at the Publications and Information Directorate of the Council for Scientific and Industrial Research and the Indian Academy of Sciences), the secretary of a scholarly academy of sciences (IASc), a teacher of information science (at the Indian National Scientific Documentation Centre), and a development researcher (at the M.S. Swaminathan Research Foundation and the Indian Institute of Technology Madras). While working with M.S. Swaminathan Research Foundation, he initiated the South-South Exchange Traveling Workshop to facilitate hands on cross-cultural learning for knowledge workers from Africa, Asia and Latin America engaged in ICT-enabled development.

I first met Arun at a meeting on publishing in the Third World (I can’t remember the exact title) which opened my eyes to the role of scholarly publications. (I think I contributed little except a stereotyped neo-colonial and chauvinistic approach but I learnt a lot).

Since then we have been correspondents at irregular intervals. He frequently asks if we can help to evangelise Openness in India. He would send discussion documents on all sorts of areas in this field. Now he has moved to this blog which is an ideal way to spread his message.

The first day (June 8th) had over 50 posts. Substantial posts. I’m hoping that these represent a backlog (many have dates up to 5 years back) and that it will settle down to a few a day, because otherwise, Arun, you will overwhelm us!

I cannot myself urge scientists and information specialists in the third world (I hope that is still the correct term) to take any particular action.

But I can meta-urge them to read Arun’s blog and to mail him(there are no comment facilities?) on areas and actions.

Posted in Uncategorized | Leave a comment

Origins of CML – 0

[I am starting to blog on CML. Since some readers of this blog may be only interested in the chemistry and not the wider aspects of science and the web I am going to wean them over to the CML blog. I’ll make a full post on that blog and just a few sentences here with a link to the post. This also means that I’ll start to remove the technical chemistry from this blog]

I shall be writing this blog mainly in the first person but you realise that CML is the joint product of Henry Rzepa and myself over many years. Simply, CML would not have happened without Henry. Perhaps in 2009 I am the more active contributor but it’s a joint creation. So, if ever I write things that appear to be just due to me please mentally replace this by PMRz what I call our symbiote.

What are the origins of CML? I think I go back to ca 1980 when I was writing code to extend Sam Motherwell’s great FORTRAN toolkit for the Cambridge database BIBSER (biliograpgraphic search), CONNSER the first and greatest chemical substructure algorithm,

continued on http://wwmm.ch.cam.ac.uk/blogs/cml/?p=38

Posted in Uncategorized | Leave a comment