petermr's blog

A Scientist and the Web


Archive for the ‘semanticWeb’ Category

libraries of the future – Ithaka report

Monday, March 23rd, 2009

In preparing for LOTf09 I asked the organizers for some guidance and got a helpfiul reply:

Hi Peter

I would like your to talk to include the research/scientific needs as this is an important perspective, in terms of questions – we will be driven by the questions although on the day anyone can post to the blog and the twitter feeds which are going to be monitored live and you will be able to see all of this – you will also have an opportunity to review all the questions posed by the audience and at the pre meeting indicate which ones you feel you would like to answer.

Your blog is very interesting – you may have seen this already but this report?  -it explores the relationship between faculty and the library in the states –
I have just posted a short review on the Libraries of the future blog – By the way the tag for the event is #LOTF09, -

Dicky Maidment-Otlet
Communications Manager

JISC Executive

PMR: I hadn’t seen it – I read nothing (the information universe is infinite – I am finite – so I read 0% of all information – thanks to H2G2). It confirms my rough impressions:

The (In)visibility of the library
An important lesson is that the library is in many ways falling off the radar screens of faculty. Although
scholars report general respect for libraries and librarians, the library is increasingly disintermediated
from their actual research process. Many researchers circumvent the library in doing their research,
preferring to access resources directly. Researchers no longer use the library as a gateway to information,
and no longer feel a significant dependence on the library in their research process. Although the library
does play essential roles in this process, activities like paying for the resources used are largely invisible
to faculty. In short, although librarians may still be providing significant value to their constituency, the
value of their brand is decreasing.

This is an area of concern for all those concerned with the information strategy of the modern campus, but
is of particular importance to the library itself; if attention and support fades from the library, its ability to
contribute to the intellectual work of the campus diminishes, and its continuing institutional well-being
may be threatened. Libraries should be aware of this decreasing visibility and take steps to improve the
value of their brand by offering more value-added services to raise their profile on campus. It is essential
to their long-term viability that libraries maintain the active support of faculty on their campuses, a factor
which will be most effectively obtained by playing a prominent, valued, and essential role in the research
process. By understanding the needs and research habits of scholars in different disciplines, libraries can
identify products and services which would be appreciated by and of use to these scholars. Such efforts to
be involved in the research process offer benefits to scholars, by providing them with services to improve
their efficiency and effectiveness, as well as to libraries, recapturing the attention of scholars and
contributing to a general awareness of and respect for the library’s contributions.


…And of Science in Particular

The information age has most significantly impacted the sciences, which are experimenting with a wide
range of new models of scholarship and communication, and demanding an increasing level of campus
support. Serving the information needs of cutting-edge scientists for tools and infrastructure requires a
coherent strategic approach, aligning the expertise of academic administrators, technologists, librarians,
and others on campus. As our findings make clear, however, despite this growing significance of
information to scientists, the role of the library is diminishing in importance fastest amongst this group.
Libraries are providing these high-growth fields value in the acquisition of resources – for example in
licensing costly journal collections – but otherwise have been relatively absent from the workflow of
these high-growth fields, with an associated decline in perceived value. Some efforts have been made by
research libraries to engage more deeply in the broader workflow of scientific research, but at the system
level these efforts have been marginal, while commercial providers are making a major push to interject
themselves throughout the scientific research value stream. Deep consideration of how the library
community can best serve scientists and preserve scholarly values in the face of a rapidly changing and
increasingly commercial ecosystem is needed, both on the local and the system level.

That was 2.5 years ago. It’s worse now. I hate to say it, but the scientific library of the future is PRISM.

Unless we stop them.

Which we can, but only if we become revolutionaries.

I haven’t heard anything in the last few days from ULibrarians (apart from Dorothea who has blogged the issue, so they can’t plead ignorance)…. Please say something.

librarians of the future – part II

Saturday, March 21st, 2009

Continuing my very personal selection of “digital librarians” and “digital libraries”. I stress that these are people and organizations that make a real difference to me as a scientist. This could be directly by providing material I use, making major changes in digital scholastic infrastructure, or acting as inspiration in scientific information provision. They also have to be things that I can explain or market to my colleagues. And they are personal to me or the sciences with which I interact.

My list is not specifically aimed to exclude conventional Libraries and Librarians, it’s just that with one or two exceptions they have little impact on what I and my colleagues do. I acknowledge the important work done in building resources and that these are valuable. I could present LOCKSS, SHERPA, UKOLN, NSDL, OAI-ORE, … but I’d be unlikely to get much other than a polite ear from my scientific colleagues.

HoweverI can enthuse lyrical about DBPedia which for me has a real “wow” factor and which I feel has a real chance of coming to their attention. They are more likely to be motivated by the Protein Data Bank (PDB) than by project Gutenberg. Both were started in the early 1970′s, and could rightly occur in my list; but the bioscience digital libraries are so numerous that none gets a mention. I could, perhaps have chosen Gutenberg over Perseus – both were pioneers in different ways. So if you or your project is omitted it’s nothing special. Although I haven’t tried to create a deliberately diverse list, there are several places where one person or one project represents a genre. At least I hope they make you think “hmm” or “wow”.

  • DBPedia This is a great concept – it takes the infoboxes in Wikipedia and turns them into RDF (?what – well RDF is another of TBL’s ideas – a protocol which creates a global semantic web by linking all information with triples. triples? a simple statement subject-verb-object. The infobox information “Sodium StandardAtomicWeight 22.98976928(2) g·mol−1″ gets broken down into triples which relate values to ontologies to units. This is all very new and will change rapidly.From WP: “DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows users to ask expressive queries against Wikipedia and to interlink other datasets on the Web with DBpedia data.” “As of November 2008, the DBpedia dataset describes more than 2.6 million things, including at least 213,000 persons, 328,000 places, 57,000 music albums, 36,000 films, 20,000 companies.” The future of the web is Linked Data – DBPedia ia a marvellous start.
  • Harvard University Harvard gets the accolade for being the first major academic institution to start to reclaim its lost scholarship. It says – our scholarly output belongs to us, not the publishers. And, effectively, it takes the burden of fighting uncooperative publishers away from individual authors and shoulders the responsibility. They’ve been followed by Stanford, and recently MIT (MIT adopts a university-wide OA mandate … the MIT faculty unanimously adopted a university-wide OA mandate. … thanks to Hal Abelson, MIT professor of computer science and engineering, who chaired the committee to formulate it: ”

    [MIT] is committed to disseminating the fruits of its research and scholarship as widely as possible. … Each Faculty member grants to the [MIT] nonexclusive permission to make available his or her scholarly articles and to exercise the copyright in those articles for the purpose of open dissemination.

    PMR: note yet another initiative from the Computer Scientists. Indeed much of the impetus for digital libraries comes not from Libraries, but from Computer Science departments. They often lead and they care deeply about the ownership of scholarship.

  • Tony Hey Yet another computer scientist – at Southampton, which has an enviable record in the new digital libraries – Tony led the UK eScience program with great energy and charisma. I understand “eScience” as the infrastructure – and sometimes content – for the digital needs of the scientist, so in that sense a digital library. eScience stressed “Grids” – the power of distributed computing resources to provide on-demand compute power from anywhere in the world to anywhere. It was a time of experiments, some of which have lasting legacies and others not – which is as it should be. We were grateful to the eScience program for funding us – and we have repaid that by donating our work to the world though digital libraries (but not Libraries). Tony has now moved to Microsoft Research where he has developed a program specifically to promote Open Scholarship and is working with groups in academia, Libraries, etc. (including us).
  • The JISC. This is the fuding body for digital libraries in the UK, and from which we receive funding. It’s not because of that, nor because they have sponsored the discussion in Oxford (LTOF09), that I include them but because there is a real sense of adventure in their program. They have funding for high-risk projects. They sponsor events like the “Developer Heaven” where all the geeks congregate and hack. Unfortunatelty I missed it – but Jim and Nico went from our group. The require collaboration. They go beyond the standard report which ends “this area needs much more research”. They are looking 10+ years into the future.
  • Brian McMahon of the International Union of Crystallography. Scientists all have their own pet discipline or disciplines and this is one of mine. Science has evolved a series of bodies which oversee and coordinate activities, and the extent of this varies wildly. Crystallography has been a tight-knit, friendly discipline for nearly a century with many Nobel laureates who have had major input into how the discipline is organized and it’s truly International. Other disciplines, but not many, achieve this sense of common purpose through their national and international bodies (it does not happen in chemistry).

    The IUCr has Commissions, which include Data and Publications. They run a series of journals which make a surplus including some Open Access ones. They have always had a major concern about the quality of scientific experiments and how these are reported and were (I think) the first ISU to move towards effective mandating that experimental data must be included with publications. By doing this some of the journals are effecively semantic documents which in aggregate are effectively a digital library of crystallography.

    Many deserve credit but I have picked out Brian who has acted as amanuesis to this vision for two decades. He services the communal dictionaries (effectively an ontology), and much of the social and technical infrastructure of the projects. he and his colleagues have supported us at Cambridge through summer students.

There are about 8-9 not yet listed (including a Librarian), so this will be 1 or two posts. If you send suggestions I will consider them. [I have been asked to use LOTF09 as the tag for the Oxford event]

librarians of the future – part I

Thursday, March 19th, 2009

In collecting my thoughts for “the library of the future” (The JISC + The Bodleian) in Oxford on April 2 I’m thinking of those people or organizations that I look to for a combination of resources, philosophy, advocacy that support my scholarship (my current working definition of the “library of the future”). I’ve come up with about 15 entries, mostly people, with one/two organizations (and I may add 2 or 3 more as I thnk of them). They are in roughly alphabetical order and almost all have entries in Wikipedia (“…” indicates quotes). In some cases a person represents a larger organization in which many people deserve high credit – so it’s a bit arbitrary. However these really influence the way I work, and I don’t think I’m unique in looking to them. Most would own to supporting Openness in one way or another (Open Access, Open Data, Open Source, Open Knowledge).

  • Anurag Acharya. Google Scholar. What is important here is not just that Google has revolutionized our thinking about information, but that they have implicitly and explicitly broken the for-profit information culture. Google Scholar came out of the “20%” sparetime activity allowed at Google and it shows what can be done with modest resources. It highlights the bloatedness of formal fee-charging metric services.
  • Tim Berners-Lee. Remember that TBL developed his ideas while supporting the information need of scientists at CERN. No more comment needed.
  • Steve Coast Openstreemap. “OpenStreetMap (OSM) was founded in July 2004 by Steve Coast. In April 2006, a foundation was established with the aim of encouraging the growth, development and distribution of free geospatial data and providing geospatial data for anybody to use and share.This project is remarkable in that it has revolution in its bones – “The initial map data was all built from scratch by volunteers performing systematic ground surveys using a handheld GPS unit and a notebook or a voice recorder, data which was then entered into the OpenStreetMap database from a computer.” In the UK map most map data (and metadata) is strictly controlled by the Ordnance Survey – a government agency which is required to recover costs and which has an effective straglehold on the re-use of maps. OSM has challnged this monopoly, spread to other countries, and just announced its 100,000th participant. It’s an epitome of the many sites which have sprung up to free information.
  • Greg Crane. The Perseus project “is a digital library project of Tufts University that assembles digital collections of humanities resources. It is hosted by the Department of Classics“… “The editor-in-chief of the project is Gregory Crane, the Tufts Winnick Family Chair in Technology and Entrepreneurship. He has been editor-in-chief since the founding of the Perseus Project [1987].”. This project is remarkable for the early vision and the way that it has liberated classical scholarship from the restrictive practices arising from lack of access to critical scholarly objects (editions, etc.). Greg is a frequent collaborator in designing future cyberacademia.
  • Paul Ginsparg “The arXiv is an archive for electronic preprints of scientific papers in the fields of mathematics, physics, computer science, quantitative biology and statistics which can be accessed via the Internet. In many fields of mathematics and physics, almost all scientific papers are placed on the arXiv. As of 3 October 2008 (2008 -10-03)[update], passed the half-million article milestone, with roughly five thousand new e-prints added every month”. Another example of a pioneer who has worked outside the system to create something that is universally regarded as essential and outstanding.

So these are some of the librarians of the future. They build vital, communal, information resources. They invite collaboration, either directly or implicitly. They overthrow conventional wisdom and entrenched systems and interests.

More will follow tomorrow. These probably weren’t what you were expecting, but although you may disagree with individual choices and assessments, you will agree that all have a major impact on our scholarship.

The challenge is how formal academia should rise to meet this creative energy. Will it do this through Libraries? That is for Librarians to think about.

The library of the future – Guardian of Scholarship?

Thursday, March 19th, 2009

I am still working out my message for JISC on April 2nd on “The library of the future”. I’ve had suggestions that I should re-ask this as ““What are librarians for?”” (Dorothea Salo) and “what can a library do?” (Chris). Thanks, and please keep the comments coming, but I am currently thinking more radically.

When I started in academia I got the impression (I don’t know why) that Libraries (capital-L = formal role in organization) had a central role in guiding scholarship. That they were part of the governance of the university (and indeed some universities have Librarian positions which have the rank of senior professor – e.g. Deans of faculties). I have held onto this idea until it has become clear that it no longer holds. Libraries (and/or Librarians) no longer play this central role. That’s very serious and seriously bad for academia as it has left a vacuum which few are trying to address and which is a contributor to the current problems.

I current see very few – if any – Librarians who are major figures in current academia. Maybe there never was a golden age, but without such people the current trajectory of the Library is inexorably downward. I trace this decline to two major missed opportunities where, if we had had real guradians of scholarship we would not be in the current mess – running scared of publishers and lawyers.

The first occasion was about 1972 (I’d be grateful for exact dates and publishers). I remember the first time I was asked to sign over copyright (either to the International Union of Crystallography or the Chemical Society (now RSC)). It looked fishy, but none of my colleagues spoke out. (Ok, no blogosphere, but there were still ways of communicating rapidly – telephones). The community – led by the Librarians – should (a) have identified the threats (b) mobilised the faculty. Both would have been easy. No publisher would have resisted – they were all primarily learned societies then – no PRISM. If the Universities had said (“this is a bad idea, don’t sign”) we would never have had Maxwell, never had ownership of scholarship by for-profit organizations. Simple. But no-one saw it (at least enough to have impacted a simple Lecturer).

The second occasion was early 1990′s – let’s say 1993 when Mosaic trumpeted the Web. It was obviou to anyone who thought about the future that electronic publication was coming. The publishers were scared – they could see their business disappearing. Their only weapon was the transfer of copyright. The ghastly, stultifying, Impact Factor had not been invented. People actually read papers to find out the worth of someone’s research rather than getting machines to count ultravariable citations.

At that stage the Universities should have re-invented scholarly publishing.The Libraries and Librarians should had led the charge. I’m not saying they would have succeeded, but they should have tried. It was a time of optimism on the Web – the dotcom boom was starting. The goodwill was there, the major universities had publishing houses. But they did nothing – and many contracted their University Presses.

There is still potential for revolution. But at every missed opportunity it’s harder. All too many Librarians have to spend their time negotiating with publishers, making sure that students don’t take too many photocopies, etc. If Institutional Repositories are an instrument of revolution (as they should have been) they haven’t succeeded.

So, simply, the librarian of the future must be a revolutionary. They may or may not be Librarians. If Librarians are not revolutionaries they have little future.

In tomorrow’s post I shall list about 10 people who I think are currently librarians of the future.

Please send us your Vistas

Thursday, March 19th, 2009

I recently got an invitation to speak (anonymized as I don’t want to fall out) which included:

“I would very much appreciate a copy of your presentation in advance of the event in Windows XP format as the venue is not migrated to Vista. ”

This is yet another example of technology driving scholarly communication. Increasingly we are asked “please send us your Powerpoints”.

I shall use modulated sound waves for the body of my message and I am tempted not to bring any visual material. But I shall, because I need to communicate what we are actually doing by showing it. Scientists, of course, frequently need to show images of molecules, animals, stars, clouds, etc. and this will and should be the mainstream.

But there is little need to echo the words that I speak by showing them on the screen. When speaking to an international audience it can be very useful to have text on the screen, but this is a UK event and whatever else I speak clearly and loudly and should be comprehensible.

What is inexcusable is how often conference organizers fail to provide any connection to the Internet, and some of these meetings are *about* the Internet and digital age. Even more inexcusable are those places which charge 100USD d^-1 for connections.

(BTW I am still fighting WordPress which loses paragraph breaks regularly…)

Closed Data at Chemical Abstracts leads to Bad Science

Tuesday, March 17th, 2009

I had decided to take a mellow tone on re-starting this blog and I was feeling full of the joys of spring when I read a paper I simply have to criticize. The issues go beyond chemistry and non-chemists can understand everything necessary. The work has been reviewed in Wired so achieved high prominence (CAS display this on their splashpage). There are so many unsatisfactory things I don’t know where to begin…

I was alerted by Rich Apodaca  who blogged…

A recent issue of Wired is running a story about a Chemical Abstracts Service (CAS) study on the distribution of scaffold frequencies in the CAS Registry database.

Cheminformatics doesn’t often make it into the popular press (or any other kind of press for that matter), so the Wired article is remarkable for that aspect alone.
From the original work (free PDF here):

It seems plausible to expect that the more often a framework has been used as the basis for a compound, the more likely it is to be used in another compound. If many compounds derived from a framework have already been synthesized, these derivatives can serve as a pool of potential starting materials for further syntheses. The availability of published schemes for making these derivatves, or the existence of these desrivates as commercial chemicals, would then facilitate the construction of more compounds based on the same framework. Of course, not all frameworks are equally likely to become the focus of a high degree of synthetic activity. Some frameworks are intrinsically more interesting than others due to their functional importance (e.g., as a building blocks in drug design), and this interest will stimulate the synthesis of derivatives. Once this synthetic activity is initiated, it may be amplified over time by a rich-get-richer process. [PMR - rich-get-richer does not apply to pharma or publishing industries but to an unusual exponent in the power law].

With the appearance of dozens of chemical databases and services on the Web in the last couple of years, the opportunities for analyses like this (and many others) can only increase. Who knows what we’ll find.

Thanks Rich. Now the paper has just appeared in a journal published by ACS (American Chemical Society, of which Chemical Abstracts (CAS) is a division). (There is no criticism of the ACS as publisher in my post, other than that I think the paper is completely flawed).  Because ACS is a Closed publisher the paper is not normally Openly readable, but papers often get the full text exposed early on and then may become closed. I’ve managed to read it from home, so if you don’t subscribe to ACS/JOC I suggest you read it quick.

I dare not reproduce any of the graphs from the paper as I am sure they are copyright ACS so you will have to read the paper quickly before it disappears.

Now I have accepted a position on the board of the new (Open) Journal Of Chemoinformatics. I dithered, because I feel that chemoinformatics is close to pseudo-science along the lines of others reported by Ben Goldacre (Bad Science). But I thought on balance that I’d do what I could to help clean up chemoinformatics and therefore take a critical role of papers which I feel are non-novel, badly designed, irreproducible, and badly written. This paper ticks all boxes.

[If I am factually wrong on any point of Chemical Abstracts, Amer. Chemical Soc. policies etc. I'd welcome correction and 'll respond in a neutral spirit.]

So to summarize the paper:

The authors selected 24 million compounds (substances?) from the CAS database and analysed their chemical formula. They found that the frequency of frameworks (e.g. benzene, penicillin) fitted a power law. (PLs are ubiquitous – in typsetting, web caches, size of research laboratories, etc. There is nothing unusual in finding one). The authors speculate that this distribution is due to functional importance stimulating synthetic activity.

I shall post later about why most chemoinformatics is flawed and criticize other papers. In general chemoinformatics consists of:

  1. selection of data sets
  2. annotating these data sets with chemical “descriptors”
  3. [optionally] using machine learning algorithms to analyse or predict
  4. analyse the findings and prepresentation

My basic contention is that unless these steps are (a) based on non-negotiable communally accepted procedures (b) reproducible in whole – chemoinformatics is close to pseudoscience.

This paper involved steps 1,2,4.  (1) is by far the most serious for Open Data advocates so I’ll simply say that
(2) There was  no description of how connection tables (molecular graphs) were created. These molecules apparently included inorgnaic compounds and the creation of CTs for these molecules is wildly variable or often non-attempted. This immediately means that millions of data in the sample are meaningless. The authors also describe an “algorithm” for finding frameworks which is woolly and badly reported. Such algorithms are common – and many are Open as in CDK and JUMBO. The results of the study will depend on the algorithm and the textual description is completely inadequate to recode it. Example – is B2H6 a framework? I would have no idea.

(4) There are no useful results. No supplemental data is published (JOC normally requires supplemental data but this is an exception – I have no idea why not). The data have been destroyed into PDF graphs (yes – this is why PDF corrupts – if the graphs had been SVG I could have extracted the data). Moreover the authors give no justification for their conclusion that frequency of occurrence is due to synthetic activity or interesting systems. What about natural products? What about silicates?

But by far the most serious concern is (1). How were the data selected?

The data come – according to the authors – from a snapshot of the CAS registry in 2007. I believe the following to be facts, and offer to stand corrected by CAS:

  • The data in CAS is based almost completely on data published in the public domain. I agree there is considerable “sweat of brow” in collating it, but it’s “our data”.
  • CAS sells a licence to academia (Scifinder) to query their databse . This does not allow re-use of the query results. Many institutions cannot afford the price.
  • There are strict conditions of use. I do not know what they are in detail but I am 100% certain that I cannot download and use a signifcant part of the database for research, and publish the results. Therefore I cannot – under any circumstances attempt to replicate the work. If I attempted I would expect to receive legal threats or worse. Certainly the University would be debarred from using CAS.

The results of the paper – such as they are – depend completely on selection of the data. There are a huge number of biological molecules (DNA, proteins) in CAS and I would have expected these to bias the analysis (with 6, 5, and 6-5 rings being present in enormous numbers). The authors may say – if they reply – that it’s “obvious” that “substance” (with < 253 atoms) excluded these – but that is a consequence of  bad writing, poor methodology and the knowledge that whatever they put in the paper cannot be verified or challenged by anyone else on the planet.

There are many data sources which are unique – satellite, climate, astronomical, etc. The curators of those work very hard to provide universal access. Here, by contrast, we have a situation where the only people who can work with a dataset are the people we pay to give us driblets of the data at extremely high prices.

This post is not primarily a criticism of CAS per se (though from time to time I will publish concerns about their apparent – but loosening – stranglehold on chemical data). If they wish to collect our data and sell it back to us it’s tenable business model and I shall continue to fight it.

But to use a monopoly to do unrefereeable bad science is not worthy of a learned society.

How can we publish semantic chemical documents?

Monday, March 16th, 2009

Tobias Kind has submitted a very thoughtful comment (in reply to Approaches to compound documents – ORE, PDF, DOCX) which deserves printing and commenting.

TK: Hello Peter,
thanks for your thoughts. The more I read the more complex and frustrating it gets. I was just reading your comments about Adobe Acrobat; I would assume that everybody in the chemistry world has an Acrobat Full license. But I recognize that’s not the case. Furthermore there are people who have problems opening a ZIP file, so one can not assume that everybody is operating at the same level of tools.

PMR: I personally do not have an Acrobat full licenCe. That’s not religious – I just don’t have one.  Maybe the University has a site licence – I don’t know. FWIW I send manuscripts as *.doc. I realise this is also a proprietary and yes, it has to be paid for. It’s just that I happen to have it painlessly. In contrast it is possible to get Open Source/free solutions for ZIP (though only after the infamous patent ran out).

TK: That’s the problem with the long tail. According to the power law it is probably safe to assume that the majority of chemists doesn’t even care if there is chemical semantics lurking out of a document. Not to offend the majority of chemists, but at the end of the day it’s only the number of publications on the CV that counts (well quality of course).

PMR: agreed

TK: Tim,
RSC with Project Prospect and Nature with journals that annotate structures and submit them to PubChem are probably top notch regarding semantics. And as I said before, yes PDF can include metadata with XMP, but as long as there are no easy (free) tools out there its hard to push semantics from the PDF side. Acrobat Reader 8 did not know XMP and yes one can attach XML using the Full Acrobat. But the mentioned ExifTool is not a commodity tool for most chemists.

PMR: agreed. I am a pragmatist in that I can reasonably persuade chemists to include various bits of information into Word2007, but not into LaTeX or Acrobat. If everyone used semantic Acrobat instead of Word I’d probably be suggesting that. I’m pleased to see that Writer is better than Open Office, but it doesn’t (I think) solve the semantic packaging problem seamlessly.

TK:  But then again the whole semantics train currently depends on the journal itself or the editorial board, or single people or innovative groups at the publisher side. And there is certainly the tools side, so for WEB 2.0 in chemistry only a broad range of software tools can act as an enabler for chemical semantics.

PMR: Completely agreed. This is why the International Union of Crystallography deserves praise – it designs and requires semantic data publication for many of its journals.

I would not go so far as “PDF corrupts and restricts thought”. Chemists can not make third parties responsible for the current mess in missing annotations and data exchange. Most of the better chemistry and life sciences journals allow supporting info, so what speaks against attaching the source HTML, DOC as supplement. Yes its redundant, but as long as publishers do not convert supporting data into bitmap PDF it is not a problem.

PMR: Like Tufte I allow myself some hyperbole. I would certainly say that “in a digital age where many new forms of information and publication are possible, a universally used format whose primary purpose is to allow printing of documents onto paper is an active restriction on the imagination”.

PMR: As an example if you go to ACS Journal of Proteome Research, you can find some of the evil PDFs, and even the evil flat 2D PDF attachments including molecular spectra or information. But a few publications also include supplement RAW data (as XLS. MDB or ZIP) and even PDB codes. So I assume if the authors and reviewers insist on publishing meta data in the supplement in a specific format the journal would agree. Well, then there is that unholy ACS supplement data copyright. But there are also ways to submit data on personal websites. For instance you could find the ACS journal supplement data for “T.IMPAFIFEHIIK.R” also on google: “Powered by Yates Bioinformatics Team; This is ongoing project with preliminary results”, ok copyrighted by the Yates group itself ;-)

PMR: If you look at ACS J.Org.Chem you will see that almost every paper has a large supplement. This is almost always in PDF. It’s clearly taken a lot of work to create. The information was, originally, semantic and the publication process has encouraged the community to turn it into PDF. The spectra were JCAMPs (or could be JCAMPs), The molecules were CDX or Mol, The reactions were RXN, etc. All have been steamrollered into flat PDFs.

PMR: The exception are the CIFs, designed, advocated, and managed by the IUCr. They have shone as an example to the rest of the chemical world.

TK: For example some of our public US taxpayer funded metabolomics data sets are fully available via our SetupX LIMS and study design database:
For those public studies people can download all the raw data and all the annotated and result data and even the underlying software. Not all research data is open access and publicly available and yes we are also guilty of publishing flat PDFs without any semantics, but we allow people to reproduce some of our experiments and download RAW and processed data and all needed software and that can only be topped by Open NoteBook Science, the purest form of scientific reporting.

PMR: This again is the influence of the bioscience community. It makes me envious.

TK: Tobias Kind

It’s technically trivial – yes trivial – to publish molecules and spectra. If a journal said “no need to write 200 pages of supplemental info in PDF, just publish the *.cdx, *.mol. “.jdx.” That’s all. But where is the editorial push for this? Will any chemical editors (technical, management, academic) step up and say “this journal will require authors to deposit semantic chemistry in … months/years”. That’s all it takes. There wouldn’t even be much resistance – probably rejoicing.

The good news is that we have an Open Source infrastructure that can convert all of these legacy formats into semantic chemistry (Chemical Markup Language, CML) essentially automatically. We’ve done it for crystallography in the chemistry depertament here and the issues are not technical but things like embargoes.

You don’t even need to know about CML.

I’ll be explaining in future posts how it is now conceptually simple to publish chemical data in semantic form. I’d like to work with, not against, publishers. And, with some like IUCr and RSC we do.

the future of the library – Balliol College

Sunday, March 15th, 2009

In preparation for my presentation to The JISC and The Bodleian on 2009-04-02 in Oxford I’m continuing a somewhat flippant random walk through web pages that might tell me what a library is and what it is for. So I have visited the web pages of Balliol College – where 3 generations of Murray-Rust have studied…

For the library I find

Balliol Library is ‘the jewel in the crown’ of Balliol College. From an early stage the Scholars had a ‘library’ of books in common; a manuscript bequeathed before May 1276 is still with us. The collections of manuscripts and books have continued to grow by gift, bequest and purchase. We serve both the Balliol community of undergraduates, graduates and Fellows, and the worldwide scholarly community who come to study our special collections of early printed books and early and modern manuscripts.

The Library aims to provide the books which undergraduates need for their weekly work, to keep multiple copies of standard texts, and to respond quickly to urgent requests for book purchase (books can often be bought the same day). There are ample funds for the purchase of books, and the modern sections cover all the main subjects of undergraduate study. The ‘in-depth’ collections of 19th and 20th century scholarly books are of interest to graduate students. Balliol was among the first Oxford colleges to begin to computerise its library catalogue. The Library has a microfilm/fiche reader-printer and computer terminals on which the Oxford automated union catalogue (including the Bodleian Library’s and Faculty Libraries’ holdings), and external databases can be consulted.

PMR: So, this seems to be consistent with what we have found so far in the Bodleian and the University:

  • collect and make available books and manuscripts
  • provide services for undergraduates

Can we find a deeper purpose for the college or its library in the statues of Balliol? I’ve managed to find three resources by Googling:

Early history of Balliol College“. This appears to be an OCRed creation and it is a nice challenge to see whether the incomprehensibility (e.g.

 Ear/)' //lstor.J' of .I]a//io/ Co//c, qz

) comes from the 13th or 21st centuries. It’s not all as bad as that – the running chapter titles are the worst. Typical extract relating to the library:

' Thus farre concerning y library y' now ttands, w'
ye Coll: had before I find little or noe mention, they
reposing their books in it, only soe farr y' seuerall y'
had bin Oxford Scolars left in their wills books to y*
Coll without any mention of a library viz among y
rett was m r Simon de Bredon ye worthieR mathe-
matician of his rime who a ° 137 -, left seall books
of Afronomie & lathematicks therto. \Vill Rede
Bishop of Chiceire, o books, c  i money & one
siluer cup 382 & Roger whelpdale Bishop of Carlile
S' AutIê de Ciuitate dei 422.'

The second consists of a catalog of  early and later manuscripts such as A. STATUTES, FOUNDATION DEEDS AND CHARTERS 13 th – 20 th centuries. This is essentially metadata – pointers to manuscripts and printer books – highly worthy but no immediate use to me.

[In passing I am very pleased to see the College is a strong supporter of Freedom of Information - which is highlighted in the sidebar. Indeed, until I discovered the next resource I contemplated a formal FOI request for the Charter.]

Finally I came across a document representing the modern statutes which gives me exactly what I need.

It does not explicitly (on cursory reading) give the purpose of the college or its library, but states – inter alia -

In elections to the Mastership the electors shall choose the person who is, in their judgement, most fit for the government of the College as a place of religion, learning, and education. [PMR: this document is modern (2008) and still apparently stresses religion.]

So I have had fun on a Sunday morning browsing round the online resources for academic libraries. No very clear idea emerges as to what a library is for. Is it more important for a college to have a library or a chapel?

If we are to look to the future of libraries we have to know what they are for. Flippancy apart, I don’t now know what a library is for. I’d be grateful for suggestions, including formal statements of purpose.

the library of the future – Oxford 2009-04-02

Sunday, March 15th, 2009

In this and subsequent posts I shall explore some ideas on the library of the future, being catalyzed by the following invitation from Rachel Bruce of The JISC:

…I’m now writing on behalf of the JISC and the Bodleian Library to invite you to speak at an event that will explore the future library on 2 April 2009.

The event will be in the style of a question time panel, before questions are put to the panel a number of key stakeholders will present their perspective on their requirements. I thought you’d be a [...] speaker and member of the panel able to give the perspective of a researcher. … we’d like you to speak about your information needs, how you undertake your research and what you, as a researcher, need to remain relevant and to produce new and innovative research.

The event will run from 2pm – 6.30pm and should have an audience of 150 -200. It will be held at the University of Oxford.

The purpose of the event is to consider some of the key challenges that will shape the library of the future. So in effect key issues libraries need to respond to if they are to survive. The types of issues we expect to be raised include: skills for the future librarian from marketing to data curation, the need to foster partnerships between public and private sectors as well as working across the organsiation ( university ); the need for a heightened understanding of the changing user base and increasingly diverse needs of users; future information needs of researchers and what will they need to undertake their research and how to serve the citizen.

We are hoping this event, with the aid of high profile speakers, will serve to make a high profile statement to libraries about how they need to respond to support research and society more generally into the future and in the digital age.

I’m very excited about this and I’m starting to think and do some browsing (I won’t call it “research”). I shall blog from time to time as I go through – I shall be provocative but, I hope, constructive.

The main question has to be “what is a library?”, moderated by “what is it for?” **in the current century**. Unless we can answer those questions, and the second one in a constructive manner – then the rest of the discussion is likely to be ill-directed.

So I have started by trying to ask “what is the Bodleian Library for?” I may try to moderate it by looking at colleges on Oxbridge, specifically Balliol and Churchill. Both have archives, but with a wide difference in content and approach.

I’m taking a pragmatic approach.

If, as a citizen of the world with no special privileges, I can’t find a resource on the web within 5 mnutes then it doesn’t exist..

I am not a historical researcher who can travel to read medieval documents – I require them to be online and transcribed into accessible twenty-first century documents. And although in practice I would probably enlist the actual help of librarians/archivists at Bodley, Balliol and Churchill I am doing this deliberately blind. I ask forbearance from anyone whose collections I may apparently criticize – I have unreserved admiration for all who curate the past and present and know how difficult this is with limited resources.

I am a scientist so will start with a hypothesis:

“The stated purpose of libraries at Oxford and Cambridge is to glorify God and promote His Kingdom on earth”. This purpose has not been formally modified

Since Cambridge and Oxford are about 800 years old (Cambridge celibrates its 800th anniversary this year) there may have been minor deviations from this purpose (kings and primeministers have sometimes tried to steer away from this) but our charters and other founding documents willl confirm the hypothesis. (I do not have enough resource to do a proper study, so in the spirit of the collaborative electronic age I will be delighted to see whether the blogosphere can help).

Let’s start with the history and statutes…

Earth’s proud empires

Saturday, March 14th, 2009
Rich Apodaca commented:

Peter, IMHO being funded by Microsoft is neither inconsistent with advocating Open Data nor with advocating Open Source. Microsoft isn’t evil – it’s just increasingly irrelevant.

The marketplace is currently dealing Microsoft what it deserves. Its customers now have choices like never before and they’re increasingly saying “no” to overdesigned products, planned obsolescence, and the general arrogance and disinterest in customers that monopoly breeds.

One of the things Microsoft’s former customers are turning to is a Web-centric way of working. Google docs is one example, but companies like 37signals, Firewheel Design, and a host of others are showing that the number of situations in which a desktop application is necessary is smaller than many would have predicted. Many of them charge for their services and a good number are profitable. Nothing wrong with that.

As long as Microsoft’s money is there, I’d take it without the slightest reservation. But I’d also try to make sure that what I’m being paid to do had some relevance to people doing their work on the Web.

Very much my own feelings as far as Microsoft Corporation is concerned. At school we used to sing “like earth’s proud empires pass away” and this holds well for the ICT industry. The exciting thing about writing compoter systems is that’s it’s possible to create smethign where you can see the contribution to change – albeit it slight. When working in the W3C XML group we could see how this was going to change the world – and interestingly some people in MS (e.g. Jean Paoli) understood that at a very early stage. Whatever else I have MS in part (along with Jon Bosak/SUN) to thank for the emergence of XML.

MS has to reinvent itself. I remember MS’s challenge to IBM – how could IBM possibly fall from grace? But in the mid 90′s it wa an ailing company. Now MS is going through the same process, being challenged by Google. The process will continue 10 years down the line…

I’m working with MS Research. I’m not exactly sure what part MS Research plays wrt. the main company. That’s true of many research divisions of companies. It’s probably clearer in traditional medicinal chemistry (where I worked) in that the drugs then had to come through the system. Now it may be diifferent. The same with Unilever Research (who now pay me). Companies don’t necessarily want their RDs to create new products – or even new ideas for products. Many want them to keep them connected and agile in an ever changing world. To link into universities and complementary businesses. To funnel the best graduates into their company.

The people in MSR with whom I work know the company has to change. They’re headed by Tony Hey who pioneered the eScience (== cyberinfrastructure) program in UK. And one major change – at least in MSR – is the need to espouse Open approaches. To do that they need a window onto that world – Open Source, Open Data, Open Standards, Open Access. MSR (sic) is involved in sponsoring all of these. They are members of Apache.

Critics may say that this is an inexpensive way of buying goodwill in a world which has shown them that monopolies – and especially arrogant monpolies – will not always prosper. History will tell. But if MSR is going to change it is going to need to do the Open things that will be a amjor part of the future. If so, they are going about it constructively.

Is our work with MS relevant? I try extremely hard to make it so. We have had to work in a .NET environment and coming from our Open Java Blue Obelisk community that has been very painful. There’s been a lot to learn which has been necessary just so we can code collaboratively.

Has it been worth it? We are 9 months in and it’s too early to say. We (through Joe Townsend) are fairly up to speed with .NET/WPF/XAML etc. The .NET environment is (I think) ultimately a lot better for creating graphics than Swing (which has cost me a lot of blood). I like the integration of WPF with XAML so that many aspects of screen display can be created though external XML. Is that better than the newer GUIs coming from the Web-centric players? No opinion yet.

There’s been a major benefit in working in a fully XML-compliant environment, and with people who are as enthusiastic as us. Chem4Word is completely based on XML both at the chemical end and on the screen. That leads to a much higher semantic coherence than traditional legacy systems where there is transduction from legacy to internal data structure and back, In C4W there is only one representation of the chemistry…

But I am getting ahead. We’ll be telling you more later about the details and how we want to explore collaboration.

Microsoft may have owned our bodies – companies like Google will end up owning our soul. We need constant vigilance.