Category Archives: semanticWeb

semantic web

libraries of the future - Ithaka report

In preparing for LOTf09 I asked the organizers for some guidance and got a helpfiul reply:

Hi Peter

I would like your to talk to include the research/scientific needs as this is an important perspective, in terms of questions - we will be driven by the questions although on the day anyone can post to the blog and the twitter feeds which are going to be monitored live and you will be able to see all of this - you will also have an opportunity to review all the questions posed by the audience and at the pre meeting indicate which ones you feel you would like to answer.

Your blog is very interesting - you may have seen this already but this report?  -it explores the relationship between faculty and the library in the states -
I have just posted a short review on the Libraries of the future blog - By the way the tag for the event is #LOTF09, -

Dicky Maidment-Otlet
Communications Manager

JISC Executive

PMR: I hadn't seen it - I read nothing (the information universe is infinite - I am finite - so I read 0% of all information - thanks to H2G2). It confirms my rough impressions:

The (In)visibility of the library
An important lesson is that the library is in many ways falling off the radar screens of faculty. Although
scholars report general respect for libraries and librarians, the library is increasingly disintermediated
from their actual research process. Many researchers circumvent the library in doing their research,
preferring to access resources directly. Researchers no longer use the library as a gateway to information,
and no longer feel a significant dependence on the library in their research process. Although the library
does play essential roles in this process, activities like paying for the resources used are largely invisible
to faculty. In short, although librarians may still be providing significant value to their constituency, the
value of their brand is decreasing.

This is an area of concern for all those concerned with the information strategy of the modern campus, but
is of particular importance to the library itself; if attention and support fades from the library, its ability to
contribute to the intellectual work of the campus diminishes, and its continuing institutional well-being
may be threatened. Libraries should be aware of this decreasing visibility and take steps to improve the
value of their brand by offering more value-added services to raise their profile on campus. It is essential
to their long-term viability that libraries maintain the active support of faculty on their campuses, a factor
which will be most effectively obtained by playing a prominent, valued, and essential role in the research
process. By understanding the needs and research habits of scholars in different disciplines, libraries can
identify products and services which would be appreciated by and of use to these scholars. Such efforts to
be involved in the research process offer benefits to scholars, by providing them with services to improve
their efficiency and effectiveness, as well as to libraries, recapturing the attention of scholars and
contributing to a general awareness of and respect for the library’s contributions.


…And of Science in Particular

The information age has most significantly impacted the sciences, which are experimenting with a wide
range of new models of scholarship and communication, and demanding an increasing level of campus
support. Serving the information needs of cutting-edge scientists for tools and infrastructure requires a
coherent strategic approach, aligning the expertise of academic administrators, technologists, librarians,
and others on campus. As our findings make clear, however, despite this growing significance of
information to scientists, the role of the library is diminishing in importance fastest amongst this group.
Libraries are providing these high-growth fields value in the acquisition of resources – for example in
licensing costly journal collections – but otherwise have been relatively absent from the workflow of
these high-growth fields, with an associated decline in perceived value. Some efforts have been made by
research libraries to engage more deeply in the broader workflow of scientific research, but at the system
level these efforts have been marginal, while commercial providers are making a major push to interject
themselves throughout the scientific research value stream. Deep consideration of how the library
community can best serve scientists and preserve scholarly values in the face of a rapidly changing and
increasingly commercial ecosystem is needed, both on the local and the system level.

That was 2.5 years ago. It's worse now. I hate to say it, but the scientific library of the future is PRISM.

Unless we stop them.

Which we can, but only if we become revolutionaries.

I haven't heard anything in the last few days from ULibrarians (apart from Dorothea who has blogged the issue, so they can't plead ignorance).... Please say something.

librarians of the future - part II

Continuing my very personal selection of "digital librarians" and "digital libraries". I stress that these are people and organizations that make a real difference to me as a scientist. This could be directly by providing material I use, making major changes in digital scholastic infrastructure, or acting as inspiration in scientific information provision. They also have to be things that I can explain or market to my colleagues. And they are personal to me or the sciences with which I interact.

My list is not specifically aimed to exclude conventional Libraries and Librarians, it's just that with one or two exceptions they have little impact on what I and my colleagues do. I acknowledge the important work done in building resources and that these are valuable. I could present LOCKSS, SHERPA, UKOLN, NSDL, OAI-ORE, ... but I'd be unlikely to get much other than a polite ear from my scientific colleagues.

HoweverI can enthuse lyrical about DBPedia which for me has a real "wow" factor and which I feel has a real chance of coming to their attention. They are more likely to be motivated by the Protein Data Bank (PDB) than by project Gutenberg. Both were started in the early 1970's, and could rightly occur in my list; but the bioscience digital libraries are so numerous that none gets a mention. I could, perhaps have chosen Gutenberg over Perseus - both were pioneers in different ways. So if you or your project is omitted it's nothing special. Although I haven't tried to create a deliberately diverse list, there are several places where one person or one project represents a genre. At least I hope they make you think "hmm" or "wow".

  • DBPedia This is a great concept - it takes the infoboxes in Wikipedia and turns them into RDF (?what - well RDF is another of TBL's ideas - a protocol which creates a global semantic web by linking all information with triples. triples? a simple statement subject-verb-object. The infobox information "Sodium StandardAtomicWeight 22.98976928(2) g·mol−1" gets broken down into triples which relate values to ontologies to units. This is all very new and will change rapidly.From WP: "DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows users to ask expressive queries against Wikipedia and to interlink other datasets on the Web with DBpedia data." "As of November 2008, the DBpedia dataset describes more than 2.6 million things, including at least 213,000 persons, 328,000 places, 57,000 music albums, 36,000 films, 20,000 companies." The future of the web is Linked Data - DBPedia ia a marvellous start.
  • Harvard University Harvard gets the accolade for being the first major academic institution to start to reclaim its lost scholarship. It says - our scholarly output belongs to us, not the publishers. And, effectively, it takes the burden of fighting uncooperative publishers away from individual authors and shoulders the responsibility. They've been followed by Stanford, and recently MIT (MIT adopts a university-wide OA mandate ... the MIT faculty unanimously adopted a university-wide OA mandate. ... thanks to Hal Abelson, MIT professor of computer science and engineering, who chaired the committee to formulate it: "

    [MIT] is committed to disseminating the fruits of its research and scholarship as widely as possible. ... Each Faculty member grants to the [MIT] nonexclusive permission to make available his or her scholarly articles and to exercise the copyright in those articles for the purpose of open dissemination.

    PMR: note yet another initiative from the Computer Scientists. Indeed much of the impetus for digital libraries comes not from Libraries, but from Computer Science departments. They often lead and they care deeply about the ownership of scholarship.

  • Tony Hey Yet another computer scientist - at Southampton, which has an enviable record in the new digital libraries - Tony led the UK eScience program with great energy and charisma. I understand "eScience" as the infrastructure - and sometimes content - for the digital needs of the scientist, so in that sense a digital library. eScience stressed "Grids" - the power of distributed computing resources to provide on-demand compute power from anywhere in the world to anywhere. It was a time of experiments, some of which have lasting legacies and others not - which is as it should be. We were grateful to the eScience program for funding us - and we have repaid that by donating our work to the world though digital libraries (but not Libraries). Tony has now moved to Microsoft Research where he has developed a program specifically to promote Open Scholarship and is working with groups in academia, Libraries, etc. (including us).
  • The JISC. This is the fuding body for digital libraries in the UK, and from which we receive funding. It's not because of that, nor because they have sponsored the discussion in Oxford (LTOF09), that I include them but because there is a real sense of adventure in their program. They have funding for high-risk projects. They sponsor events like the "Developer Heaven" where all the geeks congregate and hack. Unfortunatelty I missed it - but Jim and Nico went from our group. The require collaboration. They go beyond the standard report which ends "this area needs much more research". They are looking 10+ years into the future.
  • Brian McMahon of the International Union of Crystallography. Scientists all have their own pet discipline or disciplines and this is one of mine. Science has evolved a series of bodies which oversee and coordinate activities, and the extent of this varies wildly. Crystallography has been a tight-knit, friendly discipline for nearly a century with many Nobel laureates who have had major input into how the discipline is organized and it's truly International. Other disciplines, but not many, achieve this sense of common purpose through their national and international bodies (it does not happen in chemistry).

    The IUCr has Commissions, which include Data and Publications. They run a series of journals which make a surplus including some Open Access ones. They have always had a major concern about the quality of scientific experiments and how these are reported and were (I think) the first ISU to move towards effective mandating that experimental data must be included with publications. By doing this some of the journals are effecively semantic documents which in aggregate are effectively a digital library of crystallography.

    Many deserve credit but I have picked out Brian who has acted as amanuesis to this vision for two decades. He services the communal dictionaries (effectively an ontology), and much of the social and technical infrastructure of the projects. he and his colleagues have supported us at Cambridge through summer students.

There are about 8-9 not yet listed (including a Librarian), so this will be 1 or two posts. If you send suggestions I will consider them. [I have been asked to use LOTF09 as the tag for the Oxford event]

librarians of the future - part I

In collecting my thoughts for "the library of the future" (The JISC + The Bodleian) in Oxford on April 2 I'm thinking of those people or organizations that I look to for a combination of resources, philosophy, advocacy that support my scholarship (my current working definition of the "library of the future"). I've come up with about 15 entries, mostly people, with one/two organizations (and I may add 2 or 3 more as I thnk of them). They are in roughly alphabetical order and almost all have entries in Wikipedia ("..." indicates quotes). In some cases a person represents a larger organization in which many people deserve high credit - so it's a bit arbitrary. However these really influence the way I work, and I don't think I'm unique in looking to them. Most would own to supporting Openness in one way or another (Open Access, Open Data, Open Source, Open Knowledge).

  • Anurag Acharya. Google Scholar. What is important here is not just that Google has revolutionized our thinking about information, but that they have implicitly and explicitly broken the for-profit information culture. Google Scholar came out of the "20%" sparetime activity allowed at Google and it shows what can be done with modest resources. It highlights the bloatedness of formal fee-charging metric services.
  • Tim Berners-Lee. Remember that TBL developed his ideas while supporting the information need of scientists at CERN. No more comment needed.
  • Steve Coast Openstreemap. "OpenStreetMap (OSM) was founded in July 2004 by Steve Coast. In April 2006, a foundation was established with the aim of encouraging the growth, development and distribution of free geospatial data and providing geospatial data for anybody to use and share." This project is remarkable in that it has revolution in its bones - "The initial map data was all built from scratch by volunteers performing systematic ground surveys using a handheld GPS unit and a notebook or a voice recorder, data which was then entered into the OpenStreetMap database from a computer." In the UK map most map data (and metadata) is strictly controlled by the Ordnance Survey - a government agency which is required to recover costs and which has an effective straglehold on the re-use of maps. OSM has challnged this monopoly, spread to other countries, and just announced its 100,000th participant. It's an epitome of the many sites which have sprung up to free information.
  • Greg Crane. The Perseus project "is a digital library project of Tufts University that assembles digital collections of humanities resources. It is hosted by the Department of Classics"... "The editor-in-chief of the project is Gregory Crane, the Tufts Winnick Family Chair in Technology and Entrepreneurship. He has been editor-in-chief since the founding of the Perseus Project [1987].". This project is remarkable for the early vision and the way that it has liberated classical scholarship from the restrictive practices arising from lack of access to critical scholarly objects (editions, etc.). Greg is a frequent collaborator in designing future cyberacademia.
  • Paul Ginsparg "The arXiv is an archive for electronic preprints of scientific papers in the fields of mathematics, physics, computer science, quantitative biology and statistics which can be accessed via the Internet. In many fields of mathematics and physics, almost all scientific papers are placed on the arXiv. As of 3 October 2008 (2008 -10-03)[update], passed the half-million article milestone, with roughly five thousand new e-prints added every month". Another example of a pioneer who has worked outside the system to create something that is universally regarded as essential and outstanding.

So these are some of the librarians of the future. They build vital, communal, information resources. They invite collaboration, either directly or implicitly. They overthrow conventional wisdom and entrenched systems and interests.

More will follow tomorrow. These probably weren't what you were expecting, but although you may disagree with individual choices and assessments, you will agree that all have a major impact on our scholarship.

The challenge is how formal academia should rise to meet this creative energy. Will it do this through Libraries? That is for Librarians to think about.

The library of the future - Guardian of Scholarship?

I am still working out my message for JISC on April 2nd on "The library of the future". I've had suggestions that I should re-ask this as "“What are librarians for?”" (Dorothea Salo) and "what can a library do?" (Chris). Thanks, and please keep the comments coming, but I am currently thinking more radically.

When I started in academia I got the impression (I don't know why) that Libraries (capital-L = formal role in organization) had a central role in guiding scholarship. That they were part of the governance of the university (and indeed some universities have Librarian positions which have the rank of senior professor - e.g. Deans of faculties). I have held onto this idea until it has become clear that it no longer holds. Libraries (and/or Librarians) no longer play this central role. That's very serious and seriously bad for academia as it has left a vacuum which few are trying to address and which is a contributor to the current problems.

I current see very few - if any - Librarians who are major figures in current academia. Maybe there never was a golden age, but without such people the current trajectory of the Library is inexorably downward. I trace this decline to two major missed opportunities where, if we had had real guradians of scholarship we would not be in the current mess - running scared of publishers and lawyers.

The first occasion was about 1972 (I'd be grateful for exact dates and publishers). I remember the first time I was asked to sign over copyright (either to the International Union of Crystallography or the Chemical Society (now RSC)). It looked fishy, but none of my colleagues spoke out. (Ok, no blogosphere, but there were still ways of communicating rapidly - telephones). The community - led by the Librarians - should (a) have identified the threats (b) mobilised the faculty. Both would have been easy. No publisher would have resisted - they were all primarily learned societies then - no PRISM. If the Universities had said ("this is a bad idea, don't sign") we would never have had Maxwell, never had ownership of scholarship by for-profit organizations. Simple. But no-one saw it (at least enough to have impacted a simple Lecturer).

The second occasion was early 1990's - let's say 1993 when Mosaic trumpeted the Web. It was obviou to anyone who thought about the future that electronic publication was coming. The publishers were scared - they could see their business disappearing. Their only weapon was the transfer of copyright. The ghastly, stultifying, Impact Factor had not been invented. People actually read papers to find out the worth of someone's research rather than getting machines to count ultravariable citations.

At that stage the Universities should have re-invented scholarly publishing.The Libraries and Librarians should had led the charge. I'm not saying they would have succeeded, but they should have tried. It was a time of optimism on the Web - the dotcom boom was starting. The goodwill was there, the major universities had publishing houses. But they did nothing - and many contracted their University Presses.

There is still potential for revolution. But at every missed opportunity it's harder. All too many Librarians have to spend their time negotiating with publishers, making sure that students don't take too many photocopies, etc. If Institutional Repositories are an instrument of revolution (as they should have been) they haven't succeeded.

So, simply, the librarian of the future must be a revolutionary. They may or may not be Librarians. If Librarians are not revolutionaries they have little future.

In tomorrow's post I shall list about 10 people who I think are currently librarians of the future.

Please send us your Vistas

I recently got an invitation to speak (anonymized as I don't want to fall out) which included:

"I would very much appreciate a copy of your presentation in advance of the event in Windows XP format as the venue is not migrated to Vista. "

This is yet another example of technology driving scholarly communication. Increasingly we are asked "please send us your Powerpoints".

I shall use modulated sound waves for the body of my message and I am tempted not to bring any visual material. But I shall, because I need to communicate what we are actually doing by showing it. Scientists, of course, frequently need to show images of molecules, animals, stars, clouds, etc. and this will and should be the mainstream.

But there is little need to echo the words that I speak by showing them on the screen. When speaking to an international audience it can be very useful to have text on the screen, but this is a UK event and whatever else I speak clearly and loudly and should be comprehensible.

What is inexcusable is how often conference organizers fail to provide any connection to the Internet, and some of these meetings are *about* the Internet and digital age. Even more inexcusable are those places which charge 100USD d^-1 for connections.

(BTW I am still fighting WordPress which loses paragraph breaks regularly...)

Closed Data at Chemical Abstracts leads to Bad Science

I had decided to take a mellow tone on re-starting this blog and I was feeling full of the joys of spring when I read a paper I simply have to criticize. The issues go beyond chemistry and non-chemists can understand everything necessary. The work has been reviewed in Wired so achieved high prominence (CAS display this on their splashpage). There are so many unsatisfactory things I don't know where to begin...

I was alerted by Rich Apodaca  who blogged...

A recent issue of Wired is running a story about a Chemical Abstracts Service (CAS) study on the distribution of scaffold frequencies in the CAS Registry database.

Cheminformatics doesn't often make it into the popular press (or any other kind of press for that matter), so the Wired article is remarkable for that aspect alone.
From the original work (free PDF here):

It seems plausible to expect that the more often a framework has been used as the basis for a compound, the more likely it is to be used in another compound. If many compounds derived from a framework have already been synthesized, these derivatives can serve as a pool of potential starting materials for further syntheses. The availability of published schemes for making these derivatves, or the existence of these desrivates as commercial chemicals, would then facilitate the construction of more compounds based on the same framework. Of course, not all frameworks are equally likely to become the focus of a high degree of synthetic activity. Some frameworks are intrinsically more interesting than others due to their functional importance (e.g., as a building blocks in drug design), and this interest will stimulate the synthesis of derivatives. Once this synthetic activity is initiated, it may be amplified over time by a rich-get-richer process. [PMR - rich-get-richer does not apply to pharma or publishing industries but to an unusual exponent in the power law].

With the appearance of dozens of chemical databases and services on the Web in the last couple of years, the opportunities for analyses like this (and many others) can only increase. Who knows what we'll find.

Thanks Rich. Now the paper has just appeared in a journal published by ACS (American Chemical Society, of which Chemical Abstracts (CAS) is a division). (There is no criticism of the ACS as publisher in my post, other than that I think the paper is completely flawed).  Because ACS is a Closed publisher the paper is not normally Openly readable, but papers often get the full text exposed early on and then may become closed. I've managed to read it from home, so if you don't subscribe to ACS/JOC I suggest you read it quick.

I dare not reproduce any of the graphs from the paper as I am sure they are copyright ACS so you will have to read the paper quickly before it disappears.

Now I have accepted a position on the board of the new (Open) Journal Of Chemoinformatics. I dithered, because I feel that chemoinformatics is close to pseudo-science along the lines of others reported by Ben Goldacre (Bad Science). But I thought on balance that I'd do what I could to help clean up chemoinformatics and therefore take a critical role of papers which I feel are non-novel, badly designed, irreproducible, and badly written. This paper ticks all boxes.

[If I am factually wrong on any point of Chemical Abstracts, Amer. Chemical Soc. policies etc. I'd welcome correction and 'll respond in a neutral spirit.]

So to summarize the paper:

The authors selected 24 million compounds (substances?) from the CAS database and analysed their chemical formula. They found that the frequency of frameworks (e.g. benzene, penicillin) fitted a power law. (PLs are ubiquitous - in typsetting, web caches, size of research laboratories, etc. There is nothing unusual in finding one). The authors speculate that this distribution is due to functional importance stimulating synthetic activity.

I shall post later about why most chemoinformatics is flawed and criticize other papers. In general chemoinformatics consists of:

  1. selection of data sets
  2. annotating these data sets with chemical "descriptors"
  3. [optionally] using machine learning algorithms to analyse or predict
  4. analyse the findings and prepresentation

My basic contention is that unless these steps are (a) based on non-negotiable communally accepted procedures (b) reproducible in whole - chemoinformatics is close to pseudoscience.

This paper involved steps 1,2,4.  (1) is by far the most serious for Open Data advocates so I'll simply say that
(2) There was  no description of how connection tables (molecular graphs) were created. These molecules apparently included inorgnaic compounds and the creation of CTs for these molecules is wildly variable or often non-attempted. This immediately means that millions of data in the sample are meaningless. The authors also describe an "algorithm" for finding frameworks which is woolly and badly reported. Such algorithms are common - and many are Open as in CDK and JUMBO. The results of the study will depend on the algorithm and the textual description is completely inadequate to recode it. Example - is B2H6 a framework? I would have no idea.

(4) There are no useful results. No supplemental data is published (JOC normally requires supplemental data but this is an exception - I have no idea why not). The data have been destroyed into PDF graphs (yes - this is why PDF corrupts - if the graphs had been SVG I could have extracted the data). Moreover the authors give no justification for their conclusion that frequency of occurrence is due to synthetic activity or interesting systems. What about natural products? What about silicates?

But by far the most serious concern is (1). How were the data selected?

The data come - according to the authors - from a snapshot of the CAS registry in 2007. I believe the following to be facts, and offer to stand corrected by CAS:

  • The data in CAS is based almost completely on data published in the public domain. I agree there is considerable "sweat of brow" in collating it, but it's "our data".
  • CAS sells a licence to academia (Scifinder) to query their databse . This does not allow re-use of the query results. Many institutions cannot afford the price.
  • There are strict conditions of use. I do not know what they are in detail but I am 100% certain that I cannot download and use a signifcant part of the database for research, and publish the results. Therefore I cannot - under any circumstances attempt to replicate the work. If I attempted I would expect to receive legal threats or worse. Certainly the University would be debarred from using CAS.

The results of the paper - such as they are - depend completely on selection of the data. There are a huge number of biological molecules (DNA, proteins) in CAS and I would have expected these to bias the analysis (with 6, 5, and 6-5 rings being present in enormous numbers). The authors may say - if they reply - that it's "obvious" that "substance" (with < 253 atoms) excluded these - but that is a consequence of  bad writing, poor methodology and the knowledge that whatever they put in the paper cannot be verified or challenged by anyone else on the planet.

There are many data sources which are unique - satellite, climate, astronomical, etc. The curators of those work very hard to provide universal access. Here, by contrast, we have a situation where the only people who can work with a dataset are the people we pay to give us driblets of the data at extremely high prices.

This post is not primarily a criticism of CAS per se (though from time to time I will publish concerns about their apparent - but loosening - stranglehold on chemical data). If they wish to collect our data and sell it back to us it's tenable business model and I shall continue to fight it.

But to use a monopoly to do unrefereeable bad science is not worthy of a learned society.