libraries of the future – FriendFeed

In looking at the library of the future, we must be aware of what the web will develop without any regard to what academia and academic Librarians think is the right way to do things. Most new initiatives fail. I have seen 2-3 groups/companies who want to set up social communities for scientists and they have mainly not worked out. That’s not their fault – the recipe for success depends on a critical combination of need, utility, timing and community vision. Wikipedia was not the first community encyclopedia; Google was not the first search engine…

So, Libraries will do well to go with the web. That’s hard to guess in advance, but not so difficult to track and influence when it happens. Many academics (I don’t know about Librarians) denigrated Wikipedia (and many still do). They have to be wrong. Wrong because the momentum is huge. Wrong because they should be working out how to help it, rather than naysaying. And the same goes for much else. The main trouble is that when you are an early adopter you have the scratch things that don’t work out on the web. That’s painful, but normally a price worth paying

This post hightlights FriendFeed and Twitter. My colleagues all twitter – I am not sure how many use FriendFeed. I was introduced to FF last October – I can’t remember whom – I think it was Jean-Claude Bradley. I haven’t used it since until today. From WP:
FriendFeed is a feed aggregator that consolidates the updates from social media and social networking websites, social bookmarking websites, blogs and micro-blogging updates, as well as any other type of RSS/ Atom feed. Users can use this stream of information to create customized feeds to share (and comment) with friends.[1] The goal of FriendFeed according to their website is to make content on the Web more relevant and useful for you by using your existing social network as a tool for discovering interesting information. Users can be an individual, business or organization. Bloggers writing about FriendFeed have said that this service addresses the shortcomings of social media services which exclusively facilitate tracking of their own members’ social media activities on that particular social media service, whereas FriendFeed provides the facility to track these activities (such as posting on blogs, Twitter and Flickr) across a broad range of different social networks.[2] Some (but not all) bloggers are concerned about readers commenting on their posts inside FriendFeed instead of on their blogs, resulting in fewer page views for the blogger[3].

So after starting to blog the “libraries of the future” someone alterted me to a FF post where my ideas were ebing discussed. There were probably ca 12 posts/comments – somewhere between  2 and 6 lines – so longer that Twitter but only exposed to me because I was a Friend of the poster. I shan’t reproduce them because I assume they are confidential to a subgroup of Friends.

The message is clear – if I want to know what is going on, I need to work through FF. I shall need a day or two to adjust and find the best place to take as the “centre”. Maybe I shall end up reading blogs through FF rather than Feedreader – don’t yet know.

And I joined twitter – someone seems already to have taken “petermr” (unless I joined earlier and forgot) so I have a different name. I will blog abou this later.

So it’s clear that any library of the future should embrace these approaches and should be leading. Blogs, FF, twitter all have their place. They are increasingly the mainstream of scholarly communication (apart from mindless metrics).

BTW – I was asked to use tag #LOTF09 for these posts. Is the “#” required and if so does it differ between blogs and twitter?

Posted in Uncategorized | Tagged | 2 Comments

librarians of the future – part II

Continuing my very personal selection of “digital librarians” and “digital libraries”. I stress that these are people and organizations that make a real difference to me as a scientist. This could be directly by providing material I use, making major changes in digital scholastic infrastructure, or acting as inspiration in scientific information provision. They also have to be things that I can explain or market to my colleagues. And they are personal to me or the sciences with which I interact.

My list is not specifically aimed to exclude conventional Libraries and Librarians, it’s just that with one or two exceptions they have little impact on what I and my colleagues do. I acknowledge the important work done in building resources and that these are valuable. I could present LOCKSS, SHERPA, UKOLN, NSDL, OAI-ORE, … but I’d be unlikely to get much other than a polite ear from my scientific colleagues.

HoweverI can enthuse lyrical about DBPedia which for me has a real “wow” factor and which I feel has a real chance of coming to their attention. They are more likely to be motivated by the Protein Data Bank (PDB) than by project Gutenberg. Both were started in the early 1970’s, and could rightly occur in my list; but the bioscience digital libraries are so numerous that none gets a mention. I could, perhaps have chosen Gutenberg over Perseus – both were pioneers in different ways. So if you or your project is omitted it’s nothing special. Although I haven’t tried to create a deliberately diverse list, there are several places where one person or one project represents a genre. At least I hope they make you think “hmm” or “wow”.

  • DBPedia This is a great concept – it takes the infoboxes in Wikipedia and turns them into RDF (?what – well RDF is another of TBL’s ideas – a protocol which creates a global semantic web by linking all information with triples. triples? a simple statement subject-verb-object. The infobox information “Sodium StandardAtomicWeight 22.98976928(2) g·mol−1” gets broken down into triples which relate values to ontologies to units. This is all very new and will change rapidly.From WP: “DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows users to ask expressive queries against Wikipedia and to interlink other datasets on the Web with DBpedia data.” “As of November 2008, the DBpedia dataset describes more than 2.6 million things, including at least 213,000 persons, 328,000 places, 57,000 music albums, 36,000 films, 20,000 companies.” The future of the web is Linked Data – DBPedia ia a marvellous start.
  • Harvard University Harvard gets the accolade for being the first major academic institution to start to reclaim its lost scholarship. It says – our scholarly output belongs to us, not the publishers. And, effectively, it takes the burden of fighting uncooperative publishers away from individual authors and shoulders the responsibility. They’ve been followed by Stanford, and recently MIT (MIT adopts a university-wide OA mandate … the MIT faculty unanimously adopted a university-wide OA mandate. … thanks to Hal Abelson, MIT professor of computer science and engineering, who chaired the committee to formulate it: ”

    [MIT] is committed to disseminating the fruits of its research and scholarship as widely as possible. … Each Faculty member grants to the [MIT] nonexclusive permission to make available his or her scholarly articles and to exercise the copyright in those articles for the purpose of open dissemination.

    PMR: note yet another initiative from the Computer Scientists. Indeed much of the impetus for digital libraries comes not from Libraries, but from Computer Science departments. They often lead and they care deeply about the ownership of scholarship.

  • Tony Hey Yet another computer scientist – at Southampton, which has an enviable record in the new digital libraries – Tony led the UK eScience program with great energy and charisma. I understand “eScience” as the infrastructure – and sometimes content – for the digital needs of the scientist, so in that sense a digital library. eScience stressed “Grids” – the power of distributed computing resources to provide on-demand compute power from anywhere in the world to anywhere. It was a time of experiments, some of which have lasting legacies and others not – which is as it should be. We were grateful to the eScience program for funding us – and we have repaid that by donating our work to the world though digital libraries (but not Libraries). Tony has now moved to Microsoft Research where he has developed a program specifically to promote Open Scholarship and is working with groups in academia, Libraries, etc. (including us).
  • The JISC. This is the fuding body for digital libraries in the UK, and from which we receive funding. It’s not because of that, nor because they have sponsored the discussion in Oxford (LTOF09), that I include them but because there is a real sense of adventure in their program. They have funding for high-risk projects. They sponsor events like the “Developer Heaven” where all the geeks congregate and hack. Unfortunatelty I missed it – but Jim and Nico went from our group. The require collaboration. They go beyond the standard report which ends “this area needs much more research”. They are looking 10+ years into the future.
  • Brian McMahon of the International Union of Crystallography. Scientists all have their own pet discipline or disciplines and this is one of mine. Science has evolved a series of bodies which oversee and coordinate activities, and the extent of this varies wildly. Crystallography has been a tight-knit, friendly discipline for nearly a century with many Nobel laureates who have had major input into how the discipline is organized and it’s truly International. Other disciplines, but not many, achieve this sense of common purpose through their national and international bodies (it does not happen in chemistry).

    The IUCr has Commissions, which include Data and Publications. They run a series of journals which make a surplus including some Open Access ones. They have always had a major concern about the quality of scientific experiments and how these are reported and were (I think) the first ISU to move towards effective mandating that experimental data must be included with publications. By doing this some of the journals are effecively semantic documents which in aggregate are effectively a digital library of crystallography.

    Many deserve credit but I have picked out Brian who has acted as amanuesis to this vision for two decades. He services the communal dictionaries (effectively an ontology), and much of the social and technical infrastructure of the projects. he and his colleagues have supported us at Cambridge through summer students.

There are about 8-9 not yet listed (including a Librarian), so this will be 1 or two posts. If you send suggestions I will consider them. [I have been asked to use LOTF09 as the tag for the Oxford event]

Posted in semanticWeb, Uncategorized | Tagged | 2 Comments

librarians of the future – your feedback

I’ve had two pieces of feeback. Dorothea Salo scripsit:

Provocative statements

[ …] I neither endorse nor decry these [PMR’s views]. I merely want to call attention to the fact that Peter Murray-Rust is one of the people we serve, and he’s not sure what we’re for.
If that doesn’t provoke something in us—and I don’t mean merely dismay—we deserve to go down to a dusty death.

PMR: Thanks Dorothea –  who has valuably  used the blogosphere to praise and criticize aspects of librariana academica.  The purpose of my talk at Oxford is not to criticize libraries but to paint a picture of what – if anything – the words “library” and “librarian” should or will mean in the future.  I am giving a personal perspective – as I was asked to. Blogging a talk before it’s given is a great way to work out ideas, to correct inaccuracies and to get major positive input.

PMR: I want libraries to succeed and porsper and be a vital part of the defence of scholarship. I am not a library-naysayer in the way that Francis Crick (at Churchill College) was a chapel-basher (more later).
…and from James

Hi Peter. Thanks for your post. You speak truth to power about libraries and missed opportunities to be driving forces in the scholarship process. I have often thought that libraries should have been more forceful about continuing to build collections when journal literature went digital. Instead, they largely left the field to commercial publishers. We have been dealing with that mistake for the last 15 years as journal prices have skyrocketed.

PMR: Journal prices are a problem, but not the most important one. The primary question is whether academia owns its scholarschip.

By way of defending libraries, might I add a bit of context to your cogent analysis? While libraries have, as you rightly pointed out, been somewhat risk-averse, there are plenty of examples of libraries creating and/or participating in projects to increase access to scholarly information — Jstor, Project Muse, OAIster, LC’s American memory Project. Open Access, Internet Archive and the Open Content Alliance just to name a few.

PMR: I need to make it very clear that lots of libraries are doing great work (and Librarian(s) appear on my list). I’ll therefore skip Project Muse,  and  American memory Project. But I am writing as a scientist – that is what I was asked to do. I’ll cover Open Access later. Could I and my chemist colleagues do their current work without JStor, OAIster? I doubt if any of them have heard of them.

I will absolutely add

Brewster Kahle who (from WP) is a U.S. internet entrepreneur, activist and digital librarian. Kahle graduated from MIT in 1982 with a BS degree in Computer Science & Engineering where he was a member of the Chi Phi Fraternity. The emphasis of his studies was artificial intelligence; he studied under Marvin Minsky and W. Daniel Hillis.
He was an early member of the Thinking Machines team, where he invented the WAIS system. He later started WAIS, Inc. (sold to AOL), the nonprofit Internet Archive, and the related for-profit Alexa Internet (sold to Amazon.com). He continues as Director of the Internet Archive as of 2007[update]. He is a member of the Board of Directors of the Electronic Frontier Foundation and a key supporter of the Open Content Alliance. His stated goal is “Universal Access to all Knowledge”.

In his TED Talk on building free digital library[2], Kahle describes his vision of free digital library, which contains books, free music concerts, TV programs, snapshot of internet etc.

PMR:  I’m not aware of the complete history but this sounds as if it sprang from a librarian but not from a Library? Good, that Libraries took this up, but I think we need to look to more than witing to pick up new ideas.

I’m also not sure that your estimation about academics unwittingly giving up their copyright is completely accurate. I agree that librarians should have been — and should continue to be — more vocal about copyright (I’m playing my part by working with questioncopyright.org). However, I think academics have to take some responsibility for that as well.

PMR: Academics ultimately take complete responsibility and the more senior, the more responsibility. Much of the current mess in scholarship is because Principals and Presidents faild to spot the trends and defend their own. But Libraries should act as a touchstome of the problems and be constantly alerting management to the absolutely necessity to defend scholarship.

My takeaway from your post is that librarians and academics need to work more closely and communicate better so as to build systems like institutional repositories, share information about the best ways to facilitate research, control information and make sure it is always put under a system that facilitates and expands scholarship. I take up your challenge. I hope other readers of this post will do the same.

PMR: many thanks, I hope they will.

Posted in Uncategorized | Tagged | Leave a comment

librarians of the future – part I

In collecting my thoughts for “the library of the future” (The JISC + The Bodleian) in Oxford on April 2 I’m thinking of those people or organizations that I look to for a combination of resources, philosophy, advocacy that support my scholarship (my current working definition of the “library of the future”). I’ve come up with about 15 entries, mostly people, with one/two organizations (and I may add 2 or 3 more as I thnk of them). They are in roughly alphabetical order and almost all have entries in Wikipedia (“…” indicates quotes). In some cases a person represents a larger organization in which many people deserve high credit – so it’s a bit arbitrary. However these really influence the way I work, and I don’t think I’m unique in looking to them. Most would own to supporting Openness in one way or another (Open Access, Open Data, Open Source, Open Knowledge).

  • Anurag Acharya. Google Scholar. What is important here is not just that Google has revolutionized our thinking about information, but that they have implicitly and explicitly broken the for-profit information culture. Google Scholar came out of the “20%” sparetime activity allowed at Google and it shows what can be done with modest resources. It highlights the bloatedness of formal fee-charging metric services.
  • Tim Berners-Lee. Remember that TBL developed his ideas while supporting the information need of scientists at CERN. No more comment needed.
  • Steve Coast Openstreemap. “OpenStreetMap (OSM) was founded in July 2004 by Steve Coast. In April 2006, a foundation was established with the aim of encouraging the growth, development and distribution of free geospatial data and providing geospatial data for anybody to use and share.This project is remarkable in that it has revolution in its bones – “The initial map data was all built from scratch by volunteers performing systematic ground surveys using a handheld GPS unit and a notebook or a voice recorder, data which was then entered into the OpenStreetMap database from a computer.” In the UK map most map data (and metadata) is strictly controlled by the Ordnance Survey – a government agency which is required to recover costs and which has an effective straglehold on the re-use of maps. OSM has challnged this monopoly, spread to other countries, and just announced its 100,000th participant. It’s an epitome of the many sites which have sprung up to free information.
  • Greg Crane. The Perseus project “is a digital library project of Tufts University that assembles digital collections of humanities resources. It is hosted by the Department of Classics“… “The editor-in-chief of the project is Gregory Crane, the Tufts Winnick Family Chair in Technology and Entrepreneurship. He has been editor-in-chief since the founding of the Perseus Project [1987].”. This project is remarkable for the early vision and the way that it has liberated classical scholarship from the restrictive practices arising from lack of access to critical scholarly objects (editions, etc.). Greg is a frequent collaborator in designing future cyberacademia.
  • Paul Ginsparg “The arXiv is an archive for electronic preprints of scientific papers in the fields of mathematics, physics, computer science, quantitative biology and statistics which can be accessed via the Internet. In many fields of mathematics and physics, almost all scientific papers are placed on the arXiv. As of 3 October 2008 (2008 -10-03)[update], arXiv.org passed the half-million article milestone, with roughly five thousand new e-prints added every month”. Another example of a pioneer who has worked outside the system to create something that is universally regarded as essential and outstanding.

So these are some of the librarians of the future. They build vital, communal, information resources. They invite collaboration, either directly or implicitly. They overthrow conventional wisdom and entrenched systems and interests.

More will follow tomorrow. These probably weren’t what you were expecting, but although you may disagree with individual choices and assessments, you will agree that all have a major impact on our scholarship.

The challenge is how formal academia should rise to meet this creative energy. Will it do this through Libraries? That is for Librarians to think about.

Posted in semanticWeb, Uncategorized | Tagged | Leave a comment

The library of the future – Guardian of Scholarship?

I am still working out my message for JISC on April 2nd on “The library of the future”. I’ve had suggestions that I should re-ask this as ““What are librarians for?”” (Dorothea Salo) and “what can a library do?” (Chris). Thanks, and please keep the comments coming, but I am currently thinking more radically.

When I started in academia I got the impression (I don’t know why) that Libraries (capital-L = formal role in organization) had a central role in guiding scholarship. That they were part of the governance of the university (and indeed some universities have Librarian positions which have the rank of senior professor – e.g. Deans of faculties). I have held onto this idea until it has become clear that it no longer holds. Libraries (and/or Librarians) no longer play this central role. That’s very serious and seriously bad for academia as it has left a vacuum which few are trying to address and which is a contributor to the current problems.

I current see very few – if any – Librarians who are major figures in current academia. Maybe there never was a golden age, but without such people the current trajectory of the Library is inexorably downward. I trace this decline to two major missed opportunities where, if we had had real guradians of scholarship we would not be in the current mess – running scared of publishers and lawyers.

The first occasion was about 1972 (I’d be grateful for exact dates and publishers). I remember the first time I was asked to sign over copyright (either to the International Union of Crystallography or the Chemical Society (now RSC)). It looked fishy, but none of my colleagues spoke out. (Ok, no blogosphere, but there were still ways of communicating rapidly – telephones). The community – led by the Librarians – should (a) have identified the threats (b) mobilised the faculty. Both would have been easy. No publisher would have resisted – they were all primarily learned societies then – no PRISM. If the Universities had said (“this is a bad idea, don’t sign”) we would never have had Maxwell, never had ownership of scholarship by for-profit organizations. Simple. But no-one saw it (at least enough to have impacted a simple Lecturer).

The second occasion was early 1990’s – let’s say 1993 when Mosaic trumpeted the Web. It was obviou to anyone who thought about the future that electronic publication was coming. The publishers were scared – they could see their business disappearing. Their only weapon was the transfer of copyright. The ghastly, stultifying, Impact Factor had not been invented. People actually read papers to find out the worth of someone’s research rather than getting machines to count ultravariable citations.

At that stage the Universities should have re-invented scholarly publishing.The Libraries and Librarians should had led the charge. I’m not saying they would have succeeded, but they should have tried. It was a time of optimism on the Web – the dotcom boom was starting. The goodwill was there, the major universities had publishing houses. But they did nothing – and many contracted their University Presses.

There is still potential for revolution. But at every missed opportunity it’s harder. All too many Librarians have to spend their time negotiating with publishers, making sure that students don’t take too many photocopies, etc. If Institutional Repositories are an instrument of revolution (as they should have been) they haven’t succeeded.

So, simply, the librarian of the future must be a revolutionary. They may or may not be Librarians. If Librarians are not revolutionaries they have little future.

In tomorrow’s post I shall list about 10 people who I think are currently librarians of the future.

Posted in semanticWeb, Uncategorized, XML | Tagged | 3 Comments

Please send us your Vistas

I recently got an invitation to speak (anonymized as I don’t want to fall out) which included:

“I would very much appreciate a copy of your presentation in advance of the event in Windows XP format as the venue is not migrated to Vista. ”

This is yet another example of technology driving scholarly communication. Increasingly we are asked “please send us your Powerpoints”.

I shall use modulated sound waves for the body of my message and I am tempted not to bring any visual material. But I shall, because I need to communicate what we are actually doing by showing it. Scientists, of course, frequently need to show images of molecules, animals, stars, clouds, etc. and this will and should be the mainstream.

But there is little need to echo the words that I speak by showing them on the screen. When speaking to an international audience it can be very useful to have text on the screen, but this is a UK event and whatever else I speak clearly and loudly and should be comprehensible.

What is inexcusable is how often conference organizers fail to provide any connection to the Internet, and some of these meetings are *about* the Internet and digital age. Even more inexcusable are those places which charge 100USD d^-1 for connections.

(BTW I am still fighting WordPress which loses paragraph breaks regularly…)

Posted in open notebook science, programming for scientists, semanticWeb, Uncategorized | Leave a comment

Journal of Cheminformatics and Blue Obelisk

Christoph Steinbeck has posted to the Blue Obelisk List:

I’m delighted to announce that the first open access journal of our
field, the Journal of Cheminformatics, is now live and has published its
first articles. Journal of Cheminformatics is a new open access journal
from Chemistry Central publishing peer-reviewed research in all aspects
of cheminformatics and molecular modelling. It is run by
Editors-in-Chief David J. Wild (Indiana University) and myself (European
Bioinformatics Institute).

Amongst the launch articles are an Editorial by David J. Wild on Grand
challenges for cheminformatics, a Commentary by Steven M. Bachrach on
Chemistry publication – making the revolution and last but not least an
article by Tony Williams and coauthors on Computer Assisted Structure
Elucidation (CASE), one of my own fields of research.

You can view articles and submit your manuscripts at
http://www.jcheminf.com. Please share this information with your
colleagues working in the field of chemical information who may be
interested in this new journal.

I have accepted an invitation to be on the editorial board. I take this position with some trepidation as I have grave reservations about the current practice of cheminformatics. It suffers from closed data, closed source and closed standards, and thereby generally poor experimental design, poor metrics and almost always irreproducible results and conclusions which are based on subjective opinions.

I hope to be able to add some weight to improving this situation. The key goal must be that work is novel, useful and reproducible. Too often papers report work done that cannot be transported from inside an institution commited to non-release of code, data or protocols, and often acts primarily as an advert for code, data or people.

The Blue Obelisk is starting to change this. It has consistently argued for Open Data, Open Source and Open Standards (ODOSOS). It’s meeting next week at the ACS – I am very sorry I shan’t be there.

Open Data (and I shall write more later) starting to take off as a concept in chemistry. (It’s much better established in bioscience, astronomy, HEP, etc.). Part of this comes from culture (such as Open NoteBook Science, Chemspider community) and part from better tools for creating or collecting information (e.g. NMRShiftDB, CrystalEye). We also expect that the OREChem project (we meet next week in Redmond) will also have news of Open chemistry data – but I won’t prejudge.

Open Source also continues steadily. Jmol, OpenBabel, CDK, JUMBO are increasingly used as tools in applications. Frameworks such as Bioclipse (and JUMBO-converters) can help integrate these. OSRA looks like being a promising addition for retrieving structures from images and in our own Centre Daniel Lowe has been making very good progress with OPSIN (a name2structure tool which can now be reasnably compared with commercial offerings).

The attraction of Open Source is that it continues to grow and that new helpers come from unexpected places. I will predict that shortly OSRA and OPSIN will start to become integrated in many laboratories and, when they do, even more effort will be added.

Open Standards are also beginning to be seen as important. We were very pleased that Microsoft saw CML and RDF as a future design of chemistry. Open Standards are the most boring, most unrewarding of the three components, but again they continue to grow, however slowly. It is unlikely there will be effective standard bodies in chemistry for some time, so we have to use the power of web toolls such as XML and RDF to help create effective convergence in sematics and ontology. I’ll be blogging more of this later.

I wish J.ChemInf all the best. I know that it took Bioinformatics 5 years to achieve a respectable Impact Factor. I shall publish in J. ChemInf and I urge the referees to be strict – it’s tempting to let through early papers to fill the issues but it mustn’t happen.

Posted in Uncategorized | 2 Comments

Closed Data at Chemical Abstracts leads to Bad Science

I had decided to take a mellow tone on re-starting this blog and I was feeling full of the joys of spring when I read a paper I simply have to criticize. The issues go beyond chemistry and non-chemists can understand everything necessary. The work has been reviewed in Wired so achieved high prominence (CAS display this on their splashpage). There are so many unsatisfactory things I don’t know where to begin…

I was alerted by Rich Apodaca  who blogged…

A recent issue of Wired is running a story about a Chemical Abstracts Service (CAS) study on the distribution of scaffold frequencies in the CAS Registry database.

Cheminformatics doesn’t often make it into the popular press (or any other kind of press for that matter), so the Wired article is remarkable for that aspect alone.
From the original work (free PDF here):

It seems plausible to expect that the more often a framework has been used as the basis for a compound, the more likely it is to be used in another compound. If many compounds derived from a framework have already been synthesized, these derivatives can serve as a pool of potential starting materials for further syntheses. The availability of published schemes for making these derivatves, or the existence of these desrivates as commercial chemicals, would then facilitate the construction of more compounds based on the same framework. Of course, not all frameworks are equally likely to become the focus of a high degree of synthetic activity. Some frameworks are intrinsically more interesting than others due to their functional importance (e.g., as a building blocks in drug design), and this interest will stimulate the synthesis of derivatives. Once this synthetic activity is initiated, it may be amplified over time by a rich-get-richer process. [PMR – rich-get-richer does not apply to pharma or publishing industries but to an unusual exponent in the power law].

With the appearance of dozens of chemical databases and services on the Web in the last couple of years, the opportunities for analyses like this (and many others) can only increase. Who knows what we’ll find.

Thanks Rich. Now the paper has just appeared in a journal published by ACS (American Chemical Society, of which Chemical Abstracts (CAS) is a division). (There is no criticism of the ACS as publisher in my post, other than that I think the paper is completely flawed).  Because ACS is a Closed publisher the paper is not normally Openly readable, but papers often get the full text exposed early on and then may become closed. I’ve managed to read it from home, so if you don’t subscribe to ACS/JOC I suggest you read it quick.

I dare not reproduce any of the graphs from the paper as I am sure they are copyright ACS so you will have to read the paper quickly before it disappears.

Now I have accepted a position on the board of the new (Open) Journal Of Chemoinformatics. I dithered, because I feel that chemoinformatics is close to pseudo-science along the lines of others reported by Ben Goldacre (Bad Science). But I thought on balance that I’d do what I could to help clean up chemoinformatics and therefore take a critical role of papers which I feel are non-novel, badly designed, irreproducible, and badly written. This paper ticks all boxes.

[If I am factually wrong on any point of Chemical Abstracts, Amer. Chemical Soc. policies etc. I’d welcome correction and ‘ll respond in a neutral spirit.]

So to summarize the paper:

The authors selected 24 million compounds (substances?) from the CAS database and analysed their chemical formula. They found that the frequency of frameworks (e.g. benzene, penicillin) fitted a power law. (PLs are ubiquitous – in typsetting, web caches, size of research laboratories, etc. There is nothing unusual in finding one). The authors speculate that this distribution is due to functional importance stimulating synthetic activity.

I shall post later about why most chemoinformatics is flawed and criticize other papers. In general chemoinformatics consists of:

  1. selection of data sets
  2. annotating these data sets with chemical “descriptors”
  3. [optionally] using machine learning algorithms to analyse or predict
  4. analyse the findings and prepresentation

My basic contention is that unless these steps are (a) based on non-negotiable communally accepted procedures (b) reproducible in whole – chemoinformatics is close to pseudoscience.

This paper involved steps 1,2,4.  (1) is by far the most serious for Open Data advocates so I’ll simply say that
(2) There was  no description of how connection tables (molecular graphs) were created. These molecules apparently included inorgnaic compounds and the creation of CTs for these molecules is wildly variable or often non-attempted. This immediately means that millions of data in the sample are meaningless. The authors also describe an “algorithm” for finding frameworks which is woolly and badly reported. Such algorithms are common – and many are Open as in CDK and JUMBO. The results of the study will depend on the algorithm and the textual description is completely inadequate to recode it. Example – is B2H6 a framework? I would have no idea.

(4) There are no useful results. No supplemental data is published (JOC normally requires supplemental data but this is an exception – I have no idea why not). The data have been destroyed into PDF graphs (yes – this is why PDF corrupts – if the graphs had been SVG I could have extracted the data). Moreover the authors give no justification for their conclusion that frequency of occurrence is due to synthetic activity or interesting systems. What about natural products? What about silicates?

But by far the most serious concern is (1). How were the data selected?

The data come – according to the authors – from a snapshot of the CAS registry in 2007. I believe the following to be facts, and offer to stand corrected by CAS:
  • The data in CAS is based almost completely on data published in the public domain. I agree there is considerable “sweat of brow” in collating it, but it’s “our data”.
  • CAS sells a licence to academia (Scifinder) to query their databse . This does not allow re-use of the query results. Many institutions cannot afford the price.
  • There are strict conditions of use. I do not know what they are in detail but I am 100% certain that I cannot download and use a signifcant part of the database for research, and publish the results. Therefore I cannot – under any circumstances attempt to replicate the work. If I attempted I would expect to receive legal threats or worse. Certainly the University would be debarred from using CAS.

The results of the paper – such as they are – depend completely on selection of the data. There are a huge number of biological molecules (DNA, proteins) in CAS and I would have expected these to bias the analysis (with 6, 5, and 6-5 rings being present in enormous numbers). The authors may say – if they reply – that it’s “obvious” that “substance” (with < 253 atoms) excluded these – but that is a consequence of  bad writing, poor methodology and the knowledge that whatever they put in the paper cannot be verified or challenged by anyone else on the planet.

There are many data sources which are unique – satellite, climate, astronomical, etc. The curators of those work very hard to provide universal access. Here, by contrast, we have a situation where the only people who can work with a dataset are the people we pay to give us driblets of the data at extremely high prices.

This post is not primarily a criticism of CAS per se (though from time to time I will publish concerns about their apparent – but loosening – stranglehold on chemical data). If they wish to collect our data and sell it back to us it’s tenable business model and I shall continue to fight it.

But to use a monopoly to do unrefereeable bad science is not worthy of a learned society.
Posted in semanticWeb, Uncategorized, XML | 3 Comments