Unilever Centre for Molecular Informatics
 

petermr's blog

A Scientist and the Web

 

Archive for June, 2007

Mathematical Knowledge Management 2007

Sunday, June 24th, 2007
I have been invited to give a lecture at the Mathematical Knowledge Management 2007 meeting next week in Hagenberg, Austria. My talk is entitled Mathematics and scientific markup. I am both excited and apprehensive about this – what is a chemist (whose level of mathematics finishes at Part1A for scientists in Cambridge) doing talking to experts in the field? However in the spirit of the new Web I’m blogging my thoughts before the meeting. This serves several purposes:
  • helps me get my ideas in order
  • gets feedback from anyone who may have an interest
  • identifies other people who may also be blogging about the meeting
  • acts as a public resource from which I can give my talk if I have problems with my machine.
The conference topic …
Mathematical Knowledge Management is an innovative field in the intersection of mathematics and computer science. Its development is driven on the one hand by the new technological possibilities which computer science, the internet, and intelligent knowledge processing offer, and on the other hand by the need for new techniques for managing the rapidly growing volume of mathematical knowledge. The conference is concerned with all aspects of mathematical knowledge management. A (non-exclusive) list of important areas of current interest includes:
  • Representation of mathematical knowledge
  • Repositories of formalized mathematics
  • Diagrammatic representations
  • Mathematical search and retrieval
  • Deduction systems
  • Math assistants, tutoring and assessment systems
  • Mathematical OCR
  • Inference of semantics for semi-formalized mathematics
  • Digital libraries
  • Authoring languages and tools
  • MathML, OpenMath, and other mathematical content standards
  • Web presentation of mathematics
  • Data mining, discovery, theory exploration
  • Computer Algebra Systems
  • Collaboration tools for mathematics

Invited Speakers:

Neil J. A. Sloane AT&T Shannon Labs,a Florham Park, NJ, USA The On-Line Encyclopedia of Integer Sequences
Peter Murray Rust University of Cambridge, Dep. of Chemistry, UK Mathematics and scientific markup
What has molecular informatics to do with this? More than it appears. Chemistry overlaps considerably with chemistry and here formal systems are important. It should be possible to explore the formal representation of thermodynamics or material properties in semantic form (though I may find that my use of “semantics” is imprecise or even “wrong”). Repositories are an obviously exciting area – can we find mathematical objects either by form or by metadata? OCR is important for all content-rich disciplines – see below. Inference and semantics are becoming increasingly important in the emerging web. And so I tick about half the topics above – not in mathematical detail, of course, but in the general approach to the problems. As an example, what objects contain enough structure and canonicalized content that they act as their own discovery metadata? Most objects need a human or a lookup-table to add the metadata for their web discovery. For example you need to know the names of humans – you cannot work these out by looking at them. But in chemistry we can describe a molecule by its InChI – a canonical representation of the connection table (which is not easily human-interpretable). This is both its content and its discovery metadata. You can search Google for molecules though InChIs will find instances of molecules on the web. I wondered what other objects could be identified just by their textual content. Perhaps a poem (although it won’t tell you who wrote it). I started typing lists of numbers into Google and suddenly found I was getting hits on Neil Sloane’s Encyclopedia. In this a sequence can be identified by its content – search Google for “1,3,6,10,15″ and you get A000217 in The On-Line Encyclopedia of Integer Sequences. I had a chat with a well known computer scientists ex-mathematician at WWW2007 and he bet that I couldn’t tell me the next term in a sequence within 5 minutes. I bet him the drinks that I could. So as we had wireless in the bar I searched Google and immediately found the answer – he was astounded – and bought the drinks. So many of the problems are generic between domains. Searching for MKM2007 I found this paper on how to extract mathematics from PDF (Retro-enhancement of Recent Mathematical Literature). It’s better than recreating cows from hamburgers as they have some access to source – but there are similarities to what we are trying to recover from PDF. I shall use the tag mkm2007 for this and subsequent posts (in which I’ll explore things like the different between top-down and bottom up management systems.) No one has yet used it but maybe someone will find it – let’s see.

The NIHghts who say ‘no’ – to chemoinformatics

Saturday, June 23rd, 2007
A recent post from The Sceptical Chymist: The NIHghts who say ‘no [1]

The NIHghts who say ‘no’

Apologies to our international readers for the U.S.-centric post, but the National Institutes of Health announced earlier today that PAR-07-353, a grant involving Cheminformatics Research Centers, has been canceled for “programmatic reasons.” For those of you who haven’t heard of the Cheminformatics Research Centers, they are part of the Molecular Libraries Roadmap Program (MLP), which is
an integrated set of initiatives aimed at developing and using selective and potent chemical probes for basic research … [The MLP] was proposed to introduce high-throughput screening approaches to small molecule discovery, formerly limited to the pharmaceutical research industry, into the public sector… [and] is made up of the following major components: (1) access to a library of compounds (Molecular Libraries Small Molecule Repository); (2) access to bioassays provided by the larger research community; (3) support for the development of breakthrough instrumentation technologies; (4) access to a network of screening and chemical probe generation centers (MLPCN) where assays are screened and probe development is undertaken; (5) Pubchem, the primary portal through which the screening results of the MLPCN are made public and (6) the Cheminformatics Research Centers (CRCs) with multiple roles focused on high-level data analysis and dissemination with a focus on developing new understanding of the cellular processes (genes and pathways).
One reason why this is so surprising is because the grants were due next week (June 28th). I imagine the timing of this decision (and the decision itself) is bound to upset a number of people in this community, especially since many applicants were probably working around the clock to get their grant submitted before the (now non-existent) deadline… Does anyone know more about this story or why the grant was canceled? Joshua Joshua Finkelstein (Senior Editor, Nature)
First, Joshua, no apologies needed – this affects world science not just US chemoinformatics. (And a reminder that Nature was active in helping to report the activities of those of us who wished to promote the value of the Pubchem effort). In Cambridge we are (or were) working with potential applicants in this program and were in the process of preparing material. We were informed:
There is a notice published today in the NIH Guide
that cancels the “Preapplication for Cheminformatics Research Centers
(X)2) PAR-07-353. Here is the essential element of that notice:
“This Notice is to inform the scientific community of the cancellation
of the PAR-07-353 entitled, Preapplication for Cheminformatics Research
Centers (X02), due to programmatic reasons. Applications should not be
submitted for the June 28, 2007 date. Any application submitted will not
be assigned or reviewed. NIH intends to re-issue this announcement at a
later date.”
The point is that NIH is (or was until yesterday) reaching out to the world community. Chem(o)informatics is a key tool in understanding biology and in discovering new leads for pharma. There was a real chance to revitalise chemoinformatics which has been languishing for many years bedevilled by lack of:
  • Open Data
  • Open software
  • Open processes
  • a modern approach to information
  • reproducible science
Just last week we had a 3-day workshop at the Unilever Centre on Machine Learning, aimed primarily at chemoinformatics. One of the speakers, a statistician, has published concerns about the quality of science in many chemoinformatics publications – it suffers from all the above list.
The NIH program would have given a major boost to Open reproducible science. The data would have been fully Open and reusable, software would have been Open and modular, and chemistry would have had a major example of how data could be re-used for science rather than being aggregated and resold. Whether or not our group had been funded I was looking forward to this program as it would have been a highly cost-effective use of funds. And it could have shown the pharma industry, which relies heavily on this approach but does so little to encourage good practice in it, a way forward. Will any pharma CEOs speak out? Or do anything to help a science on which they depend?
It would be irresponsible to speculate on what “programmatic reasons” means. It could be that, like Britain, the US wants to spend its income on wars rather than health. But unfortunately not every US organization approves of the NIH’s funding of Pubchem and related projects which are often seen as “socialist” and seen as the government competing against the “private sector”. Will all scientific journalists highlight the major damage to chemistry that this cancellation causes ?
So, Joshua, please keep investigating. Maybe there is a need for another scientific Woodward-Bernstein.
[Added subsequently - my private email favours cock-up rather than conspiracy. But it's still indefensible to pull grants 6 days before the deadline. And my concern in the previous paragraph still holds - chemoinformatics should have proper public support to support proper science, not languish.]
[1] From Monty Python. In the current case the humour is ironic.
[This is the second grant I have failed to get today. At least the other one - for FP7 - got to the submission stage and was - presumably - reviewed]

Data and Institutional Repositories

Saturday, June 23rd, 2007
One of the themes of ETD2007 was a strong emphasis on IR’s. Not surprising since they are topical and a natural place to put theses and dissertations. Almost everyone there – many from the Library and Information Services (LIS) community – had built, or was building, IRs. I asked a lot of people why. Many were doing it because everyone else was, or there was funding, or similar pragmatic motives. But beyond that the motives varied considerably. They included:
  • To promote the institution and its work
  • To make the work more visible
  • To manage business processes (e.g. thesis submission, or research assessment exercises).
  • To satisfy sponsors and funding bodies
  • To preserve and archive the work
  • To curate the work
and more. The point is that there is no single purpose and therefore IR software and systems have to be able to cope with a lot of different demands. The first generation IRs (ePrints, DSpace, Fedora) addressed the reposition of single eManuscripts (“PDFs”) with associated metadata. This now seems to work quite well technically, although there are few real metrics about whether they enhance exposure and there is poor compliance in many institutions. There is also a major problem in some Closed Access (e.g. in chemistry) formally forbidding Open reposition. So the major problems are social. Recently the LIS community has started to highlight the possibility of repositing data. This is welcome, but needs careful thought – here are a few comments. Many scholars produce large amounts of valuable data. In some cases the data are far more important than the full text. For example Nick Day’s crystallographic repository CrystalEye contains 100,000 structures from the published literature and although it links back to the text can be used without needing to. This is also true of crystallographic data collected from departmental services as in our SPECTRa system. With the right metadata data can often standalone. All scientists – and especially those with sad experiences of data loss (i.e. almost all) – are keen for their data to be stored safely and indefinitely. And most would like other scientists to re-use their data. This needs a bit of courage: the main drawbacks are:
  • re-analysis could show the data or the conclusions were flawed
  • re-analysis could discover exciting science that the author had missed
  • journals could refuse to publish the work (isn’t it tedious to have to mention this in every post :-( )
But many communities have faced up to it. The biosciences require deposition of many sorts of data when articles are published. A good example is the RCSB Protein Data Bank which has a very carefully thought-out and tested policy and process. If you are thinking of setting up a data repository this document (and many more like it) should be required reading. It works, but it’s not trivial – 205 pages. It requires specialist staff, constant feedback to and from the community. Here’s a nice, honest, chunk:
If it is so simple, why is this manual so long? In the best of worlds all information would be exchanged using both syntactically and semantically precise protocols. The reality of the current state of information exchange is somewhat different. Most applications still rely on the poorly defined semantics of the PDB format. Although there has been significant effort to define and standardize the information exchange in the crystallographic community using mmCIF , the application of this method of exchange is just beginning to be employed in crystallographic software. Currently, there is wide variation in specification of deposited coordinate and structure factor data. Owing to this uncertainty, it is not practical to attempt fully automated and unsupervised processing of coordinate data. This unfortunately limits the functionality that can be presented to the depositor community through Web tools like ADIT , as it would be undesirable to have depositors deal with the unanticipated software failures arising from data format ambiguities. Rather than provide for an automated pipeline for processing and validating coordinate data, coordinate data processing is performed under the supervision of an annotator. Typical anomalies in coordinate data can be identified and rectified by the annotator in a few steps permitting subsequent processing to be performed in an automated fashion. The remainder of this manual describes in detail each of the data processing steps. The data assembly step in which incoming information is organized and encoded in a standard form is described in the next section.
So data reposition is both highly desirable and complex. I’m not offering easy solutions. But what is not useful is the facile idea that data can simply be reposited in current IRs. (I heard this suggested by a commercial repository supplier in the Digital Scolarship meeting in Glasgow last year. He showed superficial slides of genomes, star maps, etc. and implied that all this could be reposited in IRs. It can’t and I felt compelled to say so.) At ETD2007 we had a panel session on repositing data. Lars Jensen gave a very useful overview of bioscientific data – here are my own points from it:
  • There is a huge amount of bioscience data in repositories.
  • These are specialist sites, normally national or international
  • There is a commitment to the long-term
  • much bioscience is done from the data in these repositories
  • the data in them is complex
  • the community puts much effort into defining the semantics and ontologies
  • specialist staff are required to manage the reposition and maintenance
  • there are hundreds of different types of data – each requires a large amount of effort
  • the relationships between the data are both complex and exceedingly valuable
All bioscientists are aware of these repositories (they don’t normally use this term – often “data bank”, “gene bank”, etc. are used.) They would always look to them to deposit their data. Moreover the community has convinced the journals to enforce the reposition of data by authors. Some other disciplines have similar approaches – e.g. astronomers have the International Virtual Observatory Alliance. But most don’t. So can IRs help? I’d like to think they can, but I’m not sure. My current view is that data (and especially metadata) – at this stage in human scholarship – have to be managed by the domains, not the institution. So if we want chemical repositories the chemical community should take a lead. Data should firstly be captured in departments (e.g. by SPECTRa) because that is where the data are collected, analysed, and – in the first instance – re-used. For some other domains it’s different – perhaps it might be at a particular large facility (synchrotron, telescope, outstation, etc.). Some will argue that chemistry already operates this domain-specific model. Large abstracters aggregate our data (which is given for free) and then sell it back to us. In the 20th C this was the only model, but in the distributed web it breaks. It’s too expensive, does not allow for community ontologies to be developed (the only Open ones in chemistry are developed by biologists). And it’s selective and does not help the indivdual researcher and department. Three years ago I thought it would be a great idea to archive our data in our DSpace repository. It wasn’t trivial to put in 250, 000 objects. It’s proving even harder to get them out (OAI-PMH is not designed for complex and compound objects). Joe Townsend who works with me will submit his thesis very shortly. He want to preserve his data – 20 GBytes. So do I. I think it could be very useful for other chemists and eScientists. But where to put it? If we put it in DSpace it may be preserved but it won’t be re-usable. If he puts it on CD it requires zillions of actual CDs. And they will decay. We have to do something – and we are open to suggestions. So we have to have new model – and funding. Here are some constraints in chemistry – your mileage may vary:
  • there must be support for the development of Open ontologies and protocols. One model is to encourage groups which are already active and then transfer the maintenance to International Unions or Learned Societies (though this can be difficult when they are also income-generating Closed Access Publishers)
  • funders must make sure that the work they support is preserved. Hundreds of millions or more are spent in chemistry departments to create high-quality data and most of these are lost. It’s short-sighted to argue that the data only live until the paper publication
  • departments must own the initial preservation of this data. This costs money. I think the simplest solution is for funders to mandate the Open preservation of data (cf. Wellcome Trust).
  • the institutions must support the generic preservation process. This requires the departments actually talking to the LIS staff. It also requires LIS staff to be able to converse with scientists on equal terms. This is hard, but essential.
Where the data finally end up is irrelevant as long as they are well managed. There may, indeed, be more than one copy. Some could be tuned for discoverability. So the simple message is:
  • save your data
  • don’t simply put it in your repository
I wish I could suggest better how to do this well.

CML on ICE – towards Open chemical/scientific authoring

Saturday, June 23rd, 2007
Because WWMM had outages my blogging is behind and I’d written a post on Peter Sefton’s ICE. Peter and I met at ETD2007 and immediately clicked. But WWMM went to sleep and I haven’t reposted. Peter has beaten me to it. ICE is a content authoring tool based on Open Office. It works natively with XML and Subversion. So it adds a dramatic aspect to document authoring – versioning with full community access and collaboration (if required). For example, if Peter and I write a paper about this we’d use the ICE server at University of Southern Queensland to store the versions. And of course as it’s Open Source anyone can set one up – it would be ideal for the Blue Obelisk community to author papers with. But what catalysed this was the possibility of authoring theses. Students and looking for imaginative approaches and many will be happy to be early adopters in this new technology. If the domain-specific components are in XML (or other standards) it becomes easy to integrate them into ICE. And it is fantastic to be able to revert to previous versions at – I find Subversion easier than Word change management for example. So some points from PeterS’s post:

View this page as PDF

jmolInitialize(” http://ptsefton.com/files/jmol-11.2.0″);
I mentioned before that at the ETD 2007 conference I met Prof Peter Murray-Rust. [1] We’re going to collaborate on adding support for CML the Chemical Markup Language to ICE, so that people can write research publications that include ‘live’ data.
[1] I’m just Petermr or PMR :-)
Here’s a quick demo of the possibilities. I went to the amazing Crystaleye service.
PMR: This is Nick Day’s site. We’d hoped to announce it formally a week or so ago but machine problems kep us back. But we’ll get some posts out this coming week. [We thank Acta Cryst/IUCr for a summer studentship which helped greatly to get it off the ground.]
The aim of the CrystalEye project is to aggregate crystallography from web resources, and to provide methods to easily browse, search, and to keep up to date with the latest published information. http://wwmm.ch.cam.ac.uk/crystaleye/
Crystaleye automatically finds descriptions of crystals in web-accessible literature, turns them into CML and builds pages like Acta Crystallographica Section B, 2007, issue 03-00. From that page I grabbed this two dimensional image of (C6H15N4O2)2(C4H4O6-2), graphics3

PMR: Minor point: This is just the anion – there is a separate image for the cation.(the 3D structure below displays the cations as well).

There’s a Java applet on the page that lets you play with the crystal in 3d. Here’s a screenshot. of the 3d rendering.

graphics2 There’s lots more work to be done, but I thought I’d show how easy it is to make an ICE document that shows the 2d view for print, with the 3d view for the web, via the applet. Be warned, this may not work for you. The applet refuses to load in Firefox 2 for me, but it does work in Safari on Max OS X. If you follow the ‘view this page in PDF’ link above you’ll see just the picture.
PMR: image and applet deleted here …
What’s happening here? My initial hack is really simple. I grab the image and paste it into ICE like any other image, but then I link it to the CML source. I wrote a tiny fragment of Python in my ICE site to go through every page, and if it finds a link to to a CML file containing an image, it adds code to load the CML into the Jmol applet. This is a kind of integration-by-convention, AKA microformat.
The main bit of programming only took a few minutes, but sorting out where to put the CML files and the Jmol applet, and integrating the changes into this blog took ages. I ended up putting the files here on my web site which meant putting a big chunk of stuff into subversion, something that should have been done ages ago, but the version of svn that runs on my other server refuses to do large commits over HTTPS ‘cos of some SSL bug and I can’t figure out how to update it which meant switching the repository to use plain HTTP, and so on. It wasn’t made easier by me mucking around with the Airport Extreme router and our ADSL modem at the same time, halting internet access at home for a couple of hours. To make this integration a bit more usable and robust we want to:
  • Work out a workflow that lets you keep CML files in ICE and easily drop images in to your documents, letting ICE render using the applet when it makes HTML.
  • Integrate forthcoming work from Peter & team that will provide high quality vector graphics instead of the PNG files I’m using now.
PMR: I have now hacked JUMBO so it generates SVG images of 2D (and soon 3D molecules). Note that this then allows automatic generation of molecular images in PDF files (through FOP/SVG)
  • Investigate embedding CML in an image format such as EPS that word processors understand.
  • Generalize this approach for other e-scholarship applications. We’re working with the Alive team at USQ on this.
  • Talk to the DART & ARCHER teams.
I am also extremely keen to talk to these teams – as they are doing very similar and complementary work to our SPECTRa and SPECTRA-T projects in capturing scientific data at source. I am impressed by the Australian commitment to Open Access, Open Data and collaborative working. ICE is an excellent example of how we can split the load. ICE likes working with the technical aspects documents (I don’t really though I have to). The Blue Obelisk likes working with XML in chemistry. The two components naturally come together. This is something I have been waiting for for about 12 years. We haven’t got there yet, but we are well on the way.

Author Choice in Chemistry at ACS – and elsewhere?

Saturday, June 23rd, 2007
A number of closed access journals/publishers have brought out “Author Choice” and similar approaches where authors pay publishers for “open access”. The details probably varies from publisher to publisher and I have been idly looking for examples in chemistry. It is actually extremely difficult to to find articles on the basis of their access rights and you might think that I am weird to take this approach – surely the content is more important than the metadata? But I am interested in material that my robots can do exciting things with and I will take just about anything – chemistry is a desert for open text-mining (BJOC and Chemistry Central and a few others excepted) I have found the first example of this in Chemistry, alerted – of course – by the blogosphere – in this case The ChemBlog. The post casually announces JACs ASAP AUTHOR CHOICE which points to a graphical abstract. The abstract links to a further fuller abstract, the HTML and the PDF. It also announces that this is an author choice paper and links to:
ACS AuthorChoiceArticles bearing the ACS AuthorChoice logo have been made freely available to the general public through the ACS AuthorChoice option. The ACS AuthorChoice option establishes a fee-based mechanism for individual authors or their research funding agencies to sponsor the open availability of their articles on the Web at the time of online publication. Under this policy, the ACS as copyright holder will enable unrestricted Web access to a contributing author’s publication from the Society’s website, in exchange for a fixed payment from the sponsoring author. ACS AuthorChoice also enables such authors to post electronic copies of published articles on their own personal websites and institutional repositories for non-commercial scholarly purposes.
and in further information:
The American Chemical Society’s Publications Division is pleased to announce an important new publishing option in support of the Society’s journal authors who wish or need to sponsor open access to their published research articles. The ACS AuthorChoice option establishes a fee-based mechanism for individual authors or their research funding agencies to sponsor the open availability of their articles on the Web at the time of online publication. Under this new policy, to be implemented later this Fall [2006, PMR], the ACS as copyright holder will enable unrestricted Web access to a contributing author’s publication from the Society’s website, in exchange for a fixed payment from the sponsoring author. ACS AuthorChoice will also enable such authors to post electronic copies of published articles on their own personal websites and institutional repositories for non-commercial scholarly purposes. The base fee for the ACS AuthorChoice option will be set at $3,000 during 2006-2007, with significant discounts applied for contributing authors who are members of the American Chemical Society and/or who are affiliated with an ACS subscribing institution. The fee structure will be as follows: $3,000: Base Fee (authors who are neither ACS members nor affiliated with an ACS subscribing institution)
I first went to the abstract:
J. Am. Chem. Soc., ASAP Article 10.1021/ja070003c S0002-7863(07)00003-0 Web Release Date: June 22, 2007 Copyright © 2007 American Chemical Society A Red Cy3-Based Biarsenical Fluorescent Probe Targeted to a Complementary Binding Peptide Haishi Cao, Yijia Xiong, Ting Wang, Baowei Chen, Thomas C. Squier, and M. Uljana Mayer* Cell Biology and Biochemistry Group, Pacific Northwest National Laboratory, Richland, Washington 99352 uljana.mayer@pnl.gov Received January 1, 2007 Abstract: We have synthesized a red …peptide motifs. [Full text in html] [Full text in pdf]
There are several aspects of this:
  • The abstract (and the full text) is copyright ACS.
  • There is no mention in the abstract that this is an Author Choice publication
  • or in the HTML or PDF full text
  • the material cannot be used for commercial purposes
  • The link from the abstract to the HTML links to the access toll-access login – i.e. this link is closed.
  • as is the PDF
Indeed I thought the whole paper was closed until I realised that the Open Access was possible only though the graphical abstract. (To be fair this is what the DOI – at present – points to so the open access version would be found in Google through the DOI). I cannot see any reason why the full abstract should only point to closed versions of the paper. Indeed I cannot see any reason why there are closed versions of the paper at all. I hope this is an oversight rather than deliberate. I know and respect people at ACS and I know they read these messages. But the value that they have offered to the authors – who may have paid 3000USD – seems minimal. They have insisted that the authors donate their intellectual property to the ACS, advertised the Openness of the article only in one place (and not in the article), restricted the use of the article (copyright probably restricts my holding the article on my machine for serious computation). It is not in the spirit or letter of the Budapest Declaration. I have seen very little evidence of “Open Choice” and other schemes having any impact in chemistry. Technically the publishers can claim that they offer “Green” Open Access for a fee. I suspect – though I don’t know – that other commercial publishers of chemistry such as Springer and Wiley have similar low uptakes of the schemes. I have no idea whether this worries them – they get the subscription income anyway. Certainly I don’t get the impression that they intend to change the publishing model this way. I have written to Springer asking for details on openness of data in their Open Choice policy and what the take-up is in chemistry but haven’t heard back – maybe this blog will elicit a response. And, indeed, I’d be delighted to hear from any other closed access publishers – they will get a considered response. Although I support OpenAccess my energies are in Open Data, so I remain fairly quiet on the mechanisms and business models whereby OA may be achieved. The level of OA offered by these mechanisms is totally unsatisfactory for the modern semantic world. (That’s why I am turning to theses). [NOTE: If you reply please use a good ratio of text to links. I get 500 spam per day and although Akismet is excellent it occasionally consigns posts with a high link density to the spambin.]

The power of the scientific eThesis

Friday, June 15th, 2007
This is the summary of a presentation I am giving tomorrow at ETD2007 (run by Networked Digital Library of Theses and Dissertations. I’m blogging this as the simplest way of (a) reminding me what I am going to say and (b) acting as very rough record of what I might have presented. (My talks are chosen from a menu of 500+ possible slides and demos and I don’t know which at the start of the presentation so it’s very difficult to have a historical record. The blog carries the main arguments). Main themes (many of which have been blogged recently):
  • the thesis need not be a dull record of a final result but a creative work with lives and evolves until and beyond the “final submission”
  • theses should be semantic and interactive, supported by ontologies and go beyond “hamburger PDF”. Theses are computable.
  • We must develop communal semantic authoring/creation environments and processes.
  • the process should move rapidly towards embracing open philosophies and methodology. Metadata and ontologies should be open.
  • young people should be actively involved in all parts of managing the thesis process. (Harvard Free Culture)
  • “Web 2.0″ will transform society and therefore the academic process. We must be prepared for this.
  • It is not clear that current approaches to “repositories” will help rather than hinder innovation and dissemination of eTheses. They will only be useful for preservation if they are semantic.
In detail scientific theses need support for authoring and validating:
  • thesis structure (templating) – e.g. USQ’s Integrated Content Environment ICE system which supports XML/”Word”
  • MathML
  • SVG (graphics)
  • CML (Chemistry)
  • GML (maps)
  • Numeric data (various, including CML)
  • graphs (various, including CML)
  • tables (various, including CML)
  • scientific units (various, including CML)
  • ontologies and dictionaries (various, including CML)
Some exciting thesis projects: Why PDF is so awful: Organic Theses: Hamburger or Cow? Subversion (CML project) Wikipedia – caffeine – (info boxes) GoogleInChI – semantic chemical search without Google knowing The power of the semantic Web -dbpedia.org – Using Wikipedia as a Web Database. Chemical blogspace – overview of exciting developments in chemistry Local demos including analysis of theses: What should institutions and NDLTD do to promote this vision?
  • involve young people in all parts of the process – understand Web 2.0 culture and democracy. Be brave
  • help promote their vision against the conservatism of institutions, learned societies and commercial interests
  • promote thesis creation as a complete part of the research process. Start on day 0 with tools, encouragment. Get students from year+1 to explain the vision
  • Harness the power of social computing (Google, Flickr, Wikipedia, etc.). You will have to anyway. Give credit for innovation in this area
  • Co-develop semantic authoring tools, including scientific languages. Use rich clients for display.
  • Promote the use of ontologies and similar resources as integral parts of the scholarly process. Insist on marked up information and entities
  • Use software to validate data in theses. Give these tools to examiners.
  • insist that data belongs to the scientific community. Use creative commons licenses from day 0.
and overall… Use the power of the scholarly community to show that they can communicate science far better than the absurd e-paper, unacceptable business models, and repression of innovation that is forced on us by the commercial and pseudo-commercial publishers. Destroy the pernicious pseudo-science of citation metrics. Reclaim our scholarship.

“open access” – some central questions

Tuesday, June 12th, 2007
I am grateful for the recent correspondence from Peter Suber and Stevan Harnad as it helps me get my thoughts in order for ETD2007. In response to Stevan: Open Access: What Comes With the Territory, Peter has analysed the central question very clearly (as always)
I expect that all of us will agree with the analysis below. The position each of us takes may vary:
Summary [of Stevan's post]:Downloading, printing, saving and data-crunching come with the territory if you make your paper freely accessible online (Open Access). You may not, however, create derivative works out of the words of that text. It is the author’s own writing, not an audio for remix. And that is as it should be. Its contents (meaning) are yours to data-mine and reuse, with attribution. The words themselves, however, are the author’s (apart from attributed fair-use quotes). The frequent misunderstanding that what comes with the OA territory is somehow not enough seems to be based on conflating (1) the text of research articles with (2a) the raw research data on which the text is based, or with (2b) software, or with (2c) multimedia — all the wrong stuff and irrelevant to OA.
Comments
  • Stevan is responding to Peter Murray-Rust’s blog post from June 10. But since I agreed with most of what Peter MR wrote, I’ll jump in.
  • Stevan isn’t saying that OA doesn’t or shouldn’t remove permission barriers. He’s saying that removing price barriers (making work accessible online free of charge) already does most or all of the work of removing permission barriers and therefore that no extra steps are needed.
  • The chief problem with this view is the law. If a work is online without a special license or permission statement, then either it stands or appears to stand under an all-rights-reserved copyright. The only assured rights for users are those collected under fair use or fair dealing. These rights are far fewer and less adequate than OA contemplates, and in any case the boundaries of fair use and fair dealing are vague and contestable.
  • This legal problem leads to a practical problem: conscientious users will feel obliged to err on the side of asking permission and sometimes even paying permission fees (hurdles that OA is designed to remove) or to err on the side of non-use (further damaging research and scholarship). Either that, or conscientious users will feel pressure to become less conscientious. This may be happening, but it cannot be a strategy for a movement which claims that its central practices are lawful.
  • This doesn’t mean that articles in OA repositories without special licenses or permission statements may not be read or used. It means that users have access free of charge (a significant breakthrough) but are limited to fair use.
PMR: “The chief problem with this view is the law”. That puts it precisely, and that’s where Stevan and I differ. At the moment I think we have to work within the law, and I think the law debars me from crunching. There may come a time where we feel that civil disobedience is unavoidable but it hasn’t arrived yet – if it does I shall be there. And some comments on other parts of Stevan’s post:

Get the Institutional Repository Managers Out of the Decision Loop

The trouble with many Institutional Repositories (IRs) (besides the fact that they don’t have a deposit mandate) is that they are not run by researchers but by “permissions professionals,” accustomed to being mired in institutional author IP protection issues and institutional library 3rd-party usage rights rather than institutional author research give-aways.

PMR: I have had similar thoughts. I got the distinct impression that some IR’s are run like victorian museums – look but don’t touch. Ithe very word “repository” suggests a funereal process – it’s no surprise that having put much of my stuff into DSpace I find it’s an enormous effort to get it out. Why don’t we build “disseminatories” instead? [Stevan's analysis of how we should deposit papers omitted. I don't disagree - I'm just more interested in data t present.]

Now, Peter, I counsel patience! You will immediately reply: “But my robots cannot crunch Closed Access texts: I need to intervene manually!” True, but that problem will only be temporary, and you must not forget the far larger problem that precedes it, which is that 85% of papers are not yet being deposited at all, either as Open Access or Closed Access. That is the inertial practice that needs to be changed, globally, once and for all.

PMR: Here we differ. In many fields there has been little movement and no Green journals. We could wait another five years for no effect. But my main concern is the balance between Green access and copyrighted data. The longer we fail to address the copyrighting of data the worse the situation will become. Publishers are not stupid – they have revenue-oriented business people working out how to make money out of our data – Wiley told me so. Imagine, for example, that a publisher says “I will make all our journals green as long as we retain copyright. And we’ll extend the paper to cover the whole of the scientific record”. That would be wonderful for Stevan and a complete disaster for paper-crunchers. We can’t afford to wait for that to happen.

TJust as I have urged that Gold OA (publishing) advocates should not over-reach (”Gold Fever“) — by pushing directly for the conversion of all publishers and authors to Gold OA, and criticizing and even opposing Green OA and Green OA mandates as “not enough” — I urge the advocates of automatized robotic data-mining to be patient and help rather than hinder Green OA and Green OA (and ID/OA) mandates.

PMR: I am not – I hope – hindering Green access. I am not personally agitating for Green or Gold – my energies go into arguing that the experimental process must not be copyrighted by the publisher or anyone else. And that institutional repositories should start to be much much more proactive and actively support the digital research process.

Stevan Harnad on “open access”

Tuesday, June 12th, 2007
Stevan Harnad – a tireless evangelist of OA – has replied to my points. He has been consistent in arguing the logic below and I agree with the logic. The problem is that few people believe that this allows us to act as he suggests. Stevan argues that current Green Open Access allows us to do all we wish with the exposed material without permission. However when I spoke to several repositories managers at the JISC meeting all were clear that I could not have permission to do this with their current content. I asked “can my robots download and mine the content in your current open access repository of theses?” – No. “Can you let me have come chemistry theses from your open access collection so I can data-mine them/” – No – you will have to ask the permission of each author individually. So Stevan’s views on what I can do iseem not to be – unfortunately – widely held.
  1. Stevan Harnad Says: June 12th, 2007 at 3:37 am eOpen Access: What Comes With the Territory Peter Murray-Rust’s worries about OA are groundless. Peter worries he can’t be be sure that:
    “I can save my own copy (the MIT [site] suggests you cannot print it and may not be allowed to save it)”
    Pay no attention. Download, print, save and crunch (just as you could have done if you had keyed in the text from reading the pages of a paper book)! [Free Access vs. Open Access (Dec 2003)]
    “that it will be available next week”
    It will. The University OA IRs all see to that. That’s why they’re making it OA. [Proposed update of BOAI definition of OA: Immediate and Permanent (Mar 2005)]
    “that it will be unaltered in the future or that versions will be tracked”
    Versions are tracked by the IR software, and updated versions are tagged as such. Versions can even be DIFFed.
    “that I can create derivative works”
    You may not create derivative works. We are talking about someone’s own writing, not an audio for remix, And that is as it should be. The contents (meaning) are yours to data-mine and reuse, with attribution. The words, however, are the author’s (apart from attributed fair-use quotes). Link to them if you need to re-use them verbatim (or ask for permission).
    “that I can use machines to text- or data-mine it”
    Yes, you can. Download and crunch away. This is all common sense, and all comes with the OA territory when the author makes his full-text freely accessible for all, online. The rest seems to be based on some conflation between (1) the text of research articles and (2a) the raw research data on which the text is based, and with (2b) software, and with (2c) multimedia — all the wrong stuff and irrelevant to OA). Stevan Harnad American Scientist Open Access Forum
Specific issues: My concern was not with just with material in repositories but elsewhere. Some publishers allow posting on green open access on web sites but debar it from repositories. So the concerns remain. The MIT repository deliberately adds technical restrictions from printing there theses and this also technically prevents data and text mining. There are some hacks possible to get round this but it comes close to dishonesty and illegailty. “derivative works” is a phrase that doesn’t work well in the data-rich subjects and we need something better. But it’s what the licenses use at present. In data-rich subjects Linking to repositories is often little use. I need thousands of texts on specialist machines accessed with high frequency and bandwidth. My problem is not with Stevan’s views but that few others give positive support to them, particularly not the repository managers. Maybe I’m too cautious…

More on “open access”

Tuesday, June 12th, 2007
I recently posted my concern about the use of “open access” as phrase which is sufficently broad to be confusing and Peter Suber has created a thoughtful and useful reply. I agree in detail with all his analysis and any differences are probably in emphasis and strategy. Peter Murray-Rust, “open access” is not good enough, A Scientist and the Web, June 10, 2007.  Excerpt: Comments [PeterS]
  • I agree with much but not all of what Peter MR says.  I’m responding at length because I’ve often had many of the same thoughts.
  • I’m the principal author of the BOAI definition of OA, and I still support it in full.  Whenever the occasion arises, I emphasize that OA removes both price and permission barriers, not just price barriers.  I also emphasize that the other major public definitions of OA (from Bethesda and Berlin) have similar requirements.
PMR: Agreed. PeterS continually and consistently asserts this – I am arguing that the level of emphasis throughout the community should be higher.
  • I don’t agree that the term “open access” on its own, or apart from its public definitions, highlights the removal of price barriers and neglects the removal of permission barriers.  There are many ways to make content more widely accessible, or many digital freedoms, and the term “open access” on its own doesn’t favor or disfavor any of them.  Even at the BOAI meeting we realized that the term was not self-explanatory and would need to be accompanied by a clear definition and education campaign.
  • The same, BTW, is true for terms like “open content”, “open source”, and “free software”.  If “open source” is better understood than “open access”, it’s because its precise definition has spread further, not because the term by itself is self-explanatory or because “open access” lacks a precise definition.
PMR: I accept this. In which case I think we have too look for additional tools of discourse. If “open access” serves an important current purpose in a broad sense it should continued to be used in that way but we should not expect it to deliver precision.
  • I do agree that many projects which remove price barriers alone, and not permission barriers, now call themselves OA.  I often call them OA myself.  This is only to say that the common use of the term has moved beyond than the strict definitions.  But this is not always regrettable.  For most users, removing price barriers alone solves the largest part of the problem with non-OA content, and projects that do so are significant successes worth celebrating.  By going beyond the BBB definition, the common use of the term has marked out a spectrum of free online content, ranging from that which removes no permission barriers (beyond those already removed by fair use) to that which removes all the permission barriers that might interfere with scholarship.   This is useful, for we often want to refer to that whole category, not just to the upper end.  When the context requires precision we can, and should, distinguish OA content from content which is merely free of charge.  But we don’t always need this extra precision.
PMR: agreed. But “we often need the extra precision” is also valid.
  • In other words:  Yes, most of us are now using the term “OA” in at least two ways, one strict and one loose, and yes, this can be confusing.  But first, this is the case with most technical terms (compare “evolution” and “momentum”).  Second, when it’s confusing, there are ways to speak more precisely.  Third, it would be at least as confusing to speak with this extra level of precision –distinguishing different ways of removing permission barriers from content that was already free of charge– in every context.  (I’m not saying that Peter MR thought we should do the latter.)
  • One good way to be precise without introducing terms that might baffle our audience is to use a license.  Each of the CC licenses, for example, is clear in it own right and each removes a different set of permission barriers.  The same is true for the other OA-friendly licenses.  Like Peter MR, I encourage providers to remove permission barriers and to formalize this freedom with a license.  Even if we multiplied our technical terms, it will usually be more effective to point to a license than to a technical term when someone wonders exactly what we mean by OA for a given piece of work.
This is the central and simple point on which we are agreed – for some of our problems we can solve this problem without extra tools if we put our minds and energy into it. We aren’t yet doing that sufficiently. Part of the problem arises because in the Green approach to “open access” there is often an implicit trade-off between price freedom and permission freedom. There is tool-free access at the expense of having no permissions other than human readability – all the permissions (other than “fair use”) remain with the publisher. Many people may feel that this is a reasonable compromise in journal publishing at the present stage. Some may feel that 100% Green open access is an acceptable endpoint. But I think it comes with a cost to those of us who wish to develop digital scholarship – the use of the information in scholarship by machines as well as humans. As an example the JISC meeting on institutional repositories  I have just been at was called “Digital Repositories – Dealing with the Digital Deluge”.  This is an emotive phrase – but it’s currently misleading. In many subjects there is a complete Digital Drought. And unless the permissions issue is dealt with there will continue to be. Permission freedom is essential for digital scholarship. My concern is that unless we address the permission issue much more actively we shall slide into the acceptance that permission freedom is the exception or less important than price. The one area where we have to power to act unilaterally is those parts of our own scholarship over which we have effective control – theses, data in repositories, lteaching/learning materials, technical reports, etc. Let us work to make these 100% permission free. My immediate urgency is fueled by the ETD2007 meeting tomorrow. I hope that we can find consensus on this issue.

More Open Thesis heroes

Monday, June 11th, 2007
I have continued to try to find full OpenAccess theses and encountered considerable difficulty. The main problem is that universities and their repositories do not help readers to find theses with OpenAccess licenses and in many cases they do not give any license information at all. Anyway the story… I searched Google for “open access creative commons thesis” and found Mathias Klang’s thesis on Disruptive Technology. Mathias claims this is the first thesis in Sweden to be issued under CC, so I mailed and asked whether he had information from other countries about earlier theses. He mailed back:
Oleg Evnin at Caltech (successfully defended May 26, 2006) [PMR: blogged by Peter Suber]
…a number of CC-licensed ETDs at the U of Edinburgh and that the earliest seems to be by Magnus Hagdorn, submitted on March 4, 2004.
Many thanks Mathias, and I shall enjoy reading your thesis – this whole area needs some disruptive technology – I am finding that approaches to repositories still look conservative and based on outdated models of thought. I can’t comment in detail on the science but the format of Magnus’ thesis is an excellent example of what a modern thesis should contain – it’s 400Mbyte zipped but contains spendid animations and data of glaciation – worth a look. But the problem with the repositories is that there is no indication that the actual thesis is OpenAccess. The Edinburgh repository announces:
All items in ERA are protected by copyright, with all rights reserved. Copyright for this page [1] belongs to The University of Edinburgh [1] i.e. the metadata splash page
which discourages the visitor for looking for an Open License within the thesis. I’m sure this isn’r deliberate, but, repository managers, here is a very simple idea: Add dc:rights to the splash page and metadata and proudly proclaim in large letters: THIS THESIS CARRIES A CREATIVE COMMONS LICENCE – ENJOY!