digitizing theses

During my recent visit to Caltech I was able to see some of the digitization of theses. Caltech has an impressive program of putting its theses on line, but of course many of these are not “born digital” and require conversion. Here’s a very famous one, which has been converted to PDF. Here the PDF is simply a digital record of the thesis – it’s not easy to extract any textual information. Note that in this case the thesis simply consists of published papers cut and pasted (manually) into the thesis (it was before photocopiers, of course). It was also before handing over copyright to publishers – would he have been able to do this today? (Yes, but…) Interestingly there is only a title page and pasted articles.
Very recent theses are born-digital (i.e. completely composed in a machine, machine-readable though NOT necessarily machine-understandable). For the earlier ones, the whole thesis is scanned (although the actual paper quality of some makes them almost unreadable to humans, let alone to machines). Then the abstract is OCR’ed and corrected by humans, and here’s part of the abstract of the thesis I quoted in my presentation at caltech:

NOTE: Text or symbols not renderable in plain ASCII are indicated by […]. Abstract is included in .pdf document.
High valent middle and later transition metal centers tend to oxidatively degrade their ligands. A series of ligand structural features that prevent discovered decomposition routes is presented. The result of the iterative design, synthesis, and testing process described are the macrocyclic tetraamides H4MAC* and H4DEMAMPA-DCB. H4MAC* and H4DEMAMPA-DCB are the parent acids of the macrocyclic tetraamido-N ligands […] and […], which are shown to stabilize high valent middle and later transition metal complexes unavailable in other systems. The crystal structures of H4MAC* and a copper complex of one of its synthetic precursors reveal intramolecular and intermolecular hydrogen-bonding patterns which are relevant to recent developments in the ordering effects of hydrogen-bonding on solution and solid state structures. The synthetic value of these ordering effects is discussed.

PMR: We spent some time discussing how we could capture non-ASCII symbols and I’d be very grateful for suggestions. Some questions:

  • how easy is it for OCR software to capture non-ASCII characters (e.g. Greek symbols, etc.)
  • how should these be captured? Unicode?
  • what should be done about sub/superscripts? should we use HTML?
  • should we try to extend some of this to MathML?

In science and technology many concepts are represented by single non-ASCII symbols? Do we have a way forward?

Posted in Uncategorized | Leave a comment

caltech: (talk on) the power of the scientific e-thesis

Many thanks to Eric Van de Velde and colleagues for inviting me to Caltech to give a talk on the scientific e-thesis. Besides being excited about going to Caltech, I am delighted that they wished to record the presentation. (I gather this was technically successful). This is of enormous value to me as it is the only true immediate record of what I have said. When I give a presentation I organize my material as a large number (perhaps 10,000) HTML pages with associated images and possibly movies. To this I add various demos on my own PC – things like OSCAR, Bioclipse, etc. (I didn’t get as far as showing Bioclipse this time). The pages are roughly arranged in a 2-level tree category/page but I cannot remember what all of them are.
The actual presentation depends on the makeup of the audience (I try to ask a few people or for a show of hands) to get an idea of the topics that will be most interesting. I have a chunk of topics I am definitely going to cover but otherwise it is often what comes to mind as I go through the talk. (There is always far too much that I want to say). So there is no clear navigation path. I rehearse the major components to make sure they are still working. Unfortunately every browser “upgrade” destroys more demos. (At some stage I expect my SVG to stop working). I have to use both IE (for the SVG animations) and Firefox (for DBPedia). I had hoped we had left “best viewed in browser X” in the 1990’s but no such luck.
So there is no simplistic linear record in the slides. It isn’t easy (or often meaningful) to leave “a copy of the presentation” as it consists of thousands of slides which make little sense without my commentary and a record of what was shown in whta order. That’s why it is so valuable to have regular recordings using video and audio as Caltech have provided. So many thanks.
FWIW I try to add something new each talk – each one is a moving average of past and future. For this talk I very happily discovered a Caltech thesis which used Jack Dunitz ‘s ideas and work extensively. From my introduction:

Jack Dunitz postdoc’ed at Caltech 1948-1951 with Linus Pauling and Verner Schomaker. He pioneered the use of very accurate crystal strucures in giving an insight into chemistry. Here is an example of a Caltech thesis (1992) which uses his ideas. See Chapter 9.
In 1973-4 I spent a sabbatical with Jack in Zurich showing how data from the literature could be combined to show new phenomena.

PMR: So I found a rather nice timeline. In the early 1970’s Jack and co-workers would investigate chemistry by looking at very accurate crystal structures. These could take a month to complete. The thesis refers to their work on non-planar amide bonds where they would design a molecule, persuade a colleague to make it, and do the structure. It could take years to investigate the idea. When I visited in 1974 Jack and Hans-Beat-Buergi had shown that it was possible to use structures already in the literature. I used this idea to look at the distortions of tetrahedral molecules (simulating the SN1 reaction). There were over 100 data points and each required me to locate a (physical) paper in a journal with geometry about the molecule. Often I would find a paper that looked promising only to discover that there were no data or it was too inaccurate. So it took severl months to find enough papers with good enough data. I got to know the system of document retrieval rather well! Now, 30 years on, we have all the information Openly gathered in crystaleye. We can pull out tetrahedral molecules in seconds.
That still relies on the conventional publication process. If we can do the same thing for theses we shall have unlocked an enormous source of data.
I look forward to the recording and will alert you when it appears. I warned the audience that details were unpredictable and the machine would get tired towards the end. So in the spirit of Openness you will see it all as it happened!

Posted in data | 4 Comments

scifoo: data-driven science and storage

I managed to get out to a few sessions at scifoo not concerned with my immediate concerns, of which two were on the Large Synoptic Survey Telescope and Google’s abiility and willingness to manage scientific data. They come together because the astronomers are producing hundreds of terabytes every day(?) and academia isn’t always the most suitable place to manage the data. So some of them have considered/started shipping it to Google. Obviously it has to be Open Data. There cannot be human-related restrictions that require management.
Everyone thinks they are being overwhelmed with data. Where to keep it temporarily? Can we find it in a year’s time? Should we expect CrystalEye data to remain on WWMM indefinitely?  But our problems are minute compared with the astronomers which are probably 3 orders of magnitude greater.
How would you obtain bandwidth to ship data to someone like Google? Remarkably the fastest way to transmit it is on hard disk. 4 750GByte disks (i.e. 3Tb) fit nicely into a padded box and can be shipped by any major shipping company.  And disk storage  cost is decreasing at 78% per year.
I’m tempted to start putting our data into the “cloud” in this way. It’s Open, so we don’t mind what happens to it (as long as we are recognised as the original creators). It’s peanuts for the large players. If we allocate a megabyte for each new published compound (structure, spectra, crystallography, computation, links, etc. and the full-text if we are allowed) and assume  a million compounds a year that is just ONE terabyte. The whole of the world’s new chemical data each year can fit on a single disk! What the astronomers collect in one minute!
But before we all rush off to to this we must think about semantics and metadata. The astronomers have been doing this for years. They haven’t solved it fully, but they’ve made a lot of progress and have some communal dictionaries and ontologies.
So we could have all the world’s chemical information on our desktops or access it through GYM (Google/Yahoo/Microsoft).
I wonder why we don’t.

Posted in data, open issues, scifoo | 7 Comments

scifoo: Open Science

One of the themes at scifoo was “Open Science” or “Open Notebook Science” – the latter term coined by Jean-Claude Bradley. The idea that science is publicly recorded as it is done. The very first bottom-up session (i.e. Saturday morning) was run by J-CB and Bora Zivkovic of PLoS ONE. Here are two comments:

Corie Lok et al.Scifoo: day 1; open science

It’s late and so I’ll keep this short. I’ll write more detailed accounts of Scifoo soon, but here are some highlights so far.
My day today started off with a contentious talk about open science. It quickly veered off into a complaint session about how the slow publication process in biology and the fear of not being credited and of being scooped are hindering open science (putting prepublication info and data online). But the physicists in the room quickly got annoyed by the complaining (not exactly new complaints either) and so the discussion got back on track to focus on current efforts to put more data and discussion of prepublication research online (such as Jean-Claude Bradley’s open notebook efforts). The session set the stage for several other related ones later in the day. It also spawned one taking place tomorrow about the culture of fear among young scientists: fear of doing open science, at the risk of jeopardizing career prospects. I’ll definitely be at that one. For another perspective on this session, check out Anna’s post on it.

PMR: here’s Anna’s post:

Swimming in the Ocean

Date:
Saturday, 04 Aug 2007 – 22:46 GMT
Have you heard the expression “small fish in a big pond”? I have an updated version. How about, “plankton in an ocean”? That’s me. I am the plankton, spending the weekend with CEOs of major corporations, editors in-chief, a couple Nobel prize winners, people advancing science and media in ways I can hardly comprehend… and Martha Stewart. That, in a nutshell (or an ocean, as the case may be) is Science Foo Camp, where I am currently sitting with mouth hanging open and ears open wide.One of the major themes of this free-form gathering has been open access publishing. In a group discussion led by Bora Zivkovic of PLoS ONE, tempers flared (which made it even more fun than staring at science celebrities), and the many complications, pros and cons of open access were raised. Does the term “open access” refer to pre- or post-publication open access? Is it open, non-peer reviewed publication of articles or even complete lab notebooks, or access to reviewed, published articles free of charge? That aside, will open access publishing negatively affect the hiring potential of young faculty looking for tenure track positions or funding from organizations such as Wellcome Trust and the NIH?
What about intellectual property? How does one protect findings aired in a public forum? One attendee replied that you don’t, it doesn’t matter, it should all be free and open. As much as I personally admire this free love, Birkenstock/Woodstock approach to science and research, I do not believe it to be feasible at the moment. Science is run by money. In order to get money or funding, one must publish. The changes and minor revolutions in that need to occur in publishing before the concept of the science paper becomes obsolete are staggering. They are also occurring as we speak.
Back to gaping at people far smarter than me.

PMR: and comments to Anna (so far):

Comments

  • Bora Zivkovic said:
    Small fish? No way – I was very excited to get to meet you in person.
  • Anna Kushnir said:
    The pleasure was all mine. I am happy I got the chance to meet you!
  • Jean-Claude Bradley said:
    Concerning the question of intellectual property, I am guessing that you are referring to my comment. I was not saying that all research should be open and free – just that people who are interested in intellectual property protection should probably not do Open Notebook Science. And this is no different than in the traditional publication process. People who are interested in intellectual property should not publish manuscripts without filing a patent (at least a provisional US patent). This is an expensive route and completely unrealistic for most scientific research projects. Money is not the sole motivation of scientists. If that were the case who would study fields like archaeology and cosmology?I wish that we had more time to discuss these issues during the session.
  • Deepak Singh said:
    I think the IP issue didn’t get brought up enough, especially with the peer2patent and other IP types there. In many cases the flaws are not in intent, but in the system itself. That said, I think as a community, we know what the problems are. We should just focus on solutions rather than trying to go into what’s wrong in excruciating detail 🙂

PMR: and Duncan Hull



9.30am: open science 2.0: where we are, where we’re going

After breakfast at Googley’s, I head off to a session on Open Science 2.0. This session is game of two halves, the first half there is much talk of how publishing is a roadblock to many things we would like to achieve with science on the web. Peter Murray-Rust talks of “conservative chemistry”, where (un-named) publishers are the problem, not the solution and block the whole of the University of Cambridge for accessing content in unapproved ways (text-mining). Paul Sereno and Chemist Carl Djerassi discuss the importance of publications in getting jobs and tenure at Stanford. There is talk of the dangerous power of editors of journals, who ultimately decide careers that they are blind too. They don’t just accept papers when they publish, they make and break people’s livelihoods. Andrew Walkingshaw tells of a common perception amongst young scientists about the importance of the h-index and other publication metrics. Eric Lander points out that publication isn’t everything for young scientists, a lot of it comes down to letters of recommendation in job applications and this fact is often overlooked by young scientists. Pamela Silver talks of how the publish or perish mentality is slow like molasses, and sends many talented young scientists at Harvard running and screaming from academia into the arms of anywhere else that will have them, which is a great loss to science. We move on to Open Access, Tim Hubbard, head of informatics at Sanger tells how the Wellcome Trust insists any publications that arise from its funded research projects must be freely available within six months after publication. Jonathan Eisen talks of different types of open access, which is not just about reading papers for free, but reusing them for free too, as in Creative Commons. Somebody possibly Richard Jefferson, talks of a reputation engine called Carmleon? (not sure of spelling).
All of this makes young scientists risk averse and paranoid, which is bad. The only people who can take risks are established scientists, which is a shame. But the discussion takes a u-turn when Paul Ginsparg (arXiv.org) and Dave Carlson, point out we should be having fun not moaning about publishing. We didn’t all come here to whinge, we should be talking about the technology that will enable us to break the publishing roadblock and make science a better place to live, work and play. On this note, Bora Zivkovic tells of publication turnaround times at PLOS, which are now “9 weeks not 9 months”. This is great for young scientists, who often don’t have time to wait for the glacial turnaround times of many publishing companies. He asks what would cyber infrastructure look like in 2015? Jean-Claude Bradley, gives a demo of Usefulchem, see for example this experiment tools like blogs and wikis will play an important contribution in this area.

Summary

Science is becoming more open, but it will be a slow evolution not a rapid revolution. We’re heading in the right direction, some of the tools for doing it are beginning to work. PLOS asks people to be courageous and send their papers in, this can be a gamble, when scientists often favour the old favourites of Nature, Science and PNAS. This session was typical of scifoo, its a mashup of different ideas from very different people working in different areas. It doesn’t always summarise neatly, but thats life. A session on this came later on, called the Culture of Fear: led by Andrew Walkingshaw and Alex Palazzo.

PMR: The session didn’t go as planned – JCB had produced material to demonstrate and didn’t get to show it till near the end. The meeting got hijacked by the theme of Open Access and I helped in the hijack when I probably should have stayed quiet. It meant that we didn’t explore the bright future but reiterated the less inspiring present. But somehow that was the burden that a lot of people had brought with them. Scifoo doesn’t run  on predictable lines and one good thing was that Alex and Andrew were inspired to run a session (young scientists and the culture of fear) they hadn’t planned to when they came.
“Open Science” is a concept whose time has arrived. I prefer “open Notebook Science” because there is less chance of confusion with other terms which have nothing to do with the concept.  Under Open Research WP ha a stub which lists a few lists a few examples – add some more.

Posted in open issues, scifoo | 1 Comment

scifoo: young scientists and the culture of fear

On the last day, and as an inspiration from the previous sessions and the community atmosphere of the meeting, Andrew Walkingshaw and Alex Palazzo ran a session on the problems of being a postdoc under the pressure of having to publish in high-impact journals. They explained how the very high sense of competition and the pressure of conformance to a single way measure of success constrained innovation – their sense of concern came through very clearly. Here’s their blog entries (AW first):

The Scifoo nature

Scifoo was a blast.
Alex Palazzo and I ran a session today on the politics of scientific communication/open access, particularly for young scientists: he writes about our thoughts here. I was really delighted with how it went; many people, including some very successful academics and editor-in-chief of Nature, Philip Campbell, came along and shared their thoughts.
There’ll be more on what we actually discussed in due course, but the thing happening was itself staggering; from half-formed idea to a really deep round-table discussion in less than forty-eight hours. Creating a space where that can happen is priceless; I can’t thank the organisers enough for inviting me, and, equally importantly, everyone there for their generosity of spirit and openness.

PMR: Then AP. Read this in full, and also the commentary it has generated (and may continue to generate):

Scifoo – Day 3 (well that was yesterday, but I just didn’t have the time …)

Category: art, food, music, citylife and other mental stimuli
Posted on: August 6, 2007 10:46 AM, by Alex Palazzo
Our session on Scientific Communication and Young Scientists, the Culture of Fear, was great. Many bigwigs in the scientific publishing industry were present and a lot of ideas were pitched around. I would also like to thank Andrew Walkinshaw who co-hosted the session, Eric Lander for encouraging us to pursue this discussion, Pam Silver for giving a nice perspective on the whole issue, and all the other participants for giving their views.
Now someone had asked that we vlog the session, we actually tried to set it up but we didn’t have the time. In retrospect I’m glad we didn’t. This became at the last session of scifoo where attendees voiced their comments on the logistics of scifoo, many conference goers preferred to keep video and audio recording devices away from the sessions as they impede open discussion. Conversations off of the record can be more honest and more productive.
So about the session …
The main point that we wanted to make was that there are problems with the current way that we are communicating science and due to developments with Web2.0 applications there is a big push to change how this is done. But we must keep in mind the anxieties and fears of scientists. How we communicate does not only impact how information is disseminated but does affect the careers of the scientists who produce content. Until there is general consensus from the scientific publishing industry, the major funding institutions, and the higher echelons of academia (for example junior faculty search committees), junior scientists are unlikely to participate in novel and innovative modes of scientific communication. The bottom line is that it is just to risky to do so.
There are two main areas that remain to be clarified by the scientific establishment.
1) Credit. How do we ascertain who deserves credit for an original idea, model or piece of data.
2) Peer-review. Although most scientists and futurists who promote much of the open-access model of scientific publishing support some type of peer-review where the science or consistence of a particular body of work is evaluated, there remains some confusion as to whether peer-review should continue to assess the “value” of a particular manuscript. Right now, manuscripts that are submitted to any scientific publication must attain some level of importance that is at least equal to the standards of that particular journal. When evaluating the scientific contribution of any given scientist, close attention is payed to their publication record and particularly where their manuscripts are published. Now whether we should continue to follow this model where editors and the senior scientists determine the scientific validity of any given manuscript is being questioned. In an alternative model many technologists are pushing post-publication evaluation processes which evaluate the importance of any single manuscript after the manuscript has been released after minimal peer-review. These not only include citations indices, but also newer metrics that are currently being developed by many information scientists. There are many problems with these systems, the most critical is that most of the value cannot be assessed until many years after the publication date. An important piece of work may take years to have an impact in a given particular field. Until the scientific establishment reaches a consensus as to whether these post-publication metrics are indeed useful for determining the credentials of a scientist in the shorter term (<2 years post-publication) it is unlikely that any scientists would risk publishing their findings in a minimally peer-reviewed journal.
There was a strong feeling that the top journals do provide a valuable filtering service. They go through all the crap in order to publish the best work. OK they don’t always succeed but competition between all the big journals promotes a high standard. And many scientists are reluctant to give up this model. Journals also help to improve the quality of the published manuscripts, this function would be lost if all we had was PLoS One and Nature Precedings. To all those who think that journals must be eliminated in favour of an ArXiv.org model you are now warned.

PMR: I kept quiet during this session – I have no easy answer. It’s clear that the pressure to get scientific jobs is increasing – whereas not so long ago institutions could choose from those they knew (with all the pluses and minuses) now they try to create a “level playing field”. And what measure do they have when everyone has rave references? It’s difficult not to count the numbers. We did hear that one leading systems biology lab did not simply look at publications but wanted to choose people who could provide a major shift in emphasis and might have a relatively unconventional paper trail. But it’s not common.
Much credit to Alex and Andrew for their bravery in running this session, and to scifoo for it being the sort of place where it could happen.

Posted in scifoo | 3 Comments

scifoo: blogsession

As I’ve mentioned at scifoo the programme was evolved by the participants in a first-come first-accepted process whereby we signed up for free slots. It was hardly surprising that the blogosphere gained a slot and on Sunday we found a community of about 10-15 bloggers discussing how and why they did it. Here are some of the blogs that scifoo members have created and some of which were at the session. (Andrew Walkingshaw created a PlanetScifoo, the aggregation of the blogs updating every halfhour). Nothing special about my selection… they weren’t all at the session

So we spent an hour talking about why we did it – what we got out of it – etc. At one end are the compulsive writers – Henry Gee explained how he couldn’t help blogging – it was in the journalistic tradition. I sometimes feel like this but not to the extent I am driven to communicate something whatever. Many of us feel we have an “audience”, community, whatever with whome we have a fragile rapport. Some bloggers get a lot of feedback, others very little. Often we are dependent on real-life contacts for feedback (I generally get little unless I unwittingly or otherwise turn up the “outrage button” and find out who is at the other end). Many bloggers who act as transducers for the immediate are appreciated by their following – the stream of consciousness of unprepared commentary on the world make contact.
Some – such as Richard Akerman have been blogging for years, others like me have yet to reach their first blogversary. Some, especially those in clear employment (e.g. publishers), have boundaries that should not be overstepped. What the boundaries are, are not always clear. Some have more than one blog – a day blog and a more anonymous non-work one. Some feel “soft constraints”, especially when they are partially hosted by – say – a publisher’s umbrella. But I think most would be prepared to speak their mind – here (A Letter to Martha) is Anna Kushnir criticising Martha Stewart for failing to live up to the promise of Scifoo.  (I wasn’t there, but it sounds like a valid comment).
So blogs are of all sorts. Mine has a life of its own.

Posted in scifoo | 2 Comments

scifoo: immediate impressions

Like many other bloggers I’m contributing my impressions. (technorati aggregates all blogs which contain the tag – or even the word – “scifoo”). There will be so much to read that I don’t need to add detail.
The format is self-organising; all campers write ideas on a large board – there is no restriction on what they can suggest – and hope it is appealing enough that others come to their suggestions, or add additional ideas. As I know already Google has a number of small rooms and 2 large ones, though it seemed to work out pretty well and you could carry in folding chairs. (The whole atmosphere was very relaxed – the Google inmates looked after us very well – there was a huge team.)
I’d come expecting a vision of the future and there were many instances of that. But there was also an urgency to sort out the present that was more intense than I had expected and I found that almost all the sessions I attended were about the here-and-now. Publishing (in all aspects and including blogging and virtual communities) was more important than I had expected. So I felt compelled to go to many sessions like that. As a result of which I didn’t go to hear, or even meet, Neal Stephenson, Greg Bear or Kim Stanley Robinson. Or here about the future of biological manipulation. So some slight regrets.
But only because with 200+ people (I don’t know the number) and only 6 mealtimes you can’t do everything and you can’t meet everyone. So I’ll be concentrating on the issues which are close to my daily interests – blogging, publications, semantic science.
I’m now in CalTech – Pasadena – in the Einstein Suite (yes, he stayed in it and there are pictures on the walls). I’ve never been here though it’s had a large influence on my scientific life through my mentor Jack Dunitz, who worked here with Linus Pauling and Ken Trueblood on the the structure of molecules, using the power of crystallography and the power of the human brain.
Tomorrow I talk on Science and digital repositories. I think that scifoo will influence what I say considerably.

Posted in scifoo | 1 Comment

scifoo: Testing on the Toilet and other events

Among the informalities (or formalities) in Google is a series of pamphlets in the loo. As you are spending your time there – standing or sitting – you see “Testing on the Toilet”. A series of daily inspirational aids for better software practice. The current theme – in Java – is testing. “Debugging sucks, testing rocks” (a sentiment I agree with though I had thought to escape it in some areas of my existence). They even have an internal url: http://tott.
And – to show I am here – with Bora and Professor Steve from PLoS.
And you can get a good selection of scifoo blogosphere here.

Posted in scifoo | Leave a comment

Update from Scifoo

I’m sitting in the Google tent after a very full day – more to come – talking with Foocampers about how blogs work.
More later

Posted in scifoo | Leave a comment

International harvesting of OA ETD repositories

From Peter Suber’s blog:


Leading the way with a European e-Theses demonstrator project, a press release from the Dutch SURF Foundation, July 31, 2007. Excerpt:

The organisations JISC (UK), the National Library of Sweden and the Dutch SURFfoundation have tested the interoperability of repositories for e-theses. The result is a freely accessible European e-Theses portal providing access to over 10,000 doctoral theses.
For the first time ever, various local repositories containing doctoral e-theses have been harvested on an international scale. Five countries were involved in the project: Denmark, Germany, the Netherlands, Sweden and the United Kingdom.
Doctoral theses contain some of the most current and valuable research produced within universities. Still, they are underused as research resources. Nowadays, theses and dissertations no longer have to gather dust in attics or on the shelves of university libraries. By making them available on the Internet, both the author and the university can showcase their research, benefiting not only fellow scientists, but a broad public as well. And when they are publicly available, they are used many times more often than printed theses available only at libraries or by inter-library loan.
The result of this pilot project is described in the report A Portal for Doctoral e-theses in Europe; Lessons Learned from a Demonstrator Project. The report gives practical recommendations to improve the interoperability between the service provider and the data supplier. The recommendations are entirely in line with the guidelines of the DRIVER project.
The report may be useful for institutions that wish to show the world the results of their research. By making their material accessible in a standardised manner and using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), they can reach beyond any boundary….

PMR: This is very exciting. We have estimated that most of the raw data connected with chemistry is never published. But where is the pressure to publish data greatest? In theses of course. If you fail to show your data, then the examiners can rightly ask you to find them or, worse, remeasure them if you have lost them. So it should be standard that at the end of a thesis there is a full set of (at leat partially) inspected and validated data. Yet most of this is subsequently lost. So we really welcome this – not least because we have a JISC project (SPECTRa-T : Submission, Preservation and Exposure of Chemistry ...) to extract chemistry (meta)data from theses.
As I am sure everyobe else is can I make the following suggestions. If we want to do them, they shouldn’t be too difficult:

  • Where possible text-based versions should be available. I know that many historical theses  may only be present as bitmaps (Tiffs, etc.) but it’s really valuable to have searchable text. And even if the OCR isn’t 100% it’s possible to do a lot with slightly imperfect scans. And even to suggest corrections in some cases. Maybe if people are interested in a thesis they could OCR it, correct it, and resubmit it.
  • Content should be freely text- and data-minable. Now I know that copyright can be a slight problem here, but can we try to find creative ways round it.  Every graduate student I have spoken to wants their thesis to be read and none have any problems with it being data mined. But when I produced my thesis no-one had though of text-mining. I actually have no idea whether I hold the copyright – I expect so. I don’t think there is actually very much worth mining as nearly all the data has got into the public domain in publications. But who knows. But please don’t let 20-year old “copyright” serve as an unnecessary barrier to text-mining – certainly not in the sciences.
  • there should be a way of communicating such material back so that theses can be annotated. That may be more difficult but not inconceivable.

and so gradually we build up a resource which our robots, as well as us, can read. That would give us a fully searchable resource.

Posted in Uncategorized | Leave a comment