digitizing theses

During my recent visit to Caltech I was able to see some of the digitization of theses. Caltech has an impressive program of putting its theses on line, but of course many of these are not “born digital” and require conversion. Here’s a very famous one, which has been converted to PDF. Here the PDF is simply a digital record of the thesis – it’s not easy to extract any textual information. Note that in this case the thesis simply consists of published papers cut and pasted (manually) into the thesis (it was before photocopiers, of course). It was also before handing over copyright to publishers – would he have been able to do this today? (Yes, but…) Interestingly there is only a title page and pasted articles.
Very recent theses are born-digital (i.e. completely composed in a machine, machine-readable though NOT necessarily machine-understandable). For the earlier ones, the whole thesis is scanned (although the actual paper quality of some makes them almost unreadable to humans, let alone to machines). Then the abstract is OCR’ed and corrected by humans, and here’s part of the abstract of the thesis I quoted in my presentation at caltech:

NOTE: Text or symbols not renderable in plain ASCII are indicated by […]. Abstract is included in .pdf document.
High valent middle and later transition metal centers tend to oxidatively degrade their ligands. A series of ligand structural features that prevent discovered decomposition routes is presented. The result of the iterative design, synthesis, and testing process described are the macrocyclic tetraamides H4MAC* and H4DEMAMPA-DCB. H4MAC* and H4DEMAMPA-DCB are the parent acids of the macrocyclic tetraamido-N ligands […] and […], which are shown to stabilize high valent middle and later transition metal complexes unavailable in other systems. The crystal structures of H4MAC* and a copper complex of one of its synthetic precursors reveal intramolecular and intermolecular hydrogen-bonding patterns which are relevant to recent developments in the ordering effects of hydrogen-bonding on solution and solid state structures. The synthetic value of these ordering effects is discussed.

PMR: We spent some time discussing how we could capture non-ASCII symbols and I’d be very grateful for suggestions. Some questions:

  • how easy is it for OCR software to capture non-ASCII characters (e.g. Greek symbols, etc.)
  • how should these be captured? Unicode?
  • what should be done about sub/superscripts? should we use HTML?
  • should we try to extend some of this to MathML?

In science and technology many concepts are represented by single non-ASCII symbols? Do we have a way forward?

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *