Unit tests or get a life

My unit tests have taken over my life this weekend. This post is a brief respite…
Unit tests are one of the great successes of modern approaches to programming. If written briefly about them before and Egon has an introduction of how they are used in Eclipse. Unit tests are often seen as part of “eXtreme Programming” though it’s possible to use them more or less independently. Anyway, I was hoping to add some molecule building extensions to JUMBO, and started on the CMLMolecule.class. Even if you have never used Unit tests the mental attitudes may be useful.
I then discovered that only half the unit tests for this class were written. Purists will already have judged me on this sentence – the philosophy is to write the tests before the code. For the CMLMolecule class this makes excellent sense. Suppose you need a routine:

Point3 molecule.getCentroid3()

then you first write a test. You might create a test molecule (formally a “fixture”), something like:

CMLMolecule testMolecule = makeTestMolecule();
Point3 centroid = testMolecule.getCentroid3();
Point3Test.assertEquals("centroid",
new Point3(0.1, 0.2, 0.3), centroid, EPSILON);

I haven’t written the actual body of getCentroid3(). I now run the test and it will fail the assertion (because it hasn’t done anything). The point is I have a known molecule, and a known answer. (I’ve also had to create a set of tests for Point3 so I can compare the results).
You won’t believe it till you start doing it but it saves a lot of time and raises the quality. No question. It’s simple, relatively fun to do, and you know when you have finished. It’s tempting to “leave the test writing till later” – but you mustn’t.
Now why has it taken over my weekend? Because I’m still playing catch-up. At the start of the year JUMBO was ca 100,000 lines of code and there were no tests. I have had to retro-fit them. It hasn’t been fun, but there is a certain satisfaction. Some of it is mindless, but that has the advantage that you can at least watch the football (== soccer) or cricket. (Cricket is particularly good because the cycle between action (ca 1 minute) often coincides with the natural cycle of edit/compile/test).
It’s easy to generate tests – Eclipse does it automatically and makes 2135 at present. If I don’t add the test code I will get 2135 failures. Now Junit has a green bar (“keep the bar green to keep the code clean”) which tells you when the tests all work. Even 1 failure gives a brown bar. The green bar is very compelling. It gives great positive feedback. It’s probably the same limbic pathways as Sudoku. But I can’t fix 2135 failures at one sitting. So I can bypass them with an @Ignore. This also keeps the bar green, but it’s a fudge. I know those tests will have to be done some day.
So 2 days ago JUMBO had about 240 @Ignores. Unfortunately many were in the CMLMolecule and MoleculeTool class. And the more tests I created, the more I unearthed other @Ignores elsewhere.
So I’m down to less than 200 @Ignores now. I’ve found some nasty bugs. A typical one is writing something like:

if (a == null)

when I mean

if (a != null)

This is exactly the sort of thing that Unit tests are very good at catching.
So I here’s my latest test
unit0.png
oh dear! it’s brown. I’ll just @Ignore that one test and…
unit1.png
and it’s GREEN!
(But I haven’t fooled anyone. The @Ignore is just lying in wait to bite me in the future…)

Posted in programming for scientists | 3 Comments

Do we really need discovery metadata?

Many of the projects we are involved in and interact with are about systematising metadata for scientific and other scholarly applications. There are several sorts of MD; I include at least rights, provenance, semantics/format, and discovery. I’ll go along with the first three – I need to know what I can do with something, where it came from and how to read and interpret it. But do we really need discovery metadata?
Until recently this has been assumed as almost an axiom – if we annotate a digital object with domain-specific knowledge then it should be easier to find and there should be fewer false positives. If I need a thesis on the synthesis of terpenes then surely it should help if it is labelled “chemistry”, “synthesis” and “terpenes”. And it does.
But there are several downsides:

  • It’s very difficult to agree on how to structure metadata. This is mainly because everyone has a (valid) opinion and no-one can quite agree with anyone else. So the mechanism involves either interminable committees in “smoke-filled-rooms” (except without the smoke) or self-appointed experts who make it up themselves. The first is valuable if we need precise definitions and possibly controlled vocabularies but is not normally designed for discovery. The second – as is happening all over the world leads to collisions and even conflict.
  • Authors won’t comply. They either leave the metadata fields blank, or make something up to get it over with or simply abandon the operation
  • It’s extremely expensive. If a domain expert is required to reposit a document it doesn’t scale.

So is it really necessary? If I have a thesis I can tell without metadata just by looking whether it’s about chemistry (whatever language it’s in), whether it has synthesis and whether it contains terpenes. And so can Google. I just type “terpene synthesis” and all the first page are about terpene synthesis.
The point is that indexing full text (or the full datument 🙂 ) is normally sufficient for most of our content discovery. Peter Corbett has implemented Lucene – a free text indexer and done some clever things with chemistry and chemical compounds. That means his engine is now geared up to discover chemistry on the Web from its content. I’ll speculate that it’s more powerful than the existing chemical metadata…
So it was great to see on Peter Suber’s Blog (can’t stop blogging it!)

Full-text cross-archive search from OpenDOAR

OpenDOAR has created a Google Custom Search engine for the 800+ open-access repositories in its directory. From today’s announcement:

OpenDOAR – the Directory of Open Access Repositories – is pleased to announce the release of a trial search service for academics and researchers around the world….
OpenDOAR already provides a global Directory of freely available open access repositories that hold research material: now it also offers a full-text search service from this list of quality-controlled repositories. This trial service has been made possible through the recent launch by Google of its innovative and exciting Custom Search Engine, which allows OpenDOAR to define a search service based on the Directory holdings.
It is well known that a simple full-text search of the whole web will turn up thousands upon thousands of junk results, with the valuable nuggets of information often being lost in the sheer number of results. Users of the OpenDOAR service can search through the world’s open access repositories of freely available research information, with the assurance that each of these repositories has been assessed by OpenDOAR staff as being of academic value. This quality controlled approach will help to minimise spurious or junk results and lead more directly to useful and relevant information. The repositories listed by OpenDOAR have been assessed for their full-text holdings, helping to ensure that results have come from academic repositories with open access to their contents.
This service does not use the OAI-PMH protocol to underpin the search, or use the metadata held within repositories. Instead, it relies on Google’s indexes, which in turn rely on repositories being suitably structured and configured for the Googlebot web crawler. Part of OpenDOAR’s work is to help repository administrators improve access to and use of their repositories’ holdings: advice about making a repository suitable for crawling by Google is given on the site. This service is designed as a simple and basic full-text search and is intended to compliment and not compete with the many value-added search services currently under development.
A key feature of OpenDOAR is that all of the repositories we list have been visited by project staff, tested and assessed by hand. We currently decline about a quarter of candidate sites as being broken, empty, out of scope, etc. This gives a far higher quality assurance to the listings we hold than results gathered by just automatic harvesting. OpenDOAR has now surveyed over 1,100 repositories, producing a classified Directory of over 800 freely available archives of academic information.

Comment. This is a brilliant use of the new Google technology. When searching for research on deposit in OA repositories, it’s better than straight Google, by eliminating false positives –though straight Google is better if you want to find OA content outside repositories at publisher or personal websites. It’s potentially better than OAIster and other OAI-based search engines, by going beyond metadata to full-text –though not all OA repositories are configured to facilitate Google crawling. If Google isn’t crawling your repository, consult OpenDOAR or try these suggestions.

I agree! Let’s simply archive our full texts and full datuments. We’ll never be able to add metadata by human means – let the machines do it. And this has enormous benefots for subjects like chemistry – Peter’s OSCAR3 can add chemical metadata automatically in great detail.
So now all we want is chemistry theses… and chemistry papers … in those repositories. You know what you have to do…

Posted in chemistry, open issues | 3 Comments

Build your own Institutional Repository

I have alluded to Institutional Repositories (IR) before. Although I am an enthusiast and early adopter (having reposited 250, 000 digital objects) a year ago I would have said they were still a minority activity. Not now. Universities and related HEIs are all implementing IRs even though I suspect the majority of staff in some have never heard of them (we’ll reveal details for chemistry later). As well as the (IMO seriously overhyped) commercial tools there are several Open Source tools (ePrinst, DSpace, Fedora (not the Linux one)) and responsible services (e.g. from Biomed Central). There are reasonable paths for institutions of different sizes to take the plunge. So it was nice to see on Peter Suber’s blog:

Implementing an institutional repository

Meredith Farkas has blogged some notes on Roy Tenant’s talk on institutional repositories at Internet Librarian 2006 (Monterey, October 23-25, 2006). Excerpt:

I knew that Roy would be likely to give a very practical nuts-and-bolts introduction to developing institutional repositories and I was certainly not disappointed.
Why do it?

  • Allows you to capture the intellectual output of an institution and provide it freely to others (pre-prints, post-prints, things that folks have the rights to archive). Many publishers allow authors to publish their work in archives either as a pre-print or after the fact.
  • To increase exposure and use of an institution’s intellectual capital. It can increase their impact on a field. More citations from open access and archived materials.
  • To increase the reputation of your institution.

How do you do it? …
Software options….
Key decisions

  • What types of content do you want to accept (just documents? PPT files, lesson plans, etc?)
  • How will you handle copyright?
  • Will you charge for service? Or for specific value-added services?
  • What will the division of responsibilities be?
  • What implementation model will you adopt?
  • You will need to develop a policy document that covers these issues and more.

Implementation models

  • Self archiving – ceaselessly championed by Stevan Harnad. Authors upload their own work into institutional respositories. Most faculty don’t want to do this.
  • Overlay – new system (IR) overlays the way people normally do things. Typically faculty give their work to an administrative assistant to put it on the Web. Now, the repository folks train the admin assistant to upload to the repository instead. Content is more likely to be deposited than if faculty have to do it….
  • Service provider – not a model for a large institution. Library will upload papers for faculty. The positives is that works are much more likely to be deposited. The negative is that it’s a lot of work and won’t scale….

Discovery options: Most traffic comes from Google searches, but only for repositories that are easily crawlable and have a unique URL for each document. OAI aggregators like OAIster.org have millions and millions of records. They harvest metadata from many repositories. Some may come direct to the repository, but most people will not come there looking for something specific. Citations will drive traffic back to the repository.
Barriers to success:

  • Lack of institutional commitment
  • Faculty apathy (lack of adoption and use)
  • If it is difficult to upload content, people won’t use it.
  • If you don’t implement it completely or follow through it will fail.

Strategies for Success

  • Start with early adopters and work outward.
  • Market all the time. Make presentations at division meetings and stuff
  • Seek institutional mandates
  • Provide methods to bulk upload from things already living in other databases
  • Make it easy for people to participate. Reduce barriers and technical/policy issues.
  • Build technological enhancements to make it ridiculously easy for people to upload their content….

This is a good summary. I’d add that much of the early work in IRs has come from subjects where the “fulltext” is seen as the important repositable [almost a neolgism!]. We’re concerned with data and my repositions have been computations on molecules. I also admit that even as an early adopter I don’t self-archive much. This is mainly because the publishers don’t allow me to. In some cases I cannot even read my own output on the publishers website as Cambridge doesn’t subscribe to the journal online.
I have just realised what I have written! The publisher does not allow me to read my own work! We accept this?
The message from Open Scholarship was that voluntary repositing doesn’t work. There has to be explicit carrot and/or stick.
So while you are implementing your own IR make sure that you can reposit data as well as fulltext. This will be a constant theme of this blog!

Posted in open issues | Leave a comment

Open Map Data?

From Peter Suber’s Blog:

Mike Cross, Ordnance Survey in the dock again, The Guardian, October 26, 2006. Excerpt:

On one side of an electoral boundary, people might buy sun-blushed tomatoes; on the other, economy baked beans. Retailers like to know such things, so data from the 2001 census is of great commercial interest – and also the subject of the latest controversy in the Free Our Data debate.
Last week, the Association of Census Distributors filed a complaint against a state-owned entity, Ordnance Survey, over the conditions placed on the re-use of intellectual property in census data. It is the second time this year that the national mapping agency has been the subject of a complaint to the government’s Office of Public Sector Information…..
Technology Guardian’s Free Our Data campaign proposes that the best way to avoid such disputes is for basic data sets collected at taxpayers’ expense to be made freely available for any purpose (subject to privacy and national security constraints). While this would involve more direct funding for agencies such as Ordnance Survey, the economy as a whole would gain. At the moment, says [Peter Sleight of Target Marketing Consultancy], the national good is compromised because of a single trading fund’s commercial needs.

For non-UK readers: the Guardian is a liberal national newspaper (sharing with this blog a reputation for typos). The Ordnance Survey is the Government organization responsible for UK maps.
In or monthly CB2 meetings where we work out how to put the world to rights freedom of map data is a frequent topic. I regard maps as “open data” – they are part of the public infrastructure of existence. It was interesting that at the UK eScience (==Grid, == cyberinfrastructure) a lot of the applications were map based and almost all used GoogleMap API. So it makes absolute sense for maps to be part of the Open Data definition.
It’s also good to see a newspaper championing freedom – we can almost prove it makes economic sense to remove this part of the anticommons.

Posted in open issues | 4 Comments

Commons in the pharma industry?

I was excited to see the following in Peter Suber’s Open Access Blog:

var imagebase=’file://C:/Program Files/FeedReader30/’;

17:54 24/10/2006, Peter Suber, Open Access News
Pfizer is exploring data sharing with Science Commons. There are no details in this interview with David de Graaf of Pfizer’s Research Technology Center, but it’s a promising prospect to watch. Here’s the key passage:

When you encounter a knotty problem or roadblock in terms of your work in systems biology, who do you call among your peers in the industry?
…Everybody keeps running into the same toxicity and we can’t solve it. Actually putting our heads together and, more importantly, putting our data together may be something that’s worthwhile, and we’re exploring that together with the folks at Science Commons right now, as well as the folks at Teranode.

I have argued elsewhere that the current model of pharmaceutical sponsorship depletes the scientific Commons – this is a wonderful opportunity to change the model and enhance it. (I haven’t read the interview – the link seems broken).I used to work in the pharma industry – it is well known that it is very difficult to discover safe and useful drugs. Each company tackles very similar problems to the others and each runs into the same problems. Most drug projects fail. If one company has slightly fewer failures they might count that as a competitive advantage, but if we look at it from the global view of the commons (post)- or the anticommons (post) – of the industry it is still a tragedy.
In some industries (e.g. luxury goods) a failure only costs the shareholders; in pharma it often results in poor or unsafe drugs. The drug companies have to collect safety information (I have worked with WHO on this issue) but much of this is secret. Since it is us who are the test vehicles for new compounds is there not an overwhelming case for making toxicity information public and seeing this as a pre-competitive activity?
Posted in open issues | 2 Comments

Rich Apodaca: Closed Chemical Publishing and Disruptive Technology

Rich Apodaca, a founder member of the Blue Obelisk, has a thoughtful blog, DepthFirst. Besides the interesting stuff on programming – especially Ruby – there are useful injections from outside chemistry and IT. Here’s one:

The Directory of Open Access Journals (DOAJ) currently lists 2420 Open Access scholarly journals. Of these, 52 currently fall under the category of chemistry. Although the organic chemistry subcategory only currently lists three journals, the general chemistry category actually contains several journals containing organic chemistry content, such as the Bulletin of the Korean Chemical Society, Chemical and Pharmaceutical Bulletin, and Molbank.Clearly, the chemistry journals included in DOAJ’s listings would not be considered to be in “the mainstream” by experts in the field. And that’s exactly the point. Innovation always happens at the margins.As Clayton Christensen puts it in his landmark book, The Innovator’s Dilemma:

As we shall see, the list of leading companies that failed when confronted with disruptive changes in technology and market structure is a long one. … One theme common to all of these failures, however, is that the decisions that led to failure were made when the leaders in question were widely regarded as among the best companies in the world.

Replacing the word “company” with “scientific journal” leads to an important hypothesis about the future of scientific publishing.
And on the subject of disruptive innovation itself, Christensen writes:

Occasionally, however, disruptive technologies emerge: innovations that result in worse product performance, at least in the near-term. Ironically, in each of the instances studied in this book, it was disruptive technologies that precipitated the leading firms’ failure.

It seems very unlikely that scientific publishing operates according to a different set of rules than any other technology-driven business. The coming wave of disruptive innovation will be dramatic, and the outcome completely predictable.

PMR: and elsewhere he points to a possible disruptive technology…
var imagebase=’file://C:/Program Files/FeedReader30/’;
Like everything else in information technology, the costs of setting up and maintaining a scientific journal are rapidly approaching zero. A growing assortment of Open Source journal management systems is available today. Recently, I was introduced to one of these packages by Egon Willighagen as part of my involvement with CDK News.

Open Journal Systems

Open Journal Systems (OJS) automates the process of manuscript submission, peer review, editorial review, article release, and article indexing. All of these elements are, of course, cited as major costs by established publishers intent on maintaining their current business models.
OJS appears to work in much the same way as automated systems being run by major publishers. In fact, OJS is already in use by more than 800 journals written in ten languages worldwide.
Did I mention that OJS is free software – as in speech? The developers of OJS have licensed their work under the GPL, giving publishers the ability to control every aspect of how their journal management system operates. Standing out from the crowd will no doubt be an essential component of staying competitive in a world in which almost anyone can start their own journal.

Alternatives

And there’s even better news: OJS has competition. Publishers can select from no fewer than seven open source journal management systems: DPubs; OpenACS; GAP; HyperJournal; SciX; Living Reviews ePubTk; and TOPAZ.

The Last Word

Open Source tools like Open Journal Systems have the potential to radically change the rules of the scientific publication game. By slashing the costs of both success and failure in scientific publication to almost zero, these systems are set to unleash an unprecedented wave of disruptive innovation – and not a moment too soon. What are the true costs of producing a quality Open Access scientific publication – and who pays? Will the idea of starting your own Open Access journal to address deficiencies with existing offerings catch on, especially in chemistry, chemical informatics, and computational chemistry? Before long, we will have answers to these questions.

PMR: Yes – these ideas are looking increasingly relevant and believable. In the same vein Steve Heller has wittingly and irreverently shown the immense power of disruptive technology. Two years ago, when the Blue Obelisk was founded, it probably looked like the margins. Does it still? Many will think yes – I don’t 🙂
Posted in "virtual communities", chemistry, open issues | Leave a comment

Silicos contributes Commercial Open Source – thank you

It is very uncommon for commercial organizations in chemoinformatics to make any of their material Open Source. (Unlike the contributions of many IT companies – e.g. Eclipse, Netbeans, etc.) So I was very pleased to see an announcement of open Source  [BSD] chemoinformatics software on the Blue Obelisk list:

SiMath is Silicos’ open source library for the manipulation of data
matrices and the subsequent mathematical and statistical modeling.
SiMath tries to provide a simple C++ API that is data matrix centered
and that covers the model building procedure from data preprocessing
to training and evaluation of the model. The goal is to provide a
library that can be easily integrated into standalone applications.
The rationale of SiMath is not to invent the wheel again but to
integrate available open source packages and also newly implemented
algorithms into one comprehensive library. Several well established
libraries exist nowadays, but they all have a different interface and
work with their own matrix representation. These tools are
incorporated into SiMath and adapted such that their interface is
consistent over all tools. For instance, all clustering algorithms
are initiated by defining a set of parameters and the actual
clustering is done by calling the cluster method with the data matrix.
Currently, SiMath contains modules for PCA (or SVD), matrix
discretisation, SVM training and evaluation, several clustering
algorithms, self-organing map and several general mathematical
utilities.
More information about SiMath and how to download the source code can
Silicos is a chemoinformatics-based biotechnology company empowering
virtual screening technologies for the discovery of novel compounds
in a variety of disease areas.

This makes sense. The technology here is common to many applications and as (Hans De Winter ) says it is foolish to reinvent the wheel. This is exactly the sort of components we need in the discipline. Because they are in C++ and many of use use Java it may make sense to develop these as Web services (REST) as the message overhead is likely to be smaller than the computational cost.
The Blue Obelisk mantra – Open Data, Open Source, Open Standards welcomes contributions in any of these areas.

Posted in "virtual communities", chemistry, open issues, programming for scientists | 1 Comment

Chemistry Theses: How do you write them?

As I have shown it is hard and lossy to recover information from theses (or anything else!) written in PDF. In unfavourable cases it fails completely. I have a vision which I’ll reveal in future posts, but here I’d like to know how you wrote, write (or intend to write) your theses. This is addressed to synthetic chemists, but other comments would be useful. I have a real application and potential sponsorship in mind.
Firstly I guess that most of you write using Word. Some chemists use LaTeX (and Joe, who is just writing up) told us that the most important thing he would do differently when he started his PhD would be to use LaTeX). I would generally agree with this, although I am keen to see – in the future – what can be done with Open Office and Open Document tools which will use XML as the basis. The unpredictable thing is how quickly OO arrives and what authoring support it has.
A main reason for using Word is that it supports third-party tools whose results can be embedded in a Word document. The most important of these are molecular editors (such as ChemDraw (TM) and IsIsDraw (TM)). These are commercial products and have closed source. They also generally use binary formats which are difficult to untangle. (When these formats are embedded in Word they are impossible to decode – the Word binary format is not documented and efforts to decipher it are incomplete). In some cases I could extract many (but not all) of the ChemDraw files in a document. There are also MS tools such as Excel.
I’d be interested to know if OO and/or the release of MS’s XML format has changed things and what timescales we can reasonable expect for machine-processable compound documents. But for the rest of the discussion I’ll assume that the current practice is Word + commercial tools. (In later posts I shall try to evangelise a brighter future…)
The typical synthetic chemistry thesis contains inter alia :

  • discursive free text describing what was done and why
  • enumerated list of compounds (often 200+) with full synthetic details and analytical data.

The free text looks like:
disstext.PNG
============ OR ===========
diss47etc.PNG
===== the compound information looks like ========
diss38b1.PNG
Note that compounds are identified by a bold identifier ( e.g. 38) which normally increases in serial order throughout the text. This is fragile, in that the insertion of a new number requires manual editing throughout the text (this is confirmed by various chemical gurus). Compounds are drawn in the middle of free text sections, and again in the compound information. There are no tools to enforce consistency between the numbering and the diagrams. Moreover information such as reagents, yields, physical and analytical data are repeated in several places. These have to be manually transcribed and (unless you tell me differently) this is a tedious, frustrating and error-prone process.
Moreover at this stage of writing the thesis the student has to assemble all the data for the 200 compounds. Are they all there? Could any of the spectra be muddled? Is that figure in the lab book a 2 or a 7? Heaven help if a spectrum is missing and the compound has now decomposed into a brown oil or got lost in the great laboratory flood. Of course none of this ever happens…
So are you all happy with how you authored or will author your thesis? I haven’t even touched on how peaks are transcribed from spectra and how the rigmarole of spectra peaks has to be authored and formatted. If so, I’ll shut up. Else I will make some serious and positive suggestions in a later blog.

Posted in chemistry | 3 Comments

Inorganic InChIs

Mark Winter – who has done an enormous amount to promote web-based chemistry such as WebElements – makes an important point:

  1. Mark Winter Says:
    October 18th, 2006 at 10:18 am eOK – having carefully and rather too obviously written in InChI and SMILES strings in a story about ozone at nexus.webelements.info, and being an inorganic chemist who might want to write about a few inorganic species, I wondered how to write strings for, say, metal coordination complexes like the salt [Cr(OH2)6]Cl3. This compound is listed at PubChem athttp://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=104957
    but shows a nonsense structure, and not being a fluent InChI reader I therefore distrusted the InChI string on that page. I looked at the above mentioned carcinogenic potency database and found
    http://potency.berkeley.edu/chempages/COBALT%20SULFATE%20HEPTAHYDRATE.html
    where again the chemical structure drawn is nonsense and so again I have little confidence in the InChI string on that page.
    So how does one proceed for such species?

The structure in Pubchem (CrCl3.6H2O) does not reflect accurately our current knowledge of the compound (though it was probably OK in 1850). It should be Cr(OH2)6(3+).3Cl-. InChI does not have any builtin chemical knowledge and calculates what it is given. It sometimes points out potential valence errors (e.g. CH5) but since it is capable of representing unusual chemistry it doesn’t throw actual errors. So this particular problem is PubChem’s, not InChI. (Note that there is a small fraction of errors in Pubchem of many sorts – there is inconsistency in structural representation and some blatant errors. For those who like an amusing name, try CID: 27 and similar). Pubchem does accepts contributions from many places and does not check chemical “validity”. (These problems are well addressed by social computing…)
There is a more difficult problem for compounds without an agreed connection table. How do we represent “glucose”? It can have an open form and four ring forms (furanose and pyranose, alpha and beta). Similarly “aluminimum chloride” can be AlCl3, Al2Cl6 or Al3+.3Cl-, etc. InChI represents all of these faithfully but does not provide means of navigating between them. And coordination compounds may be represented differently by different humans – there is clearly no simple approach here.
But InChI takes a useful intermediate approach – it can disconnect the metal from the ligands. While this reduces the amount of information is will provide better chances of finding isomers in a search – it should be fairly easy to sort them out.

Posted in chemistry | Leave a comment

Organic Theses: Hamburger or Cow?

This is my first attempt to see if a chemistry thesis in PDF can yield any useful machine-processable information. I thank Natasha Schumann from Frankfurt for the thesis (see below for credits).
A typical chemical synthesis looks like this (screenshot of PDF thesis).
diss38b.PNG
For non-chemists this consists of name and number of compound, recipe for synthesis, structural diagram (picture), number and analytical data (Mass Spec, Infrared and Ultraviolet). This is a very standard format and style.The ugly passive (“To a solution of X was added Y”) is unfortunately universal (cf. “To my dog was donated a bone by me”). The image is not easily deconstructed (note the badly placed label “1” and “+” making machine interpretation almost impossible – that is where we need InChIs).
I then ran PDFBox on the manuscript. This does as good a job as can be expected and produces the ASCII representation.
diss38c.PNG
This is not at all bad (obviously the diagram is useless) and the greek characters are trashed but the rest is fairly good. I fed this to OSCAR1; it took about 10 seconds to process the whole thesis. You can try this as well!
diss38.PNG
OSCAR has parsed most of the text (obviously it can’t manage the diagram labels but the rest is sensible. It has extracted much of the name (fooled a bit by the greek characters) and pulled out everything in the text it was trained to do (nature, yield, melting point). It cannot manage the analytical data because the early OSCAR only read RSC journals but OSCAR3 will do better and can be relatively easily trained to manage this format.
So first shots are better than I have got in the past. OSCAR found data for 40 compounds – ca. 4 per second. Assuming that there are many similar theses there is quite a lot it can do. But not all have PDF that behaves this well…
===================
Acknowledgement (from PDF)
Chiral Retinoid Derivatives:
Synthesis and Structural Elucidation of a New Vitamin A Metabolite
Von der Fakultät für Lebenswissenschaften der Technischen Universität Carolo-Wilhelmina zu Braunschweig zur Erlangung des Grades einer
Doktorin der Naturwissenschaften (Dr. rer. nat.) genehmigte
D i s s e r t a t i o n von Madalina Andreea Stefan aus Ploiesti (Rumänien)

Posted in chemistry, data, XML | Leave a comment