#ILI2009: The challenge of searching in Instituional Repositories

In preparation for my challenge to Internet Librarian #ili2009 I am going to explain what I want to do. I want to search for a common and important chemical . I want to be able to search University Repositories for this. If I can do this for several hundred repositories simultaneously I will be moderately pleased.

Even if you are not a scientist, read on because I will explain why Google/Bing works and Institutional Repositories don’t. You MUST understand why IRs are failing because they don’t use the web properly.

I have chosen to search for 4-amino-benzoic acid because:

  • it’s relatively unambiguous (I’m glossing over some problems of chemical names for non-chemists)

  • it’s common in several fields (e.g. chemical research, bioscience, healthcare, medicine, industry)

  • many people buy it every day in sunscreens (even if they don’t know it).

  • It’s unlikely to have been specifically indexed by a human metadata librarian

  • and I know it’s in some repositories because I have looked manually

If I asked any undergraduate student to find out about 4-amino-benzoic acid they would go to Google. (I also include Bing because I want to be nice to our Microsoft funders). This is not the most accurate way as they will miss synonyms and they will get some noise, but it’s pretty good. (Science librarians will say they ought to go to Chemical Abstracts Scifinder, but that costs a lot of money, and doesn’t contain much of the information you will see below – and I will get flak for saying this and I’ll deal with it). And no doubt Wolfram will improve its (currently awful) chemistry. So here’s what they get from Google (the first 9 out of 194000). I add comments as [FOO]

4-Aminobenzoic acid – Wikipedia, the free encyclopedia

4-Aminobenzoic acid (also known as para-aminobenzoic acid or PABA) is an organic compound with the molecular formula C7H7NO2. PABA is a white crystalline
en.wikipedia.org/wiki/4-Aminobenzoic_acidCachedSimilar – [WIKIPEDIA]

p-AMINOBENZOIC ACID (PABA)

FORMULA, H2NC6H4COOH. MOL WT. 137.14. H.S. CODE, 3922.49. TOXICITY, Oral rat LD50: 6000 mg/kg. SYNONYMS, 4Aminobenzoic Acid, 4Amino-Benzoesaeure;
chemicalland21.com/…/p-AMINOBENZOIC%20ACID.htm – CachedSimilar [A BROKER/PORTAL FOR SUPPLIERS OT THE REAL STUFF]

File:4-Aminobenzoic acid.svg – Wikimedia Commons

English: 4-Aminobenzoic acid; p-aminobenzoic acid; Hachemina; Paraminol. Deutsch: p-Aminobenzoesäure; PABA; 4-Aminobenzoesäure; p-Carboxyanilin
commons.wikimedia.org/wiki/File:4-Aminobenzoic_acid.svg – CachedSimilar [WIKIMEDIA, the structural formula as an image, presumably because it’s linked from Wikipedia]

4 Aminobenzoic Acid – a comprehensive view – Wellsphere

Expert articles, personal stories, blogs, Q&A, news, local resources, pictures, video and a supportive community. 4 Aminobenzoic Acid – Health Knowledge
www.wellsphere.com/wellpage/4-aminobenzoicacidCachedSimilar [A HEALTHFOOD SUPPLIER; this community is often not evidence-based]

4-aminobenzoic acid (CHEBI:30753)

29 Sep 2008 Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical
www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:30753 – CachedSimilar [CHEBI; ONE OF THE MAIN ONTOLOGIES FOR CHEMISTRY]

IngentaConnect Am
in
aftone, a Derivative of 4-Aminobenzoic Acid

Aminaftone, a Derivative of 4-Aminobenzoic Acid, Downregulates Endothelin-1 Production in ECV304 Cells: An In Vitro Study. Authors: Scorza, Raffaella1;
www.ingentaconnect.com/content/adis/rdd/2008/…/art00005 – Similar
by R Scorza – 2008 – Related articlesAll 3 versions [A PUBLICATION IN A CLOSED ACCESS SCHOLARLY PUB a bargain at only 55 USD for human readers; I didn’t pay it]

A9878 4-Aminobenzoic acid 99%

A9878 4-Aminobenzoic acid 99% Linear Formula: H2NC6H4CO2H. Molecular Weight: 137.14. Beilstein Registry Number: 471605. EC Number: 205-753-0
www.sigmaaldrich.com/catalog/ProductDetail.do?… – CachedSimilar – [A MAJOR SUPPLIER OF CHEMICAL FOR RESEARCH SCIENTISTS]

Intermediates /4-Aminobenzoic Acid ( P-Aminobenzoic Acid, PABA

China Intermediates /4-Aminobenzoic Acid ( P-Aminobenzoic Acid, PABA) and China 4-hydroxyindole, 3-Aminobenzoic acid, 4-aminobenzamide, 4-nitrobenzamide,
www.made-in-china.com/…/China-Intermediates-4-AminobenzoicAcid-P-AminobenzoicAcid-PABA-.html – CachedSimilar – [ANOTHER SUPPLIER; CHINA IS A MAJOR SOURCE OF FINE CHEMICALS]

Safety (MSDS) data for 4-aminobenzoic acid

20 Aug 2003 Safety (MSDS) data for 4-aminobenzoic acid. Synonyms: p-aminobenzoic acid, PABA, vitamin BX, anticanitic vitamin
msds.chem.ox.ac.uk/AM/4-aminobenzoic_acid.html – CachedSimilar – [SAFETY DATA FOR 2000 SUBSTANCES COLLECTED BY OXFORD UNIVERSITY; on the department web site (that’s where students look, not in the IR)

I have also used BING the new Microsoft engine. It returns several of these sites, doesn’t get ChEBI but gets:

1-10 of 9,520 results·

ABAH

Acronym Finder: ABAH stands for 4-Aminobenzoic Acid Hydrazide

www.acronymfinder.com/4_AminobenzoicAcid-Hydrazide-(ABAH).html

4-Aminobenzoic acid – definition from Biology-Online.org

Definition and other additional information on 4-Aminobenzoic acid from Biology-Online.org dictionary. [A GLOSSARY]

AccessMedicine | 4-aminobenzoic acid

Table 57-5 FDA Category 1 Monographed Sunscreen Ingredients a Harrison’s Online > Chapter 57. Photosensitivity and Other Reactions to Light > Photoprotection [HEALTHCARE – sunscreen]

AccessMedicine | Mechanisms of Action of Antimicrobial Drugs

Topics Discussed: 4-aminobenzoic acid; aminoglycosides; antimicrobials; azalides; beta-lactam antibiotics; beta-lactamase; cell membrane transport; cell wall biosynthesis … [MEDICINE – infection]

4-Aminobenzoic acid

PubChem Substance (SID) 152180 3847 PubChem Compound (CID) 978 KEGG Compound ID C00568 CAS Registry IDs 150-13-0 8014-65-1 Miscellaneous Databases and IDs 30753- CHEBI 7627 – NSC 6840 – HSDB 6209 – CCRIS 4-27-00-07875 – Beils
tein Handbook Reference 205-753-0 – EINECS Natural Isotopic Abundance Mass 137.1359800000 Mono-Isotopic Molecular Masses

Biological Magnetic Resonance Data Bank A Repository for Data from NMR Spectroscopy on Proteins, Peptides, Nucleic Acids, and other Biomolecules [CHEMICAL DATA (NMR)]

web.grcc.cc.mi.us

web.grcc.cc.mi.us/Pr/msds/physicalscience/2006/4AminobenzoicAcid99percent.pdf [SAFETY – MSDS]

To sum up.

This is a very good place to start from. Wikipedia has a good overview, several useful links. ChEBI has all the links to Open sites that you could want. Pubchem has comprehensive but variable (author-supplied) information. I haven’t looked at Google Scholar yet. A student will conclude (correctly) that Wikipedia and Bing provide useful high-quality information.

So what If I want to get 4-amino-benzoic acid out of Institutional Repositories. I can’t, or at least I don’t know how to. I know it’s in those temples but I can’t get at it.

So why do Google/Bing work so well at finding what people want.

It’s about the hyperlinks.

Uhh?

It’s about the hyperlinks.

Google collects the information about which document link to which other documents. The links are based on HTML which contains a special tag (<a href>) to point to other documents. Google collects all these hyperlinks and builds a giant network. It then computes the eigenvectors. Don’t switch off, I only put this in to show that there is a clear algorithm for deciding the relative popularity of various sites. In very simple terms the sites which are most linked to are given the highest rank.

This ranking is based on exposing static HTML pages with hyperlinks to the search engines. If you don’t expose HTML pages you don’t get indexed. If you expose a database interface (e.g. a form) you don’t get indexed. (There are other methods, and Google will trawl OAI-PMH) but the primary linking is through HTML.

Theses are reposited in PDF so they don’t contain hyperlinks. So a thesis doesn’t produce GoogleJuice). Theses are exposed through forms so they don’t get indexed that way.

So generally a scientific thesis in an IR is largely invisible to the main web. I am happy to modify this statement if anyone can provide evidence that a significant number of scientific theses have been discovered by Google and indexed.

So my question is simple:

How do I search for all occurrences of 4-amino-benzoic acid in theses worldwide. A simple, useful request. I don’t believe I can do it. If I still can’t do it by October (#ILI2009) I will highlight the issues.

This entry was posted in Uncategorized. Bookmark the permalink.

7 Responses to #ILI2009: The challenge of searching in Instituional Repositories

  1. Klaus Graf says:

    I do not think that Google makes a difference between articles in repositories and theses.
    From my OA publications as historian there are 38 fulltexts in Freidok of University Freiburg. All PDFs have an E-Text layer under the facsimile (one article is E-text only).
    According http://tinyurl.com/lqzcdx Google has indexed 28 of the 38 PDFs (all files are some months old; you must subtract 8 files from other other authors from the total number of 36) – I feel this is a “significant number”.
    BTW: Bing hasn’t all PDFs from me, see http://tinyurl.com/n2rbah

  2. Hector says:

    I am glad that there is the beginning of chatter from scientists and academics on how our current library/archiving systems are failing us. Having worked on a 60 yr old project for my PhD, I spent endless, wasted hours worming my way through moldy basements and archives to retrieve articles which might or might not have relevance to my work.
    It is interesting that if I have to look for a topic related to my science, Google scholar is my first choice of search engine. Combining Google and Google reader I now perform searches for topics on a set schedule which I can go back and thumb through at my own leisure.
    That being said, the university library system at my institution has been investing in improving the electronic data bases for thesis and internal document searches. They have been doing this through the development of D-space (http://dspace.mit.edu/) and through the WorldCat project (http://mit.worldcat.org/). Performing the same search, 4-amino-benzoic acid, in the MIT library system, D-space did not return a result, but using WorldCat I retrieved 180 results from libraries participating in the WorldCat project. Not the results that you get with Google or Google scholar, but it is better than what was available just a short while ago.
    It is important to remind our academic institution administrators that the research performed at their institutions is only as relevant as the more used search engines results. This can mean the difference between being referenced in a paper or being glossed over because no one has search engine access to the work performed by researchers. It should be a prerequisite that librarians and institutional repository administrators have a crash course in search engine optimization and other technologies which will put their institutions on the forefront of the dissemination scientific information before they are rendered useless.

  3. Klaus Graf says:

    OpenDOAR is useless for scholarly purposes because of arbitrarily omitted results, see http://archiv.twoday.net/stories/5776766/

    • pm286 says:

      @Klaus thanks. I am not surprised. That means that there are no useful engines that search the academic web that I know of.

Leave a Reply to Hector Cancel reply

Your email address will not be published. Required fields are marked *