In preparation for my challenge to Internet Librarian #ili2009 I am going to explain what I want to do. I want to search for a common and important chemical . I want to be able to search University Repositories for this. If I can do this for several hundred repositories simultaneously I will be moderately pleased.
Even if you are not a scientist, read on because I will explain why Google/Bing works and Institutional Repositories don’t. You MUST understand why IRs are failing because they don’t use the web properly.
I have chosen to search for 4-amino-benzoic acid because:
it’s relatively unambiguous (I’m glossing over some problems of chemical names for non-chemists)
it’s common in several fields (e.g. chemical research, bioscience, healthcare, medicine, industry)
many people buy it every day in sunscreens (even if they don’t know it).
It’s unlikely to have been specifically indexed by a human metadata librarian
and I know it’s in some repositories because I have looked manually
If I asked any undergraduate student to find out about 4-amino-benzoic acid they would go to Google. (I also include Bing because I want to be nice to our Microsoft funders). This is not the most accurate way as they will miss synonyms and they will get some noise, but it’s pretty good. (Science librarians will say they ought to go to Chemical Abstracts Scifinder, but that costs a lot of money, and doesn’t contain much of the information you will see below – and I will get flak for saying this and I’ll deal with it). And no doubt Wolfram will improve its (currently awful) chemistry. So here’s what they get from Google (the first 9 out of 194000). I add comments as [FOO]
4-Aminobenzoic acid (also known as para-aminobenzoic acid or PABA) is an organic compound with the molecular formula C7H7NO2. PABA is a white crystalline …
en.wikipedia.org/wiki/4-Aminobenzoic_acid – Cached – Similar – [WIKIPEDIA]
FORMULA, H2NC6H4COOH. MOL WT. 137.14. H.S. CODE, 3922.49. TOXICITY, Oral rat LD50: 6000 mg/kg. SYNONYMS, 4–Amino–benzoic Acid, 4–Amino-Benzoesaeure; …
chemicalland21.com/…/p-AMINOBENZOIC%20ACID.htm – Cached – Similar – [A BROKER/PORTAL FOR SUPPLIERS OT THE REAL STUFF]
English: 4-Aminobenzoic acid; p-aminobenzoic acid; Hachemina; Paraminol. Deutsch: p-Aminobenzoesäure; PABA; 4-Aminobenzoesäure; p-Carboxyanilin …
commons.wikimedia.org/wiki/File:4-Aminobenzoic_acid.svg – Cached – Similar – [WIKIMEDIA, the structural formula as an image, presumably because it’s linked from Wikipedia]
Expert articles, personal stories, blogs, Q&A, news, local resources, pictures, video and a supportive community. 4 Aminobenzoic Acid – Health Knowledge …
www.wellsphere.com/wellpage/4-aminobenzoic–acid – Cached – Similar – [A HEALTHFOOD SUPPLIER; this community is often not evidence-based]
29 Sep 2008 … Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical …
www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:30753 – Cached – Similar – [CHEBI; ONE OF THE MAIN ONTOLOGIES FOR CHEMISTRY]
Aminaftone, a Derivative of 4-Aminobenzoic Acid, Downregulates Endothelin-1 Production in ECV304 Cells: An In Vitro Study. Authors: Scorza, Raffaella1; …
www.ingentaconnect.com/content/adis/rdd/2008/…/art00005 – Similar –
by R Scorza – 2008 – Related articles – All 3 versions [A PUBLICATION IN A CLOSED ACCESS SCHOLARLY PUB – a bargain at only 55 USD for human readers; I didn’t pay it]
A9878 4-Aminobenzoic acid 99% … Linear Formula: H2NC6H4CO2H. Molecular Weight: 137.14. Beilstein Registry Number: 471605. EC Number: 205-753-0 …
www.sigmaaldrich.com/catalog/ProductDetail.do?… – Cached – Similar – [A MAJOR SUPPLIER OF CHEMICAL FOR RESEARCH SCIENTISTS]
China Intermediates /4-Aminobenzoic Acid ( P-Aminobenzoic Acid, PABA) and China 4-hydroxyindole, 3-Aminobenzoic acid, 4-aminobenzamide, 4-nitrobenzamide, …
www.made-in-china.com/…/China-Intermediates-4-Aminobenzoic–Acid-P-Aminobenzoic–Acid-PABA-.html – Cached – Similar – [ANOTHER SUPPLIER; CHINA IS A MAJOR SOURCE OF FINE CHEMICALS]
20 Aug 2003 … Safety (MSDS) data for 4-aminobenzoic acid. … Synonyms: p-aminobenzoic acid, PABA, vitamin BX, anticanitic vitamin …
msds.chem.ox.ac.uk/AM/4-aminobenzoic_acid.html – Cached – Similar – [SAFETY DATA FOR 2000 SUBSTANCES COLLECTED BY OXFORD UNIVERSITY; on the department web site (that’s where students look, not in the IR)
I have also used BING the new Microsoft engine. It returns several of these sites, doesn’t get ChEBI but gets:
Acronym Finder: ABAH stands for 4-Aminobenzoic Acid Hydrazide
Definition and other additional information on 4-Aminobenzoic acid from Biology-Online.org dictionary. [A GLOSSARY]
Table 57-5 FDA Category 1 Monographed Sunscreen Ingredients a Harrison’s Online > Chapter 57. Photosensitivity and Other Reactions to Light > Photoprotection [HEALTHCARE – sunscreen]
Topics Discussed: 4-aminobenzoic acid; aminoglycosides; antimicrobials; azalides; beta-lactam antibiotics; beta-lactamase; cell membrane transport; cell wall biosynthesis … [MEDICINE – infection]
PubChem Substance (SID) 152180 3847 PubChem Compound (CID) 978 KEGG Compound ID C00568 CAS Registry IDs 150-13-0 8014-65-1 Miscellaneous Databases and IDs 30753- CHEBI 7627 – NSC 6840 – HSDB 6209 – CCRIS 4-27-00-07875 – Beils
tein Handbook Reference 205-753-0 – EINECS Natural Isotopic Abundance Mass 137.1359800000 Mono-Isotopic Molecular Masses
Biological Magnetic Resonance Data Bank A Repository for Data from NMR Spectroscopy on Proteins, Peptides, Nucleic Acids, and other Biomolecules [CHEMICAL DATA (NMR)]
web.grcc.cc.mi.us/Pr/msds/physicalscience/2006/4AminobenzoicAcid99percent.pdf [SAFETY – MSDS]
To sum up.
This is a very good place to start from. Wikipedia has a good overview, several useful links. ChEBI has all the links to Open sites that you could want. Pubchem has comprehensive but variable (author-supplied) information. I haven’t looked at Google Scholar yet. A student will conclude (correctly) that Wikipedia and Bing provide useful high-quality information.
So what If I want to get “4-amino-benzoic acid” out of Institutional Repositories. I can’t, or at least I don’t know how to. I know it’s in those temples but I can’t get at it.
So why do Google/Bing work so well at finding what people want.
It’s about the hyperlinks.
It’s about the hyperlinks.
Google collects the information about which document link to which other documents. The links are based on HTML which contains a special tag (<a href>) to point to other documents. Google collects all these hyperlinks and builds a giant network. It then computes the eigenvectors. Don’t switch off, I only put this in to show that there is a clear algorithm for deciding the relative popularity of various sites. In very simple terms the sites which are most linked to are given the highest rank.
This ranking is based on exposing static HTML pages with hyperlinks to the search engines. If you don’t expose HTML pages you don’t get indexed. If you expose a database interface (e.g. a form) you don’t get indexed. (There are other methods, and Google will trawl OAI-PMH) but the primary linking is through HTML.
Theses are reposited in PDF so they don’t contain hyperlinks. So a thesis doesn’t produce GoogleJuice). Theses are exposed through forms so they don’t get indexed that way.
So generally a scientific thesis in an IR is largely invisible to the main web. I am happy to modify this statement if anyone can provide evidence that a significant number of scientific theses have been discovered by Google and indexed.
So my question is simple:
How do I search for all occurrences of “4-amino-benzoic acid” in theses worldwide. A simple, useful request. I don’t believe I can do it. If I still can’t do it by October (#ILI2009) I will highlight the issues.