#solo10 An introduction to textmining and data extraction

Scraped/typed into Arcturus

But now we’ll show what we can get out of patents. Even if you aren’t a chemist you should be able to follow this. It’ll show you what text-mining is about and how we are looking for greenness..

Here’s a typical report in PDF (I have cut and pasted it so that’s why it looks tacky – but that’s what you get with PDF):

A) 10-Octadecyl-_1,4,7,10-tetraazacyclododecane-_1,4,7-triacetic acid

_[0065] A mixture of 1,4,7,10-_tetraazacyclododecane-_1,4,7-_triacetic acid tris_(1,1-_dimethylethyl) ester (37.5 g; 72.8

mmol) and 1-_bromooctadecane (24.5 g; 73.5 mmol) in CH3CN (500 mL) was heated to reflux. After 2 h the reaction

mixture was evaporated and the residue was dissolved in CHCl3 and a portion of CF3COOH was added. After 16 h at

room temperature the reaction mixture was evaporated and the oily residue dissolved in CF3COOH. After 3 days at

room temperature, the solution was evaporated, the residue taken up in CHCl3 and the solution evaporated. This operation

was repeated three times. The oily residue was purified by flash chromatography as follows:_

Eluents:_

(a) CH2Cl2/_MeOH = 3/1 (v/v) 3 litres

(b) CH2Cl2/_MeOH/NH4OH 25% (w/w) = 12/4/1 (v/v/v) 12 litres

(c) CH2Cl2/_MeOH/NH4OH 25% (w/w) = 6/3/1 (v/v/v) 2 litres

_[0066] The product was dissolved in H2O and acidified with 6N HCl; then, the solution was loaded onto an AmberliteK

XAD-_8 resin column and eluted with a CH3CN/H2O gradient. The product started eluting with 20% CH3CN.

Because PDF is non semantic we have lost some of the formatting, but it doesn’t matter. The XML is MUCH more useful. That’s why you should author XML, as well as PDF (or even instead). Don’t switch off just because it’s XML. The key phrases we shall use are highlighted…

<p id=”p0065″ num=”0065″>A mixture of 1,4,7,10-tetraazacyclododecane-1,4,7-triacetic acid tris(1,1-dimethylethyl) ester (37.5 g; 72.8 mmol) and 1-bromooctadecane (24.5 g; 73.5 mmol) in CH<sub>3</sub>CN (500 mL) was heated to reflux. After 2 h the reaction mixture was evaporated and the residue was dissolved in CHCl<sub>3</sub> and a portion of CF<sub>3</sub>COOH was added. After 16 h at room temperature the reaction mixture was evaporated and the oily residue dissolved in CF<sub>3</sub>COOH. After 3 days at room temperature, the solution was evaporated, the residue taken up in CHCl<sub>3</sub> and the solution evaporated. This operation was repeated three times. The oily residue was purified by flash chromatography as follows:<br/>

Eluents:

<ul id=”ul0004″ list-style=”none” compact=”compact”>

<li>(a) CH<sub>2</sub>Cl<sub>2</sub> / MeOH = 3/1 (v/v) 3 litres</li>

<li>(b) CH<sub>2</sub>Cl<sub>2</sub> / MeOH / NH<sub>4</sub>OH 25% (w/w) = 12/4/1 (v/v/v) 12 litres</li>

<li>(c) CH<sub>2</sub>Cl<sub>2</sub> / MeOH / NH<sub>4</sub>OH 25% (w/w) = 6/3/1 (v/v/v) 2 litres</li>

</ul></p>

PMR: notice how the subscripts and the list has been properly captured. Anyway it’s the text that matters. The key phrases for the Green Chain Reaction are in bold.

dissolved in CF<sub>3</sub>COOH is a classic linguistic template. This tells us that CF3COOH is a solvent! (I don’t know how green it is/isn’t. Here are its hazards

MSDS
External MSDS
R-phrases
R20
R35
R52/53
S-phrases
S9
S26
S27
S28
S45
S61
NFPA 704

(http://en.wikipedia.org/wiki/Trifluoroacetic_acid )

in CH<sub>3</sub>CN (500 mL) is also a classic template. The (number + mL) tells us it’s a liquid (if it were a solid it would have grams (g) or mg (milligrams) as units. We also know that CH3CN is a liquid by looking it up:

http://en.wikipedia.org/wiki/Acetonitrile : EU classification Flammable, harmful R-phrases
R11, R20/21/22, R36
S-phrases
(S1/2), S16, S36/37

dissolved in CHCl<sub>3</sub>. http://en.wikipedia.org/wiki/Chloroform … says:

The US National Toxicology Program’s eleventh report on carcinogens[20] implicates it as reasonably anticipated to be a human carcinogen, a designation equivalent to International Agency for Research on Cancer class 2A. It has been most readily associated with hepatocellular carcinoma.[21][22] Caution is mandated during its handling in order to minimize unnecessary exposure; safer alternatives, such as dichloromethane, have resulted in a substantial reduction of its use as a solvent.

So this is where we start to see the point of the GreenChainReaction… Let’s see what the ratio of use of chloroform to dichloromethane is over time.

So if we can extract the solvents out of every patent reaction we can get an indication of the greenness. The next post will show how we do this.

 

 

 

 

 

 


 

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *