Typed into Arcturus
The first pass of the automatic extraction of chemical information from patents is going well on a mechanical level.
- One weekly index has 30-200 appropriate patents. Each has between 0 and 1500 images of chemical relevance
- Each index therefore has ca 10,000 images, almost all of chemical compounds or general formulae or reactions.
- We use OSRA (Open Source, NIH) to interpret the images. It takes about 1-30 secs each and the first index will complete in ca 24 hours. This means that we could do this task for the last 10 years in 500 distributed days. I’d like to do that before #solo10. (I could do it all at Cambridge, but I’d rather it were citizen-science.)
So far the “record” is a patent with 1500 images. Here’s one (EP_2050749A1/0026imgb0032.tif)
Could someone please tell me what the InChI or SMILES or CML is for this compound?
I am now working on the text-mining. More later today.