I got a request today offering help for CM. Great! CM isn’t a single activity –ideally it’s a community of collaborating people and organizations combining resources. The first thing you can do is join, and post to, https://lists.okfn.org/mailman/listinfo/open-contentmining. Here’s what I replied:
I’m delighted to have had an enquiry of help for content-mining. The good news is:
*Everyone has a role to play in content-mining*
Here are some important areas – please submit others. There are lots of micro-tasks that everyone can become involved in.
==project==
* identifying a need
* coordinating a community effort
* summarising current practice (e.g. rights, barriers, resources)
* creating resources (e.g.corpora)
* running a project
==crawling==
* identifying sites to mine
* collecting bibliographic metadata (e.g. tables of content)
* agreeing web-friendly protocols (e.g. delay times)
* writing or finding crawlers
* creating or deploying crawl scripts
* managing workflow manually or or automatically
* recording crawl log
* saving crawled materials
==document==
* formalising structure of document (e.g. sections)
* creating or finding vocabularies for annotation
==generic tools==
* crawlers
* PDF readers
* flat text readers
* graphics analyzers
* image analyzers
==databases==
* customization
==natural language==
* collection of NLP tools
* vocabularies
* corpora for training
* training
* testing
* domain tools
== graphics==
* reconstruction of diagrams from primitives
* SVG tools
==images==
* selection
* croppings
* binarisation
* edge detection/segemnts
* optical character recognition
==text==
* fonts
==tables==
* reconstruction
* interpretation
==audio==
==video==
==semantics==
* annotation
* links
==domain==
* maths
* chemistry
* geo
* dates
* units of measurement
==argumentation==
* document structure
* sentiment analysis
==documentation==
==sociopoliticololegal==
==community==
* mailing lists
* crowdcrafting