Topic Modeling
06 Jul 2015Under Construction
In Week 2 of the Text Mining and Analytics MOOC, we examine topic modeling, which aims to take a take a collection of documents , and determine what are the different topics these documents are about. The ideas are presented incrementally, starting with mining a corpus of 1 document, extracting only 1 topic, and finally ending with a description of of the PLSA and LDA algorithms.
Topic models all have the same assumption at their core: a document covers different topics in different amounts, given by topic coverage , which is the proportion of topic j covered in document d. Each of the topics is represented by is itself a word distribution, wherein each word in the vocabulary has a probability of being generated by that topic. The overall probability of generating a word, given multiple word distributions and document topic coverages is
\[p(w\mid\lbrace\theta_j\rbrace,\lbrace\pi_{d,j}\rbrace) = \sum_{j=1}^k \pi_{d,j}p(w\mid\theta_j)\]
To get a better grasp of the concepts, it is instructive to imagine first mining 1 document for one topic (i.e. ). That document can be represented as , where .
We can then choose to assume that documents are generated from a single Unigram Language Model (wherein we assume words are generated independantly. This means the joint probability of generating a sequence of words is simply the product of the parameters associated with each word).
It is important to note the constraints that \[\forall d \in C, \sum_{j=1}^k \pi_{d,j} = 1\] \[\forall j \in [1,k], \sum_{i=1}^M p(w_i\mid\theta_j) = 1\]
If you have any questions or comments, please post them below.