# Topic Modeling

Under Construction

In Week 2 of the Text Mining and Analytics MOOC, we examine topic modeling, which aims to take a take a collection of documents $C$, and determine what are the different topics these documents are about. The ideas are presented incrementally, starting with mining a corpus of 1 document, extracting only 1 topic, and finally ending with a description of of the PLSA and LDA algorithms.

Topic models all have the same assumption at their core: a document covers different topics in different amounts, given by topic coverage $\pi_{d,j}$, which is the proportion of topic j covered in document d. Each of the $k$ topics is represented by $\theta_j$ is itself a word distribution, wherein each word $w$ in the vocabulary has a probability $p(w\mid\theta_j)$ of being generated by that topic. The overall probability of generating a word, given multiple word distributions and document topic coverages is

$p(w\mid\lbrace\theta_j\rbrace,\lbrace\pi_{d,j}\rbrace) = \sum_{j=1}^k \pi_{d,j}p(w\mid\theta_j)$

To get a better grasp of the concepts, it is instructive to imagine first mining 1 document for one topic (i.e. $k=1$). That document can be represented as $d = (x_1,x_2,..., x_{\mid d \mid})$, where $x_i \in \lbrace w_1,w_2, ..., w_M \rbrace$.

We can then choose to assume that documents are generated from a single Unigram Language Model (wherein we assume words are generated independantly. This means the joint probability of generating a sequence of words is simply the product of the parameters associated with each word).

It is important to note the constraints that $\forall d \in C, \sum_{j=1}^k \pi_{d,j} = 1$ $\forall j \in [1,k], \sum_{i=1}^M p(w_i\mid\theta_j) = 1$