Topic Modeling

06 Jul 2015

Under Construction

In Week 2 of the Text Mining and Analytics MOOC, we examine topic modeling, which aims to take a take a collection of documents $C$ , and determine what are the different topics these documents are about. The ideas are presented incrementally, starting with mining a corpus of 1 document, extracting only 1 topic, and finally ending with a description of of the PLSA and LDA algorithms.

Topic models all have the same assumption at their core: a document covers different topics in different amounts, given by topic coverage $\pi_{d,j}$ , which is the proportion of topic j covered in document d. Each of the $k$ topics is represented by $\theta_j$ is itself a word distribution, wherein each word $w$ in the vocabulary has a probability $p(w\mid\theta_j)$ of being generated by that topic. The overall probability of generating a word, given multiple word distributions and document topic coverages is

\[p(w\mid\lbrace\theta_j\rbrace,\lbrace\pi_{d,j}\rbrace) = \sum_{j=1}^k \pi_{d,j}p(w\mid\theta_j)\]

To get a better grasp of the concepts, it is instructive to imagine first mining 1 document for one topic (i.e. $k=1$ ). That document can be represented as $d = (x_1,x_2,..., x_{\mid d \mid})$ , where $x_i \in \lbrace w_1,w_2, ..., w_M \rbrace$ .

We can then choose to assume that documents are generated from a single Unigram Language Model (wherein we assume words are generated independantly. This means the joint probability of generating a sequence of words is simply the product of the parameters associated with each word).

It is important to note the constraints that \[\forall d \in C, \sum_{j=1}^k \pi_{d,j} = 1\] \[\forall j \in [1,k], \sum_{i=1}^M p(w_i\mid\theta_j) = 1\]

If you have any questions or comments, please post them below.

Topic Modeling

Related Posts

Intro to Dalite 31 Aug 2016

AIED 2015 - Notes on Affect 26 Jun 2015

AIED 2015 - Notes on NLP 26 Jun 2015