Mining Word Asociations

I am taking the Text Mining and Analytics MOOC. The first week, we learned about mining paradigmatic and syntagmatic relations.

Next, the idea of conditional entropy is introduced as a means of mining for syntagmatic relations (or co-occuring terms). Entropy measures the randomess of a random variable \[H(X_w) = \sum_{v\in(0,1)} -p(X_w=v)\cdot\log_2[p(X_w=v)]\]

An entropy maximum of 1 means that the random variable X is hard to predict, while an entropy of 0 means that the random variable is certain (and hence very easy to predict). The random variable defined as 1 if the word is in the document, and 0 if it is not.

Conditional entropy has the same form, except the probabilities are conditioned on knowledge of one of the two words: \[H(X_{w_1}|X_{w_2}) = \sum_{u\in(0,1)}\lbrace p(X_{w_2}=u)\cdot H(X_{w_1}|X_{w_2}=u)\rbrace\] \[= \sum_{u\in(0,1)}\lbrace p(X_{w_2}=u)\cdot [\sum_{v\in(0,1)} -p(X_{w_1}=v|X_{w_2=u})\cdot\log_2[p(X_{w_1}=v|X_{w_2=u})]]\rbrace\]

If a pairs of words have low conditional entropy, the knowledge of one increases our ability to predict the other, and hence the words may have a semantic relation.

The issue brought up is that conditional entropies cannot be compared unless we are looking at the same root word, i.e. \( H(X_{w_1}|X_{w_2}) \) and \( H(X_{w_1}|X_{w_3}) \) are comparable; \( H(X_{w_1}|X_{w_2}) \) and \( H(X_{w_2}|X_{w_3}) \) are not.

This is where the idea of Mutual Information is introduced, which measures the reduction of entropy of the root word with knowledge of some second term. \[I(X_{w_1};X_{w_2})= H(X_{w_1}) - H(X_{w_1}|X_{w_2})\]

Mutual Information (MI) is symmetric, and can be compared across different term pairs to determine stronger semantic/syntagmatic relations. MI can also be written in term of KL-divergence:

\[ I(X_{w_1};X_{w_2}) = \sum_{u\in(0,1)}\sum_{v\in(0,1)} p(X_{w_1}=v,X_{w_2}=u) \cdot \log_2 \frac{p(X_{w_1}=v,X_{w_2}=u)}{p(X_{w_1}=v)p(X_{w_2}=u)} \]

The ratio in this expression of MI is of the observed joint distribution of the two words, to the expected distribution of the two words if they were independant.

