Probabilistic latent semantic indexing (PLSI) is an approach for document retrieval by modeling the joint probability model of words and documents as a mixture of independent multinomial distribution conditioned by latent semantic classes. The model is based on two independence assumption. First, the observed words and documents are assumed to be generated independently. Second, conditioned on the latent class, words are generated independently of the specific document identity. Given that the number of classes is smaller than the number of documents, each class acts as a bottleneck variable in predicting the distribution of words conditioned on documents.
Technical details
Given a word w and a document d, their joint probability distribution is model as follows.
$$P(d,w) = P(d)P(w|d), where$$
$$P(w|d) = \displaystyle\sum\_{z\in Z} P(w|z)P(z|d)$$
where $z$ denotes a latent class.
Following the likelihood principle, one determines the distributions in (1) and (2) by maximization of the log-likelihood function
$$\mathcal{L} = \displaystyle\sum\_{d \in D} \displaystyle\sum\_{w \in W} n(d,w) log P(d,w)$$
The maximization is done by the Expectation Maximization (EM) algorithm.
Results
![](http://1.bp.blogspot.com/-eSKjS0950ac/VUHFJ2Vv7fI/AAAAAAAAA4U/bMGoVNh5_O0/s1600/result.png)