Topic Modelling with Latent Dirichlet Allocation¶

Paper: Latent Dirichlet Allocation

Authors: David M Blei, Andrew Y Ng, Michael I Jordan

STA663: Yuan Gao, Kevin Liang

1. Background¶

In natural language processing, Latent Dirichlet Allocation (LDA) is a widely used topic model proposed by David Blei, Andrew Ng, and Michael Jordan, capable of automatically discovering topics that documents in a corpus contain and explaining similarities between documents. LDA is very intriguing for us, because it is a three-level hierarchical Bayesian model, and topic modeling is a classic problem in natural language processing.

In the following report, we first describe the mechanism of Latent Dirichlet Allocation. We then use two methods to implement LDA: Variational Inference and Collapsed Gibbs Sampling. Next, we try to optimize the performance of our implementation with Cython. Finally, we generate a test data set based on different topics and visualize the result of topic discovery.

2. Algorithm Description¶

LDA uses a generative model to explain how the observed words in the documents of a corpus are generated from latent variables. The following shows the graphical model representation of LDA:

LDA

The boxes are "plates" representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document. M denotes the number of document, N the number of words in a documents, and V indicates the size of the vocabulary of the corpus. We define the following terms:

$\alpha$ is the parameter of the Dirichlet prior on the per-document topic distribution,
$\beta_i$ is the word distribution for topic k
$\theta_m$ is the topic distribution for document m,
$z_{mw}$ is the topic of word w in document m

LDA assumes the following generative process for each document m in a corpus:

Choose $N$ ~ Poisson($\xi$).
Choose $\theta$ ~ Dir($\alpha$).
For each of the $N$ word $w_n$:
1. Choose a topic $z_n$ ~ Multinomial($\theta$).
2. Choose a word $w_n$ from $p(w_n |z_n, \beta)$, a multinomial probability conditioned on the topic $z_n$.

Note that the length of each document $N$ does not interact with any of the other variables and is thus just assumed to be known.

The Dirichlet Distribution¶

The Dirichlet Distribution is the multivariate generalization of the beta distribution, which means the Dirichlet distribution is a distribution over discrete probability distributions. Dirichlet distributions are oftenly used as conjugate prior distributions of the categorical distribution and multinomial distribution in Bayesian statistics.

A k-dimensional Dirichlet random variable $\theta$ can take values in the (k-1)-simplex (a k-vector $\theta$ lies in the (k-1)-simplex if ${ \theta }_{ i }\ge 0,\sum _{ i }^{ k }{ { \alpha }_{ i } } $), and has the following probability density on the simplex:

$$p(\theta| \alpha) = \frac { \Gamma (\sum _{ i=1 }^{ k }{ { \alpha }_{ i } } ) }{ \prod _{ i=1 }^{ k }{ \Gamma ({ \alpha }_{ i }) } } { { \theta }_{ 1 } }^{ { \alpha }_{ 1 }-1 }\cdot \cdot \cdot { { \theta }_{ k } }^{ { \alpha }_{ k }-1 }$$

Given the parameters $\alpha$ and $\beta$, the joint distribution of a topic mixture $\theta$, a set of $N$ topics $z$, and a set of $N$ words $w$ is given by:

$$ p(\theta, z, w|\alpha, \beta)=p(\theta|\alpha)\prod _{ n=1 }^{ N }{ p(z_n|\theta)p(w_n|z_n,\beta) } $$

where $p(z_n|\theta)$ is simply $\theta_i$ for the unique i such that ${ z }_{ n }^{ i }=1$. Integrating over $\theta$ and summing over z, we obtain the marginal distribution of a document:

$$ p(w|\alpha, \beta) = \int { p(\theta |\alpha )(\prod _{ n=1 }^{ N }{ \sum { p({ z }_{ n }|\theta )p({ w }_{ n }|{ z }_{ n },\beta ) } } )d\theta } $$

Finally, taking the product of the marginal probabilities of single documents, we obtain the probability of a corpus:

$$ p(D|\alpha, \beta) = \prod _{ d=1 }^{ M }{ \int { p(\theta_d |\alpha )(\prod _{ n=1 }^{ N_d }{ \sum { p({ z }_{ dn }|\theta )p({ w }_{ dn }|{ z }_{ dn },\beta ) } } )d\theta_d } } $$

To infer the latent parameters is a problem of Bayesian inference. Next, we use Variational Inference and Gibbs Sampling to estimate latent parameters.

3.1 Implementation - Gibbs Sampling¶

In the original paper, Blei, Ng and Jordan (2002) gave a variational inference approximation of the posterior distribution, because posterior distribution is usually intractable. Since then though, Gibbs Sampling has also become a commonly used way to infer latent parameters in LDA. Here, we use Gibbs Sampling to implement LDA. Gibbs Sampling is a Markov Chain Monte Carlo method, in which the next state is reached by sequentially sampling from the full conditional distributions of all other variables and the data.

Since the Dirichlet distribution is conjugate prior of the multinomial distribution, the posteriors of $\theta_i$ and $\beta_i$ also follow the Dirichlet distribution. Their posterior means are:

$$ \theta_{i,k} = \frac { { n }_{ i }^{ k }+{ \alpha }_{ k } }{ \sum _{ k=1 }^{ K }{ { n }_{ i }^{ k }+{ \alpha }_{ k } } } $$$$ \beta_{k,w} = \frac { { n }_{ w }^{ k }+{ \beta }_{ w } }{ \sum _{ w=1 }^{ W }{ { n }_{ w }^{ k }+{ \beta }_{ w } } } $$

where ${ n }_{ i }^{ k }$ is the number of words in document i that have been assigned to topic k, ${ n }_{ w }^{ k }$ is the total number words $w$ assigned to topic $k$ among all documents in the corpus.

Obviously, the inference of $\theta$ and $\beta$ only depends on assignments of each word to topics $z_i$. Therefore, we can only focus on estimation of $z_i$. We define some terms:

$n_m$: the word count of document $m$, not including the current one
$n_{mz}$: the number of words from document $m$ assigned to topic $z$, not including the current one
$n_{zw}$: the number of instances of word $w$ assigned to topic $z$, not including the current one
$n_z$: the total number of words assigned to topic $z$, not including the current one

Then, the posterior distribution of word assignment is:

$$ p(z_i=j|z_i,w)\propto \frac { n_{zw} + \phi }{ n_z + V\phi } \cdot \frac { n_{mz} + \alpha }{ n_m + K\alpha } $$

And we can implement LDA by Gibbs Sampling.

Parameters¶

document: $m = 1,...,M$

topic asigned to word: $z = 1,...,K$

word: $w = 1,...,N_V$

vocabulary : $v = 1,...,V$

Z: topic assigned to word w

$\theta: K \times N$

$\beta: M \times K$

$Multinomial(\theta)$: distribution over words for a given topic

$Multinomial(\beta)$: distribution over topics for a given document

$n_m$, $n_{mz}$, $n_{zw}$, $n_z$: as defined above

Import packages and functions¶

import numpy as np
from numpy import sqrt,mean,square
from scipy.special import digamma, polygamma

Word counting function¶

def words_count_doc(corpus):
    """
    Count the total number of words in each document in corpus.

    Parameters
    ----------
    corpus : a list-like structure, contains bag-of-words of each document

    Returns
    -------
    n_m : a np.array, shape(M)
         the total number of words in each document
    """
    n_m = []
    for i in range(len(corpus)):
        n_m.append(np.sum(corpus[i], axis = 0)[1])
    return np.array(n_m)

Initialize empty parameters¶

def empty_parameters(corpus, K, V):
    """
    Initialize empty parameters n_mz, n_zw, n_z.

    Parameters:
    -----------
    K : int, the number of topics
    V : int, the number of vocabulary
    
    Returns:
    --------
    z_mw : the topic of word w in document m
    n_mz : the number of words from document m assigned to topic z
    n_zw : the number of words assigned topic z
    n_z : the total number of words assigned to topic z
    """
    z_mw = []
    n_mz = np.zeros((len(corpus), K))
    n_zw = np.zeros((K, V))
    n_z = np.zeros(K)
    return z_mw, n_mz, n_zw, n_z

Initialize parameters based on words in documents¶

def initial_parameters(corpus, K, V):
    """
    Initialize parameters for the corpus 

    Parameters:
    -----------
    corpus: a list-like structure, contains bag-of-words of each document
    K : int, the number of topics
    V : int, the size of the vocabulary

    Returns:
    --------
    z_mw : the topic of word w in document m
    n_mz : the number of words from document m assigned to topic z
    n_zw : the number of words assigned topic z
    n_z : the total number of words assigned to topic z
    
    """
    z_mw, n_mz, n_zw, n_z = empty_parameters(corpus, K, V)
    z_mw = []
    for m, doc in enumerate(corpus):
        z_n = []
        for n, t in doc:
            z = np.random.randint(0, K)
            z_n.append(z)
            n_mz[m, z] += t
            n_zw[z, n] += t
            n_z[z] += t
        z_mw.append(np.array(z_n))
    return z_mw, n_mz, n_zw, n_z

Gibbs Sampling¶

def gibbs_sampling(corpus, max_iter, K, V, n_zw, n_z, n_mz, n_m, z_mw, alpha, phi):
    beta_gibbs = []
    theta_gibbs = []

    np.random.seed(1337)
    for i in range(max_iter):
        if i%1000 == 0:
            print(i)
        for m, doc in enumerate(corpus):
            for n, (w, t) in enumerate(doc):
                #exclude the current word
                z = z_mw[m][n]
                n_mz[m, z] -= t
                n_m[m] -= t
                n_zw[z, w] -= t
                n_z[z] -= t

                new_z = sample_topic(K, n_zw, n_z, n_mz, n_m, alpha, phi, w, m)

                #include the current word
                z_mw[m][n] = new_z
                n_mz[m, new_z] += t
                n_zw[new_z, w] += t
                n_z[new_z] += t
                n_m[m] += t

        #update beta
        beta_gibbs.append(update_beta(V, n_zw, n_z, alpha))
        #update theta
        theta_gibbs.append(update_theta(K, n_mz, n_m, phi))
    return beta_gibbs, theta_gibbs

def sample_topic(K, n_zw, n_z, n_mz, n_m, alpha, phi, w, m):
    """
    Sample new topic for current word
    
    """
    p_z = np.zeros(K)
    for j in range(K):
        p_z[j] = ((n_zw[j, w] + phi)/(n_z[j] + V * phi)) * ((n_mz[m, j] + alpha)/(n_m[m] + K * alpha))
    new_z = np.random.multinomial(1, p_z/p_z.sum()).argmax()
    return new_z

def update_beta(V, n_zw, n_z, alpha):
    """
    Update beta
    """
    beta = (n_zw + alpha)/(n_z[:,None] + V *alpha)
    return beta

def update_theta(K, n_mz, n_m, phi):
    """
    Update theta
    """
    theta = (n_mz + phi)/(n_m[:, None] + K * phi)
    return theta

3.2 Implementation - Variational Inference¶

Latent Dirichlet Allocation was also implemented using variational inference. In situations where variational inference is typically used, the posterior is typically intractable to calculate directly. In the case of LDA, the posterior $p(\theta,z,w | \alpha,\beta)$ is difficult to compute, so the distribution is instead approximated with the variational distribution:

$$q(\theta,z | \gamma,\phi) = q(\theta|\gamma) \prod_{n=1}^{N} q(z_n|\phi_n)$$

Using Jensen's inequality, it can be shown that the difference between the log likelihood of the true posterior and the variational approximation is the KL-divergence between the two. In order words:

$$ \log(p(w|\alpha,\beta) = L(\gamma,\phi;\alpha,\beta) + D(q(\theta,z|\gamma,\phi)||p(\theta,z|w,\alpha,\beta))$$

We can choose to either minimize the KL-divergence or maximize the likelihood. Here, the latter is approach is taken. Factoring the likelihood appropriately, we can write the following:

$$ L(\gamma,\phi;\alpha,\beta) = E_q[\log p(\theta|\alpha)] + E_q[\log p(z|\theta)] + E_q [\log p(w|z,\beta)] - E_q [\log q(\theta)] - E_q[\log q(z)] $$

This likelihood is maximized through Expectation-Maximization (EM). During the expectation step, the variational parameters $\phi$ and $\gamma$ are first optimized by maximizing the likelihood with respect to each individually. During the maximization step, the likelihood is then maximized with respect to model parameters $\alpha$ and $\beta$. This process is outlined below.

Variables and Parameters¶

document: $m = 1,...,M$

topic: $z = 1,...,k$

word: $w = 1,...,N_m$

vocabulary : $v = 1,...,V$

$\alpha: 1 \times k$ Model parameter - vector of topic distribution probabilities for each document

$\beta: k \times v$ Model parameter - matrix of word probabilities for each topic

$\phi: M \times N_m \times k$ Variational parameter - matrix of topic probabilities for each word in each document

$\gamma: M \times k$ Variational parameter - matrix of topic probabilities for each document

Optimize variational parameters $\phi$ and $\gamma$¶

By taking the derivative the log likelihood with respect to $\phi$ and setting the result to zero, we find the maximal value of $\phi$:

$$ \phi_{ni} \propto \beta_{iv} \exp(\Psi(\gamma_i) - \Psi(\sum_{j=1}^k(\gamma_j)) $$

where $\beta_{iv}$ = $p(w_n^v = 1|z_n = i)$ and $\Psi$ is the digamma function (derivative of the log gamma function $\Gamma$). As $\phi$ represents the probability of each word in a document for each latent topic, these values must be normalized such that each row representing a word position within a document must sum to 1.

In a similar fashion, it can be shown that $\gamma$ is maximized at:

$$ \gamma_i = \alpha_i + \sum_{n=1}^N(\phi_{ni})$$

## Optimize variational parameter phi
def opt_phi(beta,gamma,words,M,N,k):
    for m in range(M):
        for n in range(N[m]):
            for i in range(k):
                phi[m][n,i] = beta[words[m][n],i] * np.exp(digamma(gamma[m,i]) - digamma(np.sum(gamma[m,:])))
            # Normalize across states so phi represents probability over states for each word
            phi[m][n,:] = phi[m][n,:]/np.sum(phi[m][n,:])
    return phi


## Optimize variational parameter gamma
def opt_gamma(alpha,phi,M):
    gamma = np.tile(alpha,(M,1)) + np.array(list(map(lambda x: np.sum(x,axis=0),phi)))
    return gamma

Estimate model parameters $\alpha$ and $\beta$¶

By taking the derivative of the log likelihood and applying the appropriate Lagrange multipliers to ensure probabilities sum to 1, we find that $/beta$ is maximized with:

$$ \beta_{ij} \propto \sum_{m=1}^M \sum_{n=1}^{N_m} \phi_{dni}w_{mn}^j$$

where $w_{mn}^j$ = 1 if the $n^{th}$ word of document $m$ is equal to $j$, and 0 otherwise. Since the columns of \beta represent the probability of each word given the topic of that particular column, they must be normalized to sum to 1.

Taking the derivative of the log likelihood with respect to $\alpha$ yields:

$$ \frac{\partial L}{\partial\alpha_i} = M(\Psi(\sum_{j=1}^k\alpha_j)-\Psi(\alpha_i)) - \sum_{m=1}^M(\Psi(\gamma_{di})-\Psi(\sum_{j=1}^k\gamma_{dj}))$$

Because this is difficult to find the zero intercept of this derivative, $\alpha$ is instead maximized numerically with the Newton-Raphson method. The Hessian is of the form:

$$ \frac{\partial^2 L}{\partial\alpha_i\partial\alpha_j} = M(\Psi'(\sum_{j=1}^k \alpha_j) - \delta(i,j)\Psi'(\alpha_i))$$

Note: This is slightly different from what is stated in the paper, which has a couple errors in the reported form of the Hessian.

## Optimize beta
def est_beta(phi,words,k,V):
    for j in range (V):
        # Construct w_mn == j of same shape as phi
        w_mnj = [np.tile((word == j),(k,1)).T for word in words]
        beta[j,:] = np.sum(np.array(list(map(lambda x: np.sum(x,axis=0),phi*w_mnj))),axis=0)

    # Normalize across states so beta represents probability of each word given the state
    for i in range(k):
        beta[:,i] = beta[:,i]/sum(beta[:,i])

    return beta


## Optimize alpha
#  (Newton-Raphson method, for a Hessian with special structure)
def est_alpha(alpha,gamma,M,k,nr_max_iters = 1000,tol = 10**-2.0):
    for it in range(nr_max_iters):
        alpha_old = alpha

        #  Calculate gradient 
        g = M*(digamma(np.sum(alpha))-digamma(alpha)) + np.sum(digamma(gamma)-np.tile(digamma(np.sum(gamma,axis=1)),(k,1)).T,axis=0)
        #  Calculate Hessian diagonal component
        h = -M*polygamma(1,alpha)
        #  Calculate Hessian constant component
        z = M*polygamma(1,np.sum(alpha))
        #  Calculate constant
        c = np.sum(g/h)/(z**(-1.0)+np.sum(h**(-1.0)))

        #  Update alpha
        alpha = alpha - (g-c)/h

        #  Check convergence
        if sqrt(mean(square(alpha-alpha_old)))<tol:
            break

    return alpha

4. Optimization - Gibbs Sampling¶

To optimize the preformance of Gibbs Sampling, in this section, we use Cython to speed the algorithm up.

%load_ext cython

%%cython

import numpy as np
import cython
@cython.boundscheck(False)
@cython.wraparound(False)

def gibbs_sampling_cy(corpus, int max_iter, int K, int V, n_zw, n_z, n_mz, n_m, z_mw, double alpha, double phi):

    cdef int i, m, n, w, t

    np.random.seed(1337)

    def sample_topic(int K, n_zw, n_z, n_mz, n_m, double alpha, double phi, int w, int m):
        """
        Sample new topic for current word

        """
        p_z = np.zeros(K)
        cdef int j
        for j in range(K):
            p_z[j] = ((n_zw[j, w] + phi)/(n_z[j] + V * phi)) * ((n_mz[m, j] + alpha)/(n_m[m] + K * alpha))
        new_z = np.random.multinomial(1, p_z/p_z.sum()).argmax()
        return new_z


    def update_beta(int V, n_zw, n_z, double alpha):
        """
        Update beta
        """
        beta = (n_zw + alpha)/(n_z[:,None] + V *alpha)
        return beta

    def update_theta(int K, n_mz, n_m, double phi):
        """
        Update theta
        """
        theta = (n_mz + phi)/(n_m[:, None] + K * phi)
        return theta

    beta_gibbs = []
    theta_gibbs = []

    for i in range(max_iter):
        for m, doc in enumerate(corpus):
            for n, (w, t) in enumerate(doc):
                #exclude the current word
                z = z_mw[m][n]
                n_mz[m, z] -= t
                n_m[m] -= t
                n_zw[z, w] -= t
                n_z[z] -= t

                new_z = sample_topic(K, n_zw, n_z, n_mz, n_m, alpha, phi, w, m)

                #include the current word
                z_mw[m][n] = new_z
                n_mz[m, new_z] += t
                n_zw[new_z, w] += t
                n_z[new_z] += t
                n_m[m] += t

        #update beta
        beta_gibbs.append(update_beta(V, n_zw, n_z, alpha))
        #update theta
        theta_gibbs.append(update_theta(K, n_mz, n_m, phi))
    return beta_gibbs, theta_gibbs

5.1 Tests - Variational Inference¶

Set data set characteristics¶

To test if our implementation of latent dirichlet allocation with variational inference works, we first generate some toy data. This toy data set will consist of 300 documents, each with a uniform random length between 150 and 200 words. The size of the vocabulary of words in the documents is set to be 30, assumed to be generated from 10 topics.

np.random.seed(1337)

M = 300
k = 10
N = np.random.randint(150,200,size=M)
V = 30

Generate data¶

The documents are then generated one by one according to the LDA model (see 2 Algorithm description). Three distinct groups of documents are generated: the first 100 have a strong preference for topics 1, 2, and 3; the second 100 have a strong preference for topics 4, 5, and 6; and the last 100 have a strong preference for topics 7, 8, 9, and 10. Furthermore, each topic will have a strong preference for 3 words, such that each word is prevalent in one topic. The structure of the resulting parameters are shown in Figures 1 and 2 below.

# Create 3 groups of documents, each with a topic preference
alpha_gen1 = np.array((20,15,10,1,1,1,1,1,1,1))
alpha_gen2 = np.array((1,1,1,10,15,20,1,1,1,1))
alpha_gen3 = np.array((1,1,1,1,1,1,10,12,15,18))

# Arbitrarily choose each topic to have 3 very common words
beta_probs = np.ones((V,k)) + np.array([np.arange(V)%k==i for i in range(k)]).T*19
beta_gen = np.array(list(map(lambda x: np.random.dirichlet(x),beta_probs.T))).T

w_struct = list();
theta = np.empty((M,k))

# Generate each document
for m in range(M):
    # Draw topic distribution for the document
    if m<M/3:
        theta[m,:] = np.random.dirichlet(alpha_gen1,1)[0]
    elif m<2*M/3:
        theta[m,:] = np.random.dirichlet(alpha_gen2,1)[0]
    else:
        theta[m,:] = np.random.dirichlet(alpha_gen3,1)[0]
    doc = np.array([])

    for n in range(N[m]):
        # Draw topic according to document's topic distribution
        z_n = np.random.choice(np.arange(k),p=theta[m,:])
        # Draw word according to topic
        w_n = np.random.choice(np.arange(V),p=beta_gen[:,z_n])
        doc = np.append(doc,w_n)
    w_struct.append(doc)

beta_gen

Figure 1: The matrix $\beta_{gen}$ used to set the topic-word probabilities of the test data. Note how each topic strongly prefers 3 words.

theta_gen

Figure 2: The matrix $\theta$ generated by the model. Note how there are 3 distinct groups, each with their preferred topics.

Initialize parameters $\alpha, \beta, \phi$ and $\gamma$¶

The model and variational parameters are then randomly initialized to reasonable values:

alpha = 100*np.random.dirichlet(10*np.ones(k),1)[0]
beta = np.random.dirichlet(np.ones(V),k).T

phi = np.array([1/k*np.ones([N[m],k]) for m in range(M)])
gamma = np.tile(alpha,(M,1)) + np.tile(N/k,(k,1)).T

Expectation Maximization (EM)¶

Convergence Criterion¶

The variational inference parameter $\gamma$ contains the topic likelihoods of every document. As such, $\gamma$ identifies to which group a document is likely to belong. As such, the convergence criterion was chosen to monitor this parameter. The root-mean-square of the change in $\gamma$ is calculated on every iteration of EM and compared against a tolerance parameter.

def converged(gamma,gamma_old,convergence):
    #print(sqrt(mean(square(gamma-gamma_old))))
    return sqrt(mean(square(gamma-gamma_old))) < convergence

Inference by iterative EM¶

Expectation-Maximization is carried out by consecutively maximizing each of the four parameters $\alpha, \beta, \phi$ and $\gamma$ with respect to the log likelihood until either the convergence criterion has been met or a maximimum number of iterations have been calculated.

convergence = 5*10**(-2.0)
successfully_Converged = False
max_iters = 10**3

for iters in range(max_iters):
    #print(iters)
    gamma_old = gamma

    ## Expectation step: Update variational parameters
    phi   = opt_phi(beta,gamma,w_struct,M,N,k)
    gamma = opt_gamma(alpha,phi,M)

    ## Maximization step: Update model parameters
    beta  = est_beta(phi,w_struct,k,V)
    alpha = est_alpha(alpha,gamma,M,k)

    if converged(gamma,gamma_old,convergence):
        successfully_Converged = True
        break

/Users/andrea/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:6: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future

Results¶

After running until the root-mean-square of the difference in $\gamma$ dropped below 0.05, the algorithm was considered converged and terminated. The results are visualized in plots below in Figures 3 and 4. Since the model parameter $\beta$ estimated by the algorithm should correspond witht the $\beta$ used to generate the data, and inferred variational parameter $\gamma$ should correspond to $\theta$, Figures 3 and 4 should resemble Figures 1 and 2 respectively.

beta_est_unordered

Figure 3: The matrix $\beta_{est}$ inferred by the algorithm.

gamma_unordered

Figure 4: The matrix $\gamma$ inferred by the algorithm.

At first, Figures 1 and 2 do not appear to match up with Figures 3 and 4. However, the individual topic identities do not have any specific relation to their index. In other words, there is a non-identifiability issue at play here. It is however apparent in Figure 4 that the algorithm correctly identifies 3 separate groups of documents of the right size. By visually inspecting $\gamma$ and re-arranging the order of the topics, we get Figures 5 and 6 below:

beta_est_rearranged

Figure 5: The matrix $\beta_{est}$ inferred by the algorithm, rearranged in original topic order.

gamma_rearranged

Figure 6: The matrix $\gamma$ inferred by the algorithm, rearranged in original topic order.

While not perfect, the visualizations of $\beta$ and $\gamma$ with the topics re-arranged now exhibit patterns similar to the original structures visible in Figures 1 and 2. In particular, the three diagonals in $\beta$ representing the three preferred words of each topic can be clearly seen in Figure 5, and the three boxes corresponding to the preferred topic distributions of the three groups of documents are also apparent in Figure 6. With more data (either more documents or more words in each document), these structures are likely to be recovered with even higher accuracy.

5.2 Tests - Gibbs Sampling¶

To test Gibbs Sampling and compare the performance of the Cython version, we use NLTK, stop_words, gensim packages in Python to clean our toy corpus. The cleaning processes are:

tokenizing: remove punctuation and split into individual words
stop words: remove common words like "the", "a", "I", and so on
stemming: reduce derived or inflected words to their word stem. For example, "stems", "stemmer", "stemming", and "stemmed" are based on "stem"

Data Cleaning¶

import matplotlib.pyplot as plt
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim

tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = get_stop_words('en')

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

# create sample documents
doc_a = "Batman became popular soon after his introduction and gained his own comic book title, Batman, in 1940."

doc_b = "In 1971, Trump moved to Manhattan, where he became involved in larger construction projects, and used attractive architectural design to win public recognition."

doc_c = "Batman is, in his everyday identity, Bruce Wayne, a wealthy American business magnate living in Gotham City."

doc_d = "In 2001, Trump completed Trump World Tower, a 72-story residential tower across from the United Nations Headquarters."

doc_e = " Unlike most superheroes, Batman does not possess any superpowers; rather, he relies on his genius intellect, physical prowess, martial arts abilities, detective skills, science and technology, vast wealth, intimidation, and indomitable will. "

# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]

# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:

    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]

    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]

    # add tokens to list
    texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

Initialize prior and parameters¶

In the corpus, there are 2 topics: Batman and Donald Trump, and the vocabulary contains 69 words. To fit the LDA model to the corpus, we set the priors $\alpha$ and $\phi$ to 0.5, and run the Gibbs Sampler for 1000 iterations.

#corpus : corpus contains bag-of-words
#K : number of topics
#V : vocaburary size

K = 2
V = 69
alpha = 0.5
phi = 0.5
max_iter = 1000

#intialize parameters
n_m = words_count_doc(corpus)
z_mw, n_mz, n_zw, n_z = initial_parameters(corpus, K, V)

Timing functions

import time
def timer(f, *args, **kwargs):
    start = time.clock()
    ans = f(*args, **kwargs)
    return ans, time.clock() - start
def report(fs, *args, **kwargs):
    ans, t = timer(fs[0], *args, **kwargs)
    print('%s: %.1f' % (fs[0].__name__, 1.0))
    for f in fs[1:]:
        ans_, t_ = timer(f, *args, **kwargs)
        print('%s: %.1f' % (f.__name__, t/t_))

report([gibbs_sampling, gibbs_sampling_cy], corpus, max_iter, K, V, n_zw, n_z, n_mz, n_m, z_mw, alpha, phi)

0
gibbs_sampling: 1.0
gibbs_sampling_cy: 1.2

Results¶

After running Gibbs Samping 1000 times for the original version and the Cython version and timing them, we can conclude Cython compilation improved the performance of Gibbs Sampling by 20%.

6. Conclusion¶

We implemented in Python via Gibbs Sampling and Variational Inference. Using the Dirchlet distribution as a prior, the goal was to discover latent topics among documents in a corpus. We simulated a dataset according to the generative process in Blei et al's paper. We then used LDA to estimate the latent variables and compared them with real latent variables in the generating process. As the visualization of latent variables showed, the estimated values were quite close to the real values. Also, we tried to improve the performance of our Gibbs sampling method with Cython, and it sped up the algorithm by 20%. In the future, we can further optimize our algorithm, as we did not tailor the data structure for Cython implementation. Once we do that, the perfomance of the algorithm will likely be boosted to a large degree.

References¶

[1] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.

[2] Griffiths, Tom. "Gibbs sampling in the generative model of latent dirichlet allocation." (2002).