一、算法推导1.1 E-steps二、算法实现
words = set() word_counts = [] for document in documents: seglist = word_tokenize(document) wordlist = [] for word in seglist: synsets = wordnet.synsets(word) if synsets: syn_word = synsets[0].lemmas()[0].name() if syn_word not in stopwords: wordlist.append(syn_word) else: if word not in stopwords: wordlist.append(word) words = words.union(wordlist) word_counts.append(Counter(wordlist)) word2id = {words:id for id, words in enumerate(words)} id2word = dict(enumerate(words)) N = len(documents) # number of documents M = len(words) # number of words X = np.zeros((N, M)) for i in range(N): for keys in word_counts[i]: X[i, word2id[keys]] = word_counts[i][keys]
def E_step(lam, theta): # lam: N * K, theta: K * M, p = K * N * M N = lam.shape[0] M = theta.shape[1] lam_reshaped = np.tile(lam, (M, 1, 1)).transpose((2,1,0)) # K * N * M theta_reshaped = np.tile(theta, (N, 1, 1)).transpose((1,0,2)) # K * N * M temp = lam @ theta p = lam_reshaped * theta_reshaped / temp return p
def M_step(p, X): # p: K * N * M, X: N * M, lam: N * K, theta: K * M # update lam lam = np.sum(p * X, axis=2) # K * N lam = lam / np.sum(lam, axis=0) # normalization for each column lam = lam.transpose((1,0)) # N * K # update theta theta = np.sum(p * X, axis=1) # K * M theta = theta / np.sum(theta, axis=1)[:, np.newaxis] # normalization for each row return lam, theta
def LogLikelihood(p, X, lam, theta): # p: K * N * M, X: N * M, lam: N * K, theta: K * M res = np.sum(X * np.log(lam @ theta)) # N * M return res三、结果分析
每个文本的主题分布保存在DocTopicDistribution.csv文件中。每个主题的单词分布保存在TopicWordDistribution.csv文件中。每个主题中出现概率最高的9个单词保存在topics.txt文件中,如下图所示。可以看到出现概率最高的单词分别为astatine, network, Associate_in_Nursing, algorithm,分别对应了物理学、计算机科学、统计学、数学四个领域。这证明了PLSA方法的有效性。
本项目开源在kungfu-crab/PLSA: A python implementation for PLSA(Probabilistic Latent Semantic Analysis) using EM algorithm. (),仅作为学习交流使用,禁止转载与抄袭。
[1] Hofmann, T. (1999). Probabilistic Latent Semantic Analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (pp. 289-296). Morgan Kaufmann Publishers Inc.
标签: #foreach能return吗