Questions tagged [topic-modeling]

Topic models describe the frequency of topics in documents and text. A "topic" is a group of words which tend to occur together.

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats (source: wikipedia)

Generative models (i.e. the statistical models used for topic modelling)

  • Latent Dirichlet Allocation (LDA)
  • Hierarchical Dirichlet process (HDP)

Software / Libraries

Related Tags :

858 questions
42
votes
6 answers

Remove empty documents from DocumentTermMatrix in R topicmodels?

I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix: corpus <- Corpus(VectorSource(vec), readerControl=list(language="en")) corpus <-…
Bill M
  • 661
  • 1
  • 6
  • 8
39
votes
2 answers

LDA topic modeling - Training and testing

I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents. References say that LDA is an algorithm which, given a collection of documents and nothing more (no supervision needed), can…
tan
  • 1,419
  • 3
  • 14
  • 29
32
votes
2 answers

Simple Python implementation of collaborative topic modeling?

I came across these 2 papers which combined collaborative filtering (Matrix factorization) and Topic modelling (LDA) to recommend users similar articles/posts based on topic terms of post/articles that users are interested in. The papers (in PDF)…
jxn
  • 6,325
  • 21
  • 75
  • 140
29
votes
5 answers

Understanding LDA implementation using gensim

I am trying to understand how gensim package in Python implements Latent Dirichlet Allocation. I am doing the following: Define the dataset documents = ["Apple is releasing a new product", "Amazon sells many things", …
visakh
  • 2,333
  • 6
  • 25
  • 50
25
votes
10 answers

How to print the LDA topics models from gensim? Python

Using gensim I was able to extract topics from a set of documents in LSA but how do I access the topics generated from the LDA models? When printing the lda.print_topics(10) the code gave the following error because print_topics() return a…
alvas
  • 94,813
  • 90
  • 365
  • 641
24
votes
2 answers

What's the disadvantage of LDA for short texts?

I am trying to understand why Latent Dirichlet Allocation(LDA) performs poorly in short text environments like Twitter. I've read the paper 'A biterm topic model for short text', however, I still do not understand "the sparsity of word…
Shuguang Zhu
  • 243
  • 1
  • 2
  • 5
23
votes
2 answers

Topic models: cross validation with loglikelihood or perplexity

I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60. I have divided my corpus into ten batches and set aside one batch for a holdout…
user37874
  • 345
  • 1
  • 3
  • 11
20
votes
2 answers

Gensim: KeyError: "word not in vocabulary"

I have a trained Word2vec model using Python's Gensim Library. I have a tokenized list as below. The vocab size is 34 but I am just giving few out of 34: b = ['let', 'know', 'buy', 'someth', 'featur', 'mashabl', 'might', 'earn', 'affili', …
Krishnang K Dalal
  • 1,671
  • 7
  • 24
  • 40
20
votes
5 answers

Using scikit-learn vectorizers and vocabularies with gensim

I am trying to recycle scikit-learn vectorizer objects with gensim topic models. The reasons are simple: first of all, I already have a great deal of vectorized data; second, I prefer the interface and flexibility of scikit-learn vectorizers; third,…
emiguevara
  • 1,299
  • 11
  • 25
19
votes
3 answers

Using Word2Vec for topic modeling

I have read that the most common technique for topic modeling (extracting possible topics from text) is Latent Dirichlet allocation (LDA). However, I am interested whether it is a good idea to try out topic modeling with Word2Vec as it clusters…
user1814735
  • 241
  • 1
  • 2
  • 5
19
votes
3 answers

LDA with topicmodels, how can I see which topics different documents belong to?

I am using LDA from the topicmodels package, and I have run it on about 30.000 documents, acquired 30 topics, and got the top 10 words for the topics, they look very good. But I would like to see which documents belong to which topic with the…
d12n
  • 821
  • 2
  • 10
  • 19
18
votes
1 answer

Export pyLDAvis graphs as standalone webpage

i am analysing text with topic modelling and using Gensim and pyLDAvis for that. Would like to share the results with distant colleagues, without a need for them to install python and all required libraries. Is there a way to export interactive…
Darius
  • 466
  • 1
  • 6
  • 21
18
votes
1 answer

Predicting LDA topics for new data

It looks like this question has may have been asked a few times before (here and here), but it has yet to be answered. I'm hoping this is due to the previous ambiguity of the question(s) asked, as indicated by comments. I apologize if I am breaking…
David
  • 8,565
  • 3
  • 37
  • 39
17
votes
4 answers

LDA model generates different topics everytime i train on the same corpus

I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics. Why does the same LDA parameters and corpus generate…
alvas
  • 94,813
  • 90
  • 365
  • 641
14
votes
1 answer

How to interpret LDA components (using sklearn)?

I used Latent Dirichlet Allocation (sklearn implementation) to analyse about 500 scientific article-abstracts and I got topics containing most important words (in german language). My problem is to interpret these values associated with the most…
LSz
  • 151
  • 1
  • 6
1
2 3
57 58