Title: | Topic-Specific Diagnostics for LDA and CTM Topic Models |
---|---|
Description: | Calculates topic-specific diagnostics (e.g. mean token length, exclusivity) for Latent Dirichlet Allocation and Correlated Topic Models fit using the 'topicmodels' package. For more details, see Chapter 12 in Airoldi et al. (2014, ISBN:9781466504080), pp 262-272 Mimno et al. (2011, ISBN:9781937284114), and Bischof et al. (2014) <arXiv:1206.4631v1>. |
Authors: | Doug Friedman [aut, cre] |
Maintainer: | Doug Friedman <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.1.9000 |
Built: | 2024-11-01 03:55:07 UTC |
Source: | https://github.com/doug-friedman/topicdoc |
The Hellinger distance between the token probabilities or betas for each topic and the overall probability for the word in the corpus is calculated.
dist_from_corpus(topic_model, dtm_data)
dist_from_corpus(topic_model, dtm_data)
topic_model |
a fitted topic model object from one of the following:
|
dtm_data |
a document-term matrix of token counts coercible to |
A vector of distances with length equal to the number of topics in the fitted model
Jordan Boyd-Graber, David Mimno, and David Newman, 2014. Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. CRC Handbooks ofModern Statistical Methods. CRC Press, Boca Raton, Florida.
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) dist_from_corpus(lda, AssociatedPress[1:20,])
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) dist_from_corpus(lda, AssociatedPress[1:20,])
Calculate the document prominence of each topic in a topic model based on either the number of documents with an estimated gamma probability above a threshold or the number of documents where a topic has the highest estimated gamma probability
doc_prominence( topic_model, method = c("gamma_threshold", "largest_gamma"), gamma_threshold = 0.2 )
doc_prominence( topic_model, method = c("gamma_threshold", "largest_gamma"), gamma_threshold = 0.2 )
topic_model |
a fitted topic model object from one of the following:
|
method |
a string indicating which method to use - "gamma_threshold" or "largest_gamma", the default is "gamma_threshold" |
gamma_threshold |
a number between 0 and 1 indicating the gamma threshold to be used when using the gamma threshold method, the default is 0.2 |
A vector of document prominences with length equal to the number of topics in the fitted model
Jordan Boyd-Graber, David Mimno, and David Newman, 2014. Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. CRC Handbooks ofModern Statistical Methods. CRC Press, Boca Raton, Florida.
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) doc_prominence(lda)
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) doc_prominence(lda)
Using the the N highest probability tokens for each topic, calculate the average token length for each topic
mean_token_length(topic_model, top_n_tokens = 10)
mean_token_length(topic_model, top_n_tokens = 10)
topic_model |
a fitted topic model object from one of the following:
|
top_n_tokens |
an integer indicating the number of top words to consider, the default is 10 |
A vector of average token lengths with length equal to the number of topics in the fitted model
Jordan Boyd-Graber, David Mimno, and David Newman, 2014. Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. CRC Handbooks ofModern Statistical Methods. CRC Press, Boca Raton, Florida.
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) mean_token_length(lda)
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) mean_token_length(lda)
Using the the N highest probability tokens for each topic, calculate the Hellinger distance between the token frequencies and the document frequencies
tf_df_dist(topic_model, dtm_data, top_n_tokens = 10)
tf_df_dist(topic_model, dtm_data, top_n_tokens = 10)
topic_model |
a fitted topic model object from one of the following:
|
dtm_data |
a document-term matrix of token counts coercible to |
top_n_tokens |
an integer indicating the number of top words to consider, the default is 10 |
A vector of distances with length equal to the number of topics in the fitted model
Jordan Boyd-Graber, David Mimno, and David Newman, 2014. Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. CRC Handbooks ofModern Statistical Methods. CRC Press, Boca Raton, Florida.
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) tf_df_dist(lda, AssociatedPress[1:20,])
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) tf_df_dist(lda, AssociatedPress[1:20,])
Using the the N highest probability tokens for each topic, calculate the topic coherence for each topic
topic_coherence(topic_model, dtm_data, top_n_tokens = 10, smoothing_beta = 1)
topic_coherence(topic_model, dtm_data, top_n_tokens = 10, smoothing_beta = 1)
topic_model |
a fitted topic model object from one of the following:
|
dtm_data |
a document-term matrix of token counts coercible to |
top_n_tokens |
an integer indicating the number of top words to consider, the default is 10 |
smoothing_beta |
a numeric indicating the value to use to smooth the document frequencies in order avoid log zero issues, the default is 1 |
A vector of topic coherence scores with length equal to the number of topics in the fitted model
Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011, July). "Optimizing semantic coherence in topic models." In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 262-272). Association for Computational Linguistics. Chicago
McCallum, Andrew Kachites. "MALLET: A Machine Learning for Language Toolkit." https://mallet.cs.umass.edu 2002.
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) topic_coherence(lda, AssociatedPress[1:20,])
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) topic_coherence(lda, AssociatedPress[1:20,])
Generate a dataframe containing the diagnostics for each topic in a topic model
topic_diagnostics( topic_model, dtm_data, top_n_tokens = 10, method = c("gamma_threshold", "largest_gamma"), gamma_threshold = 0.2 )
topic_diagnostics( topic_model, dtm_data, top_n_tokens = 10, method = c("gamma_threshold", "largest_gamma"), gamma_threshold = 0.2 )
topic_model |
a fitted topic model object from one of the following:
|
dtm_data |
a document-term matrix of token counts coercible to |
top_n_tokens |
an integer indicating the number of top words to consider for mean token length |
method |
a string indicating which method to use - "gamma_threshold" or "largest_gamma" |
gamma_threshold |
a number between 0 and 1 indicating the gamma threshold to be used when using the gamma threshold method, the default is 0.2 |
A dataframe where each row is a topic and each column contains the associated diagnostic values
Jordan Boyd-Graber, David Mimno, and David Newman, 2014. Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. CRC Handbooks ofModern Statistical Methods. CRC Press, Boca Raton, Florida.
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) topic_diagnostics(lda, AssociatedPress[1:20,])
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) topic_diagnostics(lda, AssociatedPress[1:20,])
Using the the N highest probability tokens for each topic, calculate the exclusivity for each topic
topic_exclusivity(topic_model, top_n_tokens = 10, excl_weight = 0.5)
topic_exclusivity(topic_model, top_n_tokens = 10, excl_weight = 0.5)
topic_model |
a fitted topic model object from one of the following:
|
top_n_tokens |
an integer indicating the number of top words to consider, the default is 10 |
excl_weight |
a numeric between 0 and 1 indicating the weight to place on exclusivity versus frequency in the calculation, 0.5 is the default |
A vector of exclusivity values with length equal to the number of topics in the fitted model
Bischof, Jonathan, and Edoardo Airoldi. 2012. "Summarizing topical content with word frequency and exclusivity." In Proceedings of the 29th International Conference on Machine Learning (ICML-12), eds John Langford and Joelle Pineau.New York, NY: Omnipress, 201–208.
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) topic_exclusivity(lda)
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) topic_exclusivity(lda)
Calculate the size of each topic in a topic model based on the number of fractional tokens found in each topic.
topic_size(topic_model)
topic_size(topic_model)
topic_model |
a fitted topic model object from one of the following:
|
A vector of topic sizes with length equal to the number of topics in the fitted model
Jordan Boyd-Graber, David Mimno, and David Newman, 2014. Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. CRC Handbooks ofModern Statistical Methods. CRC Press, Boca Raton, Florida.
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) topic_size(lda)
# Using the example from the LDA function library(topicmodels) data("AssociatedPress", package = "topicmodels") lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) topic_size(lda)