There are two ways to use the topic model diagnostics included
topicdoc
. You can calculate all the topic diagnostics at
once using topic_diagnostics
or use the other functions to
calculate the diagnostics individually.
The only prerequisite for using topicdoc
is that your
topic model is fit using the topicmodels
package and that
your document-term matrix (DTM) is slam
coercible. This
includes DTMs created through popular text mining packages like
tm
and quanteda
.
For this example, the Associated Press Dataset from topicmodels is used. It contains a DTM created a series of AP articles from 1988.
library(topicdoc)
library(topicmodels)
data("AssociatedPress")
lda_ap4 <- LDA(AssociatedPress,
control = list(seed = 33), k = 4)
# See the top 10 terms associated with each of the topics
terms(lda_ap4, 10)
#> Topic 1 Topic 2 Topic 3 Topic 4
#> [1,] "i" "percent" "bush" "soviet"
#> [2,] "people" "million" "i" "government"
#> [3,] "two" "year" "president" "united"
#> [4,] "police" "billion" "court" "president"
#> [5,] "years" "new" "federal" "people"
#> [6,] "new" "market" "new" "police"
#> [7,] "city" "company" "house" "military"
#> [8,] "time" "prices" "state" "states"
#> [9,] "three" "stock" "dukakis" "party"
#> [10,] "like" "last" "campaign" "two"
Here’s how you would run all the diagnostics at once.
topic_diagnostics(lda_ap4, AssociatedPress)
#> topic_num topic_size mean_token_length dist_from_corpus tf_df_dist
#> 1 1 3476.377 4.1 0.3899012 24.08191
#> 2 2 1910.153 5.6 0.5044673 26.67523
#> 3 3 2504.622 5.4 0.3830014 26.46131
#> 4 4 2581.848 6.5 0.3988826 25.52163
#> doc_prominence topic_coherence topic_exclusivity
#> 1 1053 -81.83339 7.813034
#> 2 598 -79.50691 9.560433
#> 3 783 -106.40062 9.162590
#> 4 775 -84.46149 9.058854
Here’s how you would run a few of them individually.
A full list of the diagnostics included are provided below.
Diagnostic/Metric | Function | Description |
---|---|---|
topic size | topic_size |
Total (weighted) number of tokens per topic |
mean token length | mean_token_length |
Average number of characters for the top tokens per topic |
distance from corpus distribution | dist_from_corpus |
Distance of a topic’s token distribution from the overall corpus token distribution |
distance between token and document frequencies | tf_df_dist |
Distance between a topic’s token and document distributions |
document prominence | doc_prominence |
Number of unique documents where a topic appears |
topic coherence | topic_coherence |
Measure of how often the top tokens in each topic appear together in the same document |
topic exclusivity | topic_exclusivity |
Measure of how unique the top tokens in each topic are compared to the other topics |