--- title: "Basic usage" author: "Doug Friedman" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Basic usage} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction There are two ways to use the topic model diagnostics included `topicdoc`. You can calculate all the topic diagnostics at once using `topic_diagnostics` or use the other functions to calculate the diagnostics individually. The only prerequisite for using `topicdoc` is that your topic model is fit using the `topicmodels` package and that your document-term matrix (DTM) is `slam` coercible. This includes DTMs created through popular text mining packages like `tm` and `quanteda`. ## Example For this example, the Associated Press Dataset from topicmodels is used. It contains a DTM created a series of AP articles from 1988. ```{r example_setup} library(topicdoc) library(topicmodels) data("AssociatedPress") lda_ap4 <- LDA(AssociatedPress, control = list(seed = 33), k = 4) # See the top 10 terms associated with each of the topics terms(lda_ap4, 10) ``` Here's how you would run all the diagnostics at once. ```{r all_at_once} topic_diagnostics(lda_ap4, AssociatedPress) ``` Here's how you would run a few of them individually. ```{r one_at_a_time} topic_size(lda_ap4) mean_token_length(lda_ap4) ``` ## Diagnostics Included A full list of the diagnostics included are provided below. | Diagnostic/Metric | Function | Description | |:-----------------------------------------------:|:-------------------:|:-------------------------------------------:| | topic size | `topic_size` | Total (weighted) number of tokens per topic | | mean token length | `mean_token_length` | Average number of characters for the top tokens per topic | | distance from corpus distribution | `dist_from_corpus` | Distance of a topic's token distribution from the overall corpus token distribution | | distance between token and document frequencies | `tf_df_dist` | Distance between a topic's token and document distributions | | document prominence | `doc_prominence` | Number of unique documents where a topic appears | topic coherence | `topic_coherence` | Measure of how often the top tokens in each topic appear together in the same document | | topic exclusivity | `topic_exclusivity` | Measure of how unique the top tokens in each topic are compared to the other topics |