Exploring word frequency distributions

A common task in text analysis is to explore the distribution of words (or terms) in a text collection. There are a number of ways in which a research can operationalize frequency which can change the results quite dramatically. In this case study, I will demonstrate the difference between two frequency measures: raw counts (n) and Term-weighted Inverse Document Frequency (tf-idf). I will use the later to explore the similarity between written genres (categories) in the Brown Corpus of Written American English.

Data

Let’s access a curated version of this data through the analyzr package. First, install and load the package, and the main tidyverse tools.

devtools::install_github("WFU-TLC/analyzr")

Let’s take a look at the brown_words dataset.

To find out more about the data we can look at the data dictionary provided in the analyzr package with ?brown.

Case study

Prepare the variables

The first step will be to calculate the relevant frequency metrics. Each of our measures will be grouped by category to highlight the similarity and difference between each.

As we can see the most frequent words, in terms of number of occurrences is very similar between the categories. This is very much expected as natural language tends to show a striking leftward skew in frequency counts with primarily grammatical words forming the majority of word tokens in any (sizable) corpus. To distinguish between the words the form the scaffolding of language (grammatical) and words of importance (content), we will use the Term-weighted Inverse Document Frequency (tf-idf). This measure takes into the account the overall distribution across documents within a category weighting those terms that occur in many documents within a category (such as those in the above raw frequency plot) lower. On the whole, this measure attempts to strikes a balance between common grammatical terms and content terms.

Analysis

Category similarity

Now that we have a measure which helps us get at the content of the categories, let’s now find out which categories tend to be similar to each other in text content. We will want to find the pairwise corrlation between the word frequencies and categories. The widyr package provides a key function for this task pairwise_cor(). We will use the tf_idf score to focus in on the distribution of words from a importance-based perspective.

To appreciate the relationships between the categories, we will plot a network graph. This requires packages for visualizing networks ggraph and igraph.

Summary

From this exploratory approach, we can gather that there are three groupings that show some overlap in content. This finding could be harnessed to then decided whether or not to conflate some categories.

For more ideas in terms of text exploration see Silge and Robinson (2017)

References

Silge, Julia, and David Robinson. 2017. Text mining with R: A tidy approach. O’Reilly Media.