vignettes/first-steps-in-wrangling-data.Rmd
first-steps-in-wrangling-data.Rmd
In this vignette I will provide an overview of some of the more common strategies that you will use to manipulate and organize your data for subsequente analysis. We will be working with two packages that are part of the tidyverse package. The first, tidyr
, provides a number of functions for reorganizing variables between long and wide format as well as separating out new variables based on the values of other variable. The second, dplyr
, is used for manipulating data, that is, to select, filter, sort, etc. and for transforming values either through recoding or some other operation.
Let’s take at a dataset included in the analyzr
package. First, install and load the package, and the main tidyverse tools.
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.1.1 ✔ purrr 0.3.2
#> ✔ tibble 2.1.1 ✔ dplyr 0.8.0.1
#> ✔ tidyr 0.8.3 ✔ stringr 1.4.0
#> ✔ readr 1.3.1 ✔ forcats 0.4.0
#> ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
library(analyzr)
Let’s take a look at the sdac
dataset.
glimpse(sdac)
#> Observations: 223,606
#> Variables: 20
#> $ doc_id <chr> "4325", "4325", "4325", "4325", "4325", "4325"…
#> $ damsl_tag <chr> "o", "qw", "qy^d", "+", "+", "qy", "sd", "ad",…
#> $ speaker <chr> "A", "A", "B", "A", "B", "A", "B", "B", "B", "…
#> $ turn_num <chr> "1", "1", "2", "3", "4", "5", "6", "6", "6", "…
#> $ utterance_num <chr> "1", "2", "1", "1", "1", "1", "1", "2", "3", "…
#> $ utterance_text <chr> "Okay. /", "{D So, }", "[ [ I guess, +", "Wha…
#> $ speaker_id <dbl> 1632, 1632, 1519, 1632, 1519, 1632, 1519, 1519…
#> $ pin <dbl> 7713, 7713, 775, 7713, 775, 7713, 775, 775, 77…
#> $ target <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "…
#> $ sex <chr> "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMAL…
#> $ birth_year <dbl> 1962, 1962, 1971, 1962, 1971, 1962, 1971, 1971…
#> $ dialect_area <chr> "WESTERN", "WESTERN", "SOUTH MIDLAND", "WESTER…
#> $ education <dbl> 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2…
#> $ ti <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ payment_type <chr> "CASH", "CASH", "CASH", "CASH", "CASH", "CASH"…
#> $ amt_pd <dbl> 10, 10, 4, 10, 4, 10, 4, 4, 4, 4, 4, 10, 10, 4…
#> $ con <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
#> $ remarks <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ calls_deleted <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ speaker_partition <chr> "UNC", "UNC", "UNC", "UNC", "UNC", "UNC", "UNC…
This dataset is in the tidy format. Take a look at the R documentation for this dataset with ?sdac
.
There are a few tidyverse verbs that are very commonly used to manipulate data frames.
select() allows you to select a subset of columns
sdac %>%
select(speaker_id, damsl_tag, birth_year, utterance_text) %>%
head()
#> speaker_id damsl_tag birth_year
#> 1 1632 o 1962
#> 2 1632 qw 1962
#> 3 1519 qy^d 1971
#> 4 1632 + 1962
#> 5 1519 + 1971
#> 6 1632 qy 1962
#> utterance_text
#> 1 Okay. /
#> 2 {D So, }
#> 3 [ [ I guess, +
#> 4 What kind of experience [ do you, + do you ] have, then with child care? /
#> 5 I think, ] + {F uh, } I wonder ] if that worked. /
#> 6 Does it say something? /
arrange() sorts a data frame by one or more columns
sdac %>%
select(speaker_id, damsl_tag, birth_year, utterance_text) %>%
arrange(birth_year) %>%
head()
#> speaker_id damsl_tag birth_year
#> 1 1381 o 1924
#> 2 1381 sv 1924
#> 3 1381 na^m 1924
#> 4 1381 qy 1924
#> 5 1381 sd 1924
#> 6 1381 sd 1924
#> utterance_text
#> 1 All right. /
#> 2 I think that gets us off to discussing the topic for tonight. /
#> 3 That is correct. /
#> 4 # Are you # a movie buff? /
#> 5 {D Well, } I have a V C R /
#> 6 {C and } I see most all of mine there. /
filter() allows you to select rows where the values match certain parameters
sdac %>%
select(speaker_id, damsl_tag, birth_year, utterance_text) %>%
arrange(birth_year) %>%
filter(birth_year == 1971) %>%
head()
#> speaker_id damsl_tag birth_year
#> 1 1519 qy^d 1971
#> 2 1519 + 1971
#> 3 1519 sd 1971
#> 4 1519 ad 1971
#> 5 1519 h 1971
#> 6 1519 ad 1971
#> utterance_text
#> 1 [ [ I guess, +
#> 2 I think, ] + {F uh, } I wonder ] if that worked. /
#> 3 I think it usually does. /
#> 4 You might try, {F uh, } /
#> 5 I don't know, /
#> 6 hold it down a little longer, /
filter()
can be combined with numerous operators and vector functions.
sdac %>%
select(speaker_id, damsl_tag, birth_year, utterance_text) %>%
arrange(birth_year) %>%
filter(between(birth_year, 1950, 1969)) %>%
head()
#> speaker_id damsl_tag birth_year
#> 1 1103 o 1950
#> 2 1103 sd 1950
#> 3 1103 sv 1950
#> 4 1103 qo 1950
#> 5 1103 x 1950
#> 6 1103 x 1950
#> utterance_text
#> 1 <Lipsmack> Okay, /
#> 2 this topic is, is Russia a threat <breathing> to our security? /
#> 3 <Sniffing>. I think they are more of a threat <breathing> to their own people at this day and time. /
#> 4 <Breathing>, how do you feel about it? /
#> 5 <Breathing>.
#> 6 <Breathing>.
sdac %>%
select(speaker_id, damsl_tag, birth_year, utterance_text) %>%
arrange(birth_year) %>%
filter(birth_year > 1955) %>%
head()
#> speaker_id damsl_tag birth_year
#> 1 1638 sd 1956
#> 2 1638 sv 1956
#> 3 1638 sv 1956
#> 4 1638 + 1956
#> 5 1638 aa 1956
#> 6 1638 qo 1956
#> utterance_text
#> 1 I say they, {F uh, } - /
#> 2 probably the best things to do [ is, + is ] tie them with a rope and throw them in the water # /
#> 3 ((if they sink)) <laughter> # --
#> 4 -- they're innocent {D right. } /
#> 5 <Laughter> Yeah. /
#> 6 {F Uh, } {D so, } {F uh, } [ how do you, + how do you ] feel? /
You often want to explore your data by summarizing. A basic summary is count()
.
You can also add column names to count()
to group your count summary.
sdac %>%
count(birth_year, sort = TRUE)
#> # A tibble: 47 x 2
#> birth_year n
#> <dbl> <int>
#> 1 1959 11821
#> 2 1956 11566
#> 3 1965 11041
#> 4 1960 10870
#> 5 1941 10250
#> 6 1963 10239
#> 7 1962 9995
#> 8 1961 8655
#> 9 1957 7589
#> 10 1964 7300
#> # … with 37 more rows
You can also use the group_by()
function to expliciy group your data for multiple operations.
sdac %>%
group_by(birth_year) %>%
count()
#> # A tibble: 47 x 2
#> # Groups: birth_year [47]
#> birth_year n
#> <dbl> <int>
#> 1 1924 1457
#> 2 1927 65
#> 3 1929 58
#> 4 1930 1387
#> 5 1931 1082
#> 6 1932 132
#> 7 1933 3541
#> 8 1934 1754
#> 9 1935 4720
#> 10 1936 4630
#> # … with 37 more rows
Using group_by()
we can sample data as well.
sdac %>%
group_by(birth_year) %>%
sample_n(2) %>%
select(speaker_id, birth_year, utterance_text) %>%
arrange(birth_year) %>%
head()
#> # A tibble: 6 x 3
#> # Groups: birth_year [3]
#> speaker_id birth_year utterance_text
#> <dbl> <dbl> <chr>
#> 1 1402 1924 You can do a lot of things with it. /
#> 2 1402 1924 {C So, } {F uh, } how would you improve upon that?…
#> 3 1450 1927 -- and then, of course, we look into the prices. …
#> 4 1450 1927 we had, uh, we had a, uh, - /
#> 5 1213 1929 but, -/
#> 6 1213 1929 Yeah, /
mutate
summarize
Vector functions