3 Introduction to {dplyr}
We’ve been looking at datasets that fit the ggplot2
paradigm nicely; however, most data we encounter is really messy (missing values), or is a completely different format.
In this chapter, we’ll look at one of the most powerful tools in the tidyverse
: dplyr
, which lets you manipulate data frames.
There is a function/action for most of the annoying tasks you have to use in data cleaning, which makes it super useful.
In particular, we’re going to look at six fundamental verbs/actions in dplyr
over both of these chapters:
filter()
mutate()
group_by()
/summarize()
arrange()
select()
Along the way, we’ll do some data manipulation challenges. You’ll be a dplyr
expert in no time!
You will want to keep this dplyr
cheat sheet open in a separate window to remind you about the syntax: dplyr cheat sheet
Also, remember: if you need to know the variables in a data.frame
called biopics
you can always use
3.1 Learning about assignment
In order to do the following exercises, we need to learn a little bit about how to assign the output of a function to a variable.
For example, we can assign the output of the operation 1 + 2
to a variable called sumOfTwoNumbers
using the <-
operator. This is called the assignment
operator.
You can also use =
to assign a value to a variable, but I find it makes my code a bit confusing, because there is also ==
, which tests for equality.
<- 1 + 2 sumOfTwoNumbers
Once we have something assigned to a variable, we can use it in other expressions:
<- sumOfTwoNumbers + 3 sumOfThreeNumbers
This is the bare basics of assignment. We’ll use it in the next exercises to evaluate the output of our dplyr
cleaning.
3.1.1 Exercise
- Assign
newValue
the value of10
. - Then use
newValue
to calculate the value ofmultValue
by calculatingnewValue * 5
. - Show
multValue
.
##assign newValue
<- 10
newValue ## use newValue to calculate multValue
<- newValue * 5
multValue ##show multValue
multValue
3.2 Let’s look at some data and ways to manipulate it.
We’re going to use the biopics
dataset in the fivethirtyeight
package to do learn dplyr
. This is a dataset of 761 different biopic movies.
3.2.1 Exercise
- Run a
summary
on thebiopics
dataset. It’s already loaded up for you. - How many categories are in the
country
variable? Use thelevels()
function to count the categories.
##run summary here
summary(biopics)
##show length of country categories here
levels(biopics$country)
3.3 dplyr::filter()
filter()
is a very useful dplyr
command. It allows you to subset a data.frame
based on variable criteria.
For example, if we wanted to subset biopics
to those movies that were made in the UK
we’d use the following statement:
Three things to note here:
- The first argument to
filter()
is the dataset. We’ll see another variation of this in a second. - For those who are used to accessing
data.frame
variables by$
, notice we don’t have to usebiopics$country
. Instead, we can just use the variable namecountry
. - Our filter statement uses
==
. Remember that==
is an equality test, and=
is to assign something. (confusing the two will happen to you from time to time.)
3.3.1 Exercise
- Filter
biopics
so that it only showsCriminal
movies (you’ll have to use thetype_of_subject
variable inbiopics
. - Show how many rows are left using
nrow(crimeMovies)
.
#add your filter statement here
<- filter(biopics, type_of_subject == "Criminal")
crimeMovies #show number of crime movies
nrow(crimeMovies)
3.4 Comparison operators and chaining comparisons
Let’s look at the following filter()
statement:
Three things to note:
- We used the comparison operator
>
. The basic comparisons you’ll use are>
(greater than),<
(less than),==
(equals to) and!=
(not equal to) - We also chained on another expression,
type_of_subject == "Criminal"
using&
(and). The other chaining operator that you’ll use is|
, which corresponds to OR. - Chaining expressions is where
filter()
becomes super powerful. However, it’s also the source of headaches, so you will need to carefully test your chain of expressions.
3.4.1 Exercise
- Add another comparison to the chain using
&
. Useperson_of_color == FALSE
. - Show how many rows are left from your
filter()
statement.
3.5 Quick Quiz about Chaining Comparisons
Which statement should be the larger subset? Try them out in the console if you’re not sure.
3.6 The %in% operator
What if you wanted to select for multiple values? You can use the %in%
operator. Here we put the values into a vector
with the c()
function, which concatentates the values together into a form that R can manipulate. Note that these values have to be exact and the case has to be the same (that is, “UK”, not “Uk” or “uk”) for the matching to work.
3.6.1 Exercise
- Pick out the
Musician
,Artist
andSinger
movies fromtype_of_subject
. - Assign the output to
biopicsArt
.
<- biopics %>%
biopicsArt filter(type_of_subject %in% c("Musician", "Artist", "Singer"))
head(biopicsArt)
3.7 Removing Missing Values
One trick you can use filter()
for is to remove missing values. Usually missing values are coded as NA
in data. You can remove rows that contain NAs
by using is.na()
. For example:
Note the !
in front of is.na(box_office)
. This !
is known as the NOT operator. Basically, it switches the values in our is.na
statement, making everything that was TRUE
into FALSE
, and everything FALSE
into TRUE
. We want to keep everything that is not NA
, so that’s why we use the !
.
3.7.1 Exercise
- Filter
biopics
to remove the NAs, and assign the output tofilteredBiopics
. - Compare the number of rows in
biopics
tofilteredBiopics
. - How many missing values did we remove?
3.8 dplyr::mutate()
mutate()
is one of the most useful dplyr
commands. You can use it to transform data (variables in your data.frame
) and add it as a new variable into the data.frame
. For example, let’s calculate the total box_office
divided by the number_of_subjects
to normalize our comparison as normalized_box_office
:
What did we do here? First, we used the mutate()
function to add a new column into our data.frame
called normalized_box_office
. This new variable is calculated per row by dividing box_office
by number_of_subjects
.
3.8.1 Exercise
- Try defining a new variable
race_and_gender
by pasting togethersubject_race
andsubject_sex
into a newdata_frame
calledbiopics2
. - Show the first few rows using
head()
so you can confirm that you added this new variable correctly.
Remember, you can use the paste()
function to paste two strings together.
#assign new variable race_and_gender here using mutate()
<- mutate(biopics, race_and_gender = paste(subject_race, subject_sex))
biopics2 #show first rows of biopics2 using head()
head(biopics2)
3.9 You can use mutated variables right away!
The nifty thing about mutate()
is that once you define the variables in the statement, you can use them right away, in the same mutate
statement. For example, look at this code:
Notice that we first defined box_office_year
in the first part of the mutate()
statement, and then used it right away to define a new variable, box_office_subject
.
3.9.1 Exercise
- Define another variable called
box_office_y_s_num
in the samemutate()
statement by takingbox_office_year
and dividing it bynumber_of_subjects
. - Assign the output to
mutatedBiopics
. - Hint: Add
box_office_y_s_num=box_office_year/number_of_subjects
to the statement below.
<- mutate(biopics,
mutatedBiopics box_office_year = year_release * box_office,
box_office_subject = paste0(box_office_year, subject),
box_office_y_s_num = box_office_year/number_of_subjects)
mutatedBiopics
3.10 Another Use for mutate()
What is this statement doing? Try it out in the console if you’re not sure.
3.11 The difference between filter()
and mutate()
What is the difference between these two statements? Try them out in the console if you’re not sure.
3.12 What you learned in this chapter
dplyr::filter()
dplyr::mutate()