3  Introduction to {dplyr}

We’ve been looking at datasets that fit the ggplot2 paradigm nicely; however, most data we encounter is really messy (missing values), or is a completely different format.

In this chapter, we’ll look at one of the most powerful tools in the tidyverse: dplyr, which lets you manipulate data frames.

There is a function/action for most of the annoying tasks you have to use in data cleaning, which makes it super useful.

In particular, we’re going to look at six fundamental verbs/actions in dplyr over both of these chapters:

Along the way, we’ll do some data manipulation challenges. You’ll be a dplyr expert in no time!

You will want to keep this dplyr cheat sheet open in a separate window to remind you about the syntax: dplyr cheat sheet

Also, remember: if you need to know the variables in a data.frame called biopics you can always use

3.1 Learning about assignment

In order to do the following exercises, we need to learn a little bit about how to assign the output of a function to a variable.

For example, we can assign the output of the operation 1 + 2 to a variable called sumOfTwoNumbers using the <- operator. This is called the assignment operator.

You can also use = to assign a value to a variable, but I find it makes my code a bit confusing, because there is also ==, which tests for equality.

sumOfTwoNumbers <- 1 + 2

Once we have something assigned to a variable, we can use it in other expressions:

sumOfThreeNumbers <- sumOfTwoNumbers + 3

This is the bare basics of assignment. We’ll use it in the next exercises to evaluate the output of our dplyr cleaning.

3.1.1 Exercise

  • Assign newValue the value of 10.
  • Then use newValue to calculate the value of multValue by calculating newValue * 5.
  • Show multValue.
Tip
##assign newValue
newValue <- 10
## use newValue to calculate multValue
multValue <- newValue * 5
##show multValue
multValue

3.2 Let’s look at some data and ways to manipulate it.

We’re going to use the biopics dataset in the fivethirtyeight package to do learn dplyr. This is a dataset of 761 different biopic movies.

3.2.1 Exercise

  • Run a summary on the biopics dataset. It’s already loaded up for you.
  • How many categories are in the country variable? Use the levels() function to count the categories.
Tip
##run summary here
summary(biopics)
##show length of country categories here
levels(biopics$country)

3.3 dplyr::filter()

filter() is a very useful dplyr command. It allows you to subset a data.frame based on variable criteria.

For example, if we wanted to subset biopics to those movies that were made in the UK we’d use the following statement:

Three things to note here:

  • The first argument to filter() is the dataset. We’ll see another variation of this in a second.
  • For those who are used to accessing data.frame variables by $, notice we don’t have to use biopics$country. Instead, we can just use the variable name country.
  • Our filter statement uses ==. Remember that == is an equality test, and = is to assign something. (confusing the two will happen to you from time to time.)

3.3.1 Exercise

  • Filter biopics so that it only shows Criminal movies (you’ll have to use the type_of_subject variable in biopics.
  • Show how many rows are left using nrow(crimeMovies).
Tip
#add your filter statement here
crimeMovies <- filter(biopics, type_of_subject == "Criminal")
#show number of crime movies
nrow(crimeMovies)

3.4 Comparison operators and chaining comparisons

Let’s look at the following filter() statement:

Three things to note:

  • We used the comparison operator >. The basic comparisons you’ll use are > (greater than), < (less than), == (equals to) and != (not equal to)
  • We also chained on another expression, type_of_subject == "Criminal" using & (and). The other chaining operator that you’ll use is |, which corresponds to OR.
  • Chaining expressions is where filter() becomes super powerful. However, it’s also the source of headaches, so you will need to carefully test your chain of expressions.

3.4.1 Exercise

  • Add another comparison to the chain using &. Use person_of_color == FALSE.
  • Show how many rows are left from your filter() statement.
Tip

3.5 Quick Quiz about Chaining Comparisons

Which statement should be the larger subset? Try them out in the console if you’re not sure.



3.6 The %in% operator

What if you wanted to select for multiple values? You can use the %in% operator. Here we put the values into a vector with the c() function, which concatentates the values together into a form that R can manipulate. Note that these values have to be exact and the case has to be the same (that is, “UK”, not “Uk” or “uk”) for the matching to work.

3.6.1 Exercise

  • Pick out the Musician, Artist and Singer movies from type_of_subject.
  • Assign the output to biopicsArt.
Tip
biopicsArt <- biopics %>% 
  filter(type_of_subject %in% c("Musician", "Artist", "Singer"))

head(biopicsArt)

3.7 Removing Missing Values

One trick you can use filter() for is to remove missing values. Usually missing values are coded as NA in data. You can remove rows that contain NAs by using is.na(). For example:

Note the ! in front of is.na(box_office). This ! is known as the NOT operator. Basically, it switches the values in our is.na statement, making everything that was TRUE into FALSE, and everything FALSE into TRUE. We want to keep everything that is not NA, so that’s why we use the !.

3.7.1 Exercise

  • Filter biopics to remove the NAs, and assign the output to filteredBiopics.
  • Compare the number of rows in biopics to filteredBiopics.
  • How many missing values did we remove?
Tip

3.8 dplyr::mutate()

mutate() is one of the most useful dplyr commands. You can use it to transform data (variables in your data.frame) and add it as a new variable into the data.frame. For example, let’s calculate the total box_office divided by the number_of_subjects to normalize our comparison as normalized_box_office:

What did we do here? First, we used the mutate() function to add a new column into our data.frame called normalized_box_office. This new variable is calculated per row by dividing box_office by number_of_subjects.

3.8.1 Exercise

  • Try defining a new variable race_and_gender by pasting together subject_race and subject_sex into a new data_frame called biopics2.
  • Show the first few rows using head() so you can confirm that you added this new variable correctly.

Remember, you can use the paste() function to paste two strings together.

Tip
#assign new variable race_and_gender here using mutate()
biopics2 <- mutate(biopics, race_and_gender = paste(subject_race, subject_sex))
#show first rows of biopics2 using head()
head(biopics2)

3.9 You can use mutated variables right away!

The nifty thing about mutate() is that once you define the variables in the statement, you can use them right away, in the same mutate statement. For example, look at this code:

Notice that we first defined box_office_year in the first part of the mutate() statement, and then used it right away to define a new variable, box_office_subject.

3.9.1 Exercise

  • Define another variable called box_office_y_s_num in the same mutate() statement by taking box_office_year and dividing it by number_of_subjects.
  • Assign the output to mutatedBiopics.
  • Hint: Add box_office_y_s_num=box_office_year/number_of_subjects to the statement below.
Tip
mutatedBiopics <- mutate(biopics, 
                         box_office_year = year_release * box_office, 
                         box_office_subject = paste0(box_office_year, subject), 
                         box_office_y_s_num = box_office_year/number_of_subjects)

mutatedBiopics

3.10 Another Use for mutate()

What is this statement doing? Try it out in the console if you’re not sure.



3.11 The difference between filter() and mutate()

What is the difference between these two statements? Try them out in the console if you’re not sure.



3.12 What you learned in this chapter

  • dplyr::filter()
  • dplyr::mutate()