4 More {dplyr}
We’re going to continue our journey to working with the dplyr
package. In this chapter, we’re going to learn some really important things:
- The Pipe Operator
4.1 The Pipe Operator: %>%
We’re going to introduce another bit of dplyr
syntax, the %>%
operator. %>%
is called a pipe
operator.
You can think of it as being similar to the +
in a ggplot2
statement.
What %>%
does is that it takes the output of one statement and makes it the input of the next statement. When I’m describing it, I think of it as a “THEN”. For example, I read the following expression
- 1
-
I took the
biopics
data, THEN - 2
-
I
filter
ed it down with therace_known == "Known"
criteria and THEN - 3
-
I defined a new variable called
poc_code
withmutate()
.
Note that filter()
doesn’t have a data
argument, because the data
is piped
into filter()
. Same thing for mutate()
. This takes some getting used to, but the thing to remember is:
dplyr
commands expect data.frames as input, and returns a data.frame as output.
If our dplyr
command outputs a data.frame
, then we can chain it to other commands.
%>%
allows you to chain multiple verbs in the tidyverse
. It’s one of the most powerful things about the tidyverse
.
In fact, having a standardized chain of processing actions is called a pipeline. Making pipelines for a data format is great, because you can apply that pipeline to incoming data that has the same formatting and have it output in a ggplot2
friendly format.
|>
?
You might have seen mentions of the native pipe, which is specified as |>
instead of %>%
. This is because the pipe became so popular in the {tidyverse}
, that the main R developers implemented their own version.
Keep in mind that they are interchangable, for the most part.
4.1.1 Exercise
- Use
%>%
to chainbiopics
into afilter
to filter (country=="US"
)
4.2 group_by()
/summarize()
group_by()
doesn’t do anything by itself. But when combined with summarize()
, you can calculate metrics (such as mean
, max
- the maximum, min
, sd
- the standard deviation) across groups. For example:
Here we want to calculate the mean box_office
by country
. However, in order to do that, we first need to remove any rows that have NA
values in box_office
that may confound our calculation.
Let’s ask a tough question. Is there a difference between mean box_office
between the two subject_sex
categories?
4.2.1 Exercise
First use filter()
to remove the NA values. Then, use group_by()
and summarize()
to calculate the mean box_office
by subject_sex
, naming the summary variable as mean_bo_by_gender
. Assign the output to gender_box_office
.
4.3 Counting Stuff
What does the following code do? Try it out below!
4.4 arrange()
arrange()
lets you sort by a variable. If you provide multiple variables, the variables are arranged within each other. For example:
This statement will sort the data by country
first, and then within each country
category, it will sort by year_release
.
4.4.1 Exercise
Sort biopics
by year_release
then by country
. Assign the output to biopics_sorted
.
4.5 select()
The final verb we’ll learn is select()
. select()
allows you to:
- extract columns,
- reorder columns or
- remove columns from your data, as well as
- rename your data.
For example, look at the following code:
Here, we’re just extracting two columns (title_of_movie
, box_office
). Notice we also renamed title
to movieTitle
.
4.5.1 Exercise
Use select
to extract the following variables: title
(rename it movieTitle
), box_office
and subject_sex
and assign them to a new table called threeVarTable
.
4.6 Chester Ismay’s Mantra
What is the difference between select()
and filter()?
4.7 Putting it all together
Now here comes the fun part. Chaining dplyr
verbs together to accomplish some data cleaning and transformation.
For a reference while you work, you can use the dplyr
cheatsheet here: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
4.7.1 Exercise
- For the
biopics
data,filter()
the data so that we only cover movies from 2000 to 2014. (year_release
is the variable you want.) - Filter out the NAs in
box_office
. - Then use
mutate()
to code a new variable,box_office_per_subject
. (The two variables you need here arebox_office
andnumber_of_subjects
.) - Assign this statement to
biopics_new
. - Run
summary()
onbiopics_new
to confirm that your statement worked.
4.8 Challenge 2: Show your stuff
Answer the question: Do movies where we know the race is known (race_known
== TRUE) make more money than movies where the race is not known (race_known
== FALSE) grouped by country? Which race_known
/country
combination made the highest amount of money?
4.8.1 Exercise
- You’ll need to do a
filter
step first to removeNA
values frombox_office
before you do anything. - Then think of what variables you need to
group_by
. - Finally, figure out what do you need to
summarize
(assign the value tomean_box_office
) andarrange
on (don’t forget to usedesc
!)? - Assign the output to
race_country_box_office
. - Show
race_country_box_office
.
<- biopics %>%
race_country_box_office filter(!is.na(box_office)) %>%
group_by(race_known, country) %>%
summarize(mean_box_office=mean(box_office)) %>%
arrange(desc(mean_box_office))
race_country_box_office
4.9 Challenge 3: Putting together what we know about {ggplot2}
and {dplyr}
Now we’re cooking with fire. You can directly pipe the output of a dplyr
pipeline into a ggplot2
statement. For example:
Note that we use %>%
to pipe our statement into the ggplot()
function. The tricky thing to remember is that everything after the ggplot()
is connected with +
, and not %>%
.
Also note: we don’t assign a data
variable in the ggplot()
statement. We are piping in the data.
Are you sick of biopics
yet? I promise this is the last time we use this dataset.
4.9.1 Exercise
- First, filter
biopics
to haveyear_release
< 1990 and removeNA
values. - Then pipe that into a
ggplot()
statement that plots an x-y plot ofbox_office
(usegeom_point()
) wherex=year_release
andy=log(box_office)
. - Color the points by
person_of_color
. - Assign the output to
bPlot
and print it to the screen usingprint(bPlot)
.
<- biopics %>%
bPlot filter(year_release < 1990) %>%
filter(!is.na(box_office)) %>%
ggplot(aes(x=year_release, y=log(box_office),
color=person_of_color)) +
geom_point()
print(bPlot)
4.10 What you learned in this chapter
- How to use
%>%
(the pipe) dplyr::group_by()/dplyr::summarize()
dplyr::arrange()
dplyr::select()
- How to put it all together!
Good job for making it through this chapter! You’re well on your way to becoming a tidyverse
ninja!
More Resources
- The Data Transformation chapter of R for Data Science is another great place to learn about the basics of
dplyr
. - The Pipes chapter of R for Data Science has a great discussion on why you should consider using pipes in your workflows.