4  More {dplyr}

We’re going to continue our journey to working with the dplyr package. In this chapter, we’re going to learn some really important things:

4.1 The Pipe Operator: %>%

We’re going to introduce another bit of dplyr syntax, the %>% operator. %>% is called a pipe operator.

You can think of it as being similar to the + in a ggplot2 statement.

What %>% does is that it takes the output of one statement and makes it the input of the next statement. When I’m describing it, I think of it as a “THEN”. For example, I read the following expression

1biopics %>%
2    filter(race_known == "Known") %>%
3    mutate(poc_code = as.numeric(person_of_color))
1
I took the biopics data, THEN
2
I filtered it down with the race_known == "Known" criteria and THEN
3
I defined a new variable called poc_code with mutate().

Note that filter() doesn’t have a data argument, because the data is piped into filter(). Same thing for mutate(). This takes some getting used to, but the thing to remember is:

dplyr commands expect data.frames as input, and returns a data.frame as output.

If our dplyr command outputs a data.frame, then we can chain it to other commands.

%>% allows you to chain multiple verbs in the tidyverse. It’s one of the most powerful things about the tidyverse.

In fact, having a standardized chain of processing actions is called a pipeline. Making pipelines for a data format is great, because you can apply that pipeline to incoming data that has the same formatting and have it output in a ggplot2 friendly format.

What about |>?

You might have seen mentions of the native pipe, which is specified as |> instead of %>%. This is because the pipe became so popular in the {tidyverse}, that the main R developers implemented their own version.

Keep in mind that they are interchangable, for the most part.

4.1.1 Exercise

  • Use %>% to chain biopics into a filter to filter (country=="US")
Tip

4.2 group_by()/summarize()

group_by() doesn’t do anything by itself. But when combined with summarize(), you can calculate metrics (such as mean, max - the maximum, min, sd - the standard deviation) across groups. For example:

Here we want to calculate the mean box_office by country. However, in order to do that, we first need to remove any rows that have NA values in box_office that may confound our calculation.

Let’s ask a tough question. Is there a difference between mean box_office between the two subject_sex categories?

4.2.1 Exercise

First use filter() to remove the NA values. Then, use group_by() and summarize() to calculate the mean box_office by subject_sex, naming the summary variable as mean_bo_by_gender. Assign the output to gender_box_office.

Tip

4.3 Counting Stuff

What does the following code do? Try it out below!



count() is a handy verb

There are a lot of specialized verbs in the tidyverse, but count() comes in handy. You can do the above code with a single command:

4.4 arrange()

arrange() lets you sort by a variable. If you provide multiple variables, the variables are arranged within each other. For example:

This statement will sort the data by country first, and then within each country category, it will sort by year_release.

4.4.1 Exercise

Sort biopics by year_release then by country. Assign the output to biopics_sorted.

Tip

4.5 select()

The final verb we’ll learn is select(). select() allows you to:

  1. extract columns,
  2. reorder columns or
  3. remove columns from your data, as well as
  4. rename your data.

For example, look at the following code:

Here, we’re just extracting two columns (title_of_movie, box_office). Notice we also renamed title to movieTitle.

4.5.1 Exercise

Use select to extract the following variables: title (rename it movieTitle), box_office and subject_sex and assign them to a new table called threeVarTable.

Tip

4.6 Chester Ismay’s Mantra

What is the difference between select() and filter()?




4.7 Putting it all together

Now here comes the fun part. Chaining dplyr verbs together to accomplish some data cleaning and transformation.

For a reference while you work, you can use the dplyr cheatsheet here: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

4.7.1 Exercise

  1. For the biopics data, filter() the data so that we only cover movies from 2000 to 2014. (year_release is the variable you want.)
  2. Filter out the NAs in box_office.
  3. Then use mutate() to code a new variable, box_office_per_subject. (The two variables you need here are box_office and number_of_subjects.)
  4. Assign this statement to biopics_new.
  5. Run summary() on biopics_new to confirm that your statement worked.
Tip

4.8 Challenge 2: Show your stuff

Answer the question: Do movies where we know the race is known (race_known == TRUE) make more money than movies where the race is not known (race_known== FALSE) grouped by country? Which race_known/country combination made the highest amount of money?

4.8.1 Exercise

  • You’ll need to do a filter step first to remove NA values from box_office before you do anything.
  • Then think of what variables you need to group_by.
  • Finally, figure out what do you need to summarize (assign the value to mean_box_office) and arrange on (don’t forget to use desc!)?
  • Assign the output to race_country_box_office.
  • Show race_country_box_office.
Tip
race_country_box_office <- biopics %>%
    filter(!is.na(box_office)) %>%
    group_by(race_known, country) %>%
    summarize(mean_box_office=mean(box_office)) %>%
    arrange(desc(mean_box_office))

race_country_box_office

4.9 Challenge 3: Putting together what we know about {ggplot2} and {dplyr}

Now we’re cooking with fire. You can directly pipe the output of a dplyr pipeline into a ggplot2 statement. For example:

Note that we use %>% to pipe our statement into the ggplot() function. The tricky thing to remember is that everything after the ggplot() is connected with +, and not %>%.

Also note: we don’t assign a data variable in the ggplot() statement. We are piping in the data.

Are you sick of biopics yet? I promise this is the last time we use this dataset.

4.9.1 Exercise

  1. First, filter biopics to have year_release < 1990 and remove NA values.
  2. Then pipe that into a ggplot() statement that plots an x-y plot of box_office (use geom_point()) where x=year_release and y=log(box_office).
  3. Color the points by person_of_color.
  4. Assign the output to bPlot and print it to the screen using print(bPlot).
Tip
bPlot <- biopics %>% 
  filter(year_release < 1990) %>% 
  filter(!is.na(box_office)) %>%
    ggplot(aes(x=year_release, y=log(box_office), 
               color=person_of_color)) +
    geom_point()
    
print(bPlot)

4.10 What you learned in this chapter

  • How to use %>% (the pipe)
  • dplyr::group_by()/dplyr::summarize()
  • dplyr::arrange()
  • dplyr::select()
  • How to put it all together!

Good job for making it through this chapter! You’re well on your way to becoming a tidyverse ninja!

More Resources

  • The Data Transformation chapter of R for Data Science is another great place to learn about the basics of dplyr.
  • The Pipes chapter of R for Data Science has a great discussion on why you should consider using pipes in your workflows.