Join Data with dplyr

nycflights13

flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Visualizing a raw tibble is not that pretty. Let’s wrap the next output in a DT::datatable() function to make it more readable:

DT::datatable(airlines)

mutating joins

band |> left_join(instrument, by = "name")
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
band |> right_join(instrument, by = "name")
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
3 Keith <NA>    guitar
band |> full_join(instrument, by = "name")
# A tibble: 4 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
4 Keith <NA>    guitar
band |> inner_join(instrument, by = "name")
# A tibble: 2 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  

Your Turn 1

Which airlines had the largest arrival delays? Complete the code below.

  1. Join airlines to flights
  2. Compute and order the average arrival delays by airline. Display full names, no codes.

(Hint: Be sure to remove each _ before turning eval to true)

flights |>
  filter(!is.na(arr_delay)) |>
  __________________  |>
  group_by(_________) |>
  __________________  |>
  arrange(__________) 

Your Turn 2

Join flights and airports by dest and faa.

Then for each name, compute the distance from NYC and the average arr_delay. Hint: use first() to get the first value of distance.

Order by average delay, worst to best.

(Hint: Be sure to remove each _ before turning eval to true)

flights |> 
  filter(!is.na(arr_delay)) |>
  ______(airports, __________________) |>
  group_by(name) |>
  _________(distance = _____________, 
               delay = _____________) |>
  ______(_____(delay))

filtering joins

band |> semi_join(instrument, by = "name")
# A tibble: 2 × 2
  name  band   
  <chr> <chr>  
1 John  Beatles
2 Paul  Beatles
band |> anti_join(instrument, by = "name")
# A tibble: 1 × 2
  name  band  
  <chr> <chr> 
1 Mick  Stones

Your Turn 3

How many airports in airports are serviced by flights in flights? (i.e. how many places can you fly to direct from New York?)

Notice that the column to join on is named faa in the airports data set and dest in the flights data set.

(Hint: Be sure to remove each _ before turning eval to true)

__________ |>
 _____________________________ |>
  select(faa)

Take aways

  • left_join() retains all cases in left data set

  • right_join() retains all cases in right data set

  • full_join() retains all cases in either data set

  • inner_join() retains only cases in both data sets

  • semi_join() extracts cases that have a match

  • anti_join() extracts cases that do not have a match