nycflights13
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Visualizing a raw tibble is not that pretty. Let’s wrap the next output in a DT::datatable() function to make it more readable:
mutating joins
band |> left_join(instrument, by = "name")
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
band |> right_join(instrument, by = "name")
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
3 Keith <NA> guitar
band |> full_join(instrument, by = "name")
# A tibble: 4 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
4 Keith <NA> guitar
band |> inner_join(instrument, by = "name")
# A tibble: 2 × 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
Your Turn 1
Which airlines had the largest arrival delays? Complete the code below.
- Join
airlines to flights
- Compute and order the average arrival delays by airline. Display full names, no codes.
(Hint: Be sure to remove each _ before turning eval to true)
flights |>
filter(!is.na(arr_delay)) |>
__________________ |>
group_by(_________) |>
__________________ |>
arrange(__________)
Your Turn 2
Join flights and airports by dest and faa.
Then for each name, compute the distance from NYC and the average arr_delay. Hint: use first() to get the first value of distance.
Order by average delay, worst to best.
(Hint: Be sure to remove each _ before turning eval to true)
flights |>
filter(!is.na(arr_delay)) |>
______(airports, __________________) |>
group_by(name) |>
_________(distance = _____________,
delay = _____________) |>
______(_____(delay))
filtering joins
band |> semi_join(instrument, by = "name")
# A tibble: 2 × 2
name band
<chr> <chr>
1 John Beatles
2 Paul Beatles
band |> anti_join(instrument, by = "name")
# A tibble: 1 × 2
name band
<chr> <chr>
1 Mick Stones
Your Turn 3
How many airports in airports are serviced by flights in flights? (i.e. how many places can you fly to direct from New York?)
Notice that the column to join on is named faa in the airports data set and dest in the flights data set.
(Hint: Be sure to remove each _ before turning eval to true)
__________ |>
_____________________________ |>
select(faa)
Take aways
left_join() retains all cases in left data set
right_join() retains all cases in right data set
full_join() retains all cases in either data set
inner_join() retains only cases in both data sets
semi_join() extracts cases that have a match
anti_join() extracts cases that do not have a match