nycflights13
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Visualizing a raw tibble is not that pretty. Let’s wrap the next output in a DT::datatable()
function to make it more readable:
mutating joins
band |> left_join(instrument, by = "name")
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
band |> right_join(instrument, by = "name")
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
3 Keith <NA> guitar
band |> full_join(instrument, by = "name")
# A tibble: 4 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
4 Keith <NA> guitar
band |> inner_join(instrument, by = "name")
# A tibble: 2 × 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
Your Turn 1
Which airlines had the largest arrival delays? Complete the code below.
- Join
airlines
to flights
- Compute and order the average arrival delays by airline. Display full names, no codes.
(Hint: Be sure to remove each _
before turning eval to true)
flights |>
filter(!is.na(arr_delay)) |>
__________________ |>
group_by(_________) |>
__________________ |>
arrange(__________)
Your Turn 2
Join flights
and airports
by dest
and faa
.
Then for each name
, compute the distance
from NYC and the average arr_delay
. Hint: use first()
to get the first value of distance.
Order by average delay, worst to best.
(Hint: Be sure to remove each _
before turning eval to true)
flights |>
filter(!is.na(arr_delay)) |>
______(airports, __________________) |>
group_by(name) |>
_________(distance = _____________,
delay = _____________) |>
______(_____(delay))
filtering joins
band |> semi_join(instrument, by = "name")
# A tibble: 2 × 2
name band
<chr> <chr>
1 John Beatles
2 Paul Beatles
band |> anti_join(instrument, by = "name")
# A tibble: 1 × 2
name band
<chr> <chr>
1 Mick Stones
Your Turn 3
How many airports in airports
are serviced by flights in flights
? (i.e. how many places can you fly to direct from New York?)
Notice that the column to join on is named faa
in the airports data set and dest
in the flights data set.
(Hint: Be sure to remove each _
before turning eval to true)
__________ |>
_____________________________ |>
select(faa)
Take aways
left_join()
retains all cases in left data set
right_join()
retains all cases in right data set
full_join()
retains all cases in either data set
inner_join()
retains only cases in both data sets
semi_join()
extracts cases that have a match
anti_join()
extracts cases that do not have a match