Transform Data

Import Data

Your Turn 1

Make sure that the .csv file is in the same directory as the .qmd file you’re currently working on. Then import the babynames.csv data set. Give it the name babynames. Then copy the import code into the code chunk below. Does it run?

babynames
# A tibble: 1,924,665 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
 3  1880 F     Emma       2003 0.0205
 4  1880 F     Elizabeth  1939 0.0199
 5  1880 F     Minnie     1746 0.0179
 6  1880 F     Margaret   1578 0.0162
 7  1880 F     Ida        1472 0.0151
 8  1880 F     Alice      1414 0.0145
 9  1880 F     Bertha     1320 0.0135
10  1880 F     Sarah      1288 0.0132
# ℹ 1,924,655 more rows

dplyr

Your Turn 2

Alter the code to select just the n column:

select(babynames, name, prop)
# A tibble: 1,924,665 × 2
   name        prop
   <chr>      <dbl>
 1 Mary      0.0724
 2 Anna      0.0267
 3 Emma      0.0205
 4 Elizabeth 0.0199
 5 Minnie    0.0179
 6 Margaret  0.0162
 7 Ida       0.0151
 8 Alice     0.0145
 9 Bertha    0.0135
10 Sarah     0.0132
# ℹ 1,924,655 more rows

Quiz

Which of these is NOT a way to select the name and n columns together?

select(babynames, -c(year, sex, prop))
# A tibble: 1,924,665 × 2
   name          n
   <chr>     <int>
 1 Mary       7065
 2 Anna       2604
 3 Emma       2003
 4 Elizabeth  1939
 5 Minnie     1746
 6 Margaret   1578
 7 Ida        1472
 8 Alice      1414
 9 Bertha     1320
10 Sarah      1288
# ℹ 1,924,655 more rows
select(babynames, name:n)
# A tibble: 1,924,665 × 2
   name          n
   <chr>     <int>
 1 Mary       7065
 2 Anna       2604
 3 Emma       2003
 4 Elizabeth  1939
 5 Minnie     1746
 6 Margaret   1578
 7 Ida        1472
 8 Alice      1414
 9 Bertha     1320
10 Sarah      1288
# ℹ 1,924,655 more rows
select(babynames, starts_with("n"))
# A tibble: 1,924,665 × 2
   name          n
   <chr>     <int>
 1 Mary       7065
 2 Anna       2604
 3 Emma       2003
 4 Elizabeth  1939
 5 Minnie     1746
 6 Margaret   1578
 7 Ida        1472
 8 Alice      1414
 9 Bertha     1320
10 Sarah      1288
# ℹ 1,924,655 more rows
select(babynames, ends_with("n"))
# A tibble: 1,924,665 × 1
       n
   <int>
 1  7065
 2  2604
 3  2003
 4  1939
 5  1746
 6  1578
 7  1472
 8  1414
 9  1320
10  1288
# ℹ 1,924,655 more rows

Your Turn 3

Use filter, babynames, and the logical operators to find:

  • All of the names where prop is greater than or equal to 0.08
  • All of the children named “Sea”

Your Turn 4

Use Boolean operators to return only the rows that contain:

  • Boys named Sue
  • Names that were used by exactly 5 or 6 children in 1880
  • Names that are one of Acura, Lexus, or Yugo

Help Me

What is the smallest value of n? What is the largest?

Your Turn 5

Use |> to write a sequence of functions that:

  1. Filters babynames to just the girls that were born in 2017, then…
  2. Selects the name and n columns, then…
  3. Arranges the results so that the most popular names are near the top.

Your Turn 6

  1. Trim babynames to just the rows that contain your name and your sex
  2. Trim the result to just the columns that will appear in your graph (not strictly necessary, but useful practice)
  3. Plot the results as a line graph with year on the x axis and prop on the y axis

Your Turn 7

Cpmplete the code below to extract the rows where name == "Khaleesi". Then use summarise() and sum() and min() to find:

  1. The total number of children named Khaleesi
  2. The first year Khaleesi appeared in the data

Make sure to turn the eval option to true before running the code.

(Hint: Be sure to remove each _ before running the code)

babynames ___ 
  filter(_______________________) ___
  ___________(total = ________, first = _______)

Your Turn 8

Use group_by(), summarise(), and arrange() to display the ten most popular names. Compute popularity as the total number of children of a single gender given a name.

Make sure to turn the eval option to true before running the code.

(Hint: Be sure to remove each _ before running the code)

babynames |> 
  _______(name, sex) |> 
  _______(total = _____(n)) |> 
  _______(desc(_____))

Your Turn 9

Use group_by() to calculate and then plot the total number of children born each year over time.

Your Turn 10

Use mutate() and min_rank()to rank each row in babynames from largest n to lowest n.

Make sure to turn the eval option to true before running the code.

(Hint: Be sure to remove each _ before running the code)

babynames |> 
  ______(rank = _______(____(prop)))

Your Turn 11

Group babynames by year and then re-rank the data. Filter the results to just rows where rank == 1.


Take aways

  • Extract variables with select()

  • Extract cases with filter()

  • Arrange cases, with arrange()

  • Make tables of summaries with summarise()

  • Make new variables, with mutate()

  • Do groupwise operations with group_by()

  • Connect operations with |>