Tidy Data

table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583
table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
table4b
# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583
table5
# A tibble: 6 × 4
  country     century year  rate             
  <chr>       <chr>   <chr> <chr>            
1 Afghanistan 19      99    745/19987071     
2 Afghanistan 20      00    2666/20595360    
3 Brazil      19      99    37737/172006362  
4 Brazil      20      00    80488/174504898  
5 China       19      99    212258/1272915272
6 China       20      00    213766/1280428583

tidyr

Your Turn 1

On a sheet of paper, draw how the cases data set would look if it had the same values grouped into three columns: country, year, n

Your Turn 2

Use pivot_longer() to reorganize table4a into three columns: country, year, and cases.

Your Turn 3

On a sheet of paper, draw how this data set would look if it had the same values grouped into three columns: city, large, small

Your Turn 4

Use pivot_wider() to reorganize table2 into four columns: country, year, cases, and population.

who

Your Turn 5

Gather the 5th through 60th columns of who into a pair of key:value columns named codes and n.

Then select just the county, year, codes and n variables.

who
# A tibble: 7,240 × 60
   country  iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
   <chr>    <chr> <chr> <dbl>       <dbl>        <dbl>        <dbl>        <dbl>
 1 Afghani… AF    AFG    1980          NA           NA           NA           NA
 2 Afghani… AF    AFG    1981          NA           NA           NA           NA
 3 Afghani… AF    AFG    1982          NA           NA           NA           NA
 4 Afghani… AF    AFG    1983          NA           NA           NA           NA
 5 Afghani… AF    AFG    1984          NA           NA           NA           NA
 6 Afghani… AF    AFG    1985          NA           NA           NA           NA
 7 Afghani… AF    AFG    1986          NA           NA           NA           NA
 8 Afghani… AF    AFG    1987          NA           NA           NA           NA
 9 Afghani… AF    AFG    1988          NA           NA           NA           NA
10 Afghani… AF    AFG    1989          NA           NA           NA           NA
# ℹ 7,230 more rows
# ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
#   new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
#   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
#   new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
#   new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
#   new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>, …

Your Turn 6

Separate the sexage column into sex and age columns.

(Hint: Be sure to remove each _ before running the code and switch eval option to true)

who |> 
  pivot_longer(cols = 5:60, names_to = "codes", values_to = "n") |> 
  select(-iso2, -iso3) |> 
  separate(codes, c("new", "type", "sexage"), sep = "_") |> 
  select(-new) |> 
  _______________________________

Reshaping Final Exam

Your Turn 7

Extend this code to reshape the data into a data set with three columns:

  1. year
  2. M
  3. F

Calculate the percent of male (or female) children by year. Then plot the percent over time.

babynames |> 
  group_by(year, sex) |> 
  summarise(n = sum(n))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 276 × 3
# Groups:   year [138]
    year sex        n
   <dbl> <chr>  <int>
 1  1880 F      90993
 2  1880 M     110491
 3  1881 F      91953
 4  1881 M     100743
 5  1882 F     107847
 6  1882 M     113686
 7  1883 F     112319
 8  1883 M     104627
 9  1884 F     129020
10  1884 M     114442
# ℹ 266 more rows

Take Aways

Data comes in many formats but R prefers just one: tidy data.

A data set is tidy if and only if:

  1. Every variable is in its own column
  2. Every observation is in its own row
  3. Every value is in its own cell (which follows from the above)

What is a variable and an observation may depend on your immediate goal.