ch_26_solutions

Prerequisites:

library(tidyverse)
library(nycflights13)

26.2.8 Exercises:

  1. Parts 1-3 below:

    palmerpenguins::penguins |> 
      summarise(across(everything(), n_distinct))
    ## # A tibble: 1 × 8
    ##   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
    ##     <int>  <int>          <int>         <int>             <int>       <int>
    ## 1       3      3            165            81                56          95
    ## # ℹ 2 more variables: sex <int>, year <int>
    mtcars |> 
      summarise(across(everything(), mean))
    ##        mpg    cyl     disp       hp     drat      wt     qsec     vs      am
    ## 1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625
    ##     gear   carb
    ## 1 3.6875 2.8125
    diamonds |> 
      group_by(cut, clarity, color) |> 
      summarise(
        n = n(),
        across(where(is.numeric), mean),
        .groups = 'drop'
      )
    ## # A tibble: 276 × 11
    ##    cut   clarity color     n carat depth table price     x     y     z
    ##    <ord> <ord>   <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    ##  1 Fair  I1      D         4 1.88   65.6  56.8 7383   7.52  7.42  4.90
    ##  2 Fair  I1      E         9 0.969  65.6  58.1 2095.  6.17  6.06  4.01
    ##  3 Fair  I1      F        35 1.02   65.7  58.4 2544.  6.14  6.04  4.00
    ##  4 Fair  I1      G        53 1.23   65.3  57.7 3187.  6.52  6.43  4.23
    ##  5 Fair  I1      H        52 1.50   65.8  58.4 4213.  6.96  6.86  4.55
    ##  6 Fair  I1      I        34 1.32   65.7  58.4 3501   6.76  6.65  4.41
    ##  7 Fair  I1      J        23 1.99   66.5  57.9 5795.  7.55  7.46  4.99
    ##  8 Fair  SI2     D        56 1.02   64.7  58.6 4355.  6.24  6.17  4.01
    ##  9 Fair  SI2     E        78 1.02   63.4  59.5 4172.  6.28  6.22  3.96
    ## 10 Fair  SI2     F        89 1.08   63.8  59.5 4520.  6.36  6.30  4.04
    ## # ℹ 266 more rows
  2. It just appends _1, _2, _3 etc. to the end of the name. This is quite unreadable, especially if you don’t have access to the code.

    diamonds |> 
      summarise(
        across(where(is.numeric), list(mean, median))
      )
    ## # A tibble: 1 × 14
    ##   carat_1 carat_2 depth_1 depth_2 table_1 table_2 price_1 price_2   x_1   x_2
    ##     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl> <dbl>
    ## 1   0.798     0.7    61.7    61.8    57.5      57   3933.    2401  5.73   5.7
    ## # ℹ 4 more variables: y_1 <dbl>, y_2 <dbl>, z_1 <dbl>, z_2 <dbl>
  3. the .keep argument to mutate determines what columns are maintained after mutating a column. The way I wrote this function doesn’t require embracing since I am not taking columns as an argument. A more flexible function, such as one that allows the user to define the columns to keep/remove, would likely require embracing.

    expand_dates <- function(df) {
      df |> 
        mutate(
          across(where(is.Date), list(year = year, month = month, day = mday)),
          .keep = 'unused'
        )
    }
    
    df_date <- tibble(
      name = c("Amy", "Bob"),
      date = ymd(c("2009-08-03", "2010-01-16"))
    )
    
    expand_dates(df_date)
    ## # A tibble: 2 × 4
    ##   name  date_year date_month date_day
    ##   <chr>     <dbl>      <dbl>    <int>
    ## 1 Amy        2009          8        3
    ## 2 Bob        2010          1       16
  4. To keep the summary short, the function outputs the count of NAs for each combination of user submitted variables for columns that have at least one NA. The special feature of where (as described in the documentation) is its ability to take purrr-like formulas. The logic of the where function in this function reminds me of a HAVING statement in a SQL query, which is used to filtered data post-aggregation.

    show_missing <- function(df, group_vars, summary_vars = everything()) {
      df |> 
        group_by(pick({{ group_vars }})) |> 
        summarize(
          across({{ summary_vars }}, \(x) sum(is.na(x))),
          .groups = "drop"
        ) |>
        select(where(\(x) any(x > 0)))
    }
    nycflights13::flights |> show_missing(c(year, month, day))
    ## # A tibble: 365 × 9
    ##     year month   day dep_time dep_delay arr_time arr_delay tailnum air_time
    ##    <int> <int> <int>    <int>     <int>    <int>     <int>   <int>    <int>
    ##  1  2013     1     1        4         4        5        11       0       11
    ##  2  2013     1     2        8         8       10        15       2       15
    ##  3  2013     1     3       10        10       10        14       2       14
    ##  4  2013     1     4        6         6        6         7       2        7
    ##  5  2013     1     5        3         3        3         3       1        3
    ##  6  2013     1     6        1         1        1         3       0        3
    ##  7  2013     1     7        3         3        3         3       1        3
    ##  8  2013     1     8        4         4        4         7       1        7
    ##  9  2013     1     9        5         5        7         9       2        9
    ## 10  2013     1    10        3         3        3         3       2        3
    ## # ℹ 355 more rows