Ch.12 Solutions

Prerequisites


library(tidyverse)
library(nycflights13)

12.2.4 Exercises:

  1. Looking at documentation you will come across a variable called .Machine$double.eps for which you can also read documentation. The function works by taking the different of the two supplied vectors and seeing if it is less than the smallest x s.t. 1 + x != 1 for your machine (but you can supply other tolerance levels if you want).

    near(sqrt(2)^2, 2)
    ## [1] TRUE
  2. Code below:

    flights |> 
      mutate(
        dep_time_na = is.na(dep_time),
        sched_dep_time_na = is.na(sched_dep_time),
        dep_delay_na = is.na(dep_delay)
      ) |> 
      count(dep_time_na, sched_dep_time_na, dep_delay_na)
    ## # A tibble: 2 × 4
    ##   dep_time_na sched_dep_time_na dep_delay_na      n
    ##   <lgl>       <lgl>             <lgl>         <int>
    ## 1 FALSE       FALSE             FALSE        328521
    ## 2 TRUE        FALSE             TRUE           8255

12.3.4 Exercises:

  1. Made 2 separate tibbles.

    # arr_delay missing but dep_delay populated
    flights |> 
      filter(is.na(arr_delay)&!is.na(dep_delay))
    
    #neither arr_time nor sched_arr_time missing, but arr_delay is
    flights |> 
      filter(!is.na(arr_time)&!is.na(sched_arr_time)&is.na(arr_delay))
  2. Used the across function to get counts for every column simultaneously.

    flights |> 
      count(is.na(dep_time))
    ## # A tibble: 2 × 2
    ##   `is.na(dep_time)`      n
    ##   <lgl>              <int>
    ## 1 FALSE             328521
    ## 2 TRUE                8255
    flights |> 
      filter(is.na(dep_time)) |> 
      summarise(across(everything(), ~sum(is.na(.x))))
    ## # A tibble: 1 × 19
    ##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    ##   <int> <int> <int>    <int>          <int>     <int>    <int>          <int>
    ## 1     0     0     0     8255              0      8255     8255              0
    ## # ℹ 11 more variables: arr_delay <int>, carrier <int>, flight <int>,
    ## #   tailnum <int>, origin <int>, dest <int>, air_time <int>, distance <int>,
    ## #   hour <int>, minute <int>, time_hour <int>
    • We see that when dep_time is null, then dep_delay, arr_time, arr_delay and air_time are always null. My suspicion is that these fields are left unpopulated when the flight is cancelled.
  3. I just did a plot, but some ideas for further analysis.

    • Make the size of points dependent on the number of flights.

    • Look into the fit of a regression line.

    • look if arr_delay has the same trend.

    flights |> 
      group_by(year, month, day) |> 
      summarise(prop_cancelled = sum(is.na(dep_time))/n(),
                avg_delay = mean(dep_delay, na.rm = TRUE),
                .groups = 'drop') |> 
      ggplot(aes(prop_cancelled, avg_delay))+
      geom_point()

12.4.4 Exercises:

  1. sum(is.na(x)) gives the count of missing values in vector x.

    • mean(is.na(x)) gives the proportion of missing values in vector x.
  2. Both prod() and min() are analogous to all().

    • since if any value in vector x is FALSE, then prod(x) and min(x) will return FALSE.

12.5.4 Exercises:

  1. Code below:

    vector_0_20 <- (1:21)
    if_else(vector_0_20%%2==0, 'EVEN', 'ODD')
    ##  [1] "ODD"  "EVEN" "ODD"  "EVEN" "ODD"  "EVEN" "ODD"  "EVEN" "ODD"  "EVEN"
    ## [11] "ODD"  "EVEN" "ODD"  "EVEN" "ODD"  "EVEN" "ODD"  "EVEN" "ODD"  "EVEN"
    ## [21] "ODD"
  2. Code below:

    days_of_week <- c('Monday', 'Tuesday', 'Wednesday', 'Thusday', 'Friday', 'Saturday', 'Sunday')
    ifelse(days_of_week %in% c('Saturday', 'Sunday'), 'Weekend', 'Weekday')
    ## [1] "Weekday" "Weekday" "Weekday" "Weekday" "Weekday" "Weekend" "Weekend"
  3. Code below:

    sample_vector_x <- c(1, -4, 3, 2, 0 , -17.5)
    print(if_else(sample_vector_x<0, -1*sample_vector_x, sample_vector_x))
    ## [1]  1.0  4.0  3.0  2.0  0.0 17.5
  4. Can add more holidays but I am lazy:

    flights |> 
      mutate(holiday_boolean = case_when(
        month==12 & day == 25 ~ 'TRUE',
        month == 1 & day == 1 ~ 'TRUE',
        month == 12 & day == 31 ~ 'TRUE',
        .default = 'FALSE'),
      holiday = case_when(
        month==12 & day == 25 ~ 'Christmas',
        month == 1 & day == 1 ~ 'New Years',
        month == 12 & day == 31 ~ 'New Years Eve',
        .default = 'NA'
      ))