Ch.13 Solutions

Prerequisites

library(tidyverse)
library(nycflights13)

13.3.1 Exercises:

Code below:

flights |> 
  count(is.na(dep_time))
## # A tibble: 2 × 2
##   `is.na(dep_time)`      n
##   <lgl>              <int>
## 1 FALSE             328521
## 2 TRUE                8255

Documentation for count outlines how how it is analogous to using group_by and summarize(n = n()).

#1.
flights |> 
  group_by(dest) |> 
  summarise(n = n()) |> 
  arrange(desc(n))
## # A tibble: 105 × 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705
## # ℹ 95 more rows

#2.
flights |> 
  group_by(tailnum) |> 
  summarise(wt = sum(distance))
## # A tibble: 4,044 × 2
##    tailnum     wt
##    <chr>    <dbl>
##  1 D942DN    3418
##  2 N0EGMQ  250866
##  3 N10156  115966
##  4 N102UW   25722
##  5 N103US   24619
##  6 N104UW   25157
##  7 N10575  150194
##  8 N105UW   23618
##  9 N107US   21677
## 10 N108UW   32070
## # ℹ 4,034 more rows

13.4.8 Exercises:

Each line of the below code block represent one of the 7 lines used to create the plot.

#1. Pipes the dataset of interest to group_by

#2. Using integer division to get the hour from dep_time. This is necessary because sched_dep_time is coded like HHMM.

#3. Summarize function returns both the proportion of missing values and the count of flights that hour.

#4. Do this because there is only one value before hour 5 and it's at hour 1. Makes the plot much cleaner to remove it.

#5. Setting aesthetic globally since it will apply to both layers.

#6. Makes a line plot using the axes defined above. Also using a grey line is surprisingly much cleaner looking. (Run without that argument to see the difference, its much more than I expected.)

#7. Adds scatterpoints at the hour marks that are the size of the count at that hour.

Can run ?Trig to get the documentation on the different trig functions. By default they are in radians.

Issue is rooted in the fact that dep_time is coded as a numeric variable in flights while time in acutality is base-60 (e.g. 459 to 500 is a gap of one minute in flights, not 41). I decided to do minutes since midnight:

flights |> 
  mutate(
    dep_min_since_midnight = (dep_time %/% 100) * 60 + dep_time%%100,
    sched_dep_min_since_midnight = (sched_dep_time %/% 100) * 60 + sched_dep_time%%100
  ) |> 
  select(
    dep_time, 
    dep_min_since_midnight, 
    sched_dep_time, 
    sched_dep_min_since_midnight
  )
## # A tibble: 336,776 × 4
##    dep_time dep_min_since_midnight sched_dep_time sched_dep_min_since_midnight
##       <int>                  <dbl>          <int>                        <dbl>
##  1      517                    317            515                          315
##  2      533                    333            529                          329
##  3      542                    342            540                          340
##  4      544                    344            545                          345
##  5      554                    354            600                          360
##  6      554                    354            558                          358
##  7      555                    355            600                          360
##  8      557                    357            600                          360
##  9      557                    357            600                          360
## 10      558                    358            600                          360
## # ℹ 336,766 more rows

A bit trickier, since if it rounds to a 60 minute increment you need to move to the next hour (e.g. you want 500 not 460).

flights |> 
  mutate(
    rounded_dep_time = if_else(
    (round((dep_time / 5)) * 5) %% 100==60,
    ceiling(dep_time / 100) * 100,
    round((dep_time / 5)) * 5
  )) |> 
  select(dep_time, rounded_dep_time)
## # A tibble: 336,776 × 2
##    dep_time rounded_dep_time
##       <int>            <dbl>
##  1      517              515
##  2      533              535
##  3      542              540
##  4      544              545
##  5      554              555
##  6      554              555
##  7      555              555
##  8      557              555
##  9      557              555
## 10      558              600
## # ℹ 336,766 more rows

13.5.4 Exercises:

Different rank functions handle ties differently. For this questions I chose min_rank since it gives me a maximum of 10 rows (dense_rank may give more than 10 rows) and I personally find it more intuitive than row_number because I prefer ties to have the same ranking.

flights |> 
 mutate(rank = min_rank(desc(dep_delay))) |> 
  filter(rank <= 10)
## # A tibble: 10 × 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     1    10     1121           1635      1126     1239           1810
##  3  2013    12     5      756           1700       896     1058           2020
##  4  2013     3    17     2321            810       911      135           1020
##  5  2013     4    10     1100           1900       960     1342           2211
##  6  2013     6    15     1432           1935      1137     1607           2120
##  7  2013     6    27      959           1900       899     1236           2226
##  8  2013     7    22      845           1600      1005     1044           1815
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013     9    20     1139           1845      1014     1457           2210
## # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, rank <int>

As an extension to my code below, could look into using other summary statistics (e.g. median) or ways to confirm the validity of my results (e.g. include the count of flights per tail or the distribution of delays to see if outliers could be misrepresenting the results).

flights |> 
  group_by(tailnum) |> 
  summarise(
    avg_delay = mean(dep_delay, na.rm = TRUE)
  ) |> 
  mutate(most_delayed = min_rank(desc(avg_delay))) |> 
  filter(most_delayed == 1)
## # A tibble: 1 × 3
##   tailnum avg_delay most_delayed
##   <chr>       <dbl>        <int>
## 1 N844MH        297            1

The fact that hour 5 and 23 have considerably less flights than other hours might make you question the significance of their result. As an extension, could include the standard deviation of delay to improve your results.

flights |> 
  group_by(hour = sched_dep_time %/% 100) |> 
  summarise(avg_delay = mean(is.na(dep_delay)), n=n()) |> 
  mutate(min_rank = min_rank(desc(avg_delay))) |> 
  arrange(avg_delay)
## # A tibble: 20 × 4
##     hour avg_delay     n min_rank
##    <dbl>     <dbl> <int>    <int>
##  1     5   0.00461  1953       20
##  2    23   0.0123   1061       19
##  3     7   0.0127  22821       18
##  4     9   0.0161  20312       17
##  5     8   0.0162  27242       16
##  6     6   0.0164  25951       15
##  7    10   0.0174  16708       14
##  8    11   0.0185  16033       13
##  9    12   0.0213  18181       12
## 10    13   0.0215  19956       11
## 11    14   0.0261  21706       10
## 12    17   0.0270  24426        9
## 13    15   0.0280  23888        8
## 14    18   0.0287  21783        7
## 15    22   0.0296   2639        6
## 16    16   0.0365  23002        5
## 17    21   0.0374  10933        4
## 18    20   0.0380  16739        3
## 19    19   0.0402  21441        2
## 20     1   1           1        1

Since you don’t supply a vector row_number, the first expression will return the first 3 rows per destination as they appear in the data set. In the second expression you rank by dep_delay, so it will give the 3 flights per destination with the lowest delay.

flights |> group_by(dest) |> filter(row_number() < 4)
## # A tibble: 311 × 19
## # Groups:   dest [105]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 301 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

flights |> 
  group_by(dest, flight) |> 
  summarise(flight_delay = sum(dep_delay)) |> 
  mutate(dest_delay = sum(flight_delay, na.rm = TRUE)) 
## `summarise()` has grouped output by 'dest'. You can override using the
## `.groups` argument.
## # A tibble: 11,467 × 4
## # Groups:   dest [105]
##    dest  flight flight_delay dest_delay
##    <chr>  <int>        <dbl>      <dbl>
##  1 ABQ       65         1239       3490
##  2 ABQ     1505         2251       3490
##  3 ACK     1191          614       1711
##  4 ACK     1195           33       1711
##  5 ACK     1291          250       1711
##  6 ACK     1491          814       1711
##  7 ALB     3256            6       2529
##  8 ALB     3260          119       2529
##  9 ALB     3264            1       2529
## 10 ALB     3811           NA       2529
## # ℹ 11,457 more rows

This works because group_by peels a “layer” off when I use a second mutate/summarise function. So my mutate call is working as if the data is group by dest only, not dest and flight.
An older blog about this phenomena here.

# only the last 5 lines are added by me
library(ggpmisc)
## Loading required package: ggpp
## Registered S3 methods overwritten by 'ggpp':
##   method                  from   
##   heightDetails.titleGrob ggplot2
##   widthDetails.titleGrob  ggplot2
## 
## Attaching package: 'ggpp'
## The following object is masked from 'package:ggplot2':
## 
##     annotate

flights |> 
  mutate(hour = dep_time %/% 100) |> 
  group_by(year, month, day, hour) |> 
  summarize(
    dep_delay = mean(dep_delay, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  ) |> 
  filter(n > 5) |> 
  mutate(previous_delay_lag = lag(dep_delay)) |>
  ggplot(aes(dep_delay, previous_delay_lag))+
  geom_point() +
  stat_poly_line() +
  stat_poly_eq()
## Warning: Removed 426 rows containing non-finite outside the scale range
## (`stat_poly_line()`).
## Warning: Removed 426 rows containing non-finite outside the scale range
## (`stat_poly_eq()`).
## Warning: Removed 426 rows containing missing values or values outside the scale range
## (`geom_point()`).

Used the ggpmisc package to quickly add a regression line with the R^2. geom_smooth() can also make a regression line, but ggpmisc makes adding the R^2 especially easy.

flights |> 
  group_by(dest) |> 
  mutate(mean_air = mean(air_time, na.rm = TRUE),
         prop_air = air_time / mean_air * 100) |> 
  select(dest, mean_air, prop_air) |> 
  arrange(prop_air)
## # A tibble: 336,776 × 3
## # Groups:   dest [105]
##    dest  mean_air prop_air
##    <chr>    <dbl>    <dbl>
##  1 BOS       39.0     53.9
##  2 ATL      113.      57.6
##  3 GSP       93.4     58.9
##  4 BOS       39.0     59.0
##  5 BNA      114.      61.2
##  6 MSP      151.      61.8
##  7 PHL       33.2     63.3
##  8 PHL       33.2     63.3
##  9 PHL       33.2     63.3
## 10 CVG       96.0     64.6
## # ℹ 336,766 more rows

The above table tells me that the fastest flight has an airtime that is 54% of the mean flight for that destination. While definitely quick, I don’t consider this a data entry error given how many other flights also take up ~60% of the mean time.

To find the flights that were most delayed in the air, I am going to compare dep_delay to arr_delay.

flights |> 
  mutate(air_delay = arr_delay - dep_delay) |> 
  arrange(desc(air_delay)) |> 
  select(flight, tailnum, air_delay, dep_delay, arr_delay)
## # A tibble: 336,776 × 5
##    flight tailnum air_delay dep_delay arr_delay
##     <int> <chr>       <dbl>     <dbl>     <dbl>
##  1    399 N629VA        196        -2       194
##  2    707 N3EXAA        181        -2       179
##  3    996 N593UA        165       180       345
##  4   1465 N711ZX        161        16       177
##  5   3199 N5PBMQ        157        43       200
##  6   1619 N970DL        154        -9       145
##  7   2395 N931DL        153        11       164
##  8   4580 N12122        150       291       441
##  9   2793 N657MQ        150        41       191
## 10   2083 N565AA        148        -5       143
## # ℹ 336,766 more rows

Again, nothing screams data entry error since I have personally been on flights spent taxiing due to gate issues.

flights |> 
  group_by(dest, carrier) |> 
  summarise(avg_delay = mean(dep_delay, na.rm = TRUE), .groups = 'drop') |> 
  mutate(
    dest_rank = row_number(avg_delay),
         count_carrier = n_distinct(carrier)
  ) |> 
  filter(count_carrier >= 2) |> 
  select(dest, carrier, avg_delay, dest_rank) |> 
  arrange(dest, dest_rank)
## # A tibble: 314 × 4
##    dest  carrier avg_delay dest_rank
##    <chr> <chr>       <dbl>     <int>
##  1 ABQ   B6         13.7         174
##  2 ACK   B6          6.46         66
##  3 ALB   EV         23.6         289
##  4 ANC   UA         12.9         161
##  5 ATL   9E          0.965        32
##  6 ATL   WN          2.34         39
##  7 ATL   MQ          9.35        107
##  8 ATL   DL         10.4         119
##  9 ATL   UA         15.8         202
## 10 ATL   FL         18.4         240
## # ℹ 304 more rows

Like exercise #5, I am using the “peel” of group_by so the summarize call is grouping by both dest and carrier while the mutate call is grouping by dest only.

13.6.7 Exercises:

Lots of potential answers so just giving some aspects to consider:
1. Can take arr_delay minus dep_delay to get the amount of air delay.
2. mean vs. median depends on context. If you have a lot of outliers you should probably prefer median.
3. can use ?planes to read about the table. Would be useful if you want to see if plane age/speed has an effect on delay.
4. the package also has a weather table. Could use to see if delay correlates with visibility or precipitation at a nearby weather station.

flights |> 
  mutate(air_speed = distance / air_time) |> 
  group_by(dest) |> 
  summarize(sd_speed = sd(air_speed, na.rm = TRUE)) |> 
  arrange(desc(sd_speed))
## # A tibble: 105 × 2
##    dest  sd_speed
##    <chr>    <dbl>
##  1 OKC      0.639
##  2 TUL      0.624
##  3 ILM      0.615
##  4 BNA      0.615
##  5 CLT      0.611
##  6 MSY      0.603
##  7 HOU      0.603
##  8 MEM      0.599
##  9 AUS      0.596
## 10 XNA      0.596
## # ℹ 95 more rows

Curiously, both #1 and #2 in my data are in Oklahoma.

To see if airports move locations, I graphed the distance of the flights to EGE by month. Flights from both EWR and JFK seemed to have decreased a mile which makes me think the airport may have re-positioned the runways. (FYI reading the wikipedia page for Veil Airport, I am honestly not sure what caused the change in 2013).
```
flights |> 
  filter(dest == 'EGE') |> 
  group_by(month, origin) |> 
  summarise(distance = mean(distance), .groups = 'drop') |> 
  ggplot(aes(month, distance, color = origin)) +
  geom_point()
```