ch_23_solutions

Prerequisites:

library(tidyverse)
library(repurrrsive)
library(jsonlite)

23.3.5 Exercises:

  1. Need to supply names_sep argument to unnest_wider.

    df2 <- tribble(
      ~x, ~y,
      1, list(11, 12, 13),
      2, list(21),
      3, list(31, 32),
    )
    df2 |> 
      unnest_wider(y, names_sep = '_')
    ## # A tibble: 3 × 4
    ##       x   y_1   y_2   y_3
    ##   <dbl> <dbl> <dbl> <dbl>
    ## 1     1    11    12    13
    ## 2     2    21    NA    NA
    ## 3     3    31    32    NA
  2. By default it will include a column of the element names. this can be suppressed with the indices_include argument.

    df1 <- tribble(
      ~x, ~y,
      1, list(a = 11, b = 12),
      2, list(a = 21, b = 22),
      3, list(a = 31, b = 32),
    )
    df1 |> 
      unnest_longer(y, indices_include = FALSE)
    ## # A tibble: 6 × 2
    ##       x     y
    ##   <dbl> <dbl>
    ## 1     1    11
    ## 2     1    12
    ## 3     2    21
    ## 4     2    22
    ## 5     3    31
    ## 6     3    32
  3. The issue with apply 2 unnest_longer calls is that it returns all combinations of row values (i.e. the cartesian product).

    df4 <- tribble(
      ~x, ~y, ~z,
      "a", list("y-a-1", "y-a-2"), list("z-a-1", "z-a-2"),
      "b", list("y-b-1", "y-b-2", "y-b-3"), list("z-b-1", "z-b-2", "z-b-3")
    )
    df4 |> 
      unnest_longer(y) |> 
      unnest_longer(z)
    ## # A tibble: 13 × 3
    ##    x     y     z    
    ##    <chr> <chr> <chr>
    ##  1 a     y-a-1 z-a-1
    ##  2 a     y-a-1 z-a-2
    ##  3 a     y-a-2 z-a-1
    ##  4 a     y-a-2 z-a-2
    ##  5 b     y-b-1 z-b-1
    ##  6 b     y-b-1 z-b-2
    ##  7 b     y-b-1 z-b-3
    ##  8 b     y-b-2 z-b-1
    ##  9 b     y-b-2 z-b-2
    ## 10 b     y-b-2 z-b-3
    ## 11 b     y-b-3 z-b-1
    ## 12 b     y-b-3 z-b-2
    ## 13 b     y-b-3 z-b-3
    • To get the ordered pairing of data, (e.g. the first value of y to the first value of z), you can supply a vector of columns to a single unnest_longer call. This property is seen in the “Examples” section of the documentation.
    df4 |> 
      unnest_longer(c(y, z))
    ## # A tibble: 5 × 3
    ##   x     y     z    
    ##   <chr> <chr> <chr>
    ## 1 a     y-a-1 z-a-1
    ## 2 a     y-a-2 z-a-2
    ## 3 b     y-b-1 z-b-1
    ## 4 b     y-b-2 z-b-2
    ## 5 b     y-b-3 z-b-3

23.4.4 Exercises:

  1. I assume gh_repos was created the evening of October 25th 2016 or early October 26th. I looked to see when the last update was recorded in the data set, and since updates are roughly a daily occurrence, I assume the data set wasn’t created on October 27th or later. Someone better than me at stats could probably apply a exponential distribution to find the expected next occurrence and consequently the likelihood the data set was created before that point.

    repos <- tibble(gh_repos)
    
    repos |> 
      unnest_longer(gh_repos) |> 
      unnest_wider(gh_repos) |>   
      mutate(update = as_date(updated_at)) |> 
      count(update) |> 
      arrange(desc(update))
    ## # A tibble: 142 × 2
    ##    update         n
    ##    <date>     <int>
    ##  1 2016-10-25     2
    ##  2 2016-10-24     2
    ##  3 2016-10-23     3
    ##  4 2016-10-22     1
    ##  5 2016-10-20     1
    ##  6 2016-10-19     4
    ##  7 2016-10-18     2
    ##  8 2016-10-16     1
    ##  9 2016-10-15     1
    ## 10 2016-10-11     2
    ## # ℹ 132 more rows
  2. Distinct does seem to work with list columns. Granted, I would usually do the unnest_wider call first because then I have the flexibility to determine uniqueness using a subset of columns from the unnested owner in my distinct call.

    repos |> 
      unnest_longer(gh_repos) |> 
      unnest_wider(gh_repos) |> 
      select(owner) |> 
      distinct() |> 
      unnest_wider(owner)
    ## # A tibble: 6 × 17
    ##   login     id avatar_url gravatar_id url   html_url followers_url following_url
    ##   <chr>  <int> <chr>      <chr>       <chr> <chr>    <chr>         <chr>        
    ## 1 gabo… 6.60e5 https://a… ""          http… https:/… https://api.… https://api.…
    ## 2 jenn… 5.99e5 https://a… ""          http… https:/… https://api.… https://api.…
    ## 3 jtle… 1.57e6 https://a… ""          http… https:/… https://api.… https://api.…
    ## 4 juli… 1.25e7 https://a… ""          http… https:/… https://api.… https://api.…
    ## 5 leep… 3.51e6 https://a… ""          http… https:/… https://api.… https://api.…
    ## 6 masa… 8.36e6 https://a… ""          http… https:/… https://api.… https://api.…
    ## # ℹ 9 more variables: gists_url <chr>, starred_url <chr>,
    ## #   subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
    ## #   events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>
  3. Definitely want to maintain the ID column in case you plan to join the data together.

    chars <- tibble(json = got_chars)
    
    aliases <- chars |> 
      unnest_wider(json) |> 
      select(id, aliases) |> 
      unnest_longer(aliases) |> 
      filter(aliases != '')
    
    allegiances <- chars |> 
      unnest_wider(json) |> 
      select(id, allegiances) |> 
      unnest_longer(allegiances)
    
    books <- chars |> 
      unnest_wider(json) |> 
      select(id, books) |> 
      unnest_longer(books)
    
    tv_series <- chars |> 
      unnest_wider(json) |> 
      select(id, tvSeries) |> 
      unnest_longer(tvSeries) |> 
      filter(tvSeries != '')
    
    aliases
    ## # A tibble: 107 × 2
    ##       id aliases           
    ##    <int> <chr>             
    ##  1  1022 Prince of Fools   
    ##  2  1022 Theon Turncloak   
    ##  3  1022 Reek              
    ##  4  1022 Theon Kinslayer   
    ##  5  1052 The Imp           
    ##  6  1052 Halfman           
    ##  7  1052 The boyman        
    ##  8  1052 Giant of Lannister
    ##  9  1052 Lord Tywin's Doom 
    ## 10  1052 Lord Tywin's Bane 
    ## # ℹ 97 more rows
    allegiances
    ## # A tibble: 33 × 2
    ##       id allegiances                      
    ##    <int> <chr>                            
    ##  1  1022 House Greyjoy of Pyke            
    ##  2  1052 House Lannister of Casterly Rock 
    ##  3  1074 House Greyjoy of Pyke            
    ##  4  1166 House Nymeros Martell of Sunspear
    ##  5   130 House Nymeros Martell of Sunspear
    ##  6  1303 House Targaryen of King's Landing
    ##  7  1319 House Baratheon of Dragonstone   
    ##  8  1319 House Seaworth of Cape Wrath     
    ##  9   148 House Stark of Winterfell        
    ## 10   149 House Oakheart of Old Oak        
    ## # ℹ 23 more rows
    books
    ## # A tibble: 77 × 2
    ##       id books                    
    ##    <int> <chr>                    
    ##  1  1022 A Game of Thrones        
    ##  2  1022 A Storm of Swords        
    ##  3  1022 A Feast for Crows        
    ##  4  1052 A Feast for Crows        
    ##  5  1052 The World of Ice and Fire
    ##  6  1074 A Game of Thrones        
    ##  7  1074 A Clash of Kings         
    ##  8  1074 A Storm of Swords        
    ##  9  1109 A Clash of Kings         
    ## 10  1166 A Game of Thrones        
    ## # ℹ 67 more rows
    tv_series
    ## # A tibble: 93 × 2
    ##       id tvSeries
    ##    <int> <chr>   
    ##  1  1022 Season 1
    ##  2  1022 Season 2
    ##  3  1022 Season 3
    ##  4  1022 Season 4
    ##  5  1022 Season 5
    ##  6  1022 Season 6
    ##  7  1052 Season 1
    ##  8  1052 Season 2
    ##  9  1052 Season 3
    ## 10  1052 Season 4
    ## # ℹ 83 more rows
  4. Using pivot_longer to split the data is usable in this case because all the list columns are unnamed and will require a unnest_longer call. If some of the lists are named, or to be more encompassing the lists will unnest to different width column-wise, then un-nesting simultaneously will not get you a fully populated matrix since some cells will not be populated.

    tibble(json = got_chars) |> 
      unnest_wider(json) |> 
      select(id, where(is.list)) |> 
      pivot_longer(
        where(is.list), 
        names_to = "name", 
        values_to = "value"
      )
    ## # A tibble: 210 × 3
    ##       id name        value     
    ##    <int> <chr>       <list>    
    ##  1  1022 titles      <chr [2]> 
    ##  2  1022 aliases     <chr [4]> 
    ##  3  1022 allegiances <chr [1]> 
    ##  4  1022 books       <chr [3]> 
    ##  5  1022 povBooks    <chr [2]> 
    ##  6  1022 tvSeries    <chr [6]> 
    ##  7  1022 playedBy    <chr [1]> 
    ##  8  1052 titles      <chr [2]> 
    ##  9  1052 aliases     <chr [11]>
    ## 10  1052 allegiances <chr [1]> 
    ## # ℹ 200 more rows
  5. It is of mixed length since the addresses have different amounts of detail (e.g. some places have a county, some do not). I personally found it better to use unnest_wider with types since the two values seem to contain slightly different information and therefore better considered as it’s own variable.

    gmaps_cities |> 
      unnest_wider(json) |> 
      select(results) |> 
      unnest_longer(results) |> 
      unnest_wider(results) |> 
      select(place_id, address_components) |> 
      unnest_longer(address_components) |> 
      unnest_wider(address_components) |> 
      unnest_wider(types, names_sep = '_')
    ## # A tibble: 25 × 5
    ##    place_id                    long_name            short_name   types_1 types_2
    ##    <chr>                       <chr>                <chr>        <chr>   <chr>  
    ##  1 ChIJAYWNSLS4QIYROwVl894CDco Houston              Houston      locali… politi…
    ##  2 ChIJAYWNSLS4QIYROwVl894CDco Harris County        Harris Coun… admini… politi…
    ##  3 ChIJAYWNSLS4QIYROwVl894CDco Texas                TX           admini… politi…
    ##  4 ChIJAYWNSLS4QIYROwVl894CDco United States        US           country politi…
    ##  5 ChIJ-bDD5__lhVQRuvNfbGh4QpQ Washington           WA           admini… politi…
    ##  6 ChIJ-bDD5__lhVQRuvNfbGh4QpQ United States        US           country politi…
    ##  7 ChIJW-T2Wt7Gt4kRKl2I1CJFUsI Washington           Washington   locali… politi…
    ##  8 ChIJW-T2Wt7Gt4kRKl2I1CJFUsI District of Columbia District of… admini… politi…
    ##  9 ChIJW-T2Wt7Gt4kRKl2I1CJFUsI District of Columbia DC           admini… politi…
    ## 10 ChIJW-T2Wt7Gt4kRKl2I1CJFUsI United States        US           country politi…
    ## # ℹ 15 more rows

23.5.4 Exercises:

  1. df_col |> 
      unnest_wider(col_json) |> 
      unnest_longer(c(x, y))
    ## # A tibble: 3 × 2
    ##   x         y
    ##   <chr> <int>
    ## 1 a        10
    ## 2 x        NA
    ## 3 z         3
    
    df_row |> 
      unnest_wider(row_json)
    ## # A tibble: 3 × 2
    ##   x         y
    ##   <chr> <int>
    ## 1 a        10
    ## 2 x        NA
    ## 3 z         3