ch_23_solutions

Prerequisites:

library(tidyverse)
library(repurrrsive)
library(jsonlite)

23.3.5 Exercises:

Need to supply names_sep argument to unnest_wider.

df2 <- tribble(
  ~x, ~y,
  1, list(11, 12, 13),
  2, list(21),
  3, list(31, 32),
)

df2 |> 
  unnest_wider(y, names_sep = '_')
## # A tibble: 3 × 4
##       x   y_1   y_2   y_3
##   <dbl> <dbl> <dbl> <dbl>
## 1     1    11    12    13
## 2     2    21    NA    NA
## 3     3    31    32    NA

By default it will include a column of the element names. this can be suppressed with the indices_include argument.

df1 <- tribble(
  ~x, ~y,
  1, list(a = 11, b = 12),
  2, list(a = 21, b = 22),
  3, list(a = 31, b = 32),
)

df1 |> 
  unnest_longer(y, indices_include = FALSE)
## # A tibble: 6 × 2
##       x     y
##   <dbl> <dbl>
## 1     1    11
## 2     1    12
## 3     2    21
## 4     2    22
## 5     3    31
## 6     3    32

The issue with apply 2 unnest_longer calls is that it returns all combinations of row values (i.e. the cartesian product).

df4 <- tribble(
  ~x, ~y, ~z,
  "a", list("y-a-1", "y-a-2"), list("z-a-1", "z-a-2"),
  "b", list("y-b-1", "y-b-2", "y-b-3"), list("z-b-1", "z-b-2", "z-b-3")
)

df4 |> 
  unnest_longer(y) |> 
  unnest_longer(z)
## # A tibble: 13 × 3
##    x     y     z    
##    <chr> <chr> <chr>
##  1 a     y-a-1 z-a-1
##  2 a     y-a-1 z-a-2
##  3 a     y-a-2 z-a-1
##  4 a     y-a-2 z-a-2
##  5 b     y-b-1 z-b-1
##  6 b     y-b-1 z-b-2
##  7 b     y-b-1 z-b-3
##  8 b     y-b-2 z-b-1
##  9 b     y-b-2 z-b-2
## 10 b     y-b-2 z-b-3
## 11 b     y-b-3 z-b-1
## 12 b     y-b-3 z-b-2
## 13 b     y-b-3 z-b-3

To get the ordered pairing of data, (e.g. the first value of y to the first value of z), you can supply a vector of columns to a single unnest_longer call. This property is seen in the “Examples” section of the documentation.

df4 |> 
  unnest_longer(c(y, z))
## # A tibble: 5 × 3
##   x     y     z    
##   <chr> <chr> <chr>
## 1 a     y-a-1 z-a-1
## 2 a     y-a-2 z-a-2
## 3 b     y-b-1 z-b-1
## 4 b     y-b-2 z-b-2
## 5 b     y-b-3 z-b-3

23.4.4 Exercises:

I assume gh_repos was created the evening of October 25th 2016 or early October 26th. I looked to see when the last update was recorded in the data set, and since updates are roughly a daily occurrence, I assume the data set wasn’t created on October 27th or later. Someone better than me at stats could probably apply a exponential distribution to find the expected next occurrence and consequently the likelihood the data set was created before that point.

repos <- tibble(gh_repos)

repos |> 
  unnest_longer(gh_repos) |> 
  unnest_wider(gh_repos) |>   
  mutate(update = as_date(updated_at)) |> 
  count(update) |> 
  arrange(desc(update))
## # A tibble: 142 × 2
##    update         n
##    <date>     <int>
##  1 2016-10-25     2
##  2 2016-10-24     2
##  3 2016-10-23     3
##  4 2016-10-22     1
##  5 2016-10-20     1
##  6 2016-10-19     4
##  7 2016-10-18     2
##  8 2016-10-16     1
##  9 2016-10-15     1
## 10 2016-10-11     2
## # ℹ 132 more rows

Distinct does seem to work with list columns. Granted, I would usually do the unnest_wider call first because then I have the flexibility to determine uniqueness using a subset of columns from the unnested owner in my distinct call.

repos |> 
  unnest_longer(gh_repos) |> 
  unnest_wider(gh_repos) |> 
  select(owner) |> 
  distinct() |> 
  unnest_wider(owner)
## # A tibble: 6 × 17
##   login     id avatar_url gravatar_id url   html_url followers_url following_url
##   <chr>  <int> <chr>      <chr>       <chr> <chr>    <chr>         <chr>        
## 1 gabo… 6.60e5 https://a… ""          http… https:/… https://api.… https://api.…
## 2 jenn… 5.99e5 https://a… ""          http… https:/… https://api.… https://api.…
## 3 jtle… 1.57e6 https://a… ""          http… https:/… https://api.… https://api.…
## 4 juli… 1.25e7 https://a… ""          http… https:/… https://api.… https://api.…
## 5 leep… 3.51e6 https://a… ""          http… https:/… https://api.… https://api.…
## 6 masa… 8.36e6 https://a… ""          http… https:/… https://api.… https://api.…
## # ℹ 9 more variables: gists_url <chr>, starred_url <chr>,
## #   subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
## #   events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>

Definitely want to maintain the ID column in case you plan to join the data together.

chars <- tibble(json = got_chars)

aliases <- chars |> 
  unnest_wider(json) |> 
  select(id, aliases) |> 
  unnest_longer(aliases) |> 
  filter(aliases != '')

allegiances <- chars |> 
  unnest_wider(json) |> 
  select(id, allegiances) |> 
  unnest_longer(allegiances)

books <- chars |> 
  unnest_wider(json) |> 
  select(id, books) |> 
  unnest_longer(books)

tv_series <- chars |> 
  unnest_wider(json) |> 
  select(id, tvSeries) |> 
  unnest_longer(tvSeries) |> 
  filter(tvSeries != '')

aliases
## # A tibble: 107 × 2
##       id aliases           
##    <int> <chr>             
##  1  1022 Prince of Fools   
##  2  1022 Theon Turncloak   
##  3  1022 Reek              
##  4  1022 Theon Kinslayer   
##  5  1052 The Imp           
##  6  1052 Halfman           
##  7  1052 The boyman        
##  8  1052 Giant of Lannister
##  9  1052 Lord Tywin's Doom 
## 10  1052 Lord Tywin's Bane 
## # ℹ 97 more rows
allegiances
## # A tibble: 33 × 2
##       id allegiances                      
##    <int> <chr>                            
##  1  1022 House Greyjoy of Pyke            
##  2  1052 House Lannister of Casterly Rock 
##  3  1074 House Greyjoy of Pyke            
##  4  1166 House Nymeros Martell of Sunspear
##  5   130 House Nymeros Martell of Sunspear
##  6  1303 House Targaryen of King's Landing
##  7  1319 House Baratheon of Dragonstone   
##  8  1319 House Seaworth of Cape Wrath     
##  9   148 House Stark of Winterfell        
## 10   149 House Oakheart of Old Oak        
## # ℹ 23 more rows
books
## # A tibble: 77 × 2
##       id books                    
##    <int> <chr>                    
##  1  1022 A Game of Thrones        
##  2  1022 A Storm of Swords        
##  3  1022 A Feast for Crows        
##  4  1052 A Feast for Crows        
##  5  1052 The World of Ice and Fire
##  6  1074 A Game of Thrones        
##  7  1074 A Clash of Kings         
##  8  1074 A Storm of Swords        
##  9  1109 A Clash of Kings         
## 10  1166 A Game of Thrones        
## # ℹ 67 more rows
tv_series
## # A tibble: 93 × 2
##       id tvSeries
##    <int> <chr>   
##  1  1022 Season 1
##  2  1022 Season 2
##  3  1022 Season 3
##  4  1022 Season 4
##  5  1022 Season 5
##  6  1022 Season 6
##  7  1052 Season 1
##  8  1052 Season 2
##  9  1052 Season 3
## 10  1052 Season 4
## # ℹ 83 more rows

Using pivot_longer to split the data is usable in this case because all the list columns are unnamed and will require a unnest_longer call. If some of the lists are named, or to be more encompassing the lists will unnest to different width column-wise, then un-nesting simultaneously will not get you a fully populated matrix since some cells will not be populated.

tibble(json = got_chars) |> 
  unnest_wider(json) |> 
  select(id, where(is.list)) |> 
  pivot_longer(
    where(is.list), 
    names_to = "name", 
    values_to = "value"
  )
## # A tibble: 210 × 3
##       id name        value     
##    <int> <chr>       <list>    
##  1  1022 titles      <chr [2]> 
##  2  1022 aliases     <chr [4]> 
##  3  1022 allegiances <chr [1]> 
##  4  1022 books       <chr [3]> 
##  5  1022 povBooks    <chr [2]> 
##  6  1022 tvSeries    <chr [6]> 
##  7  1022 playedBy    <chr [1]> 
##  8  1052 titles      <chr [2]> 
##  9  1052 aliases     <chr [11]>
## 10  1052 allegiances <chr [1]> 
## # ℹ 200 more rows

It is of mixed length since the addresses have different amounts of detail (e.g. some places have a county, some do not). I personally found it better to use unnest_wider with types since the two values seem to contain slightly different information and therefore better considered as it’s own variable.

gmaps_cities |> 
  unnest_wider(json) |> 
  select(results) |> 
  unnest_longer(results) |> 
  unnest_wider(results) |> 
  select(place_id, address_components) |> 
  unnest_longer(address_components) |> 
  unnest_wider(address_components) |> 
  unnest_wider(types, names_sep = '_')
## # A tibble: 25 × 5
##    place_id                    long_name            short_name   types_1 types_2
##    <chr>                       <chr>                <chr>        <chr>   <chr>  
##  1 ChIJAYWNSLS4QIYROwVl894CDco Houston              Houston      locali… politi…
##  2 ChIJAYWNSLS4QIYROwVl894CDco Harris County        Harris Coun… admini… politi…
##  3 ChIJAYWNSLS4QIYROwVl894CDco Texas                TX           admini… politi…
##  4 ChIJAYWNSLS4QIYROwVl894CDco United States        US           country politi…
##  5 ChIJ-bDD5__lhVQRuvNfbGh4QpQ Washington           WA           admini… politi…
##  6 ChIJ-bDD5__lhVQRuvNfbGh4QpQ United States        US           country politi…
##  7 ChIJW-T2Wt7Gt4kRKl2I1CJFUsI Washington           Washington   locali… politi…
##  8 ChIJW-T2Wt7Gt4kRKl2I1CJFUsI District of Columbia District of… admini… politi…
##  9 ChIJW-T2Wt7Gt4kRKl2I1CJFUsI District of Columbia DC           admini… politi…
## 10 ChIJW-T2Wt7Gt4kRKl2I1CJFUsI United States        US           country politi…
## # ℹ 15 more rows

23.5.4 Exercises:

df_col |> 
  unnest_wider(col_json) |> 
  unnest_longer(c(x, y))
## # A tibble: 3 × 2
##   x         y
##   <chr> <int>
## 1 a        10
## 2 x        NA
## 3 z         3

df_row |> 
  unnest_wider(row_json)
## # A tibble: 3 × 2
##   x         y
##   <chr> <int>
## 1 a        10
## 2 x        NA
## 3 z         3