library(tidyverse)
library(repurrrsive)
library(jsonlite)
ch_23_solutions
Prerequisites:
23.3.5 Exercises:
Need to supply
names_sep
argument tounnest_wider
.<- tribble( df2 ~x, ~y, 1, list(11, 12, 13), 2, list(21), 3, list(31, 32), )
|> df2 unnest_wider(y, names_sep = '_') ## # A tibble: 3 × 4 ## x y_1 y_2 y_3 ## <dbl> <dbl> <dbl> <dbl> ## 1 1 11 12 13 ## 2 2 21 NA NA ## 3 3 31 32 NA
By default it will include a column of the element names. this can be suppressed with the
indices_include
argument.<- tribble( df1 ~x, ~y, 1, list(a = 11, b = 12), 2, list(a = 21, b = 22), 3, list(a = 31, b = 32), )
|> df1 unnest_longer(y, indices_include = FALSE) ## # A tibble: 6 × 2 ## x y ## <dbl> <dbl> ## 1 1 11 ## 2 1 12 ## 3 2 21 ## 4 2 22 ## 5 3 31 ## 6 3 32
The issue with apply 2
unnest_longer
calls is that it returns all combinations of row values (i.e. the cartesian product).<- tribble( df4 ~x, ~y, ~z, "a", list("y-a-1", "y-a-2"), list("z-a-1", "z-a-2"), "b", list("y-b-1", "y-b-2", "y-b-3"), list("z-b-1", "z-b-2", "z-b-3") )
|> df4 unnest_longer(y) |> unnest_longer(z) ## # A tibble: 13 × 3 ## x y z ## <chr> <chr> <chr> ## 1 a y-a-1 z-a-1 ## 2 a y-a-1 z-a-2 ## 3 a y-a-2 z-a-1 ## 4 a y-a-2 z-a-2 ## 5 b y-b-1 z-b-1 ## 6 b y-b-1 z-b-2 ## 7 b y-b-1 z-b-3 ## 8 b y-b-2 z-b-1 ## 9 b y-b-2 z-b-2 ## 10 b y-b-2 z-b-3 ## 11 b y-b-3 z-b-1 ## 12 b y-b-3 z-b-2 ## 13 b y-b-3 z-b-3
- To get the ordered pairing of data, (e.g. the first value of y to the first value of z), you can supply a vector of columns to a single
unnest_longer
call. This property is seen in the “Examples” section of the documentation.
|> df4 unnest_longer(c(y, z)) ## # A tibble: 5 × 3 ## x y z ## <chr> <chr> <chr> ## 1 a y-a-1 z-a-1 ## 2 a y-a-2 z-a-2 ## 3 b y-b-1 z-b-1 ## 4 b y-b-2 z-b-2 ## 5 b y-b-3 z-b-3
- To get the ordered pairing of data, (e.g. the first value of y to the first value of z), you can supply a vector of columns to a single
23.4.4 Exercises:
I assume
gh_repos
was created the evening of October 25th 2016 or early October 26th. I looked to see when the last update was recorded in the data set, and since updates are roughly a daily occurrence, I assume the data set wasn’t created on October 27th or later. Someone better than me at stats could probably apply a exponential distribution to find the expected next occurrence and consequently the likelihood the data set was created before that point.<- tibble(gh_repos) repos |> repos unnest_longer(gh_repos) |> unnest_wider(gh_repos) |> mutate(update = as_date(updated_at)) |> count(update) |> arrange(desc(update)) ## # A tibble: 142 × 2 ## update n ## <date> <int> ## 1 2016-10-25 2 ## 2 2016-10-24 2 ## 3 2016-10-23 3 ## 4 2016-10-22 1 ## 5 2016-10-20 1 ## 6 2016-10-19 4 ## 7 2016-10-18 2 ## 8 2016-10-16 1 ## 9 2016-10-15 1 ## 10 2016-10-11 2 ## # ℹ 132 more rows
Distinct does seem to work with list columns. Granted, I would usually do the
unnest_wider
call first because then I have the flexibility to determine uniqueness using a subset of columns from the unnested owner in mydistinct
call.|> repos unnest_longer(gh_repos) |> unnest_wider(gh_repos) |> select(owner) |> distinct() |> unnest_wider(owner) ## # A tibble: 6 × 17 ## login id avatar_url gravatar_id url html_url followers_url following_url ## <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 gabo… 6.60e5 https://a… "" http… https:/… https://api.… https://api.… ## 2 jenn… 5.99e5 https://a… "" http… https:/… https://api.… https://api.… ## 3 jtle… 1.57e6 https://a… "" http… https:/… https://api.… https://api.… ## 4 juli… 1.25e7 https://a… "" http… https:/… https://api.… https://api.… ## 5 leep… 3.51e6 https://a… "" http… https:/… https://api.… https://api.… ## 6 masa… 8.36e6 https://a… "" http… https:/… https://api.… https://api.… ## # ℹ 9 more variables: gists_url <chr>, starred_url <chr>, ## # subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>, ## # events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>
Definitely want to maintain the ID column in case you plan to join the data together.
<- tibble(json = got_chars) chars <- chars |> aliases unnest_wider(json) |> select(id, aliases) |> unnest_longer(aliases) |> filter(aliases != '') <- chars |> allegiances unnest_wider(json) |> select(id, allegiances) |> unnest_longer(allegiances) <- chars |> books unnest_wider(json) |> select(id, books) |> unnest_longer(books) <- chars |> tv_series unnest_wider(json) |> select(id, tvSeries) |> unnest_longer(tvSeries) |> filter(tvSeries != '') aliases## # A tibble: 107 × 2 ## id aliases ## <int> <chr> ## 1 1022 Prince of Fools ## 2 1022 Theon Turncloak ## 3 1022 Reek ## 4 1022 Theon Kinslayer ## 5 1052 The Imp ## 6 1052 Halfman ## 7 1052 The boyman ## 8 1052 Giant of Lannister ## 9 1052 Lord Tywin's Doom ## 10 1052 Lord Tywin's Bane ## # ℹ 97 more rows allegiances## # A tibble: 33 × 2 ## id allegiances ## <int> <chr> ## 1 1022 House Greyjoy of Pyke ## 2 1052 House Lannister of Casterly Rock ## 3 1074 House Greyjoy of Pyke ## 4 1166 House Nymeros Martell of Sunspear ## 5 130 House Nymeros Martell of Sunspear ## 6 1303 House Targaryen of King's Landing ## 7 1319 House Baratheon of Dragonstone ## 8 1319 House Seaworth of Cape Wrath ## 9 148 House Stark of Winterfell ## 10 149 House Oakheart of Old Oak ## # ℹ 23 more rows books## # A tibble: 77 × 2 ## id books ## <int> <chr> ## 1 1022 A Game of Thrones ## 2 1022 A Storm of Swords ## 3 1022 A Feast for Crows ## 4 1052 A Feast for Crows ## 5 1052 The World of Ice and Fire ## 6 1074 A Game of Thrones ## 7 1074 A Clash of Kings ## 8 1074 A Storm of Swords ## 9 1109 A Clash of Kings ## 10 1166 A Game of Thrones ## # ℹ 67 more rows tv_series## # A tibble: 93 × 2 ## id tvSeries ## <int> <chr> ## 1 1022 Season 1 ## 2 1022 Season 2 ## 3 1022 Season 3 ## 4 1022 Season 4 ## 5 1022 Season 5 ## 6 1022 Season 6 ## 7 1052 Season 1 ## 8 1052 Season 2 ## 9 1052 Season 3 ## 10 1052 Season 4 ## # ℹ 83 more rows
Using
pivot_longer
to split the data is usable in this case because all the list columns are unnamed and will require aunnest_longer
call. If some of the lists are named, or to be more encompassing the lists will unnest to different width column-wise, then un-nesting simultaneously will not get you a fully populated matrix since some cells will not be populated.tibble(json = got_chars) |> unnest_wider(json) |> select(id, where(is.list)) |> pivot_longer( where(is.list), names_to = "name", values_to = "value" )## # A tibble: 210 × 3 ## id name value ## <int> <chr> <list> ## 1 1022 titles <chr [2]> ## 2 1022 aliases <chr [4]> ## 3 1022 allegiances <chr [1]> ## 4 1022 books <chr [3]> ## 5 1022 povBooks <chr [2]> ## 6 1022 tvSeries <chr [6]> ## 7 1022 playedBy <chr [1]> ## 8 1052 titles <chr [2]> ## 9 1052 aliases <chr [11]> ## 10 1052 allegiances <chr [1]> ## # ℹ 200 more rows
It is of mixed length since the addresses have different amounts of detail (e.g. some places have a county, some do not). I personally found it better to use
unnest_wider
with types since the two values seem to contain slightly different information and therefore better considered as it’s own variable.|> gmaps_cities unnest_wider(json) |> select(results) |> unnest_longer(results) |> unnest_wider(results) |> select(place_id, address_components) |> unnest_longer(address_components) |> unnest_wider(address_components) |> unnest_wider(types, names_sep = '_') ## # A tibble: 25 × 5 ## place_id long_name short_name types_1 types_2 ## <chr> <chr> <chr> <chr> <chr> ## 1 ChIJAYWNSLS4QIYROwVl894CDco Houston Houston locali… politi… ## 2 ChIJAYWNSLS4QIYROwVl894CDco Harris County Harris Coun… admini… politi… ## 3 ChIJAYWNSLS4QIYROwVl894CDco Texas TX admini… politi… ## 4 ChIJAYWNSLS4QIYROwVl894CDco United States US country politi… ## 5 ChIJ-bDD5__lhVQRuvNfbGh4QpQ Washington WA admini… politi… ## 6 ChIJ-bDD5__lhVQRuvNfbGh4QpQ United States US country politi… ## 7 ChIJW-T2Wt7Gt4kRKl2I1CJFUsI Washington Washington locali… politi… ## 8 ChIJW-T2Wt7Gt4kRKl2I1CJFUsI District of Columbia District of… admini… politi… ## 9 ChIJW-T2Wt7Gt4kRKl2I1CJFUsI District of Columbia DC admini… politi… ## 10 ChIJW-T2Wt7Gt4kRKl2I1CJFUsI United States US country politi… ## # ℹ 15 more rows
23.5.4 Exercises:
|> df_col unnest_wider(col_json) |> unnest_longer(c(x, y)) ## # A tibble: 3 × 2 ## x y ## <chr> <int> ## 1 a 10 ## 2 x NA ## 3 z 3 |> df_row unnest_wider(row_json) ## # A tibble: 3 × 2 ## x y ## <chr> <int> ## 1 a 10 ## 2 x NA ## 3 z 3