Ch.15 Solutions

Prerequisites


library(tidyverse)
library(babynames)

15.3.5 Exercises:

  1. I only considered AEIOU as vowels.

    babynames |> 
      distinct(name) |> 
      mutate(
        vowel_count = str_count(name, '[aeiou]')
      ) |> 
      filter(vowel_count == max(vowel_count))
    ## # A tibble: 2 × 2
    ##   name            vowel_count
    ##   <chr>                 <int>
    ## 1 Mariaguadalupe            8
    ## 2 Mariadelrosario           8
    babynames |> 
      distinct(name) |> 
      mutate(
        vowel_prop = round(str_count(name, '[aeiou]') / str_length(name) * 100, 2)
      ) |> 
      filter(vowel_prop == max(vowel_prop))
    ## # A tibble: 10 × 2
    ##    name  vowel_prop
    ##    <chr>      <dbl>
    ##  1 Louie         80
    ##  2 Louia         80
    ##  3 Gioia         80
    ##  4 Zoeie         80
    ##  5 Zoiee         80
    ##  6 Kauai         80
    ##  7 Kaiea         80
    ##  8 Douaa         80
    ##  9 Zoeii         80
    ## 10 Zoiie         80
  2. The need for 4 consecutive backslashes is explained in the section “Escaping” of this chapter. Notice that I am printing the raw version of the string w/o visible escape characters.

    example_string <- "a/b/c/d/e"
    backslash_string <- str_replace_all(example_string, '/', '\\\\')
    print(backslash_string)
    ## [1] "a\\b\\c\\d\\e"
    
    str_view(backslash_string)
    ## [1] │ a\b\c\d\e
    
    str_view(str_replace_all(backslash_string, '\\\\', '/'))
    ## [1] │ a/b/c/d/e
  3. letters and LETTERS are some of the built in constants of R. There are additional constants for month names, month abbreviations and digits of pi.

    test_string <- 'Abraham lowercase UPPERCASE'
    letter_vector <- setNames(letters,LETTERS)
    
    str_replace_all(test_string, letter_vector)
    ## [1] "abraham lowercase uppercase"
  4. As an american who has cleaned my fair share of user inputted phone numbers, people will use multiple forms if not restricted (e.g. people may or may not include a country code, do you wrap the area code in parentheses). To keep it simple I will only look at the the format ###-###-#### using techniques discussed up to this point in the book.

    example_numbers <- c('123-456-7890', '1234567890', '(123) 456-3456', '123---434-4454')
    str_detect(example_numbers, '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]')
    ## [1]  TRUE FALSE FALSE FALSE
    • In the next section you will learn about escaping and character classes which allows you to match many other phone number formats.

15.4.7 Exercises:

  1. Used the raw string format we learned in chapter 14.

    str_detect(r"["'\]",'\"\'\\\\')
    ## [1] TRUE
    str_detect(r"["$^$"]", '\"\\$\\^\\$\"')
    ## [1] TRUE
  2. Regex expressions get handled by both the regex engine and the parser and both levels need to have an escape character preceding the backslash. Therefore, to have the regex engine receive \\, we need to send \\\\ to the parser. Thankfully if my explanation isn’t sufficient this has been asked online a bunch of times.

  3. For some of these I printed the number of matches to avoid printing large vectors.

    • Starts with Y:
    words[str_detect(words, '^y')]
    ## [1] "year"      "yes"       "yesterday" "yet"       "you"       "young"
    • Doesn’t starts with Y:
    length(words[str_detect(words, '^[^y]')])
    ## [1] 974
    • Ends with X:
    words[str_detect(words, 'x$')]
    ## [1] "box" "sex" "six" "tax"
    • Are exactly 3 letters long:
    words[str_detect(words, '\\b\\w{3}\\b')] |> 
      length()
    ## [1] 110
    • Has 7 or more letters:
    words[str_detect(words, '\\w{7}')] |> 
      length()
    ## [1] 219
    • Contains a vowel-consonant pair:
    words[str_detect(words, '[aeiou][^aeiou]')] |> 
      length()
    ## [1] 944
    • Contains at least 2 vowel-consonant pairs in a row:
    words[str_detect(words, '[aeiou][^aeiou][aeiou][^aeiou]')] |> 
      length()
    ## [1] 169
    • Only consists of repeated vowel-consonant pairs:
      • I assume this meant the word goes vowel-consonant for the entire length of the word and not identical pairs each time.
    words[str_detect(words, '\\b([aeiou][^aeiou]){1,}\\b')]
    ##  [1] "as"       "at"       "away"     "eleven"   "even"     "ever"    
    ##  [7] "if"       "in"       "it"       "item"     "of"       "okay"    
    ## [13] "on"       "open"     "or"       "original" "over"     "unit"    
    ## [19] "up"       "upon"
  4. I did most of these using the OR operator.

    brit_american_words <- c('airplane', 'aeroplane',
                             'aluminum', 'aluminium',
                             'analog', 'analogue',
                             'ass', 'arse',
                             'center', 'centre',
                             'defense', 'defence',
                             'donut', 'doughnut',
                             'gray', 'grey',
                             'modeling', 'modelling',
                             'skeptic', 'sceptic',
                             'summarize', 'summarise')
    
    brit_american_words[str_detect(brit_american_words, '(air|aero)plane')]
    ## [1] "airplane"  "aeroplane"
    
    brit_american_words[str_detect(brit_american_words, 'alumini{0,1}um')]
    ## [1] "aluminum"  "aluminium"
    
    brit_american_words[str_detect(brit_american_words, 'analo(que|g)')]
    ## [1] "analog"   "analogue"
    
    brit_american_words[str_detect(brit_american_words, '(ass|arse)')]
    ## [1] "ass"  "arse"
    
    brit_american_words[str_detect(brit_american_words, 'cent(er|re)')]
    ## [1] "center" "centre"
    
    brit_american_words[str_detect(brit_american_words, 'defen(c|s)e')]
    ## [1] "defense" "defence"
    
    brit_american_words[str_detect(brit_american_words, 'gr(a|e)y')]
    ## [1] "gray" "grey"
    
    brit_american_words[str_detect(brit_american_words, '(do|dough)nut')]
    ## [1] "donut"    "doughnut"
    
    brit_american_words[str_detect(brit_american_words, 'model{1,2}ing')]
    ## [1] "modeling"  "modelling"
    
    brit_american_words[str_detect(brit_american_words, 's(c|k)eptic')]
    ## [1] "skeptic" "sceptic"
    
    brit_american_words[str_detect(brit_american_words, 'summari(s|z)e')]
    ## [1] "summarize" "summarise"
  5. str_replace() allows you to use groups as a replacement. Make sure your regex contains the word “a”, for some of my earlier attempts at this problem did not work with words of length 1.

    words[str_replace(words, '(^.)(.*)(.$)','\\3\\2\\1') %in% words]
    ##  [1] "a"          "america"    "area"       "dad"        "dead"      
    ##  [6] "deal"       "dear"       "depend"     "dog"        "educate"   
    ## [11] "else"       "encourage"  "engine"     "europe"     "evidence"  
    ## [16] "example"    "excuse"     "exercise"   "expense"    "experience"
    ## [21] "eye"        "god"        "health"     "high"       "knock"     
    ## [26] "lead"       "level"      "local"      "nation"     "no"        
    ## [31] "non"        "on"         "rather"     "read"       "refer"     
    ## [36] "remember"   "serious"    "stairs"     "test"       "tonight"   
    ## [41] "transport"  "treat"      "trust"      "window"     "yesterday"
  6. Make sure to keep in mind that in order to match it only has to contain the relevant expression once and the string could contain more than what gets matched. e.g. in part f the expression fffg is a valid match.

    a. matches any character string.

    b. Matches strings with curly brackets around 1 or more characters.

    c. matches 4 digits, then 2 pairs of 2 digits separated by hyphens. e.g. 1234-56-78.

    d. Matches strings with at least 4 backslashes in a row.

    e. Matches a period, any character, period, any character, period, any character.

    f. Matches the same character three times a row.

    g, Matches 2 characters that repeat once e.g. coco

15.6.4 Exercises:

  1. I personally find multiple str_detect() calls more readable in all these cases.

    words[str_detect(words, '(^x.*|.*x$)')]
    ## [1] "box" "sex" "six" "tax"
    words[str_detect(words, '^x') | str_detect(words, 'x$')]
    ## [1] "box" "sex" "six" "tax"
    length(words[str_detect(words, '^[aeiou].*[^aeiou]$')])
    ## [1] 122
    
    length(words[str_detect(words, '^[aeiou]') & str_detect(words, '[^aeiou]$')])
    ## [1] 122
    words <- append(words, 'facetious')
    
    words[str_detect(words, '\\b(?=[a-z]*?a)(?=[a-z]*?e)(?=[a-z]*?i)(?=[a-z]*?o)(?=[a-z]*?u)')]
    ## [1] "facetious"
    
    words[
      str_detect(words, "a") &
      str_detect(words, "e") &
      str_detect(words, "i") &
      str_detect(words, "o") &
      str_detect(words, "u")
    ]
    ## [1] "facetious"
  2. For such a well known phrase, it’s pretty astounding how many counter examples it has.

    words[str_detect(words, '[^c]ie') | str_detect(words, 'cei')]
    ##  [1] "achieve"    "believe"    "brief"      "client"     "die"       
    ##  [6] "experience" "field"      "friend"     "lie"        "piece"     
    ## [11] "quiet"      "receive"    "tie"        "view"
    
    words[str_detect(words, 'cie') | str_detect(words, '.*[^c]ei')]
    ## [1] "science" "society" "weigh"
  3. I did this by creating a regex group that extracts the leading characters of a color if the name contains another color in the list.

    colors <- colors()
    
    unique(str_match(colors, str_c('(.+)(', str_flatten(colors, '|'), ')'))[,2])
    ##  [1] NA               "alice"          "antique"        "blue"          
    ##  [5] "cadet"          "cornflower"     "dark"           "darkolive"     
    ##  [9] "darksea"        "darkslate"      "deep"           "deepsky"       
    ## [13] "dim"            "dodger"         "floral"         "forest"        
    ## [17] "ghost"          "green"          "hot"            "indian"        
    ## [21] "lawn"           "light"          "lightgoldenrod" "lightsea"      
    ## [25] "lightsky"       "lightslate"     "lightsteel"     "lime"          
    ## [29] "medium"         "mediumsea"      "mediumslate"    "mediumspring"  
    ## [33] "mediumviolet"   "midnight"       "navajo"         "navy"          
    ## [37] "orange"         "pale"           "paleviolet"     "powder"        
    ## [41] "rosy"           "royal"          "saddle"         "sandy"         
    ## [45] "sea"            "sky"            "slate"          "spring"        
    ## [49] "steel"          "violet"         "yellow"
  4. I pulled anything before the first space since a review of the data told me a space is used to precede the grouping data frame.

    str_match(data(package = "datasets")$results[, "Item"], '\\S*') |> 
      length()
    ## [1] 104