Home > Back-end >  Expansion of `str_to...` to include abbreviations
Expansion of `str_to...` to include abbreviations

Time:10-06

Motivated by a desire to create column headers with a certain case, I want to exclude abbreviations. The snakecase::anycase() is a fine function, but if there is a special character (e.g. #), it is removed.

Say that a string (or specifically, a table has these columns)

library(stringr)
x <- c("nyc", "buffalo", "la", "raleigh", "richmond")
str_to_title(x)
#> [1] "Nyc"      "Buffalo"  "La"       "Raleigh"  "Richmond"

But "NYC" and "LA" are abbreviations, I want these to be all upper case. i.e.

"NYC"      "Buffalo"  "LA"       "Raleigh"  "Richmond"

Of course, this problem goes beyond just changing the case of table headers. There may be other situations where it is useful for string_to... to include abbreviations.

Created on 2021-10-05 by the reprex package (v2.0.1)

CodePudding user response:

string_r is a verstaile package with strings, and it can easily be expanded.

  
  #' Converts case of a string to a class, but keeps the abbreviations as listed.
  #'
  #' @param string A string (for example names of a data frame).
  #' @param str_to a function that emphasizes the case from stringr: str_to_title, str_to_sentence, str_to_upper, str_to_lower. Need to select one.
  #' @param abbreviations character vector with (uppercase) abbreviations.
  #'
  #' @return A character vector.
  #'
  #' @author Malte Grosser, \email{malte.grosser@@gmail.com}
  #' @keywords utilities
#'
str_to_with_abbreviations <-
  function(string,
           str_to = c(str_to_title, str_to_sentence, str_to_upper, str_to_lower),
           abbreviations = NULL
  ){
    if(is.null(abbreviations)){
      warning("Abbreviations is NULL. why are you doing a more complicated function? Just using glue::glue_data{str_to}.")
    }
    
    #To replace, convert both abbreviations and string to `str_to`
    abbrevations_to_x <- abbreviations %>% str_to()
    string_to_x <- string %>% str_to()
    
    #Are abbreviations in the string? If not, just return!
    abbreviations_in_string <- any(string_to_x %in% abbrevations_to_x)
    
    if(abbreviations_in_string & !is.null(abbreviations)){
      #Anything better than a for loop?
      for(j in 1:length(abbreviations)){
        string_to_x <- stringr::str_replace_all(
          string_to_x,
          abbrevations_to_x[j],
          abbreviations[j]
        )
      }
    }
    
    return(string_to_x)
  }

This function can be applied to the string specified to get the desired output:

str_to_with_abbreviations(
c("nyc", "buffalo", "la", "raleigh", "richmond"),
abbreviations = c("NYC", "LA"),
str_to = str_to_title
)
#> [1] "NYC"      "Buffalo"  "LA"       "Raleigh"  "Richmond"

The function isn't perfect, but it is a start for most applications.

CodePudding user response:

I see that for your answer you bundled everything into a function. I didn't get into that, but you could easily enough wrap this in a function. A couple notes on the logic & regex that will get you started scaling this:

  • Collapse all your abbreviations into a single regex pattern that will find any of the abbreviations, in the form "(patt1|patt2)"
  • Make the pattern span from the beginning of the string to the end with "^" and "$"; otherwise "la paz" will become "LA Paz"
  • Use the string transformation function (str_to_title is most applicable for town names) on both the abbrevations and the string you're changing. This is where your function could make things more efficient
  • Use gsub with perl = TRUE to let you substitute uppercase versions ("\\U") of your matches. As far as I know the stringr functions don't take perl arguments, though the underlying stringi functions might.
x <- c("nyc", "buffalo", "la", "raleigh", "richmond", "la paz")
abbrev <- c("NYC", "LA")
abbrev_patt <- sprintf("^(%s)$", paste(stringr::str_to_title(abbrev), collapse = "|"))
abbrev_patt
#> [1] "^(Nyc|La)$"
gsub(abbrev_patt, "\\U\\1", stringr::str_to_title(x), perl = TRUE)
#> [1] "NYC"      "Buffalo"  "LA"       "Raleigh"  "Richmond" "La Paz"
  • Related