Home > Software engineering >  String replacing UTF special characters like ®, °F in a dataframe
String replacing UTF special characters like ®, °F in a dataframe

Time:10-11

I want to string replace special characters such as ® and °F in all the columns of my dataframe to html code.

special_char <- function(df) {
  df %>%
    mutate_all(.funs = ~ str_replace_all(.x, pattern = "®", replacement = "&reg;"))
}

However, this code does not replace ® to &reg; as I want in the columns. Instead, ® remains as if the pattern is undetected.

CodePudding user response:

If you only have a few specific symbols to change, it would be easiest to use their Unicode code points. For example, to change all occurences of the registered trademark symbol (Unicode U00AE) to the equivalent html entity (&reg;), and any degree symbols ( U00B0) to the entity &deg;, we can do:

special_char <- function(df) {
  
    mutate_all(df, .funs = ~ str_replace_all(.x, 
                                             c("\u00ae", "\u00b0"),
                                             c("&reg;",  "&deg;")))
}

So, if your data frame looks like this:

data <- data.frame(a = c("Stack Overflow®", "451°F"),
                   b = c("Coca Cola®", "22°F"))
#>                 a          b
#> 1 Stack Overflow® Coca Cola®
#> 2           451°F       22°F

Your function will escape all relevant instances:

data %>% special_char()
#>                     a              b
#> 1 Stack Overflow&reg; Coca Cola&reg;
#> 2           451&deg;F       22&deg;F

If you want all non-ASCII characters encoded to html entities, a more general solution would be to use the numerical entity format. This is less human-readable, but probably the go-to option if you have a lot of different symbols to escape. A useful starting point would be Mr Flick's solution here, though you would need to vectorize this function to get it working with data frame columns.

  • Related