Home > Software design >  If string detected in df, create new column with replacement string R dplyr
If string detected in df, create new column with replacement string R dplyr

Time:11-04

I'm using the mutate method to find strings in a column (Name in the example)and replace them with a corrected string in R, which works well both on partial and full strings.

Method:

df <- data.frame(Name = c("Jim","Bob","Sue","Sally","Jimmm","Boob","Suezi","Sallyyyy","Jim","Bob","Sue","Sally"),
Period = c("P1","P1","P1","P1","P2","P2","P2","P2","P3","P3","P3","P3"),
Value = c(150, 200, 325, 120, 760,245,46,244,200, 325, 120, 760))

df <- df %>% 
  mutate(Name = case_when(
  str_detect(Name, "Jim") ~ "Jim",
str_detect(Name, "Sue") ~ "Sue",
  TRUE ~ Name)) %>%
  mutate(across(Name, str_replace, "Sallyyyy", "Sally")) 

In my real application I realized I should probably maintain the original column for reference and and create a new column with the corrections.

I tried simply adding a new column the standard way in r, as below:

df$test <- df %>% 
  mutate(Name = case_when(
  str_detect(Name, "Jim") ~ "Jim",
  TRUE ~ Name)) %>%
  mutate(across(Name, str_replace, "Sallyyyy", "Sally")) 

but instead of just creating a new column called test, in this case it creates a copy of the entire dataframe.

Is there a method within the mutate function that will allow me to create a new column with the correction as opposed to replacing it in the original column?

CodePudding user response:

I realized I'm over complicating this, the simple solution here is to just create a copy of the column and apply the correction to the copied column, like so:

df$Name_Correct <- df$Name

Full solution:

df <- data.frame(Name = c("Jim","Bob","Sue","Sally","Jimmm","Boob","Suezi","Sallyyyy","Jim","Bob","Sue","Sally"),
                 Period = c("P1","P1","P1","P1","P2","P2","P2","P2","P3","P3","P3","P3"),
                 Value = c(150, 200, 325, 120, 760,245,46,244,200, 325, 120, 760))

df$Name_Correct <- df$Name

df <- df %>% 
  mutate(Name_Correct = case_when(
    str_detect(Name_Correct, "Jim") ~ "Jim",
    str_detect(Name_Correct, "Sue") ~ "Sue",
    TRUE ~ Name_Correct)) %>%
  mutate(across(Name_Correct, str_replace, "Sallyyyy", "Sally")) 

CodePudding user response:

If you have a dataframe full of good names, you could try to partially match the closest name. You could use the built in babynames dataset for reference. This way, you would need no case_when and you could recode all in one go:

library(tidyverse)

nms <- babynames::babynames |>
  filter(year > 1990) |>
  filter(!duplicated(name)) |>
  pull(name)

df |>
  mutate(Name_Correct = map_chr(Name, ~nms[which.min(stringdist::stringdist(.x, nms ))]))
#>        Name Period Value Name_Correct
#> 1       Jim     P1   150          Jim
#> 2       Bob     P1   200          Bob
#> 3       Sue     P1   325          Sue
#> 4     Sally     P1   120        Sally
#> 5     Jimmm     P2   760        Jimmy
#> 6      Boob     P2   245          Bob
#> 7     Suezi     P2    46         Suzi
#> 8  Sallyyyy     P2   244        Sally
#> 9       Jim     P3   200          Jim
#> 10      Bob     P3   325          Bob
#> 11      Sue     P3   120          Sue
#> 12    Sally     P3   760        Sally

Its not perfect, but you can refine with the stringdist function to get it tuned up correctly.

  • Related