I'm using the mutate method to find strings in a column (Name
in the example)and replace them with a corrected string in R, which works well both on partial and full strings.
Method:
df <- data.frame(Name = c("Jim","Bob","Sue","Sally","Jimmm","Boob","Suezi","Sallyyyy","Jim","Bob","Sue","Sally"),
Period = c("P1","P1","P1","P1","P2","P2","P2","P2","P3","P3","P3","P3"),
Value = c(150, 200, 325, 120, 760,245,46,244,200, 325, 120, 760))
df <- df %>%
mutate(Name = case_when(
str_detect(Name, "Jim") ~ "Jim",
str_detect(Name, "Sue") ~ "Sue",
TRUE ~ Name)) %>%
mutate(across(Name, str_replace, "Sallyyyy", "Sally"))
In my real application I realized I should probably maintain the original column for reference and and create a new column with the corrections.
I tried simply adding a new column the standard way in r, as below:
df$test <- df %>%
mutate(Name = case_when(
str_detect(Name, "Jim") ~ "Jim",
TRUE ~ Name)) %>%
mutate(across(Name, str_replace, "Sallyyyy", "Sally"))
but instead of just creating a new column called test
, in this case it creates a copy of the entire dataframe.
Is there a method within the mutate function that will allow me to create a new column with the correction as opposed to replacing it in the original column?
CodePudding user response:
I realized I'm over complicating this, the simple solution here is to just create a copy of the column and apply the correction to the copied column, like so:
df$Name_Correct <- df$Name
Full solution:
df <- data.frame(Name = c("Jim","Bob","Sue","Sally","Jimmm","Boob","Suezi","Sallyyyy","Jim","Bob","Sue","Sally"),
Period = c("P1","P1","P1","P1","P2","P2","P2","P2","P3","P3","P3","P3"),
Value = c(150, 200, 325, 120, 760,245,46,244,200, 325, 120, 760))
df$Name_Correct <- df$Name
df <- df %>%
mutate(Name_Correct = case_when(
str_detect(Name_Correct, "Jim") ~ "Jim",
str_detect(Name_Correct, "Sue") ~ "Sue",
TRUE ~ Name_Correct)) %>%
mutate(across(Name_Correct, str_replace, "Sallyyyy", "Sally"))
CodePudding user response:
If you have a dataframe full of good names, you could try to partially match the closest name. You could use the built in babynames
dataset for reference. This way, you would need no case_when
and you could recode all in one go:
library(tidyverse)
nms <- babynames::babynames |>
filter(year > 1990) |>
filter(!duplicated(name)) |>
pull(name)
df |>
mutate(Name_Correct = map_chr(Name, ~nms[which.min(stringdist::stringdist(.x, nms ))]))
#> Name Period Value Name_Correct
#> 1 Jim P1 150 Jim
#> 2 Bob P1 200 Bob
#> 3 Sue P1 325 Sue
#> 4 Sally P1 120 Sally
#> 5 Jimmm P2 760 Jimmy
#> 6 Boob P2 245 Bob
#> 7 Suezi P2 46 Suzi
#> 8 Sallyyyy P2 244 Sally
#> 9 Jim P3 200 Jim
#> 10 Bob P3 325 Bob
#> 11 Sue P3 120 Sue
#> 12 Sally P3 760 Sally
Its not perfect, but you can refine with the stringdist
function to get it tuned up correctly.