Home > Back-end >  R: Find and replace character in one df col with character from another df
R: Find and replace character in one df col with character from another df

Time:05-27

I have a df (dataframe A) with over 200k rows with incorrect ids scattered throughout a single column (id). I have a another df (dataframe B) with all the incorrect ids of dataframe A matched with corrected ids. How can I use dataframe B to fix the errors in the id column of dataframe A?

Old dataframe A column:

id
C-2005-8-11-14
C-2005-8-11-15
C-2005-8-11-16
C-2006-3-7-1
C-2007-1-10-8
C-2007-1-10-9
C-2007-1-10-10
C-2008-7-2-4
C-2009-1-15-41  

dataframe B:

bad_id          correct_id
C-2005-8-11-14  C-2005-8-22-14
C-2006-3-7-1    C-2006-3-30-1
C-2009-1-15-41  C-2009-1-12-41

New dataframe A column:

id
C-2005-8-22-14
C-2005-8-11-15
C-2005-8-11-16
C-2006-3-30-1
C-2007-1-10-8
C-2007-1-10-9
C-2007-1-10-10
C-2008-7-2-4
C-2009-1-12-41  

CodePudding user response:

We could do it with an ifelse statement: Assuming df1 = A, df2 = B:

library(dplyr)

A %>% 
  mutate(id = ifelse(id %in% B$bad_id, B$correct_id, id))
              id
1 C-2005-8-22-14
2 C-2005-8-11-15
3 C-2005-8-11-16
4 C-2005-8-22-14
5  C-2007-1-10-8
6  C-2007-1-10-9
7 C-2007-1-10-10
8   C-2008-7-2-4
9 C-2009-1-12-41

CodePudding user response:

A possible solution:

library(dplyr)

df1 %>% 
  full_join(df2, by = c("id" = "bad_id")) %>% 
  mutate(correct_id = coalesce(id, correct_id), id = NULL)

#>       correct_id
#> 1 C-2005-8-11-14
#> 2 C-2005-8-11-15
#> 3 C-2005-8-11-16
#> 4   C-2006-3-7-1
#> 5  C-2007-1-10-8
#> 6  C-2007-1-10-9
#> 7 C-2007-1-10-10
#> 8   C-2008-7-2-4
#> 9 C-2009-1-15-41
  • Related