Home > Net >  How do I match names in two dataframes and output the matched value?
How do I match names in two dataframes and output the matched value?

Time:03-31

Say I have two dataframes:

a <- c("smith", "lee", "black", "gonzalez", "rodriguez")
df1 <- as.data.frame(a)

df1
          a
1     smith
2       lee
3     black
4  gonzalez
5 rodriguez

b <- c("harry smith", "john smith", "laura smith", "carol black", "peter h. black", "cora lee", "benjamin d. black", "gonzalez 12323902130", "rodriguez 0931029321")
df2 <- as.data.frame(b)

df2

                     b
1          harry smith
2           john smith
3          laura smith
4          carol black
5       peter h. black
6             cora lee
7    benjamin d. black
8 gonzalez 12323902130
9 rodriguez 0931029321

If "harry smith" matches with anything from df1$a, I want it to output "smith." Ideally, I'll have something like this:

b <- c("harry smith", "john smith", "laura smith", "carol black", "peter h. black", "cora lee", "benjamin d. black", "gonzalez 12323902130", "rodriguez 0931029321")
match <- c("smith", "smith", "smith", "black", "black", "lee", "black", "gonzalez", "rodriguez")
df <- as.data.frame(b, match)
df

df

match                         b
smith              harry smith
smith               john smith
smith              laura smith
black              carol black
black           peter h. black
lee                   cora lee
black        benjamin d. black
gonzalez  gonzalez 12323902130
rodriguez rodriguez 0931029321

I tried something like this and got an error message:

df$match <- ifelse(df1$a %in% df2$b, df1$a, NA)
Error in `$<-.data.frame`(`*tmp*`, match, value = c(NA, NA, NA, NA, NA : 
  replacement has 5 rows, data has 9

CodePudding user response:

An alternative using regex partial matching:

lk <- sapply(a, grepl, x = b)
cbind(b, apply(lk, 1, function(i) names(which(i))))

 [1,] "harry smith"          "smith"    
 [2,] "john smith"           "smith"    
 [3,] "laura smith"          "smith"    
 [4,] "carol black"          "black"    
 [5,] "peter h. black"       "black"    
 [6,] "cora lee"             "lee"      
 [7,] "benjamin d. black"    "black"    
 [8,] "gonzalez 12323902130" "gonzalez" 
 [9,] "rodriguez 0931029321" "rodriguez"

CodePudding user response:

df <- data.frame(match = sapply(strsplit(df2$b, " "), function(x) x[x %in% df1$a]),
                 b = df2$b)

df

      match                    b
1     smith          harry smith
2     smith           john smith
3     smith          laura smith
4     black          carol black
5     black       peter h. black
6       lee             cora lee
7     black    benjamin d. black
8  gonzalez gonzalez 12323902130
9 rodriguez rodriguez 0931029321
  • Related