Merge dataframe with a key value that is contained within a string in a separate dataframe-CodePudding

employee <- c('John','Peter', 'Gynn', 'Jolie', 'Hope', 'Sue', 'Jane', 'Sarah')
salary <- c('VT020', 'VT126', 'VT027', 'VT667', 'VC120', 'VT000', 'VA120', 'VA020')

emp <- data.frame(employee, salary)


benefit <- c('Health', 'Time', 'Bonus')
benefit_id <- c('VT020 VT126 VT667 VA020', 'VT667', 'VT126 VT667 VT000')

ben <- data.frame(benefit, benefit_id)

Above we have to dataframes, one contains names and a unique ID, the other contains a category and a list of unique IDs.

What is the most efficient way to merge the ben dataframe with the emp dataframe such that we get the appropriate benefit assigned to each employee?

CodePudding user response：

tidyverse

library(dplyr)
library(tidyr) # tidyr
ben %>%
  mutate(benefit_id = strsplit(benefit_id, "\\s ")) %>%
  unnest(benefit_id) %>%
  left_join(emp, ., by = c(salary = "benefit_id"))
#    employee salary benefit
# 1      John  VT020  Health
# 2     Peter  VT126  Health
# 3     Peter  VT126   Bonus
# 4      Gynn  VT027    <NA>
# 5     Jolie  VT667  Health
# 6     Jolie  VT667    Time
# 7     Jolie  VT667   Bonus
# 8      Hope  VC120    <NA>
# 9       Sue  VT000   Bonus
# 10     Jane  VA120    <NA>
# 11    Sarah  VA020  Health

Depending on your needs, you may also prefer a different join. For instance, use a full_join if you want all pairings, where NA in employee indicates a benefit sans employee.

FYI: if you are running R before 4.0, then you might have factors in your data. To fix that, just convert the factor columns with as.character first. (This can be determined with sapply(ben, inherits, "factor").)

CodePudding user response：

data.table

library(data.table)
setDT(emp)
ben_long <- setDT(ben)[, list(benefit_id = unlist(strsplit(x = benefit_id, split = " "))), by = benefit]
merge(x = emp, y = ben_long, by.x = "salary", by.y = "benefit_id", all.x = TRUE)


    salary employee benefit
 1:  VA020    Sarah  Health
 2:  VA120     Jane    <NA>
 3:  VC120     Hope    <NA>
 4:  VT000      Sue   Bonus
 5:  VT020     John  Health
 6:  VT027     Gynn    <NA>
 7:  VT126    Peter  Health
 8:  VT126    Peter   Bonus
 9:  VT667    Jolie  Health
10:  VT667    Jolie    Time
11:  VT667    Jolie   Bonus