Merging data with partial match-CodePudding

I have two large data frames, and want to merge them based on one of the column. However, some of the cells only have partial match. Please see the example below:

df1 = data.frame(SampleID = c(1:6), Gene = c("ARF5;ARG1","AP3B1","CLDN5","XPO1;STX7","ABCC4","FLOT1"))
df2 = data.frame(Operation = c("Y"), Gene = c("ARG1","CLDN5;STK10","XPO1","PDE5A","ARF5","IPO7","VAPB","ABCC4"))           
#-----------------
SampleID Gene
1        ARF5;ARG1
2        AP3B1
3        CLDN5
4        XPO1;STX7
5        ABCC4
6        FLOT1
#-----------------
Operation Gene
Y         ARG1
Y         CLDN5;STK10
Y         XPO1
Y         PDE5A
Y         ARF5
Y         IPO7
Y         VAPB
Y         ABCC4

Expected Output

#-----------------
    SampleID Gene       Operation
    1        ARF5;ARG1  Y
    2        AP3B1      -
    3        CLDN5      Y
    4        XPO1;STX7  Y
    5        ABCC4      Y
    6        FLOT1      -

You can see that df1$Gene and df2$Gene have partially matched, and I want to add Operation information into df1 whenever there is a match. In the example, the df1 row 1 and row 4 have partially match to the df2 row 1 and row 2. For those has no matches, it can be NA, or whatever. I have thousands of rows for my data frame, so I cannot adjust them one by one.

CodePudding user response：

Using dplyr and fuzzyjoin:

library(dplyr)
# library(fuzzyjoin) # regex_left_join
df2 %>%
  mutate(Gene = sapply(strsplit(Gene, ";"), function(z) paste0("\\b(", paste(z, collapse = "|"), ")\\b"))) %>%
  fuzzyjoin::regex_left_join(df1, ., by = "Gene") %>%
  group_by(SampleID) %>%
  summarize(Gene = Gene.x[1], Operation = na.omit(Operation)[1], .groups = "drop")
# # A tibble: 6 x 3
#   SampleID Gene      Operation
#      <int> <chr>     <chr>    
# 1        1 ARF5;ARG1 Y        
# 2        2 AP3B1     NA       
# 3        3 CLDN5     Y        
# 4        4 XPO1;STX7 Y        
# 5        5 ABCC4     Y        
# 6        6 FLOT1     NA

The first step converts df2$Gene[2] from CLDN5;STK10 to \\b(CLDN5|STK10)\\b, a pattern that allows a match on any of its ;-delimited values (inferred from your expected output).

Edit: if you have a lot of other columns, you may be able to add them to the grouping such that you don't need to explicitly summarize them (with [1]). For example, the above might be rewritten as:

df2 %>%
  mutate(Gene = sapply(strsplit(Gene, ";"), function(z) paste0("\\b(", paste(z, collapse = "|"), ")\\b"))) %>%
  fuzzyjoin::regex_left_join(df1, ., by = "Gene") %>%
  rename(Gene = Gene.x) %>%
  group_by(across(SampleID:Gene)) %>%
  summarize(Operation = na.omit(Operation)[1], .groups = "drop")
# # A tibble: 6 x 3
#   SampleID Gene      Operation
#      <int> <chr>     <chr>    
# 1        1 ARF5;ARG1 Y        
# 2        2 AP3B1     NA       
# 3        3 CLDN5     Y        
# 4        4 XPO1;STX7 Y        
# 5        5 ABCC4     Y        
# 6        6 FLOT1     NA

(Renaming from Gene.x to Gene is not necessary but looked nice :-)

This method assumes that all columns that you want to keep are either consecutive (allowing for fromcolumn:tocolumn use of :-ranges) or not difficult to add individually.