I have two large data frames, and want to merge them based on one of the column. However, some of the cells only have partial match. Please see the example below:
df1 = data.frame(SampleID = c(1:6), Gene = c("ARF5;ARG1","AP3B1","CLDN5","XPO1;STX7","ABCC4","FLOT1"))
df2 = data.frame(Operation = c("Y"), Gene = c("ARG1","CLDN5;STK10","XPO1","PDE5A","ARF5","IPO7","VAPB","ABCC4"))
#-----------------
SampleID Gene
1 ARF5;ARG1
2 AP3B1
3 CLDN5
4 XPO1;STX7
5 ABCC4
6 FLOT1
#-----------------
Operation Gene
Y ARG1
Y CLDN5;STK10
Y XPO1
Y PDE5A
Y ARF5
Y IPO7
Y VAPB
Y ABCC4
Expected Output
#-----------------
SampleID Gene Operation
1 ARF5;ARG1 Y
2 AP3B1 -
3 CLDN5 Y
4 XPO1;STX7 Y
5 ABCC4 Y
6 FLOT1 -
You can see that df1$Gene and df2$Gene have partially matched, and I want to add Operation information into df1 whenever there is a match. In the example, the df1 row 1 and row 4 have partially match to the df2 row 1 and row 2. For those has no matches, it can be NA, or whatever. I have thousands of rows for my data frame, so I cannot adjust them one by one.
CodePudding user response:
Using dplyr
and fuzzyjoin
:
library(dplyr)
# library(fuzzyjoin) # regex_left_join
df2 %>%
mutate(Gene = sapply(strsplit(Gene, ";"), function(z) paste0("\\b(", paste(z, collapse = "|"), ")\\b"))) %>%
fuzzyjoin::regex_left_join(df1, ., by = "Gene") %>%
group_by(SampleID) %>%
summarize(Gene = Gene.x[1], Operation = na.omit(Operation)[1], .groups = "drop")
# # A tibble: 6 x 3
# SampleID Gene Operation
# <int> <chr> <chr>
# 1 1 ARF5;ARG1 Y
# 2 2 AP3B1 NA
# 3 3 CLDN5 Y
# 4 4 XPO1;STX7 Y
# 5 5 ABCC4 Y
# 6 6 FLOT1 NA
The first step converts df2$Gene[2]
from CLDN5;STK10
to \\b(CLDN5|STK10)\\b
, a pattern that allows a match on any of its ;
-delimited values (inferred from your expected output).
Edit: if you have a lot of other columns, you may be able to add them to the grouping such that you don't need to explicitly summarize them (with [1]
). For example, the above might be rewritten as:
df2 %>%
mutate(Gene = sapply(strsplit(Gene, ";"), function(z) paste0("\\b(", paste(z, collapse = "|"), ")\\b"))) %>%
fuzzyjoin::regex_left_join(df1, ., by = "Gene") %>%
rename(Gene = Gene.x) %>%
group_by(across(SampleID:Gene)) %>%
summarize(Operation = na.omit(Operation)[1], .groups = "drop")
# # A tibble: 6 x 3
# SampleID Gene Operation
# <int> <chr> <chr>
# 1 1 ARF5;ARG1 Y
# 2 2 AP3B1 NA
# 3 3 CLDN5 Y
# 4 4 XPO1;STX7 Y
# 5 5 ABCC4 Y
# 6 6 FLOT1 NA
(Renaming from Gene.x
to Gene
is not necessary but looked nice :-)
This method assumes that all columns that you want to keep are either consecutive (allowing for fromcolumn:tocolumn
use of :
-ranges) or not difficult to add individually.