I want merge 2 dataframes based on a shared pattern.
The pattern is the ID name (here in bold): ID=HAND2;ACS=20 as "ID=(. );ACS"
If the ID is a match in both dataframes, then combine the respective rows!
DF1 DF2 MERGED ( DF2 DF1 )
col1 col2 col1 col2 col1 col2 col3 col4
HAND2 H2 OFS ID=GATA5;ACS=45 OFS ID=GATA5;ACS=45
HAND6 H6 FAM ID=HAND2;ACS=20 FAM ID=HAND2;ACS=20 HAND2 H2
In this example (HAND2) ID is matched, then, DF1 and DF2 matched rows are combined/merged.
Script tried
MERGED <- merge(data.frame(DF1, row.names=NULL), data.frame(DF2, row.names=NULL), by = ("ID=(. );ACS"), all = TRUE)[-1]
error
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
I am struggling in finding a similar command, where in alternative to column-names, I can instead match dataframes rows by a shared pattern.
Thank you in advance for your help.
CodePudding user response:
You may try fuzzyjoin
. In the match_fun
argument you can define a function for your specific needs.
In your case gsub
is extracting the pattern of the DF2 col2 variable. And with str_detect
the extraction is compared to the col1 column of DF1.
Data
DF1 <- read.table(text = "col1 col2
HAND2 H2
HAND6 H6", header = T)
DF2 <- read.table(text = "col1 col2
OFS ID=GATA5;ACS=45
FAM ID=HAND2;ACS=20", header = T)
Code
library(fuzzyjoin)
library(stringr)
DF2 %>%
fuzzy_left_join(DF1,
by = c("col2"= "col1"),
match_fun = function(x,y) str_detect(y, gsub("ID=(. );(.*)", "\\1", x)) )
Output
col1.x col2.x col1.y col2.y
1 OFS ID=GATA5;ACS=45 <NA> <NA>
2 FAM ID=HAND2;ACS=20 HAND2 H2