Merge dataframes rows by shared pattern-CodePudding

I want merge 2 dataframes based on a shared pattern.
The pattern is the ID name (here in bold): ID=HAND2;ACS=20 as "ID=(. );ACS"
If the ID is a match in both dataframes, then combine the respective rows!

DF1                   DF2                  MERGED ( DF2   DF1 )

col1  col2      col1  col2                 col1  col2             col3  col4
HAND2 H2        OFS   ID=GATA5;ACS=45      OFS   ID=GATA5;ACS=45
HAND6 H6        FAM   ID=HAND2;ACS=20      FAM   ID=HAND2;ACS=20  HAND2   H2

In this example (HAND2) ID is matched, then, DF1 and DF2 matched rows are combined/merged.

Script tried

MERGED <- merge(data.frame(DF1, row.names=NULL), data.frame(DF2, row.names=NULL), by = ("ID=(. );ACS"), all = TRUE)[-1]

error

Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

I am struggling in finding a similar command, where in alternative to column-names, I can instead match dataframes rows by a shared pattern.

Thank you in advance for your help.

CodePudding user response：

You may try fuzzyjoin. In the match_fun argument you can define a function for your specific needs.

In your case gsub is extracting the pattern of the DF2 col2 variable. And with str_detect the extraction is compared to the col1 column of DF1.

Data

DF1 <- read.table(text = "col1  col2 
HAND2 H2  
HAND6 H6", header = T) 

DF2 <- read.table(text = "col1  col2
OFS   ID=GATA5;ACS=45
FAM   ID=HAND2;ACS=20", header = T)

Code

library(fuzzyjoin)
library(stringr)

DF2 %>% 
  fuzzy_left_join(DF1, 
                  by = c("col2"= "col1"),
                  match_fun = function(x,y) str_detect(y, gsub("ID=(. );(.*)", "\\1", x)) )

Output

  col1.x          col2.x col1.y col2.y
1    OFS ID=GATA5;ACS=45   <NA>   <NA>
2    FAM ID=HAND2;ACS=20  HAND2     H2