Home > OS >  Merge data frames based on custom condition - string comparison
Merge data frames based on custom condition - string comparison

Time:05-31

I'd like to merge rows of two data frames - df1 and df2 using column A:

 #df1   
    A <- c('ab','ab','bc','bc','bc','cd')
    B <- floor(runif(6, min=0, max=10))
    C <- floor(runif(6, min=0, max=10))
    D <- floor(runif(6, min=0, max=10))
    E <- c('a, b, c','a, d, e','a, g, h','d, e, f','a, d, f','f, j')
    df1 <- data.frame(A,B,C,D,E)
    df1
       A B C D        E
    1 ab 5 4 3  a, b, c
    2 ab 9 4 0  a, d, e
    3 bc 4 4 9  a, g, h
    4 bc 5 5 6  d, e, f
    5 bc 1 6 6  a, d, f
    6 cd 1 2 0     f, j

 #df2
   A <- c('ab','bc','cd')
   B <- floor(runif(3, min=0, max=10))
   E <- c('a, d','d, f','n, m')
   df2 <- data.frame(A,B,E)
   df2
   A B    E
   1 ab 4 a, d
   2 bc 7 d, f
   3 cd 1 n, m

I can do simply:

df3 <- merge(x=df1, y=df2, by='A', all.x = TRUE)

However there's condition of merging. Namely, I'd like to merge only rows from df2 to df1 when all substrings (column E) from df2 are present in df1, so the output should look like this:

    df3
       A B C D        E  A.y  B.y  E.y
    1 ab 5 4 3  a, b, c  NA   NA   NA
    2 ab 9 4 0 a, d, e,  ab   6    a, d
    3 bc 4 4 9  a, g, h  NA   NA   NA
    4 bc 5 5 6  d, e, f  bc   7    d, f
    5 bc 1 6 6  a, d, f  bc   7    d, f
    6 cd 1 2 0     f, j  NA   NA   NA

I know there's an option using %in% regarding vector comparison. However I have strings, should I first do some strsplit and unlist and then perform the comparison?

CodePudding user response:

This is pretty messy but should do what you're looking for:

First, expand rows for both E values, then group by the key column to check if any values from RHS E are in LHS E. Then filter based on the lookup table.

library(tidyverse)

df3 <- merge(x=df1, y=df2, by='A', all.x = TRUE)
      
check_rows <- df3 %>% 
  separate_rows(E.y, sep = ',') %>% 
  separate_rows(E.x, sep = ',') %>% 
  mutate(E.x = trimws(E.x),
         E.y = trimws(E.y)) %>% 
  group_by(A) %>% 
  mutate(check = E.y %in% E.x,
         check = ifelse(any(check == TRUE), TRUE, FALSE)) %>% 
  select(A, check) %>% 
  unique() %>% 
  filter(check == TRUE)


df3 <- df3 %>% 
  filter(A %in% check_rows$A)
  • Related