I'd like to merge rows of two data frames - df1 and df2 using column A
:
#df1
A <- c('ab','ab','bc','bc','bc','cd')
B <- floor(runif(6, min=0, max=10))
C <- floor(runif(6, min=0, max=10))
D <- floor(runif(6, min=0, max=10))
E <- c('a, b, c','a, d, e','a, g, h','d, e, f','a, d, f','f, j')
df1 <- data.frame(A,B,C,D,E)
df1
A B C D E
1 ab 5 4 3 a, b, c
2 ab 9 4 0 a, d, e
3 bc 4 4 9 a, g, h
4 bc 5 5 6 d, e, f
5 bc 1 6 6 a, d, f
6 cd 1 2 0 f, j
#df2
A <- c('ab','bc','cd')
B <- floor(runif(3, min=0, max=10))
E <- c('a, d','d, f','n, m')
df2 <- data.frame(A,B,E)
df2
A B E
1 ab 4 a, d
2 bc 7 d, f
3 cd 1 n, m
I can do simply:
df3 <- merge(x=df1, y=df2, by='A', all.x = TRUE)
However there's condition of merging. Namely, I'd like to merge only rows from df2
to df1
when all substrings (column E
) from df2
are present in df1
, so the output should look like this:
df3
A B C D E A.y B.y E.y
1 ab 5 4 3 a, b, c NA NA NA
2 ab 9 4 0 a, d, e, ab 6 a, d
3 bc 4 4 9 a, g, h NA NA NA
4 bc 5 5 6 d, e, f bc 7 d, f
5 bc 1 6 6 a, d, f bc 7 d, f
6 cd 1 2 0 f, j NA NA NA
I know there's an option using %in%
regarding vector comparison. However I have strings, should I first do some strsplit
and unlist
and then perform the comparison?
CodePudding user response:
This is pretty messy but should do what you're looking for:
First, expand rows for both E
values, then group by the key column to check if any values from RHS E
are in LHS E
. Then filter based on the lookup table.
library(tidyverse)
df3 <- merge(x=df1, y=df2, by='A', all.x = TRUE)
check_rows <- df3 %>%
separate_rows(E.y, sep = ',') %>%
separate_rows(E.x, sep = ',') %>%
mutate(E.x = trimws(E.x),
E.y = trimws(E.y)) %>%
group_by(A) %>%
mutate(check = E.y %in% E.x,
check = ifelse(any(check == TRUE), TRUE, FALSE)) %>%
select(A, check) %>%
unique() %>%
filter(check == TRUE)
df3 <- df3 %>%
filter(A %in% check_rows$A)