Home > Blockchain >  r - Efficient conditional join on multiple columns
r - Efficient conditional join on multiple columns

Time:03-31

I have two tables that i would like to join using multiple columns, and this is perfectly feasible using the dplyr join functions. The complication comes from the fact that i need to join on multiple columns and the join should be succesful if at least one column join is succesful. To demonstrate my case here is a reproducible example:

df1 <- data.frame(
  A1 = c(1,2,3,4),
  B1 = c(4,5,6,7),
  C1 = c("a", "b", "c", "d")
)

df2 <- data.frame(
  A2 = c(8,"",3,4),
  B2 = c(9,5,"",7),
  C2 = c("aa", "bb", "cc", "dd")
)

I would like to join df1 and df2 on columns A or B, meaning to keep all rows where at least df1$A = df2$A or df1$B = df2$B (please note my real dataset has 6 columns that i would like to use for the joining). The end result for the simplified example should be:

data.frame(
  A1 = c(2,3,4),
  A2 = c("",3,7),
  B1 = c(5,6,7),
  B2 = c(5,"", 7),
  C1 = c("b", "c", "d"),
  C2 = c("bb", "cc", "dd")
)

Many thanks in advance for any recommendations on how this can be done efficiently or if fast is not possible then slow solution can be accepted as well

CodePudding user response:

Not quite sure how to do this using dplyr, but sqldf could help you out:

library(sqldf)
sqldf("SELECT * 
       FROM df1
       JOIN df2 
       ON df1.A1 = df2.A2 
       OR df1.B1 = df2.B2")

You can add additional OR statements after this for more columns.

CodePudding user response:

It seems like this isn't possible with a single call to a dplyr join function.

If you would like to use a dplyr join, here is a hacky workaround I created using a purrr map function to do a separate inner join for each of the conditions in the conditional join. Then bind them together and remove duplicate rows. It can be generalized to more columns by appending to the key1 and key2 vectors.

note: first we need to modify the example data so columns to be joined have the same type. dplyr throws an error if you try to join incompatible column types, in this case integer and character.

df1 <- df1 %>%
  mutate(A1 = as.character(A1), B1 = as.character(B1))

key1 <- c('A1', 'B1')
key2 <- c('A2', 'B2')

map2_dfr(key1, key2, ~ inner_join(df1, df2, by = setNames(.y, .x), keep = TRUE)) %>%
  distinct()

Result:

  A1 B1 C1 A2 B2 C2
1  3  6  c  3    cc
2  4  7  d  4  7 dd
3  2  5  b     5 bb
  • Related