Home > Back-end >  Joining tables using variable columns - dplyr, r, join
Joining tables using variable columns - dplyr, r, join

Time:05-12

Problem: I have 2 tables I'd like to join. However, the column upon which I wish to join the second to the first will vary dependent upon a successful parsing of the 2nd data frame to identify which column and row to join.

Request: I have found a solution to the problem (see below) but it does not seem to me to very computationally efficient. Not a problem for the reproducible example below but potentially less ideal when stepped out to a larger scale problem i.e. ~200,000 rows / observations.

Wondering if anyone might be able to help identify something better - ideally utilising functionality from dplyr.

Reproducible Example:

# Equipment alias table
alias1    <- c('a1a1', 'a2a2', 'a3a3', 'a4a4', 'a5a5', 'a6a6')
alias2   <- c('bc001', 'bc002', 'bc003', 'bc004', 'bc005', 'bc006')
alias3   <- c('e1o1', 'e202', 'e303', 'e404', 'e505', 'e606')

df_alias <- data.frame(alias1, alias2, alias3)

# Attribute table
equip    <- c('a1a1','bc006', 'e404')
att1     <- c('a', 'b', 'c')
att2     <- c('1', '2', '3')

df_att   <- data.frame(equip, att1, att2)

Desired Outcome:

I'm looking to achieve the following....

# DESRIED OUTPUT - combining equipment alias table into attribute table based on string match between attibute_equip and any one of columns in equipment alias
equip    <- c('a1a1','bc006', 'e404')
att1     <- c('a', 'b', 'c')
att2     <- c('1', '2', '3')
alias1   <- c('a1a1','a6a6', 'a4a4')
alias2   <- c('bc001','bc006', 'bc004')
alias3   <- c('e1o1','e606', 'e404')

df_att   <- data.frame(equip, att1, att2, alias1, alias2, alias3)

Current Solution:

library(dplyr)  
left_join(df_att, df_alias, by = character()) %>% 
filter(equip == alias1 | equip == alias2 | equip == alias3)

Effective but not exactly elegant as there's a great deal of duplication for ultimately a filter to then be applied to undo that duplication.

CodePudding user response:

An option is to filter with if_any and then bind the subset rows with the df_att

library(dplyr)
df_att2 <- df_alias %>%
    filter(if_any(everything(), ~ .x %in% df_att$equip)) %>% 
    arrange(na.omit(unlist(across(everything(), ~ match(df_att$equip, .x))))) %>% 
    bind_cols(df_att, .)

-checking with OP's expected (changed the object name 'df_att' to 'out' to avoid any confusion)

> all.equal(df_att2, out)
[1] TRUE

CodePudding user response:

I don't know how it compares efficiency-wise, but one idea is to pivot a copy of each alias so that you can left_join against a single column instead of multiple ones.

library(tidyr)
library(dplyr)

df_alias %>% 
  mutate(across(everything(), ~., .names = "_{.col}")) %>% 
  pivot_longer(starts_with('_'), names_to = NULL, values_to = 'equip') %>% 
  left_join(df_att, .)

#> Joining, by = "equip"
#>   equip att1 att2 alias1 alias2 alias3
#> 1  a1a1    a    1   a1a1  bc001   e1o1
#> 2 bc006    b    2   a6a6  bc006   e606
#> 3  e404    c    3   a4a4  bc004   e404
  • Related