Problem: I have 2 tables I'd like to join. However, the column upon which I wish to join the second to the first will vary dependent upon a successful parsing of the 2nd data frame to identify which column and row to join.
Request: I have found a solution to the problem (see below) but it does not seem to me to very computationally efficient. Not a problem for the reproducible example below but potentially less ideal when stepped out to a larger scale problem i.e. ~200,000 rows / observations.
Wondering if anyone might be able to help identify something better - ideally utilising functionality from dplyr.
Reproducible Example:
# Equipment alias table
alias1 <- c('a1a1', 'a2a2', 'a3a3', 'a4a4', 'a5a5', 'a6a6')
alias2 <- c('bc001', 'bc002', 'bc003', 'bc004', 'bc005', 'bc006')
alias3 <- c('e1o1', 'e202', 'e303', 'e404', 'e505', 'e606')
df_alias <- data.frame(alias1, alias2, alias3)
# Attribute table
equip <- c('a1a1','bc006', 'e404')
att1 <- c('a', 'b', 'c')
att2 <- c('1', '2', '3')
df_att <- data.frame(equip, att1, att2)
Desired Outcome:
I'm looking to achieve the following....
# DESRIED OUTPUT - combining equipment alias table into attribute table based on string match between attibute_equip and any one of columns in equipment alias
equip <- c('a1a1','bc006', 'e404')
att1 <- c('a', 'b', 'c')
att2 <- c('1', '2', '3')
alias1 <- c('a1a1','a6a6', 'a4a4')
alias2 <- c('bc001','bc006', 'bc004')
alias3 <- c('e1o1','e606', 'e404')
df_att <- data.frame(equip, att1, att2, alias1, alias2, alias3)
Current Solution:
library(dplyr)
left_join(df_att, df_alias, by = character()) %>%
filter(equip == alias1 | equip == alias2 | equip == alias3)
Effective but not exactly elegant as there's a great deal of duplication for ultimately a filter to then be applied to undo that duplication.
CodePudding user response:
An option is to filter
with if_any
and then bind the subset rows with the df_att
library(dplyr)
df_att2 <- df_alias %>%
filter(if_any(everything(), ~ .x %in% df_att$equip)) %>%
arrange(na.omit(unlist(across(everything(), ~ match(df_att$equip, .x))))) %>%
bind_cols(df_att, .)
-checking with OP's expected (changed the object name 'df_att' to 'out' to avoid any confusion)
> all.equal(df_att2, out)
[1] TRUE
CodePudding user response:
I don't know how it compares efficiency-wise, but one idea is to pivot a copy of each alias so that you can left_join
against a single column instead of multiple ones.
library(tidyr)
library(dplyr)
df_alias %>%
mutate(across(everything(), ~., .names = "_{.col}")) %>%
pivot_longer(starts_with('_'), names_to = NULL, values_to = 'equip') %>%
left_join(df_att, .)
#> Joining, by = "equip"
#> equip att1 att2 alias1 alias2 alias3
#> 1 a1a1 a 1 a1a1 bc001 e1o1
#> 2 bc006 b 2 a6a6 bc006 e606
#> 3 e404 c 3 a4a4 bc004 e404