Home > Enterprise >  Conditionally join from multiple tables based on a column?
Conditionally join from multiple tables based on a column?

Time:07-29

What I'm trying to accomplish is basically the equivalent of a vlookup in Excel with an if statement to determine which table to use based on the value of a given column.

Main dataset looks like this:

#STATE  CODE  AMOUNT
#  NJ    1     88
#  DE    2     75
#  VA    1     24
#  PA    1     32

Then there are a handful of other tables that I need to use for the lookups to add the factor column, depending on the state - some states are unique, and the rest all use a common table. For example (the actual tables are much longer than this):

NJ:

#CODE FACTOR
# 1    0.75
# 2    0.90

PA:

#CODE FACTOR
  1    0.80
  2    0.95

All Other:

#CODE FACTOR
  1    0.82
  2    0.93

So the final output would be:

#STATE  CODE  AMOUNT  FACTOR
#  NJ    1     88      0.75
#  DE    2     75      0.93   
#  VA    1     24      0.82
#  PA    1     32      0.80

Is there a way to conditionally join/lookup from the various factor tables depending on the value of State, in this example? Or would I need to combine the factor tables into a single table and explicitly list every state/factor combination and then join based on both State and Code? Thanks for your help.

CodePudding user response:

Instead of having different dataframes for each state it would be easier if the data is present in one dataframe.

Here is one approach -

library(dplyr)

combined_states <- bind_rows(lst(NJ, PA, other), .id = "STATE") 

main %>%
  mutate(STATE_temp = replace(STATE, 
                      !STATE %in% unique(combined_states$STATE), 'other')) %>%
  left_join(combined_states, by = c('STATE_temp' = 'STATE', 'CODE')) %>%
  select(-STATE_temp)

#  STATE  CODE AMOUNT FACTOR
#  <chr> <dbl>  <dbl>  <dbl>
#1 NJ        1     88   0.75
#2 DE        2     75   0.93
#3 VA        1     24   0.82
#4 PA        1     32   0.8 

Note that the name of other dataframe should match with the replaced value for STATE_temp.

data

main <- tibble(STATE = c('NJ', 'DE', 'VA', 'PA'), 
               CODE = c(1, 2, 1, 1), 
               AMOUNT = c(88, 75, 24, 32))
NJ <- tibble(CODE = c(1, 2), FACTOR = c(0.75, 0.9))
PA <- tibble(CODE = c(1, 2), FACTOR = c(0.8, 0.95))
other <- tibble(CODE = c(1, 2), FACTOR = c(0.82, 0.93))

CodePudding user response:

Make a list of all data frames from your global environment with state abbreviation names, bind them into a single data frame with a STATE column, and join to the main dataset. Then for anything left over we can use rows_patch to fill in NA values.

library(dplyr)
state_tables = ls(pattern = paste(state.abb, collapse = "|")) |>
  mget() |>
  bind_rows(.id = "STATE")

main = main |>
  left_join(state_tables, by = c("STATE", "CODE")) |>
  rows_patch(other, by = "CODE")

main
#  A tibble: 4 × 4
#   STATE  CODE AMOUNT FACTOR
#   <chr> <dbl>  <dbl>  <dbl>
# 1 NJ        1     88   0.75
# 2 DE        2     75   0.93
# 3 VA        1     24   0.82
# 4 PA        1     32   0.8 

(Using Ronak's kindly shared data)

  • Related