Conditionally join from multiple tables based on a column?-CodePudding

What I'm trying to accomplish is basically the equivalent of a vlookup in Excel with an if statement to determine which table to use based on the value of a given column.

Main dataset looks like this:

#STATE  CODE  AMOUNT
#  NJ    1     88
#  DE    2     75
#  VA    1     24
#  PA    1     32

Then there are a handful of other tables that I need to use for the lookups to add the factor column, depending on the state - some states are unique, and the rest all use a common table. For example (the actual tables are much longer than this):

NJ:

#CODE FACTOR
# 1    0.75
# 2    0.90

PA:

#CODE FACTOR
  1    0.80
  2    0.95

All Other:

#CODE FACTOR
  1    0.82
  2    0.93

So the final output would be:

#STATE  CODE  AMOUNT  FACTOR
#  NJ    1     88      0.75
#  DE    2     75      0.93   
#  VA    1     24      0.82
#  PA    1     32      0.80

Is there a way to conditionally join/lookup from the various factor tables depending on the value of State, in this example? Or would I need to combine the factor tables into a single table and explicitly list every state/factor combination and then join based on both State and Code? Thanks for your help.

CodePudding user response：

Instead of having different dataframes for each state it would be easier if the data is present in one dataframe.

Here is one approach -

library(dplyr)

combined_states <- bind_rows(lst(NJ, PA, other), .id = "STATE") 

main %>%
  mutate(STATE_temp = replace(STATE, 
                      !STATE %in% unique(combined_states$STATE), 'other')) %>%
  left_join(combined_states, by = c('STATE_temp' = 'STATE', 'CODE')) %>%
  select(-STATE_temp)

#  STATE  CODE AMOUNT FACTOR
#  <chr> <dbl>  <dbl>  <dbl>
#1 NJ        1     88   0.75
#2 DE        2     75   0.93
#3 VA        1     24   0.82
#4 PA        1     32   0.8

Note that the name of other dataframe should match with the replaced value for STATE_temp.

data

main <- tibble(STATE = c('NJ', 'DE', 'VA', 'PA'), 
               CODE = c(1, 2, 1, 1), 
               AMOUNT = c(88, 75, 24, 32))
NJ <- tibble(CODE = c(1, 2), FACTOR = c(0.75, 0.9))
PA <- tibble(CODE = c(1, 2), FACTOR = c(0.8, 0.95))
other <- tibble(CODE = c(1, 2), FACTOR = c(0.82, 0.93))

CodePudding user response：

Make a list of all data frames from your global environment with state abbreviation names, bind them into a single data frame with a STATE column, and join to the main dataset. Then for anything left over we can use rows_patch to fill in NA values.

library(dplyr)
state_tables = ls(pattern = paste(state.abb, collapse = "|")) |>
  mget() |>
  bind_rows(.id = "STATE")

main = main |>
  left_join(state_tables, by = c("STATE", "CODE")) |>
  rows_patch(other, by = "CODE")

main
#  A tibble: 4 × 4
#   STATE  CODE AMOUNT FACTOR
#   <chr> <dbl>  <dbl>  <dbl>
# 1 NJ        1     88   0.75
# 2 DE        2     75   0.93
# 3 VA        1     24   0.82
# 4 PA        1     32   0.8

(Using Ronak's kindly shared data)