Home > other >  Join multiple columns with multiple lookup tables
Join multiple columns with multiple lookup tables

Time:12-03

I have the task to reproduce a process from SAS in R. I have 1 table with 1.4 million rows and 156 columns for each of the last 71 months. In the columns there are only IDs and these are to be replaced by a text.

For this there are 60 lookup tables. Some of them are used multiple times and some only once.

I can't show the real data but here is a small example of what the table looks like.:

df <-tibble(contract_id = c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010),
            feature_a = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1),
            feature_b = c(3, 2, 1, 3, 2, 1, 3, 2, 1, 3),
            feature_c = c(2, 3, 1, 2, 3, 1, 2, 3, 1, 2),
            feature_d = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2),
            feature_e = c(2, 1, 2, 1, 2, 1, 2, 1, 2, 1),
            feature_f = c(2, 2, 1, 1, 2, 2, 1, 1, 2, 2))

   contract_id feature_a feature_b feature_c feature_d feature_e feature_f
         <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
         1001         1         3         2         1         2         2
         1002         2         2         3         2         1         2
         1003         3         1         1         1         2         1
         1004         1         3         2         2         1         1
         1005         2         2         3         1         2         2
         1006         3         1         1         2         1         2
         1007         1         3         2         1         2         1
         1008         2         2         3         2         1         1
         1009         3         1         1         1         2         2
         1010         1         3         2         2         1         2

These are 2 of the 60 lookup tables, which are used multiple times, for example lookup_a is used 8 times and lookup_b 15 times:

lookup_a = tibble(id = c(1, 2, 3),
                 value = c("yes", "no", "yes, mandatory"))
                 
lookup_b = tibble(id = c(1, 2),
                  value = c("yes", "no"))

This is how the desired result should look (feature_a - c use lookup_a and feature_d - f use lookup b):

df_expected <-tibble(contract_id = c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010),
                     feature_a = c("yes", "no", "yes, mandatory", "yes", "no", "yes, mandatory", "yes", "no", "yes, mandatory", "yes"),
            feature_b = c("yes, mandatory", "no", "yes", "yes, mandatory", "no", "yes", "yes, mandatory", "no", "yes", "yes, mandatory"),
            feature_c = c("no", "yes, mandatory", "yes", "no", "yes, mandatory", "yes", "no", "yes, mandatory", "yes", "no"),
            feature_d = c("yes", "no", "yes", "no", "yes", "no", "yes", "no", "yes", "no"),
            feature_e = c("no", "yes", "no", "yes", "no", "yes", "no", "yes", "no", "yes"),
            feature_f = c("no", "no", "yes", "yes", "no", "no", "yes", "yes", "no", "no"))

   contract_id feature_a      feature_b      feature_c      feature_d feature_e feature_f
         <dbl> <chr>          <chr>          <chr>          <chr>     <chr>     <chr>    
         1001 yes            yes, mandatory no             yes       no        no       
         1002 no             no             yes, mandatory no        yes       no       
         1003 yes, mandatory yes            yes            yes       no        yes      
         1004 yes            yes, mandatory no             no        yes       yes      
         1005 no             no             yes, mandatory yes       no        no       
         1006 yes, mandatory yes            yes            no        yes       no       
         1007 yes            yes, mandatory no             yes       no        yes      
         1008 no             no             yes, mandatory no        yes       yes      
         1009 yes, mandatory yes            yes            yes       no        no       
         1010 yes            yes, mandatory no             no        yes       no 

I can of course create a join for each column but this is not really satisfying. I would like to keep the number of joins as small as possible:

df %>% 
      left_join(lookup_a, by = c("feature_a" = "id")) %>% 
      select(-feature_a) %>% 
      rename(feature_a = value)

I have also tried different ways with data.table or match but I have not found a way to join multiple columns at once. I have the problem that all columns are changed and not the selected ones.

Here are my questions:

  • is there a way to do a join/match with a lookup table for multiple columns at once (for example the left_join) and using the names of the columns for the rename?
  • Or is it possible to replace the values for multiple columns at once?

Maybe I'm thinking too complicated right now and the solution is relatively simple.

Thank you in advance!

CodePudding user response:

Welcome on SO! You can replace the values of multiple columns using across in mutate verb using the column index you want to change (2 to 4 for columns a to c, and 5 to 7 for columns d to f):

library(dplyr)
df %>% 
  mutate(across(2:4,
         ~case_when(. == 1 ~ "Yes",
                    . == 2 ~ "No",
                    . == 3 ~ "Yes, mandatory",
                    TRUE ~ "Error"))) %>%
  mutate(across(5:7,
                ~case_when(. == 1 ~ "Yes",
                           . == 2 ~ "No",
                           TRUE ~ "Error")))

Output:

# A tibble: 10 x 7
   contract_id feature_a      feature_b      feature_c      feature_d feature_e feature_f
         <dbl> <chr>          <chr>          <chr>          <chr>     <chr>     <chr>    
 1        1001 Yes            Yes, mandatory No             Yes       No        No       
 2        1002 No             No             Yes, mandatory No        Yes       No       
 3        1003 Yes, mandatory Yes            Yes            Yes       No        Yes      
 4        1004 Yes            Yes, mandatory No             No        Yes       Yes      
 5        1005 No             No             Yes, mandatory Yes       No        No       
 6        1006 Yes, mandatory Yes            Yes            No        Yes       No       
 7        1007 Yes            Yes, mandatory No             Yes       No        Yes      
 8        1008 No             No             Yes, mandatory No        Yes       Yes      
 9        1009 Yes, mandatory Yes            Yes            Yes       No        No       
10        1010 Yes            Yes, mandatory No             No        Yes       No   
  • Related