I have the task to reproduce a process from SAS in R. I have 1 table with 1.4 million rows and 156 columns for each of the last 71 months. In the columns there are only IDs and these are to be replaced by a text.
For this there are 60 lookup tables. Some of them are used multiple times and some only once.
I can't show the real data but here is a small example of what the table looks like.:
df <-tibble(contract_id = c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010),
feature_a = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1),
feature_b = c(3, 2, 1, 3, 2, 1, 3, 2, 1, 3),
feature_c = c(2, 3, 1, 2, 3, 1, 2, 3, 1, 2),
feature_d = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2),
feature_e = c(2, 1, 2, 1, 2, 1, 2, 1, 2, 1),
feature_f = c(2, 2, 1, 1, 2, 2, 1, 1, 2, 2))
contract_id feature_a feature_b feature_c feature_d feature_e feature_f
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1001 1 3 2 1 2 2
1002 2 2 3 2 1 2
1003 3 1 1 1 2 1
1004 1 3 2 2 1 1
1005 2 2 3 1 2 2
1006 3 1 1 2 1 2
1007 1 3 2 1 2 1
1008 2 2 3 2 1 1
1009 3 1 1 1 2 2
1010 1 3 2 2 1 2
These are 2 of the 60 lookup tables, which are used multiple times, for example lookup_a is used 8 times and lookup_b 15 times:
lookup_a = tibble(id = c(1, 2, 3),
value = c("yes", "no", "yes, mandatory"))
lookup_b = tibble(id = c(1, 2),
value = c("yes", "no"))
This is how the desired result should look (feature_a - c use lookup_a and feature_d - f use lookup b):
df_expected <-tibble(contract_id = c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010),
feature_a = c("yes", "no", "yes, mandatory", "yes", "no", "yes, mandatory", "yes", "no", "yes, mandatory", "yes"),
feature_b = c("yes, mandatory", "no", "yes", "yes, mandatory", "no", "yes", "yes, mandatory", "no", "yes", "yes, mandatory"),
feature_c = c("no", "yes, mandatory", "yes", "no", "yes, mandatory", "yes", "no", "yes, mandatory", "yes", "no"),
feature_d = c("yes", "no", "yes", "no", "yes", "no", "yes", "no", "yes", "no"),
feature_e = c("no", "yes", "no", "yes", "no", "yes", "no", "yes", "no", "yes"),
feature_f = c("no", "no", "yes", "yes", "no", "no", "yes", "yes", "no", "no"))
contract_id feature_a feature_b feature_c feature_d feature_e feature_f
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1001 yes yes, mandatory no yes no no
1002 no no yes, mandatory no yes no
1003 yes, mandatory yes yes yes no yes
1004 yes yes, mandatory no no yes yes
1005 no no yes, mandatory yes no no
1006 yes, mandatory yes yes no yes no
1007 yes yes, mandatory no yes no yes
1008 no no yes, mandatory no yes yes
1009 yes, mandatory yes yes yes no no
1010 yes yes, mandatory no no yes no
I can of course create a join for each column but this is not really satisfying. I would like to keep the number of joins as small as possible:
df %>%
left_join(lookup_a, by = c("feature_a" = "id")) %>%
select(-feature_a) %>%
rename(feature_a = value)
I have also tried different ways with data.table or match but I have not found a way to join multiple columns at once. I have the problem that all columns are changed and not the selected ones.
Here are my questions:
- is there a way to do a join/match with a lookup table for multiple columns at once (for example the left_join) and using the names of the columns for the rename?
- Or is it possible to replace the values for multiple columns at once?
Maybe I'm thinking too complicated right now and the solution is relatively simple.
Thank you in advance!
CodePudding user response:
Welcome on SO! You can replace the values of multiple columns using across
in mutate
verb using the column index you want to change (2 to 4 for columns a to c, and 5 to 7 for columns d to f):
library(dplyr)
df %>%
mutate(across(2:4,
~case_when(. == 1 ~ "Yes",
. == 2 ~ "No",
. == 3 ~ "Yes, mandatory",
TRUE ~ "Error"))) %>%
mutate(across(5:7,
~case_when(. == 1 ~ "Yes",
. == 2 ~ "No",
TRUE ~ "Error")))
Output:
# A tibble: 10 x 7
contract_id feature_a feature_b feature_c feature_d feature_e feature_f
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1001 Yes Yes, mandatory No Yes No No
2 1002 No No Yes, mandatory No Yes No
3 1003 Yes, mandatory Yes Yes Yes No Yes
4 1004 Yes Yes, mandatory No No Yes Yes
5 1005 No No Yes, mandatory Yes No No
6 1006 Yes, mandatory Yes Yes No Yes No
7 1007 Yes Yes, mandatory No Yes No Yes
8 1008 No No Yes, mandatory No Yes Yes
9 1009 Yes, mandatory Yes Yes Yes No No
10 1010 Yes Yes, mandatory No No Yes No