Home > database >  How to replace multiple dummy-coded columns with labels efficiently?
How to replace multiple dummy-coded columns with labels efficiently?

Time:09-23

I'm trying to replace some dummy coded variables with the actual categorical value. I know there's ways to replace this using base R, dplyr, and the *apply or within functions, but my brain just can't figure out which is most efficient without looping it.

Currently, I have 2 data frames. Examples of what they look like below. Note: Instead of 5 variables, there are closer to 10 and this operation needs to be done multiple times so looping or a simple df <- df[df$var == 1] <- "a" won't work.

df1
#   id  v1  v2  v3  v4  v5  
# 1  1   0  1  0  1  0
# 2  3   0  0  0  0  1
# 3  5   0  1  1  0  0
# 4  12  1  0  1  0  0
# 5  26  0  0  0  1  0
...

codes
#   name  label
# 1  v1  "a"
# 2  v2  "b"
# 3  v3  "c"
# 4  v4  "d"
# 5  v5  "e"

The goal is to get the values in codes$label to replace the 1's according to the relevant column name in df1 so the end result is:

df1
#   id  v1  v2  v3  v4  v5  
# 1  1   0  b  0  d  0
# 2  3   0  0  0  0  e
# 3  5   0  b  c  0  0
# 4  12  a  0  c  0  0
# 5  26  0  0  0  d  0

I know there's a way to do it, but I can't find any code snippets online that do it over multiple columns and when referencing another dataframe.

Thank you!

P.S. These are both technically tibbles() formatted with tidyverse

CodePudding user response:

Tidyverse solution. You can:

  • reshape df1 to long form

  • join with codes by name and value (for value = 1)

  • replace missing labels with 0

  • remove unwanted column value

  • reshape back to wide form using the new labels as values

    library(dplyr)
    library(tidyr)
    
    df1 %>% 
      pivot_longer(-id) %>% 
      left_join(mutate(codes, value = 1)) %>% 
      mutate(label = ifelse(is.na(label), 0, label)) %>% 
      select(-value) %>% 
      pivot_wider(names_from = "name", 
                  values_from = "label")
    

Result:

# A tibble: 5 × 6
     id v1    v2    v3    v4    v5   
  <int> <chr> <chr> <chr> <chr> <chr>
1     1 0     b     0     d     0    
2     3 0     0     0     0     e    
3     5 0     b     c     0     0    
4    12 a     0     c     0     0    
5    26 0     0     0     d     0

Data:

df1 <- structure(list(id = c(1L, 3L, 5L, 12L, 26L), 
                      v1 = c(0L, 0L, 0L, 1L, 0L), 
                      v2 = c(1L, 0L, 1L, 0L, 0L), 
                      v3 = c(0L, 0L, 1L, 1L, 0L), 
                      v4 = c(1L, 0L, 0L, 0L, 1L), 
                      v5 = c(0L, 1L, 0L, 0L, 0L)), 
                 class = "data.frame", 
                 row.names = c("1", "2", "3", "4", "5"))

codes <- structure(list(name = c("v1", "v2", "v3", "v4", "v5"), 
                        label = c("a", "b", "c", "d", "e")), 
                        class = "data.frame", 
                        row.names = c("1", "2", "3", "4", "5"))

CodePudding user response:

Wuth dplyr, you could use across cur_column():

library(dplyr)

df1 %>%
  mutate(across(v1:v5, ~ replace(.x, .x == 1, codes$label[codes$name == cur_column()])))

#   id v1 v2 v3 v4 v5
# 1  1  0  b  0  d  0
# 2  3  0  0  0  0  e
# 3  5  0  b  c  0  0
# 4 12  a  0  c  0  0
# 5 26  0  0  0  d  0
  • Related