Home > Enterprise >  Transforming a one-hot encoded variable to one column
Transforming a one-hot encoded variable to one column

Time:12-22

I have age columns like so that are dummy encoded. How can I transform these columns to one column using dplyr?

Input:

  age_0-10 age_11-20 age_21-30 age_31-40 age_41-50 age_51-60 gender
1 0        1         0         0         0         0         0
2 0        0         1         0         0         0         1
3 0        0         0         1         0         0         0
4 0        1         0         0         0         0         1
5 0        0         0         0         0         1         1

Expected output:

age         gender
1 11-20     0
2 21-30     1
3 31-40     0
4 11-20     1
5 51-60     1

CodePudding user response:

A possible solution, now, thanks to @Adam's comment, with names_prefix:

library(tidyverse)

df <- data.frame(
  check.names = FALSE,
  `age_0-10` = c(0L, 0L, 0L, 0L, 0L),
  `age_11-20` = c(1L, 0L, 0L, 1L, 0L),
  `age_21-30` = c(0L, 1L, 0L, 0L, 0L),
  `age_31-40` = c(0L, 0L, 1L, 0L, 0L),
  `age_41-50` = c(0L, 0L, 0L, 0L, 0L),
  `age_51-60` = c(0L, 0L, 0L, 0L, 1L),
  gender = c(0L, 1L, 0L, 1L, 1L)
)

df %>% 
  pivot_longer(col=starts_with("age"), names_to="age", names_prefix="age_") %>% 
  filter(value==1) %>%
  select(age, gender, -value)

#> # A tibble: 5 × 2
#>   age   gender
#>   <chr>  <int>
#> 1 11-20      0
#> 2 21-30      1
#> 3 31-40      0
#> 4 11-20      1
#> 5 51-60      1

CodePudding user response:

Here is a way in dplyr using c_across().

library(dplyr)
library(stringr)

df %>% 
  rowwise() %>% 
  mutate(age = str_remove(names(.)[which(c_across(starts_with("age")) == 1)], "^age_")) %>% 
  ungroup() %>% 
  select(age, gender)

# # A tibble: 5 x 2
#   age   gender
#   <chr>  <int>
# 1 11-20      0
# 2 21-30      1
# 3 31-40      0
# 4 11-20      1
# 5 51-60      1

CodePudding user response:

Try the base R code below using max.col

cbind(
  age = gsub("^age_", "", head(names(df), -1)[max.col(df[-ncol(df)])]),
  df[ncol(df)]
)

which gives

    age gender
1 11-20      0
2 21-30      1
3 31-40      0
4 11-20      1
5 51-60      1

CodePudding user response:

Here is another tidyverse solution:

library(dplyr)
library(purrr)

df %>%
  mutate(age = pmap_chr(select(cur_data(), !gender), 
                        ~ names(df)[-ncol(df)][as.logical(c(...))])) %>%
  select(age, gender)

        age gender
1 age_11-20      0
2 age_21-30      1
3 age_31-40      0
4 age_11-20      1
5 age_51-60      1
  • Related