Home > Enterprise >  multiple hot encoding in R dplyr?
multiple hot encoding in R dplyr?

Time:03-14

In a situation where each observation may have up to 4 categories entered in a single cell (poor data entry i guess), how can i one-hot encode this in R? (Maybe the terminology should be multiple hot encoding but i'm not sure). Note that most of the observations have just one category, but some have 2, some 3 and some 4. Overall, the variable of interest has 7 distinct categories. This is what i did. First i separated the column into four columns and separated the categories, one in each column. Then i used the mutate and spread function in dplyr but i could only spread the first category ('firstcrop') for one-hot encoding purposes.

I created this dataframe to demonstrate my problem

library(dplyr)
bf <- data.frame(crops = c(1,3,345,9562))
bf <- separate(bf, crops, into = c('empty','firstcrop','secondcrop','thirdcrop','fourthcrop'), sep = "", extra = "merge")
bf <- select(bf, -(empty))

This code above successfully separates the categories into four separate columns. Next i do this to encode them into 1's and 0's. This worked only for the first crop.

cf<- bf%>%mutate(value = 1) %>% spread(firstcrop, value, fill = 0)

But i need to get a result like this where each observation could have multiple 1's.

'1' '2' '3' '4' '5' '6' '9'
1 0 0 0 0 0 0
0 0 1 0 0 0 0
0 0 1 1 1 0 0
0 1 0 0 1 1 1

How do i go about this?

CodePudding user response:

Here’s a data.table solution in case anyone is interested:

library(data.table)
bf <- data.table(crops = c(1, 3, 345, 9562))

bf[, crops := strsplit(as.character(crops), "")]
cols <- sort(unique(unlist(bf$crops)))
bf[, (cols) := lapply(cols, \(col) sapply(crops, \(row) col %in% row))]
bf
##      crops     1     2     3     4     5     6     9
## 1:       1  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 2:       3 FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## 3:   3,4,5 FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE
## 4: 9,5,6,2 FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

And a Base R solution:

bf <- data.frame(crops = c(1, 3, 345, 9562))
bf$crops <- strsplit(as.character(bf$crops), "")
cols <- sort(unique(unlist(bf$crops)))
for (col in cols) {
    bf[[col]] <- sapply(bf$crops, \(row) col %in% row)
}
bf
##        crops     1     2     3     4     5     6     9
## 1          1  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 2          3 FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## 3    3, 4, 5 FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE
## 4 9, 5, 6, 2 FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

CodePudding user response:

Here is a potential tidyverse solution:

library(dplyr)
library(tidyr)
bf <- data.frame(crops = c(1,3,345,9562))
bf %>%
  separate(crops, into = c('empty',
                           'firstcrop',
                           'secondcrop',
                           'thirdcrop',
                           'fourthcrop'),
           sep = "", extra = "merge") %>%
  select(-(empty))  %>%
  mutate(row = row_number()) %>%
  pivot_longer(-row) %>%
  na.omit() %>%
  pivot_wider(names_from = value,
              values_from = name) %>%
  select(-row) %>%
  mutate(across(everything(), ~ !is.na(.x))) %>%
  select(order(colnames(.)))
#> Warning: Expected 5 pieces. Missing pieces filled with `NA` in 3 rows [1, 2, 3].
#> # A tibble: 4 × 7
#>     `1`   `2`   `3`   `4`   `5`   `6`   `9`
#>   <int> <int> <int> <int> <int> <int> <int>
#> 1     1     0     0     0     0     0     0
#> 2     0     0     1     0     0     0     0
#> 3     0     0     1     1     1     0     0
#> 4     0     1     0     0     1     1     1

Created on 2022-03-14 by the reprex package (v2.0.1)

  • Related