In a situation where each observation may have up to 4 categories entered in a single cell (poor data entry i guess), how can i one-hot encode this in R? (Maybe the terminology should be multiple hot encoding but i'm not sure). Note that most of the observations have just one category, but some have 2, some 3 and some 4. Overall, the variable of interest has 7 distinct categories. This is what i did. First i separated the column into four columns and separated the categories, one in each column. Then i used the mutate and spread function in dplyr but i could only spread the first category ('firstcrop') for one-hot encoding purposes.
I created this dataframe to demonstrate my problem
library(dplyr)
bf <- data.frame(crops = c(1,3,345,9562))
bf <- separate(bf, crops, into = c('empty','firstcrop','secondcrop','thirdcrop','fourthcrop'), sep = "", extra = "merge")
bf <- select(bf, -(empty))
This code above successfully separates the categories into four separate columns. Next i do this to encode them into 1's and 0's. This worked only for the first crop.
cf<- bf%>%mutate(value = 1) %>% spread(firstcrop, value, fill = 0)
But i need to get a result like this where each observation could have multiple 1's.
'1' | '2' | '3' | '4' | '5' | '6' | '9' |
---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 1 | 1 | 0 | 0 |
0 | 1 | 0 | 0 | 1 | 1 | 1 |
How do i go about this?
CodePudding user response:
Here’s a data.table
solution in case anyone is interested:
library(data.table)
bf <- data.table(crops = c(1, 3, 345, 9562))
bf[, crops := strsplit(as.character(crops), "")]
cols <- sort(unique(unlist(bf$crops)))
bf[, (cols) := lapply(cols, \(col) sapply(crops, \(row) col %in% row))]
bf
## crops 1 2 3 4 5 6 9
## 1: 1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 2: 3 FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 3: 3,4,5 FALSE FALSE TRUE TRUE TRUE FALSE FALSE
## 4: 9,5,6,2 FALSE TRUE FALSE FALSE TRUE TRUE TRUE
And a Base R solution:
bf <- data.frame(crops = c(1, 3, 345, 9562))
bf$crops <- strsplit(as.character(bf$crops), "")
cols <- sort(unique(unlist(bf$crops)))
for (col in cols) {
bf[[col]] <- sapply(bf$crops, \(row) col %in% row)
}
bf
## crops 1 2 3 4 5 6 9
## 1 1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 3 FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 3 3, 4, 5 FALSE FALSE TRUE TRUE TRUE FALSE FALSE
## 4 9, 5, 6, 2 FALSE TRUE FALSE FALSE TRUE TRUE TRUE
CodePudding user response:
Here is a potential tidyverse solution:
library(dplyr)
library(tidyr)
bf <- data.frame(crops = c(1,3,345,9562))
bf %>%
separate(crops, into = c('empty',
'firstcrop',
'secondcrop',
'thirdcrop',
'fourthcrop'),
sep = "", extra = "merge") %>%
select(-(empty)) %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
na.omit() %>%
pivot_wider(names_from = value,
values_from = name) %>%
select(-row) %>%
mutate(across(everything(), ~ !is.na(.x))) %>%
select(order(colnames(.)))
#> Warning: Expected 5 pieces. Missing pieces filled with `NA` in 3 rows [1, 2, 3].
#> # A tibble: 4 × 7
#> `1` `2` `3` `4` `5` `6` `9`
#> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 0 0 0 0 0 0
#> 2 0 0 1 0 0 0 0
#> 3 0 0 1 1 1 0 0
#> 4 0 1 0 0 1 1 1
Created on 2022-03-14 by the reprex package (v2.0.1)