Home > OS >  one hot encoding dirty column in R dplyr
one hot encoding dirty column in R dplyr

Time:07-22

I have a column like so. The column begins and ends with a ',' and each value is separated by ',,'.

col1
,101,,9,,201,,200,
,201,,101,,102,
,9,,101,,102,,200,,201,
,101,,200,,9,,102,,102,

How can i transform this column into the following:

col1_9 col1_101 col1_102 col1_200 col1_201
1      1        0        1        1
0      1        1        0        1
1      1        1        1        1
1      1        2        1        0

CodePudding user response:

An option could be:

  • First remove the "," at begin and end using str_sub from stringr
  • One Hot encode the column using mtabulate and strsplit with sep of ",,"
  • Order the column names based on number
  • Finally, give the columns the "col1_" names using paste0

Which gives this as result:

df <- read.table(text = "col1
,101,,9,,201,,200,
,201,,101,,102,
,9,,101,,102,,200,,201,
,101,,200,,9,,102,,102,", header = TRUE)

library(stringr)
library(qdapTools)
df$col1 <- str_sub(df$col1, 2, -2)
df <- mtabulate(strsplit(df$col1, ",,"))
df <- df[, order(as.numeric(names(df)))]
names(df) <- paste0("col1_", names(df))
df
#>   col1_9 col1_101 col1_102 col1_200 col1_201
#> 1      1        1        0        1        1
#> 2      0        1        1        0        1
#> 3      1        1        1        1        1
#> 4      1        1        2        1        0

Created on 2022-07-21 by the reprex package (v2.0.1)

CodePudding user response:

df%>%
  mutate(rowid = row_number(), value = 1)%>%
  separate_rows(col1)%>%
  filter(nzchar(col1)) %>%
  pivot_wider(rowid, names_from = col1, 
              values_fn = sum, names_prefix = 'col1_', 
              values_fill = 0)

  # A tibble: 4 x 6
  rowid col1_101 col1_9 col1_201 col1_200 col1_102
  <int>    <dbl>  <dbl>    <dbl>    <dbl>    <dbl>
1     1        1      1        1        1        0
2     2        1      0        1        0        1
3     3        1      1        1        1        1
4     4        1      1        0        1        2

in Base R:

a <- setNames(strsplit(trimws(df$col1,white=','), ', '), seq(nrow(df)))
as.data.frame.matrix(t(table(stack(a))))

  101 102 200 201 9
1   1   0   1   1 1
2   1   1   0   1 0
3   1   1   1   1 1
4   1   2   1   0 1
  • Related