Home > front end >  Turn colum containing list into dummies
Turn colum containing list into dummies

Time:04-20

I have a dataframe with a list of (space-separated) years that I would like to turn into dummies for each year.

Consider the following toy data:

raw <- data.frame(textcol = c("case1", "case2", "case3"), years=c('1996 1997 1998','1997 1999 2000', '1996 1998 2000'))


  textcol          years
1   case1 1996 1997 1998
2   case2 1997 1999 2000
3   case3 1996 1998 2000

I would now like to transform the data frame into this

  textcol `1996` `1997` `1998` `1999` `2000` 
1   case1      1      1      1      0      0
2   case2      0      1      0      1      1
3   case3      1      0      1      0      1

I tried using separate() and str_split() to no avail. Can someone point me to the right approach?

CodePudding user response:

Use separate_rows to get each year in a separate row and then use table. (Append %>% as.data.frame.matrix to the pipeline if you want it as a data frame.)

library(tidyr)

tab <- raw %>% separate_rows(years) %>% table

giving:

tab
##        years
## textcol 1996 1997 1998 1999 2000
##   case1    1    1    1    0    0
##   case2    0    1    0    1    1
##   case3    1    0    1    0    1

We can display this as a graph:

library(igraph)

set.seed(23)
raw %>% separate_rows(years) %>% graph.data.frame %>% plot

screenshot

CodePudding user response:

Use separate_rows with pivot_wider:

library(tidyverse)
raw %>% 
  separate_rows(years) %>% 
  mutate(value = 1) %>% 
  pivot_wider(textcol, names_from = years, values_from = value, values_fill = 0)

# A tibble: 3 x 6
  textcol `1996` `1997` `1998` `1999` `2000`
  <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 case1        1      1      1      0      0
2 case2        0      1      0      1      1
3 case3        1      0      1      0      1

CodePudding user response:

You can use strsplit to split into separate years, rep textcol and use table.

. <- strsplit(raw$years, " ", TRUE)
table(rep(raw$textcol, lengths(.)), unlist(.))
#        1996 1997 1998 1999 2000
#  case1    1    1    1    0    0
#  case2    0    1    0    1    1
#  case3    1    0    1    0    1

Or a piped version using stack.

strsplit(raw$years, " ", TRUE) |>
setNames(raw$textcol) |>
stack() |>
rev() |>
table()
#       values
#ind     1996 1997 1998 1999 2000
#  case1    1    1    1    0    0
#  case2    0    1    0    1    1
#  case3    1    0    1    0    1

CodePudding user response:

Another solution with fastDummies::dummyCols and the useful split argument.

fastDummies::dummy_cols(raw, "years", split = " ", remove_selected_columns = T)

  textcol years_1996 years_1997 years_1998 years_2000 years_1999
1   case1          1          1          1          0          0
2   case2          0          1          0          1          1
3   case3          1          0          1          1          0

CodePudding user response:

A data.table option

> library(data.table)

> dcast(setDT(raw)[, strsplit(years, " "), textcol], textcol ~ V1, fun = length)
   textcol 1996 1997 1998 1999 2000
1:   case1    1    1    1    0    0
2:   case2    0    1    0    1    1
3:   case3    1    0    1    0    1
  • Related