I have a dataframe with a list of (space-separated) years that I would like to turn into dummies for each year.
Consider the following toy data:
raw <- data.frame(textcol = c("case1", "case2", "case3"), years=c('1996 1997 1998','1997 1999 2000', '1996 1998 2000'))
textcol years
1 case1 1996 1997 1998
2 case2 1997 1999 2000
3 case3 1996 1998 2000
I would now like to transform the data frame into this
textcol `1996` `1997` `1998` `1999` `2000`
1 case1 1 1 1 0 0
2 case2 0 1 0 1 1
3 case3 1 0 1 0 1
I tried using separate()
and str_split()
to no avail. Can someone point me to the right approach?
CodePudding user response:
Use separate_rows
to get each year in a separate row and then use table
. (Append %>% as.data.frame.matrix
to the pipeline if you want it as a data frame.)
library(tidyr)
tab <- raw %>% separate_rows(years) %>% table
giving:
tab
## years
## textcol 1996 1997 1998 1999 2000
## case1 1 1 1 0 0
## case2 0 1 0 1 1
## case3 1 0 1 0 1
We can display this as a graph:
library(igraph)
set.seed(23)
raw %>% separate_rows(years) %>% graph.data.frame %>% plot
CodePudding user response:
Use separate_rows
with pivot_wider
:
library(tidyverse)
raw %>%
separate_rows(years) %>%
mutate(value = 1) %>%
pivot_wider(textcol, names_from = years, values_from = value, values_fill = 0)
# A tibble: 3 x 6
textcol `1996` `1997` `1998` `1999` `2000`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 case1 1 1 1 0 0
2 case2 0 1 0 1 1
3 case3 1 0 1 0 1
CodePudding user response:
You can use strsplit
to split into separate years, rep
textcol and use table
.
. <- strsplit(raw$years, " ", TRUE)
table(rep(raw$textcol, lengths(.)), unlist(.))
# 1996 1997 1998 1999 2000
# case1 1 1 1 0 0
# case2 0 1 0 1 1
# case3 1 0 1 0 1
Or a piped version using stack
.
strsplit(raw$years, " ", TRUE) |>
setNames(raw$textcol) |>
stack() |>
rev() |>
table()
# values
#ind 1996 1997 1998 1999 2000
# case1 1 1 1 0 0
# case2 0 1 0 1 1
# case3 1 0 1 0 1
CodePudding user response:
Another solution with fastDummies::dummyCols
and the useful split
argument.
fastDummies::dummy_cols(raw, "years", split = " ", remove_selected_columns = T)
textcol years_1996 years_1997 years_1998 years_2000 years_1999
1 case1 1 1 1 0 0
2 case2 0 1 0 1 1
3 case3 1 0 1 1 0
CodePudding user response:
A data.table
option
> library(data.table)
> dcast(setDT(raw)[, strsplit(years, " "), textcol], textcol ~ V1, fun = length)
textcol 1996 1997 1998 1999 2000
1: case1 1 1 1 0 0
2: case2 0 1 0 1 1
3: case3 1 0 1 0 1