Breaking a string into multiple dummy variables in R-CodePudding

I have the following table in R

id<-c(1,2,3,4)
medal<-c("2021-2020-2018","NA","2019","2015-2014-2012") 
df<-data.frame(id,medal)

 id          medal
 1      2021-2020-2018
 2             NA
 3            2019
 4      2015-2014-2012

and would like to break the medal column to multiple dummy variables for each id as follows:

 id          medal         2021   2020    2019   2018  2015  2014   2012
 1      2021-2020-2018      1       1       0      1    0     0      0 
 2             NA           0       0       0      0    0     0      0 
 3            2019          0       0       1      0    0     0      0 
 4      2015-2014-2012      0       0       0      0    1     1      1

I would appreciate your help with this.

CodePudding user response：

qdapTools has a function that does exactly this, you just need to split the medal column first:

library(qdapTools)

df <- cbind(df, mtabulate(strsplit(df$medal, "-")))
df[, names(df) != "NA"]

CodePudding user response：

Here is a tidyverse solution: Note that I have changed "NA" in the input data to NA_character_.

The idea is to use tidyr: separate and then to pivot the data using gather and spread. Also I convert id to factor such that ids without medals stay in the output.

library(tidyverse)
id <- c(1,2,3,4)
medal <- c("2021-2020-2018",NA_character_,"2019","2015-2014-2012")
df <- data.frame(id,medal)

df %>%
  separate(medal, sep = "-", into = as.character(1:3), fill = "right") %>%
  gather(dummy, year, -id) %>%
  select(-dummy) %>%
  mutate(val = 1, id = factor(id)) %>% 
  filter(!is.na(year)) %>%
  spread(year, val, fill = 0, drop = FALSE)


  id 2012 2014 2015 2018 2019 2020 2021
1  1    0    0    0    1    0    1    1
2  2    0    0    0    0    0    0    0
3  3    0    0    0    0    1    0    0
4  4    1    1    1    0    0    0    0

CodePudding user response：

Or:

library(tidyverse)

id<-c(1,2,3,4)
medal<-c("2021-2020-2018","NA","2019","2015-2014-2012")
df<-data.frame(id,medal)

tmp <- unlist(str_split(df$medal, "-"))
tmp <- sort(tmp[tmp != "NA"])
tmp <- set_names(tmp, tmp)

df %>%
  bind_cols(map_dfc(tmp, ~as.integer(str_detect(medal, .x))))

#  id          medal 2012 2014 2015 2018 2019 2020 2021
#1  1 2021-2020-2018    0    0    0    1    0    1    1
#2  2             NA    0    0    0    0    0    0    0
#3  3           2019    0    0    0    0    1    0    0
#4  4 2015-2014-2012    1    1    1    0    0    0    0

CodePudding user response：

Or using unnest()

library(tidyverse)
id<-c(1,2,3,4)
medal<-c("2021-2020-2018","NA","2019","2015-2014-2012")
df<-data.frame(id,medal)

df %>% mutate(
  medal_2=str_split(medal, "-")) %>% 
  unnest(medal_2) %>% 
  mutate(value=1) %>% 
  pivot_wider(
    c("id", "medal"), names_from=medal_2, values_from = value
    ) %>% 
  replace(is.na(.), 0)
#> # A tibble: 4 x 10
#>      id medal          `2021` `2020` `2018`  `NA` `2019` `2015` `2014` `2012`
#>   <dbl> <chr>           <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1     1 2021-2020-2018      1      1      1     0      0      0      0      0
#> 2     2 NA                  0      0      0     1      0      0      0      0
#> 3     3 2019                0      0      0     0      1      0      0      0
#> 4     4 2015-2014-2012      0      0      0     0      0      1      1      1

^{Created on 2021-10-21 by the reprex package (v2.0.0)}