How to recode categorical variable with multiple response combinations in r?-CodePudding

I am new to stack overflow and couldn't find an answer to my question. I would really appreciate any input!

I have a dataset that I am tidying. I want to recode a variable with 202 columns into a table with binary values. It is from a "check all that apply" survey. The output from the survey looks like this:

Participant          Language
 1                   'English|French'
 2                   'English'
 3                   'Spanish|French'
 4                   'English|Spanish'
 5                   'French'

The variables output has languages separated by '|' so can't do a table here. I'm wanting the result to look like this:

Participant	English	French	Spanish
1	1	1	0
2	1	0	0
3	0	1	1
4	1	0	1
5	0	1	0

I'm not sure how to do this without using '''ifelse''' and creating '''or''' arguments for each possible combination of languages. I would really appreciate any tips!

Note: the actual dataset is not focused on language, but the format is the same. There are far more than 3 choices so I am hoping to find an efficient way to do this

CodePudding user response：

With tidyverse you could try the following. With separate_rows you can add rows for each language. Then, add a temporary column to indicate 1 when language is present for the participant. Finally, pivot_wider would put result into the desired format.

library(tidyverse)

df %>%
  separate_rows(Language) %>%
  mutate(Present = 1) %>%
  pivot_wider(id_cols = Participant, 
              names_from = Language, 
              values_from = Present,
              values_fill = 0)

Output

  Participant English French Spanish
        <int>   <dbl>  <dbl>   <dbl>
1           1       1      1       0
2           2       1      0       0
3           3       0      1       1
4           4       1      0       1
5           5       0      1       0

CodePudding user response：

You will need to define the columns you want first. You can either do this manually:

cols <- c("English", "French", "Spanish")

Or automated:

cols <- unique(unlist(strsplit(df$Language, "\\|")))

cols
#> [1] "English" "French"  "Spanish"

In either case, your result can be obtained like this:

cbind(df[1], setNames(as.data.frame(lapply(cols, function(x) {
  as.numeric(grepl(x, df$Language))
})), cols))
#>   Participant English French Spanish
#> 1           1       1      1       0
#> 2           2       1      0       0
#> 3           3       0      1       1
#> 4           4       1      0       1
#> 5           5       0      1       0

^{Created on 2022-03-13 by the reprex package (v2.0.1)}

Data

df <- structure(list(Participant = 1:5, Language = c("English|French", 
"English", "Spanish|French", "English|Spanish", "French")), 
class = "data.frame", row.names = c(NA, -5L))

df
#>   Participant        Language
#> 1           1  English|French
#> 2           2         English
#> 3           3  Spanish|French
#> 4           4 English|Spanish
#> 5           5          French