Home > Enterprise >  R: Splitting multiple-choice-column and creating a matrix
R: Splitting multiple-choice-column and creating a matrix

Time:06-28

I have a multiple-choice-question with seven possible answers, my data looks like this:

Q12
1 Inhalt, Ermöglichen Koalition
2 Inhalt, Ermöglichen Koalition, Verhindern Kanzlerschaft
3 Inhalt
4 Spitzenpolitiker

My goal is to -> seperate the observations and create a binary matrix with seven variables ("Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker") akin to this:

Inhalt Ermöglichen Koalition Verhindern Kanzlerschaft Spitzenpolitiker
1 1 1 0 0
2 1 1 1 0
3 1 0 0 0
4 0 0 0 1

I have tried einzeln_strategisch_2021 <- data.frame(strategisch_2021[, ! colnames (strategisch_2021) %in% "Q12"], model.matrix(~ Q12 - 1, strategisch_2021)) %>%

This gives me the matrix I want but it does not separate the observations, so now I have a matrix with 20 variables instead of the seven

also tried seperate() like this separate(Q12, into = c("Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker"), ",") %>%

This does separate the observations, but not in the right order and without the matrix.

I also tried to use splitstackshape and the charMat-function, but I cant get that to work at all (judging from the description "Create a Binary Matrix from a List of Character Values") it should do exactly what I want.

Any help would be greatly appreciated as I have been stuck for days now ;)

CodePudding user response:

Another possible solution:

library(tidyverse)

df %>%
  mutate(id = row_number()) %>%
  separate_rows(Q12, sep = ", ") %>%
  count(id, Q12) %>%
  pivot_wider(id, names_from = Q12, values_from = n, values_fill = 0,
     values_fn = length) %>% 
  bind_cols(Q12 = df$Q12, .) %>% 
  relocate(Q12, .after = "id")

#> # A tibble: 4 × 6
#>      id Q12            `Ermöglichen K…` Inhalt `Verhindern Ka…` Spitzenpolitiker
#>   <int> <chr>                     <int>  <int>            <int>            <int>
#> 1     1 Inhalt, Ermög…                1      1                0                0
#> 2     2 Inhalt, Ermög…                1      1                1                0
#> 3     3 Inhalt                        0      1                0                0
#> 4     4 Spitzenpoliti…                0      0                0                1

Or shorter:

library(tidyverse)

bind_rows(apply(df, 1, \(x) strsplit(x, ", ") %>% table), .id = "id") %>% 
  mutate(across(-id, ~ as.integer(.x) %>% pmin(1) %>% replace_na(0)))

#> # A tibble: 4 × 5
#>   id    `Ermöglichen Koalition` Inhalt `Verhindern Kanzlersch…` Spitzenpolitiker
#>   <chr>                   <dbl>  <dbl>                    <dbl>            <dbl>
#> 1 1                           1      1                        0                0
#> 2 2                           1      1                        1                0
#> 3 3                           0      1                        0                0
#> 4 4                           0      0                        0                1

CodePudding user response:

Try this:

## Creating the dataframe
df = data.frame(Options = c("1","2","3","4","5","6","7"),
                Ques12 =c("Alpha, Beta, Gamma", "Lima, Peru", "Beta", "YOLO, Gamma", "Victor", "Beta, Gamma", "YOLO, Peru"))

## Separating by delimiter and inserting new rows
df = df %>%
mutate(Ques12 = strsplit(as.character(Ques12), ",")) %>% 
  unnest(Ques12) 

## Removing quotation marks
df$Ques12 = noquote(df$Ques12)

## Removing all white spaces
df$Ques12 = gsub(" ", "", df$Ques12, fixed = TRUE)

## Longer to wider format
df1 = df %>%
  group_by(Options,Ques12) %>%
  summarise(freq = n()) %>%
  pivot_wider(names_from = Ques12, values_from = freq) %>%
  replace(is.na(.), 0)
  

Output

> df
  Options             Ques12
1       1 Alpha, Beta, Gamma
2       2         Lima, Peru
3       3               Beta
4       4        YOLO, Gamma
5       5             Victor
6       6        Beta, Gamma
7       7         YOLO, Peru
> df1
# A tibble: 7 x 8
# Groups:   Options [7]
  Options Alpha  Beta Gamma  Lima  Peru  YOLO Victor
  <chr>   <int> <int> <int> <int> <int> <int>  <int>
1 1           1     1     1     0     0     0      0
2 2           0     0     0     1     1     0      0
3 3           0     1     0     0     0     0      0
4 4           0     0     1     0     0     1      0
5 5           0     0     0     0     0     0      1
6 6           0     1     1     0     0     0      0
7 7           0     0     0     0     1     1      0

CodePudding user response:

We may use dummy_cols from fastDummies

library(fastDummies)
library(stringr)
library(dplyr)
dummy_cols(df1, "Q12", split=",\\s*") %>%
   setNames(., str_remove(names(.), "Q12_"))

-output

                                                      Q12 Inhalt Ermöglichen Koalition Verhindern Kanzlerschaft Spitzenpolitiker
1                           Inhalt, Ermöglichen Koalition      1                     1                        0                0
2 Inhalt, Ermöglichen Koalition, Verhindern Kanzlerschaft      1                     1                        1                0
3                                                  Inhalt      1                     0                        0                0
4                                        Spitzenpolitiker      0                     0                        0                1

It can also work with multiple columns

df1$Q13 <- df1$Q12
dummy_cols(df1, c("Q12", "Q13"), split=",\\s*", 
  remove_selected_columns = TRUE) %>%
    setNames(., str_remove(names(.), "Q\\d _"))

-output

Inhalt Ermöglichen Koalition Verhindern Kanzlerschaft Spitzenpolitiker Inhalt Ermöglichen Koalition Verhindern Kanzlerschaft
1      1                     1                        0                0      1                     1                        0
2      1                     1                        1                0      1                     1                        1
3      1                     0                        0                0      1                     0                        0
4      0                     0                        0                1      0                     0                        0
  Spitzenpolitiker
1                0
2                0
3                0
4                1

data

df1 <- structure(list(Q12 = c("Inhalt, Ermöglichen Koalition", 
"Inhalt, Ermöglichen Koalition, Verhindern Kanzlerschaft", 
"Inhalt", "Spitzenpolitiker")), class = "data.frame", row.names = c("1", 
"2", "3", "4"))
  •  Tags:  
  • r
  • Related