Conditionally Separate String into Columns-CodePudding

It is common in surveys to ask a question and then tell participants to "select all that apply". For example, "Which foods do you enjoy eating (Please select all that apply)?" a) Sushi, b) Pasta, c) Hamburger.

Assuming four (N=4) participants answered this question, the data could look like this.

food.df <- data.frame(id = c(1,2,3,4), food.choice = c("1,2", "", "1,2,3", "3"))

What I am trying to do is conditionally separate these into unique columns using a method that is flexible on the number of individuals and the number of food choice attributes (i.e. Sushi, Pasta, Hamburger, ....). The final data would look something like this.

food.final <- data.frame(id= c(1,2,3,4), sushi = c(1,0,1,0), pasta = c(1,0,1,0), hamburger = c(0,0,1,1))

The more advanced version of this would allow for conditional groupings. You can think of this as grouping by food groups, location, etc. Assuming we were grouping by "selected foods that have protein" this could be coded to reflect total choices. This could look something like this.

food.group <- data.frame(id = c(1,2,3,4), protein = c(1,0,2,1), non.protein = c(1,0,1,0))

I have tried to use tidyr::separate, strsplit, and other column splitting functions but cannot seem to get the desired outcome. Appreciate the help on this and hopefully, the answer helps other users of R who do survey work.

CodePudding user response：

We may use fastDummies

library(fastDummies)
library(dplyr)
dummy_cols(food.df, 'food.choice', split = ",", 
    remove_selected_columns = TRUE) %>%
    setNames(c("id", "sushi", "pasta", "hamburger"))

-output

   id sushi pasta hamburger
1  1     1     1         0
2  2     0     0         0
3  3     1     1         1
4  4     0     0         1

If the renaming should be automatic, create a named vector and use str_replace

library(stringr)
nm1 <- setNames(c("sushi", "pasta", "hamburger"), 1:3)
 dummy_cols(food.df, 'food.choice', split = ",", 
    remove_selected_columns = TRUE) %>% 
   rename_with(~ str_replace_all(str_remove(.x, 'food.choice_'), nm1), -id)
  id sushi pasta hamburger
1  1     1     1         0
2  2     0     0         0
3  3     1     1         1
4  4     0     0         1

For the second case, we may use str_count

food.df %>%
   mutate(protein = str_count(food.choice, '[13]'), 
    non.protein = str_count(food.choice, '2'), .keep = 'unused')
  id protein non.protein
1  1       1           1
2  2       0           0
3  3       2           1
4  4       1           0

CodePudding user response：

You could create or probably have a matrix that allocates the needed information like this foody.

(foody <- matrix(c('sushi', 'pasta', 'hamburger', 
                  'protein', 'non_protein', 'protein',
                  '1', '2', '3'), nrow=3, ncol=3, 
                dimnames=list(NULL, c('food', 'protein', 'id'))))
#       food        protein       id 
# [1,] "sushi"     "protein"     "1"
# [2,] "pasta"     "non_protein" "2"
# [3,] "hamburger" "protein"     "3"

Then you could easily strsplit on the commas and match the IDs with foody. tabulate creates a binary matching vector of length nrow(foody) and in an sapply we get a matrix mt.

(mt <- t(sapply(strsplit(food.df$food.choice, ','), \(x) {
  tabulate(match(x, foody[, 'id']), nrow(foody))
})))
#      [,1] [,2] [,3]
# [1,]    1    1    0
# [2,]    0    0    0
# [3,]    1    1    1
# [4,]    0    0    1
# [5,]    1    0    1

Finally all we need is to create a table of a factor of each row with the feature we desire as levels. For convenience we wrap it into a function f.

f <- \(v) {
  r <- apply(mt, 1, \(i) foody[as.logical(i), v])
  cbind(food.df[1], t(sapply(r, \(x) 
                             table(factor(x, levels=unique(foody[, v]))))))
}

f('food')
#   id sushi pasta hamburger
# 1  1     1     1         0
# 2  2     0     0         0
# 3  3     1     1         1
# 4  4     0     0         1
# 5  5     1     0         1

f('protein')
#   id protein non_protein
# 1  1       1           1
# 2  2       0           0
# 3  3       2           1
# 4  4       1           0
# 5  5       2           0

Note that the number strings should be sorted in ascending order, which they probably are anyway.

Data:

food.df <- structure(list(id = 1:5, food.choice = c("1,2", "", "1,2,3", 
           "3", "1,3")), class = "data.frame", row.names = c("1", "2", "3", 
           "4", "5"))