It is common in surveys to ask a question and then tell participants to "select all that apply". For example, "Which foods do you enjoy eating (Please select all that apply)?" a) Sushi, b) Pasta, c) Hamburger.
Assuming four (N=4) participants answered this question, the data could look like this.
food.df <- data.frame(id = c(1,2,3,4), food.choice = c("1,2", "", "1,2,3", "3"))
What I am trying to do is conditionally separate these into unique columns using a method that is flexible on the number of individuals and the number of food choice attributes (i.e. Sushi, Pasta, Hamburger, ....). The final data would look something like this.
food.final <- data.frame(id= c(1,2,3,4), sushi = c(1,0,1,0), pasta = c(1,0,1,0), hamburger = c(0,0,1,1))
The more advanced version of this would allow for conditional groupings. You can think of this as grouping by food groups, location, etc. Assuming we were grouping by "selected foods that have protein" this could be coded to reflect total choices. This could look something like this.
food.group <- data.frame(id = c(1,2,3,4), protein = c(1,0,2,1), non.protein = c(1,0,1,0))
I have tried to use tidyr::separate, strsplit, and other column splitting functions but cannot seem to get the desired outcome. Appreciate the help on this and hopefully, the answer helps other users of R who do survey work.
CodePudding user response:
We may use fastDummies
library(fastDummies)
library(dplyr)
dummy_cols(food.df, 'food.choice', split = ",",
remove_selected_columns = TRUE) %>%
setNames(c("id", "sushi", "pasta", "hamburger"))
-output
id sushi pasta hamburger
1 1 1 1 0
2 2 0 0 0
3 3 1 1 1
4 4 0 0 1
If the renaming should be automatic, create a named vector and use str_replace
library(stringr)
nm1 <- setNames(c("sushi", "pasta", "hamburger"), 1:3)
dummy_cols(food.df, 'food.choice', split = ",",
remove_selected_columns = TRUE) %>%
rename_with(~ str_replace_all(str_remove(.x, 'food.choice_'), nm1), -id)
id sushi pasta hamburger
1 1 1 1 0
2 2 0 0 0
3 3 1 1 1
4 4 0 0 1
For the second case, we may use str_count
food.df %>%
mutate(protein = str_count(food.choice, '[13]'),
non.protein = str_count(food.choice, '2'), .keep = 'unused')
id protein non.protein
1 1 1 1
2 2 0 0
3 3 2 1
4 4 1 0
CodePudding user response:
You could create or probably have a matrix that allocates the needed information like this foody
.
(foody <- matrix(c('sushi', 'pasta', 'hamburger',
'protein', 'non_protein', 'protein',
'1', '2', '3'), nrow=3, ncol=3,
dimnames=list(NULL, c('food', 'protein', 'id'))))
# food protein id
# [1,] "sushi" "protein" "1"
# [2,] "pasta" "non_protein" "2"
# [3,] "hamburger" "protein" "3"
Then you could easily strsplit
on the commas and match
the IDs with foody
. tabulate
creates a binary matching vector of length nrow(foody)
and in an sapply
we get a matrix mt
.
(mt <- t(sapply(strsplit(food.df$food.choice, ','), \(x) {
tabulate(match(x, foody[, 'id']), nrow(foody))
})))
# [,1] [,2] [,3]
# [1,] 1 1 0
# [2,] 0 0 0
# [3,] 1 1 1
# [4,] 0 0 1
# [5,] 1 0 1
Finally all we need is to create a table
of a factor
of each row with the feature we desire as levels. For convenience we wrap it into a function f
.
f <- \(v) {
r <- apply(mt, 1, \(i) foody[as.logical(i), v])
cbind(food.df[1], t(sapply(r, \(x)
table(factor(x, levels=unique(foody[, v]))))))
}
f('food')
# id sushi pasta hamburger
# 1 1 1 1 0
# 2 2 0 0 0
# 3 3 1 1 1
# 4 4 0 0 1
# 5 5 1 0 1
f('protein')
# id protein non_protein
# 1 1 1 1
# 2 2 0 0
# 3 3 2 1
# 4 4 1 0
# 5 5 2 0
Note that the number strings should be sorted in ascending order, which they probably are anyway.
Data:
food.df <- structure(list(id = 1:5, food.choice = c("1,2", "", "1,2,3",
"3", "1,3")), class = "data.frame", row.names = c("1", "2", "3",
"4", "5"))