I have a multiple-choice question in a survey that came with several dichotomous variables in my data that I want to transform to a single select one variable (because most of the respondent in the survey didn't select more than one option).
I have variables in a dataframe (df) with the following names var1, var2, …., var195. I wrote the following code that works perfectly:
df<-df %>% mutate(
newvar = case_when(
var1 == "yes" ~ "option1",
var2 == "yes" ~ "option2",
var3 == "yes" ~ " option3",
var4 == "yes" ~ "option4",
and so on)
)
However, this is tedious because I should have 195 lines. I tried to write the following function (trying to simplify and speed up my code) but it doesn’t work.
multichoice_to_one<-function(df,pat,charvec){
# df is the dataframe with the data to handle
# pat is the pattern to look into in the names of variables in this case
# "var"
# charvec is the (character) vector of the options
for (i in 1:length(charvec)){df<-df %>% cat(str_c(pat,i))=="yes" ~ charvec[i]}}
Can anyone helps?
CodePudding user response:
It's usually easiest to work with 'long' data for these types of questions, so pivot_longer
is your friend! Here's one solution (that doesn't use a function approach like you had started with).
Basically for each survey response, it looks for the first 'yes' response in your relevant columns.
library(tidyverse)
# Create some dummy data with 4 columns to look in, plus some extra columns
n_responses <- 100
df <- tibble(
id = 1:n_responses,
var1 = sample(c("yes", "no"), n_responses, TRUE),
var2 = sample(c("yes", "no"), n_responses, TRUE),
var3 = sample(c("yes", "no"), n_responses, TRUE),
var4 = sample(c("yes", "no"), n_responses, TRUE),
othervar1 = sample(LETTERS, n_responses, TRUE),
othervar2 = sample(LETTERS, n_responses, TRUE)
)
df <- df %>%
# Turn the data from 'wide' to 'long'. Only pivots the 'varX' columns
pivot_longer(cols = starts_with("var"), names_to = "question", values_to = "response") %>%
# Group by the response number (or any other unique ID for each response in your survey)
group_by(id) %>%
# Sets `first_yes` to the question name if:
# the person responded yes, and
# either it's the first row or we haven't seen any 'yes' responses previously
mutate(first_yes = ifelse(response == "yes" & (row_number() == 1 | !lag(cumany(response == "yes"))), question, NA)) %>%
# Fill in the response name for all rows
mutate(first_yes = max(first_yes, na.rm = TRUE)) %>%
ungroup() %>%
# Put the data back into the old format
pivot_wider(names_from = question, values_from = response)
You will get an error message printed for any row that doesn't have any 'yes' in the relevant columns. Those rows will have NA
in the indicator column.
CodePudding user response:
Assuming a data frame like this:
# A tibble: 15 × 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
2 FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
3 FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
4 FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
5 FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
6 FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
7 TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE
8 TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
9 FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
10 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
11 TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
12 TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
13 TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
14 TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
15 FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
You can simply use:
df %>%
mutate(group = apply(., 1, function(row, ...){ first(which(row == TRUE)) }) %>% factor())
# A tibble: 15 × 11
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 group
<lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <fct>
1 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE 6
2 FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE 2
3 FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE 2
4 FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE 2
5 FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE 2
6 FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE 2
7 TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE 1
8 TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE 1
9 FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE 4
10 TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE 1
11 TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE 1
12 TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE 1
13 TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE 1
14 TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE 1
15 FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE 3
I constructed the above example frame using:
df = replicate(n=p, sample(c(TRUE, FALSE), size=n, replace=TRUE)) %>%
as_tibble()