R code to transform several dichotomous variables (coming from a multiple choice question) to a sing-CodePudding

I have a multiple-choice question in a survey that came with several dichotomous variables in my data that I want to transform to a single select one variable (because most of the respondent in the survey didn't select more than one option).

I have variables in a dataframe (df) with the following names var1, var2, …., var195. I wrote the following code that works perfectly:

df<-df %>% mutate(
newvar = case_when(
  var1 == "yes" ~ "option1",
  var2 == "yes" ~ "option2",
  var3 == "yes" ~ " option3",
  var4 == "yes" ~ "option4",
  and so on)
)

However, this is tedious because I should have 195 lines. I tried to write the following function (trying to simplify and speed up my code) but it doesn’t work.

multichoice_to_one<-function(df,pat,charvec){ 
# df is the dataframe with the data to handle 
# pat is the pattern to look into in the names of variables in this case
# "var"
# charvec is the (character) vector of the options
      for (i in 1:length(charvec)){df<-df %>% cat(str_c(pat,i))=="yes" ~ charvec[i]}}

Can anyone helps?

CodePudding user response：

It's usually easiest to work with 'long' data for these types of questions, so pivot_longer is your friend! Here's one solution (that doesn't use a function approach like you had started with).

Basically for each survey response, it looks for the first 'yes' response in your relevant columns.

library(tidyverse)

# Create some dummy data with 4 columns to look in, plus some extra columns
n_responses <- 100
df <- tibble(
  id = 1:n_responses,
  var1 = sample(c("yes", "no"), n_responses, TRUE),
  var2 = sample(c("yes", "no"), n_responses, TRUE),
  var3 = sample(c("yes", "no"), n_responses, TRUE),
  var4 = sample(c("yes", "no"), n_responses, TRUE),
  othervar1 = sample(LETTERS, n_responses, TRUE),
  othervar2 = sample(LETTERS, n_responses, TRUE)
)

df <- df %>%
  # Turn the data from 'wide' to 'long'. Only pivots the 'varX' columns
  pivot_longer(cols = starts_with("var"), names_to = "question", values_to = "response") %>% 
  # Group by the response number (or any other unique ID for each response in your survey)
  group_by(id) %>% 
  # Sets `first_yes` to the question name if:
  #                         the person responded yes, and
  #                                             either it's the first row or we haven't seen any 'yes' responses previously
  mutate(first_yes = ifelse(response == "yes" & (row_number() == 1 | !lag(cumany(response == "yes"))), question, NA)) %>% 
  # Fill in the response name for all rows
  mutate(first_yes = max(first_yes, na.rm = TRUE)) %>% 
  ungroup() %>% 
  # Put the data back into the old format
  pivot_wider(names_from = question, values_from = response)

You will get an error message printed for any row that doesn't have any 'yes' in the relevant columns. Those rows will have NA in the indicator column.

CodePudding user response：

Assuming a data frame like this:

# A tibble: 15 × 10
   V1    V2    V3    V4    V5    V6    V7    V8    V9    V10  
   <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
 1 FALSE FALSE FALSE FALSE FALSE TRUE  FALSE FALSE FALSE TRUE 
 2 FALSE TRUE  FALSE TRUE  TRUE  FALSE TRUE  FALSE TRUE  TRUE 
 3 FALSE TRUE  TRUE  FALSE FALSE FALSE TRUE  FALSE TRUE  FALSE
 4 FALSE TRUE  TRUE  TRUE  FALSE TRUE  FALSE TRUE  TRUE  TRUE 
 5 FALSE TRUE  TRUE  TRUE  TRUE  FALSE TRUE  FALSE TRUE  TRUE 
 6 FALSE TRUE  FALSE TRUE  TRUE  TRUE  TRUE  FALSE TRUE  TRUE 
 7 TRUE  FALSE FALSE FALSE TRUE  FALSE TRUE  TRUE  FALSE TRUE 
 8 TRUE  FALSE TRUE  TRUE  FALSE TRUE  FALSE TRUE  TRUE  TRUE 
 9 FALSE FALSE FALSE TRUE  TRUE  TRUE  TRUE  TRUE  FALSE FALSE
10 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE  TRUE  FALSE
11 TRUE  FALSE TRUE  TRUE  TRUE  FALSE FALSE TRUE  FALSE FALSE
12 TRUE  FALSE FALSE TRUE  TRUE  TRUE  FALSE FALSE TRUE  TRUE 
13 TRUE  FALSE FALSE TRUE  TRUE  FALSE FALSE TRUE  TRUE  TRUE 
14 TRUE  TRUE  FALSE TRUE  TRUE  FALSE TRUE  FALSE FALSE TRUE 
15 FALSE FALSE TRUE  FALSE FALSE TRUE  FALSE FALSE FALSE TRUE

You can simply use:

df %>%
    mutate(group = apply(., 1, function(row, ...){ first(which(row == TRUE)) }) %>% factor())

# A tibble: 15 × 11
   V1    V2    V3    V4    V5    V6    V7    V8    V9    V10   group
   <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <fct>
 1 FALSE FALSE FALSE FALSE FALSE TRUE  FALSE FALSE FALSE TRUE  6    
 2 FALSE TRUE  FALSE TRUE  TRUE  FALSE TRUE  FALSE TRUE  TRUE  2    
 3 FALSE TRUE  TRUE  FALSE FALSE FALSE TRUE  FALSE TRUE  FALSE 2    
 4 FALSE TRUE  TRUE  TRUE  FALSE TRUE  FALSE TRUE  TRUE  TRUE  2    
 5 FALSE TRUE  TRUE  TRUE  TRUE  FALSE TRUE  FALSE TRUE  TRUE  2    
 6 FALSE TRUE  FALSE TRUE  TRUE  TRUE  TRUE  FALSE TRUE  TRUE  2    
 7 TRUE  FALSE FALSE FALSE TRUE  FALSE TRUE  TRUE  FALSE TRUE  1    
 8 TRUE  FALSE TRUE  TRUE  FALSE TRUE  FALSE TRUE  TRUE  TRUE  1    
 9 FALSE FALSE FALSE TRUE  TRUE  TRUE  TRUE  TRUE  FALSE FALSE 4    
10 TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE TRUE  TRUE  FALSE 1    
11 TRUE  FALSE TRUE  TRUE  TRUE  FALSE FALSE TRUE  FALSE FALSE 1    
12 TRUE  FALSE FALSE TRUE  TRUE  TRUE  FALSE FALSE TRUE  TRUE  1    
13 TRUE  FALSE FALSE TRUE  TRUE  FALSE FALSE TRUE  TRUE  TRUE  1    
14 TRUE  TRUE  FALSE TRUE  TRUE  FALSE TRUE  FALSE FALSE TRUE  1    
15 FALSE FALSE TRUE  FALSE FALSE TRUE  FALSE FALSE FALSE TRUE  3

I constructed the above example frame using:

df = replicate(n=p, sample(c(TRUE, FALSE), size=n, replace=TRUE)) %>%
    as_tibble()