Home > Blockchain >  summarize multiple binary variables in a single column
summarize multiple binary variables in a single column

Time:10-28

in a survey I conducted, I asked about the education level of the participants. The results are spread over several columns as binary variables. I would appreciate efficient ways to combine the results into a single variable. The tables below show the current and desired data format.

ID high school college PhD
1 high school -1 -1
2 -1 college -1
3 -1 -1 PhD
4 high school -1 -1
ID Educational background
1 high school
2 college
3 PhD
4 high school

CodePudding user response:

To answer your specific question using the tidyverse, creating a test dataset with the code at the end of this post:

library(tidyverse)

df %>% 
  mutate(
    across(-ID, function(x) ifelse(x == "-1", NA, x)),
    EducationalBackground=coalesce(high_school, college, PhD)
  )
  ID high_school college  PhD EducationalBackground
1  1 high_school    <NA> <NA>           high_school
2  2        <NA> college <NA>               college
3  3        <NA>    <NA>  PhD                   PhD
4  4 high_school    <NA> <NA>           high_school

The code works by converting the text values of "-1" in your columns, which I take to be missing value flags, to true missing values. Then I use coalesce to find the first non-missing value in the three columns that contain survey data and place it in the new summary column. This assumes that there will be one and only one non-missing value in each row of the data frame.

That said, my preference would be to avoid the problem by adapting your workflow earlier in the piece to avoid the problem. But you haven't given any details of that, so I can't make any suggestions about how to do that.

Test data

df <- read.table(textConnection("ID     high_school     college     PhD
1   high_school     -1  -1
2   -1  college     -1
3   -1  -1  PhD
4   high_school     -1  -1"), header=TRUE)
  • Related