in a survey I conducted, I asked about the education level of the participants. The results are spread over several columns as binary variables. I would appreciate efficient ways to combine the results into a single variable. The tables below show the current and desired data format.
ID | high school | college | PhD |
---|---|---|---|
1 | high school | -1 | -1 |
2 | -1 | college | -1 |
3 | -1 | -1 | PhD |
4 | high school | -1 | -1 |
ID | Educational background |
---|---|
1 | high school |
2 | college |
3 | PhD |
4 | high school |
CodePudding user response:
To answer your specific question using the tidyverse, creating a test dataset with the code at the end of this post:
library(tidyverse)
df %>%
mutate(
across(-ID, function(x) ifelse(x == "-1", NA, x)),
EducationalBackground=coalesce(high_school, college, PhD)
)
ID high_school college PhD EducationalBackground
1 1 high_school <NA> <NA> high_school
2 2 <NA> college <NA> college
3 3 <NA> <NA> PhD PhD
4 4 high_school <NA> <NA> high_school
The code works by converting the text values of "-1"
in your columns, which I take to be missing value flags, to true missing values. Then I use coalesce
to find the first non-missing value in the three columns that contain survey data and place it in the new summary column. This assumes that there will be one and only one non-missing value in each row of the data frame.
That said, my preference would be to avoid the problem by adapting your workflow earlier in the piece to avoid the problem. But you haven't given any details of that, so I can't make any suggestions about how to do that.
Test data
df <- read.table(textConnection("ID high_school college PhD
1 high_school -1 -1
2 -1 college -1
3 -1 -1 PhD
4 high_school -1 -1"), header=TRUE)