Home > Blockchain >  How to create a variable with other dataset variables as its levels
How to create a variable with other dataset variables as its levels

Time:04-05

I have a dataset, where a couple of variables are dichotomised as yes/no.

> df[1:20,]
# A tibble: 20 × 2
   black white
   <fct> <fct>
 1 No    Yes  
 2 No    Yes  
 3 No    Yes  
 4 No    Yes  
 5 No    Yes  
 6 No    Yes  
 7 No    Yes  
 8 No    Yes  
 9 No    Yes  
10 No    Yes  
11 No    Yes  
12 No    Yes  
13 No    Yes  
14 No    Yes  
15 No    Yes  
16 Yes   No   
17 No    Yes  
18 No    Yes  
19 No    Yes  
20 Yes   No 

This creates a lot of variables (my real data has more than one option for race) and doesn’t look very neat as it means a lot of variables unnecessarily. I want to create a new variable (e.g 'race'), where the now individual variables 'black', 'white' etc are levels to that variable. Like in this example

> df2[1:20,]
# A tibble: 20 × 1
   race 
   <fct>
 1 White
 2 White
 3 White
 4 White
 5 White
 6 White
 7 White
 8 White
 9 White
10 White
11 White
12 White
13 White
14 White
15 White
16 Black
17 White
18 White
19 White
20 Black

How would I go about this?

CodePudding user response:

To account for multiple races, use apply on the rows (MARGIN = 1), and paste toString the column names with "Yes":

df <- structure(list(asian = c("No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes"), black = c("No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "Yes"), white = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No")), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))

data.frame(race = apply(df == "Yes", 1, \(x) toString(colnames(df)[which(x)])))

           race
1         white
2         white
3         white
4         white
5         white
6         white
7         white
8         white
9         white
10        white
11        white
12        white
13        white
14        white
15        white
16        black
17        white
18        white
19        white
20 asian, black

Using max.col (works only for one value per individual):

data.frame(race = colnames(df)[max.col(df == "Yes")])

CodePudding user response:

Here is a solution that works with multiracial cases.

library(tidyverse)

# Sample data with multiracial case
df <- structure(list(respondent = 1:20, asian = c("No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes"), black = c("No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "Yes"), white = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No")), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))

df %>%
  select(asian:white) %>%
  `==`("Yes") %>%
  apply(1, 
        \(.row) colnames(.)[.row] %>%
          str_c(collapse = "-")) 
#>  [1] "white"       "white"       "white"       "white"       "white"      
#>  [6] "white"       "white"       "white"       "white"       "white"      
#> [11] "white"       "white"       "white"       "white"       "white"      
#> [16] "black"       "white"       "white"       "white"       "asian-black"

Created on 2022-04-04 by the reprex package (v2.0.1)

CodePudding user response:

Using dplyr (this assumes that in your dataset a person can only be of 1 race):

library(dplyr)

dat <- data.frame(id = 1:2,
                  black = c("No", "Yes"),
                  white = c("Yes", "No"))

dat |> mutate(
        race = case_when(black == "Yes" ~ "black",
                         white == "Yes" ~ "white")
)

Output:

#>   id black white  race
#> 1  1    No   Yes white
#> 2  2   Yes    No black
  • Related