Creating new conditioned factor variable from 2 binary variables with missing values in R-CodePudding

I have two factor variables (T2ENNAT, P2ANYLNG) which have each the two levels 0 = NO Multilingual and 1 = Multilingual. Both have serveral missing values.

Now I want to create a new factor variable that combines the two with the following conditions:

If one of the both variables is 1 and the other is either 0 or missing-> new variable should be 1.
If both variables are 0 -> new variable should be 0
If one of the both variables is 0 and the other is missng -> new variable should be 0
If both are missing -> new variable should be NA (missing)

I startet step by step and tried the following code:

data$T2Multi = with(data,  
ifelse(T2ENNAT == "Multilingual" & P2ANYLNG == "Multilingual", 1,
ifelse(T2ENNAT == "NO Multilingual" & P2ANYLNG == "NO Multilingual", 0,
ifelse(T2ENNAT == "Multilingual" & P2ANYLNG == "NO Multilingual", 1,
ifelse(T2ENNAT == "NO Multilingual" & P2ANYLNG == "Multilingual", 1,
ifelse(is.na(T2ENNAT) & P2ANYLNG =="Multilingual",1,
       NA))))))

The first 4 conditions are working. However, the last one does not. R assings NA to the new variable if T2ENNAT is missing and P2ANYLanguage = 1 (Multilingual).

I do not understand the problem with this line. I think somehow the is.na(variable) function does not work. Do you know how to adress this problem?

CodePudding user response：

Here is a vectorized way.

data == "Multilingual" returns a logical matrix of TRUE where the data entries are "Multilingual" and FALSE otherwise ("No Multilingual" or NA);
the matrix row values are added and if the sums are equal or greater than 1, there's at least one "Multilingual" and the new column is a 1.
if the row sums of the logical matrix is.na(data[1:2]) are equal to 2, then all values in that row are missing and the new column entry is NA.

Two base R code lines will solve the problem.

data$T2Multi <-  (rowSums(data == "Multilingual", na.rm = TRUE) >= 1L)
is.na(data$T2Multi) <- rowSums(is.na(data[1:2])) == 2L
data
#>            T2ENNAT        P2ANYLNG T2Multi
#> 1     Multilingual    Multilingual       1
#> 2     Multilingual            <NA>       1
#> 3             <NA> No Multilingual       0
#> 4             <NA> No Multilingual       0
#> 5  No Multilingual No Multilingual       0
#> 6     Multilingual    Multilingual       1
#> 7  No Multilingual    Multilingual       1
#> 8             <NA>            <NA>      NA
#> 9  No Multilingual No Multilingual       0
#> 10    Multilingual    Multilingual       1
#> 11            <NA> No Multilingual       0
#> 12 No Multilingual    Multilingual       1
#> 13            <NA>            <NA>      NA
#> 14    Multilingual No Multilingual       1
#> 15    Multilingual No Multilingual       1
#> 16 No Multilingual            <NA>       0
#> 17    Multilingual    Multilingual       1
#> 18    Multilingual    Multilingual       1
#> 19    Multilingual    Multilingual       1
#> 20    Multilingual            <NA>       1

^{Created on 2022-03-21 by the reprex package (v2.0.1)}

Test data set

set.seed(2022)
n <- 20
data <- data.frame(
  T2ENNAT = factor(rbinom(n, 1, 0.5), labels = c("No Multilingual", "Multilingual")),
  P2ANYLNG = factor(rbinom(n, 1, 0.5), labels = c("No Multilingual", "Multilingual"))
)
data[] <- lapply(data, \(x){
  is.na(x) <- sample(n, n/4)
  x
})

^{Created on 2022-03-21 by the reprex package (v2.0.1)}

CodePudding user response：

A tidyverse solution to the problem:

library(tidyverse)

# Data set of possible cases
d <- crossing(x = c(0:1, NA), 
              y = x) 

d |> 
  mutate(z =  case_when(
    x | y ~ x|y,
    !(x&y) ~ F
  ))
#> # A tibble: 9 × 3
#>       x     y     z
#>   <int> <int> <int>
#> 1     0     0     0
#> 2     0     1     1
#> 3     0    NA     0
#> 4     1     0     1
#> 5     1     1     1
#> 6     1    NA     1
#> 7    NA     0     0
#> 8    NA     1     1
#> 9    NA    NA    NA

^{Created on 2022-03-21 by the reprex package (v2.0.1)}