I have two factor variables (T2ENNAT, P2ANYLNG) which have each the two levels 0 = NO Multilingual and 1 = Multilingual. Both have serveral missing values.
Now I want to create a new factor variable that combines the two with the following conditions:
- If one of the both variables is 1 and the other is either 0 or missing-> new variable should be 1.
- If both variables are 0 -> new variable should be 0
- If one of the both variables is 0 and the other is missng -> new variable should be 0
- If both are missing -> new variable should be NA (missing)
I startet step by step and tried the following code:
data$T2Multi = with(data,
ifelse(T2ENNAT == "Multilingual" & P2ANYLNG == "Multilingual", 1,
ifelse(T2ENNAT == "NO Multilingual" & P2ANYLNG == "NO Multilingual", 0,
ifelse(T2ENNAT == "Multilingual" & P2ANYLNG == "NO Multilingual", 1,
ifelse(T2ENNAT == "NO Multilingual" & P2ANYLNG == "Multilingual", 1,
ifelse(is.na(T2ENNAT) & P2ANYLNG =="Multilingual",1,
NA))))))
The first 4 conditions are working. However, the last one does not. R assings NA to the new variable if T2ENNAT is missing and P2ANYLanguage = 1 (Multilingual).
I do not understand the problem with this line. I think somehow the is.na(variable) function does not work. Do you know how to adress this problem?
CodePudding user response:
Here is a vectorized way.
data == "Multilingual"
returns a logical matrix ofTRUE
where the data entries are"Multilingual"
andFALSE
otherwise ("No Multilingual"
orNA
);- the matrix row values are added and if the sums are equal or greater than 1, there's at least one
"Multilingual"
and the new column is a1
. - if the row sums of the logical matrix
is.na(data[1:2])
are equal to2
, then all values in that row are missing and the new column entry isNA
.
Two base R code lines will solve the problem.
data$T2Multi <- (rowSums(data == "Multilingual", na.rm = TRUE) >= 1L)
is.na(data$T2Multi) <- rowSums(is.na(data[1:2])) == 2L
data
#> T2ENNAT P2ANYLNG T2Multi
#> 1 Multilingual Multilingual 1
#> 2 Multilingual <NA> 1
#> 3 <NA> No Multilingual 0
#> 4 <NA> No Multilingual 0
#> 5 No Multilingual No Multilingual 0
#> 6 Multilingual Multilingual 1
#> 7 No Multilingual Multilingual 1
#> 8 <NA> <NA> NA
#> 9 No Multilingual No Multilingual 0
#> 10 Multilingual Multilingual 1
#> 11 <NA> No Multilingual 0
#> 12 No Multilingual Multilingual 1
#> 13 <NA> <NA> NA
#> 14 Multilingual No Multilingual 1
#> 15 Multilingual No Multilingual 1
#> 16 No Multilingual <NA> 0
#> 17 Multilingual Multilingual 1
#> 18 Multilingual Multilingual 1
#> 19 Multilingual Multilingual 1
#> 20 Multilingual <NA> 1
Created on 2022-03-21 by the reprex package (v2.0.1)
Test data set
set.seed(2022)
n <- 20
data <- data.frame(
T2ENNAT = factor(rbinom(n, 1, 0.5), labels = c("No Multilingual", "Multilingual")),
P2ANYLNG = factor(rbinom(n, 1, 0.5), labels = c("No Multilingual", "Multilingual"))
)
data[] <- lapply(data, \(x){
is.na(x) <- sample(n, n/4)
x
})
Created on 2022-03-21 by the reprex package (v2.0.1)
CodePudding user response:
A tidyverse solution to the problem:
library(tidyverse)
# Data set of possible cases
d <- crossing(x = c(0:1, NA),
y = x)
d |>
mutate(z = case_when(
x | y ~ x|y,
!(x&y) ~ F
))
#> # A tibble: 9 × 3
#> x y z
#> <int> <int> <int>
#> 1 0 0 0
#> 2 0 1 1
#> 3 0 NA 0
#> 4 1 0 1
#> 5 1 1 1
#> 6 1 NA 1
#> 7 NA 0 0
#> 8 NA 1 1
#> 9 NA NA NA
Created on 2022-03-21 by the reprex package (v2.0.1)