I have a dataframe with 5 binary variables (TRUE
or FALSE
, but represented as 0
or 1
for convenience) which can have missing values:
df <- data.frame(a = c(1,0,1,0,0,...),
b = c(1,0,NA,0,1,...),
c = c(1,0,1,0,NA,...),
d = c(0,1,1,NA,NA,...),
e = c(0,0,0,1,1,...))
a b c d e
1 1 1 1 0 0
2 0 0 0 1 0
3 1 NA 1 1 0
4 0 0 0 NA 1
5 0 1 NA NA 1
...
Now I want to make a variable that indicates whether the observation satisfies more than two conditions out of the five, that is, whether the sum of a
, b
, c
, d
, and e
is greater than 2.
For the first row and the second row, the values are obviously TRUE
and FALSE
respectively. For the third row, the value should be TRUE
, since the sum is greater than 2 regardless of whether b
is TRUE
or FALSE
. For the third row, the value should be FALSE
, since the sum is less than or equal to 2 regardless of whether d
is TRUE
or FALSE
. For the fifth row, the value should be NA
, since the sum can range from 2 to 4 depending on c
and d
. So the desirable vector is c(TRUE, FALSE, TRUE, FALSE, NA, ...)
.
Here is my attempt:
df %>%
mutate(a0 = ifelse(is.na(a), 0, a),
b0 = ifelse(is.na(b), 0, b),
c0 = ifelse(is.na(c), 0, c),
d0 = ifelse(is.na(d), 0, d),
e0 = ifelse(is.na(e), 0, e),
a1 = ifelse(is.na(a), 1, a),
b1 = ifelse(is.na(b), 1, b),
c1 = ifelse(is.na(c), 1, c),
d1 = ifelse(is.na(d), 1, d),
e1 = ifelse(is.na(e), 1, e)
) %>%
mutate(summin = a0 b0 c0 d0 e0,
summax = a1 b1 c1 d1 e1) %>%
mutate(f = ifelse(summax <= 2,
FALSE,
ifelse(summin >= 3, TRUE, NA)))
This did work, but I had to make too many redunant variables, plus the code would be too lengthy if there were more variables. Is there any better solution?
CodePudding user response:
I would use
library(tidyverse)
want <- if_else(rowSums(df, na.rm = TRUE) >= 2, TRUE, FALSE)
If you want to stick to base-R you can use the function ifelse() instead.
CodePudding user response:
I am not sure what you mean by "For the fifth row, the value should be NA, since the sum can range from 2 to 4 depending on c and d."
But the following results in the vector you wish for:
test <- ifelse(is.na(df$c), NA, ifelse(rowSums(df[1:5,], na.rm=T) > 2, TRUE, FALSE))
If there is an NA value in the column c, an NA value will be inserted in the new vector test
. Else, it is tested if the sum of the first 5 columns is greater than 2 - if true, TRUE
will be inserted and FALSE
when the sum is lower than or exactly two.