Home > Net >  checking if sum of logical variables is greater than n, with na, in r
checking if sum of logical variables is greater than n, with na, in r

Time:01-26

I have a dataframe with 5 binary variables (TRUE or FALSE, but represented as 0 or 1 for convenience) which can have missing values:

df <- data.frame(a = c(1,0,1,0,0,...),
                 b = c(1,0,NA,0,1,...),
                 c = c(1,0,1,0,NA,...),
                 d = c(0,1,1,NA,NA,...),
                 e = c(0,0,0,1,1,...))
     a  b  c  d  e
 1   1  1  1  0  0
 2   0  0  0  1  0
 3   1 NA  1  1  0
 4   0  0  0 NA  1
 5   0  1 NA NA  1
...

Now I want to make a variable that indicates whether the observation satisfies more than two conditions out of the five, that is, whether the sum of a, b, c, d, and e is greater than 2.

For the first row and the second row, the values are obviously TRUE and FALSE respectively. For the third row, the value should be TRUE, since the sum is greater than 2 regardless of whether b is TRUE or FALSE. For the third row, the value should be FALSE, since the sum is less than or equal to 2 regardless of whether d is TRUE or FALSE. For the fifth row, the value should be NA, since the sum can range from 2 to 4 depending on c and d. So the desirable vector is c(TRUE, FALSE, TRUE, FALSE, NA, ...).

Here is my attempt:

df %>%
  mutate(a0 = ifelse(is.na(a), 0, a),
         b0 = ifelse(is.na(b), 0, b),
         c0 = ifelse(is.na(c), 0, c),
         d0 = ifelse(is.na(d), 0, d),
         e0 = ifelse(is.na(e), 0, e),
         a1 = ifelse(is.na(a), 1, a),
         b1 = ifelse(is.na(b), 1, b),
         c1 = ifelse(is.na(c), 1, c),
         d1 = ifelse(is.na(d), 1, d),
         e1 = ifelse(is.na(e), 1, e)
         ) %>%
  mutate(summin = a0   b0   c0   d0   e0,
         summax = a1   b1   c1   d1   e1) %>%
  mutate(f = ifelse(summax <= 2,
                    FALSE,
                    ifelse(summin >= 3, TRUE, NA)))

This did work, but I had to make too many redunant variables, plus the code would be too lengthy if there were more variables. Is there any better solution?

CodePudding user response:

I would use

library(tidyverse)
want <- if_else(rowSums(df, na.rm = TRUE) >= 2, TRUE, FALSE)

If you want to stick to base-R you can use the function ifelse() instead.

CodePudding user response:

I am not sure what you mean by "For the fifth row, the value should be NA, since the sum can range from 2 to 4 depending on c and d."

But the following results in the vector you wish for:

test <- ifelse(is.na(df$c), NA, ifelse(rowSums(df[1:5,], na.rm=T) > 2, TRUE, FALSE))

If there is an NA value in the column c, an NA value will be inserted in the new vector test. Else, it is tested if the sum of the first 5 columns is greater than 2 - if true, TRUE will be inserted and FALSE when the sum is lower than or exactly two.

  • Related