Home > Enterprise >  Apply "or" function across any number of data.frame columns and preserve missingness
Apply "or" function across any number of data.frame columns and preserve missingness

Time:09-23

I create datasets in R regularly and often find I need to take two or more binary variables and "or" them into one new variable that indicates if any were 1, none were 1, or all were missing. Simply using | does not handle NA's the way I would like.

So given a data.frame, df of three columns:

x = c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,NA,NA,NA,NA,NA,NA,NA,NA,NA)
y = c( 0, 0, 0, 1, 1, 1,NA,NA,NA, 0, 0, 0, 1, 1, 1,NA,NA,NA, 0, 0, 0, 1, 1, 1,NA,NA,NA)
z = c( 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA)

df = data.frame(x,y,z)

The output I am looking for is:

myFunction(df)
 [1]  0  1  0  1  1  1  0  1  0  1  1  1  1  1  1  1  1  1  0  1  0  1  1  1  0  1 NA

But simply using | does not handle 0's the way I am looking for as it prioritizes NA's over 0's:

as.numeric(df$x | df$y | df$z)
 [1]  0  1 NA  1  1  1 NA  1 NA  1  1  1  1  1  1  1  1  1 NA  1 NA  1  1  1 NA  1 NA

This is the best solution I came up with:

myFunction <- function(...) {
  as.numeric(apply(data.frame(...),1,function(x) { ifelse(all(is.na(x)),NA,sum(x,na.rm = T)) }) > 0)
}
df$xyz = myFunction(df)
df$xyz
 [1]  0  1  0  1  1  1  0  1  0  1  1  1  1  1  1  1  1  1  0  1  0  1  1  1  0  1 NA

Is there a package with this functionality or a better way to write this so I don't have to copy paste this mess across all my scripts? Am I over thinking this?

CodePudding user response:

Another way that I thought of

library(dplyr)

df %>% 
  rowwise() %>% 
  mutate(out = max(c_across(),na.rm = TRUE)) %>% 
  pull(out) %>% 
  replace(is.infinite(.), NA)

 [1]  0  1  0  1  1  1  0  1  0  1  1  1  1  1  1  1  1  1  0  1  0  1  1  1  0  1 NA

CodePudding user response:

We can use rowSums and convert to binary

df$new_col <-  (rowSums(df, na.rm = TRUE) > 0) * NA^(!rowSums(!is.na(df)))

-output

df$new_col
[1]  0  1  0  1  1  1  0  1  0  1  1  1  1  1  1  1  1  1  0  1  0  1  1  1  0  1 NA

It is also possible in a compact way if we use sum_ from hablar

library(hablar)
 (apply(df, 1, sum_) > 0)
 [1]  0  1  0  1  1  1  0  1  0  1  1  1  1  1  1  1  1  1  0  1  0  1  1  1  0  1 NA

CodePudding user response:

If you want your output as a new column in the dataframe:

dplyr::if_any is most helpful here. We can use if_any() to create a logical vector that outputs TRUE if any of the elements in the data is TRUE, rowwise. Then replace NAs with zeroes with coalesce.

library(dplyr)

df %>% mutate(new_col=coalesce(if_any(everything()), 0))

    x  y  z new_col
1   0  0  0       0
2   0  0  1       1
3   0  0 NA       0
4   0  1  0       1
5   0  1  1       1
6   0  1 NA       1
7   0 NA  0       0
8   0 NA  1       1
9   0 NA NA       0
10  1  0  0       1
11  1  0  1       1
12  1  0 NA       1
13  1  1  0       1
14  1  1  1       1
15  1  1 NA       1
16  1 NA  0       1
17  1 NA  1       1
18  1 NA NA       1
19 NA  0  0       0
20 NA  0  1       1
21 NA  0 NA       0
22 NA  1  0       1
23 NA  1  1       1
24 NA  1 NA       1
25 NA NA  0       0
26 NA NA  1       1
27 NA NA NA       0

We use coalesce to replace NAs with 0s inside the mutate call, so the NAs from the original columns are preserved. We can also use reduce( | ) to create the new column, then coerce to numeric with .

library(dplyr)
library(purrr)

df %>% mutate(new_col =  (map_dfc(df, coalesce, 0) %>% reduce(`|`)))

Or just use the reduce(|) method first, then replace NAs with 0 with coalesce at the end:

library(dplyr)
library(purrr)

df %>% mutate(new_col = coalesce(reduce(., `|`), 0))

If you want just the vector, use:

coalesce(Reduce(`|`, df), 0)

 [1] 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0

observation

For row-wise logical operations, if_any/if_all, reduce(|) and reduce(&), and rowSums(condition) are more robust then rowwise %>% max because max can`t handle rows with all NAs (will output Inf).

In case you want to have NAs as the output when all values are NAs for a given row

For that, just pipe the intermediate objects into replace...if_all...is.na..., as with the following code:

output<-df %>% mutate(new_col=coalesce(if_any(everything()), 0) %>%
                      replace(., if_all(everything(), is.na), NA))

output$new_col
[1] 0  1  0  1  1  1  0  1  0  1  1  1  1  1  1  1  1  1  0  1  0  1  1  1  0  1 NA
  • Related