I create datasets in R regularly and often find I need to take two or more binary variables and "or" them into one new variable that indicates if any were 1, none were 1, or all were missing.
Simply using |
does not handle NA
's the way I would like.
So given a data.frame, df
of three columns:
x = c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,NA,NA,NA,NA,NA,NA,NA,NA,NA)
y = c( 0, 0, 0, 1, 1, 1,NA,NA,NA, 0, 0, 0, 1, 1, 1,NA,NA,NA, 0, 0, 0, 1, 1, 1,NA,NA,NA)
z = c( 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA, 0, 1,NA)
df = data.frame(x,y,z)
The output I am looking for is:
myFunction(df)
[1] 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 NA
But simply using |
does not handle 0's the way I am looking for as it prioritizes NA
's over 0's:
as.numeric(df$x | df$y | df$z)
[1] 0 1 NA 1 1 1 NA 1 NA 1 1 1 1 1 1 1 1 1 NA 1 NA 1 1 1 NA 1 NA
This is the best solution I came up with:
myFunction <- function(...) {
as.numeric(apply(data.frame(...),1,function(x) { ifelse(all(is.na(x)),NA,sum(x,na.rm = T)) }) > 0)
}
df$xyz = myFunction(df)
df$xyz
[1] 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 NA
Is there a package with this functionality or a better way to write this so I don't have to copy paste this mess across all my scripts? Am I over thinking this?
CodePudding user response:
Another way that I thought of
library(dplyr)
df %>%
rowwise() %>%
mutate(out = max(c_across(),na.rm = TRUE)) %>%
pull(out) %>%
replace(is.infinite(.), NA)
[1] 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 NA
CodePudding user response:
We can use rowSums
and convert to binary
df$new_col <- (rowSums(df, na.rm = TRUE) > 0) * NA^(!rowSums(!is.na(df)))
-output
df$new_col
[1] 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 NA
It is also possible in a compact way if we use sum_
from hablar
library(hablar)
(apply(df, 1, sum_) > 0)
[1] 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 NA
CodePudding user response:
If you want your output as a new column in the dataframe:
dplyr::if_any
is most helpful here. We can use if_any() to create a logical vector that outputs TRUE if any of the elements in the data is TRUE, rowwise. Then replace NAs with zeroes with coalesce
.
library(dplyr)
df %>% mutate(new_col=coalesce(if_any(everything()), 0))
x y z new_col
1 0 0 0 0
2 0 0 1 1
3 0 0 NA 0
4 0 1 0 1
5 0 1 1 1
6 0 1 NA 1
7 0 NA 0 0
8 0 NA 1 1
9 0 NA NA 0
10 1 0 0 1
11 1 0 1 1
12 1 0 NA 1
13 1 1 0 1
14 1 1 1 1
15 1 1 NA 1
16 1 NA 0 1
17 1 NA 1 1
18 1 NA NA 1
19 NA 0 0 0
20 NA 0 1 1
21 NA 0 NA 0
22 NA 1 0 1
23 NA 1 1 1
24 NA 1 NA 1
25 NA NA 0 0
26 NA NA 1 1
27 NA NA NA 0
We use coalesce
to replace NAs with 0s inside the mutate call, so the NAs from the original columns are preserved.
We can also use reduce( | )
to create the new column, then coerce to numeric with
.
library(dplyr)
library(purrr)
df %>% mutate(new_col = (map_dfc(df, coalesce, 0) %>% reduce(`|`)))
Or just use the reduce(|)
method first, then replace NAs with 0 with coalesce
at the end:
library(dplyr)
library(purrr)
df %>% mutate(new_col = coalesce(reduce(., `|`), 0))
If you want just the vector, use:
coalesce(Reduce(`|`, df), 0)
[1] 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0
observation
For row-wise logical operations, if_any
/if_all
, reduce(|)
and reduce(&)
, and rowSums(condition)
are more robust then rowwise %>% max
because max can`t handle rows with all NAs (will output Inf).
In case you want to have NAs as the output when all values are NAs for a given row
For that, just pipe the intermediate objects into replace...if_all...is.na...
, as with the following code:
output<-df %>% mutate(new_col=coalesce(if_any(everything()), 0) %>%
replace(., if_all(everything(), is.na), NA))
output$new_col
[1] 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 NA