I am trying to check the values of various binary columns and then mutate a new column populated with a value conditional to those columns.
The new column should indicate which column included 1 when all others were 0 and record the number of that column. If multiple 1s are present, a zero should be recorded. Y1 is included as an indicator that some columns need to be ignored.
For example,
if X1=0, X2=0, X3=1; I'd like Xfoo to be 3,
if X1=1, X2=0, X3=1; I'd like Xfoo to be 0, etc
So far I have been achieving this using dplyr mutate
and case_when
as follows:
# create toy data:
set.seed(123)
N<-10
mydata <- data.frame(Y1 = sample(0 : 1, N, replace = TRUE),
X1 = sample(0 : 1, N, replace = TRUE),
X2 = sample(0 : 1, N, replace = TRUE),
X3 = sample(0 : 1, N, replace = TRUE))
# perform mutate:
mydata <- mydata %>%
mutate(
Xfoo = case_when(
X1 == 1 & X2 == 0 & X3 == 0 ~ 1,
X1 == 0 & X2 == 1 & X3 == 0 ~ 2,
X1 == 0 & X2 == 0 & X3 == 1 ~ 3,
TRUE ~ 0)
)
I am happy with this approach given small numbers of variables, however, I realise that this quickly becomes less effective with large numbers of variables.
To combat this, I'd have thought that grouping conditions in some way would keep things concise. Something like:
case_when(
X1 == 1 & c(X2, X3) ==0 ~ 1
...
I realise that my logic is wrong here, I've been experimenting with all()
and any()
but can't figure it out and would appreciate some pointers.
CodePudding user response:
There's not really a built in function function for multiple comparisons but you could create one
all_same_as <- function(val, ...) {
Reduce(function(a,b) a & (b==val), list(...), init = TRUE)
}
So this function allows you to pass in any number of columns, and it checks that all the values in those columns are equal to the first value you pass in. So the first parameter should be the number you want the rest to match And then you could use
mydata %>%
mutate(
Xfoo = case_when(
X1 == 1 & all_same_as(0, X2, X3) ~ 1,
X2 == 1 & all_same_as(0, X1, X3) ~ 2,
X3 == 1 & all_same_as(0, X1, X2) ~ 3,
TRUE ~ 0)
)
CodePudding user response:
You could try this, but it will be less computationally efficient. I'd use {data.table} if you are working with a large dataset.
mydata %>%
rowwise() %>%
mutate(
Xfoo = case_when(
all(c(X2,X3) == 0) & X1 == 1 ~ 1,
all(c(X1,X3) == 0) & X2 == 1 ~ 2,
all(c(X1,X2) == 0) & X3 == 1 ~ 3,
TRUE ~ 0)
) %>%
ungroup()
CodePudding user response:
You could check what column has the maximum value and extract the column names and the number from that (as indicated by @Henrik in the comments):
library(tidyverse)
mydata |>
mutate(Xfoo = pmap_dbl(across(starts_with("X")),
~ if_else(sum(c(...)) == 1,
names(c(...))[which.max(c(...))] |> parse_number(),
0)))
Output:
# A tibble: 10 × 5
Y1 X1 X2 X3 Xfoo
<int> <int> <int> <int> <dbl>
1 0 1 0 0 1
2 0 1 1 1 0
3 0 1 0 0 1
4 1 0 0 1 3
5 0 1 0 1 0
6 1 0 0 0 0
7 1 1 1 0 0
8 1 0 1 0 2
9 0 0 0 0 0
10 0 0 1 1 0
See also: For each row return the column name of the largest value