How can I use different variables to calculate a new variable, depending on which variable has missi-CodePudding

I would like to create a variable x that is the product of variable e and one of the variables a, b, c, or d. The resulting variable should take the value a * e, but if a is NA, then it should take the value b * e and if a & b are missing, then it should take the value c * e and so on.

For example: If my data frame looks like this:

df <- data.frame(a = c(1, 2, NA, NA, 5), b = c(NA, 1, NA, 4, 6), c = c(NA, 3, 3, 3, 7), d = c(1, 1, 1, 1, 1), e = c(1, 2, 3, 4, NA))

I would like to get to having the following result:

df$x <- c(1, 4, 9, 16, NA)

which I am trying to achieve the following way:

df <- df %>% mutate(x = case_when(!is.na(a) ~ a * e, is.na(a) ~ b * e, is.na(a) & is.na(b) ~ c * e, is.na(a) & is.na(b) & is.na(c) ~ d * e))

This, unfortunately is not working yet, since R is somehow not understanding is.na(a) & is.na(b) as both values missing at the same time.

CodePudding user response：

You can use coalesce() from dplyr to find the first non-missing element.

library(dplyr)

df %>%
  mutate(x = e * coalesce(a, b, c, d))

#    a  b  c d  e  x
# 1  1 NA NA 1  1  1
# 2  2  1  3 1  2  4
# 3 NA NA  3 1  3  9
# 4 NA  4  3 1  4 16
# 5  5  6  7 1 NA NA

If you have lots of columns to be multiplied, you can automate it by using tidy-selection in across(). (Edit: Thank @akrun for the improvement)

df %>%
  mutate(x = e * do.call(coalesce, across(a:d)))

CodePudding user response：

df <-
  data.frame(
    a = c(1, 2, NA, NA, 5),
    b = c(NA, 1, NA, 4, 6),
    c = c(NA, 3, 3, 3, 7),
    d = c(1, 1, 1, 1, 1),
    e = c(1, 2, 3, 4, NA)
  )

df$res <- df[, 5] * apply(df[-5], 1, function(x) x[which(x = !is.na(x))[1]])
df
#>    a  b  c d  e res
#> 1  1 NA NA 1  1   1
#> 2  2  1  3 1  2   4
#> 3 NA NA  3 1  3   9
#> 4 NA  4  3 1  4  16
#> 5  5  6  7 1 NA  NA

^{Created on 2022-05-19 by the reprex package (v2.0.1)}

CodePudding user response：

Here is one option with row/column indexing

df$res <- df$e * df[-5][cbind(seq_len(nrow(df)), 
     max.col(!is.na(df[-5]), 'first'))]
df$res
[1]  1  4  9 16 NA