Using case_when with dplyr mutate: why is sometimes the value per row used for calculation and somet-CodePudding

I want to create a new variable when two variables match a condition. In that case I want to have the smaller of two other variables. In my real data these are dates and I want to add other conditions in the case_when function, but I keep it simple her.

When I use first(sort(c(var1,var2))) the variables are interpreted as the whole column and it results in the smallest value for var1, var2 for the entire data.frame. But I want the smallest of the two variables per row in the data. In contrast, when I use var1 * var2 the variables per row in the data are used.

I tried the following:

library(dplyr)
    starwars %>% 
      mutate(
        new_var = 
          case_when(
           hair_color == "brown" & eye_color == "brown" ~ first(sort(c(birth_year, mass)))
          ),
        new_var2 = 
          case_when(
            hair_color == "brown" & eye_color == "brown" ~ birth_year * mass 
          )
       ) %>% 
      select ( name, hair_color, eye_color, mass, birth_year, new_var, new_var2) %>% 
      filter(hair_color == "brown" & eye_color == "brown")

shortend output:

 # A tibble: 9 × 7
  name                  hair_color eye_color  mass birth_year new_var new_var2
  <chr>                 <chr>      <chr>     <dbl>      <dbl>   <dbl>    <dbl>
1 Leia Organa           brown      brown        49         19       8      931
2 Han Solo              brown      brown        80         29       8     2320
3 Arvel Crynyd          brown      brown        NA         NA       8       NA
4 Wicket Systri Warrick brown      brown        20          8       8      160

Thanks a lot in advance

CodePudding user response：

For functions that summarise information, like first, using the function without grouping will result in a calculation over the entire dataframe. The most direct remedy is to use rowwise with c_across.

But doing this results in slower code, so I'd suggest using pmin:

library(dplyr)
library(purrr)
starwars %>% 
  mutate(new_var = 
           case_when(hair_color == "brown" & eye_color == "brown" ~ 
                       pmin(birth_year, mass))) %>% 
  select (name, hair_color, eye_color, mass, birth_year, new_var)

output

# A tibble: 87 × 6
   name               hair_color    eye_color  mass birth_year new_var
   <chr>              <chr>         <chr>     <dbl>      <dbl>   <dbl>
 1 Luke Skywalker     blond         blue         77       19        NA
 2 C-3PO              NA            yellow       75      112        NA
 3 R2-D2              NA            red          32       33        NA
 4 Darth Vader        none          yellow      136       41.9      NA
 5 Leia Organa        brown         brown        49       19        19
 6 Owen Lars          brown, grey   blue        120       52        NA
 7 Beru Whitesun lars brown         blue         75       47        NA
 8 R5-D4              NA            red          32       NA        NA
 9 Biggs Darklighter  black         brown        84       24        NA
10 Obi-Wan Kenobi     auburn, white blue-gray    77       57        NA

With c_across rowwise. This is not an optimal solution. You could use pmin to get the minimum value per row instead (see above).

starwars %>% 
  rowwise() %>% 
  mutate(new_var = 
           case_when(hair_color == "brown" & eye_color == "brown" ~ 
                       first(sort(c_across(c(birth_year, mass)))))) %>% 
  select (name, hair_color, eye_color, mass, birth_year, new_var)

If you need the calculations only on those with the conditions, you can be even simpler and avoid case_when, using pmin:

starwars %>% 
  filter(hair_color == "brown" & eye_color == "brown") %>% 
  mutate(new_var = pmin(birth_year, mass)))