I want to create a new variable when two variables match a condition. In that case I want to have the smaller of two other variables. In my real data these are dates and I want to add other conditions in the case_when function, but I keep it simple her.
When I use first(sort(c(var1,var2)))
the variables are interpreted as the whole column and it results in the smallest value for var1, var2 for the entire data.frame. But I want the smallest of the two variables per row in the data. In contrast, when I use var1 * var2
the variables per row in the data are used.
I tried the following:
library(dplyr)
starwars %>%
mutate(
new_var =
case_when(
hair_color == "brown" & eye_color == "brown" ~ first(sort(c(birth_year, mass)))
),
new_var2 =
case_when(
hair_color == "brown" & eye_color == "brown" ~ birth_year * mass
)
) %>%
select ( name, hair_color, eye_color, mass, birth_year, new_var, new_var2) %>%
filter(hair_color == "brown" & eye_color == "brown")
shortend output:
# A tibble: 9 × 7
name hair_color eye_color mass birth_year new_var new_var2
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Leia Organa brown brown 49 19 8 931
2 Han Solo brown brown 80 29 8 2320
3 Arvel Crynyd brown brown NA NA 8 NA
4 Wicket Systri Warrick brown brown 20 8 8 160
Thanks a lot in advance
CodePudding user response:
For functions that summarise information, like first
, using the function without grouping will result in a calculation over the entire dataframe. The most direct remedy is to use rowwise
with c_across
.
But doing this results in slower code, so I'd suggest using pmin
:
library(dplyr)
library(purrr)
starwars %>%
mutate(new_var =
case_when(hair_color == "brown" & eye_color == "brown" ~
pmin(birth_year, mass))) %>%
select (name, hair_color, eye_color, mass, birth_year, new_var)
output
# A tibble: 87 × 6
name hair_color eye_color mass birth_year new_var
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Luke Skywalker blond blue 77 19 NA
2 C-3PO NA yellow 75 112 NA
3 R2-D2 NA red 32 33 NA
4 Darth Vader none yellow 136 41.9 NA
5 Leia Organa brown brown 49 19 19
6 Owen Lars brown, grey blue 120 52 NA
7 Beru Whitesun lars brown blue 75 47 NA
8 R5-D4 NA red 32 NA NA
9 Biggs Darklighter black brown 84 24 NA
10 Obi-Wan Kenobi auburn, white blue-gray 77 57 NA
With c_across
rowwise
. This is not an optimal solution. You could use pmin
to get the minimum value per row instead (see above).
starwars %>%
rowwise() %>%
mutate(new_var =
case_when(hair_color == "brown" & eye_color == "brown" ~
first(sort(c_across(c(birth_year, mass)))))) %>%
select (name, hair_color, eye_color, mass, birth_year, new_var)
If you need the calculations only on those with the conditions, you can be even simpler and avoid case_when
, using pmin
:
starwars %>%
filter(hair_color == "brown" & eye_color == "brown") %>%
mutate(new_var = pmin(birth_year, mass)))