I am grouping by an identifier (the identifier may have multiple rows) and then attempting to make a unique row per identifier by using a group_by
summarize
with a case_when
. The idea is if there are 2 or more rows then return a single preset value otherwise return the original value. The problem is that case_when returns duplicate rows with the preset value. How do I make sure the output is unique? I could use a distinct
but my expectation would be that the value is already unique. dplyr v. 1.0.7
MWE
library(tidyverse)
dat <- tibble(
id = c(1, 1, 2, 3, 4, 4, 5),
fav_color = c('red', 'blue', 'red', 'green', 'white', 'black', 'pink')
)
dat %>%
group_by(id) %>%
summarize(
fav_color = case_when(
n() > 1 ~ 'Multiple Colors',
n() == 1 ~ fav_color
)
)
Returns
# A tibble: 7 x 2
# Groups: id [5]
id fav_color
<dbl> <chr>
1 1 Multiple Colors
2 1 Multiple Colors
3 2 red
4 3 green
5 4 Multiple Colors
6 4 Multiple Colors
7 5 pink
What I want ifelse
gives:
dat %>%
group_by(id) %>%
summarize(
fav_color = ifelse(n() > 1, 'Multiple Colors', fav_color)
)
# A tibble: 5 x 2
id fav_color
<dbl> <chr>
1 1 Multiple Colors
2 2 red
3 3 green
4 4 Multiple Colors
5 5 pink
Note Using a character output for the final return results in the correct number of rows:
dat %>%
group_by(id) %>%
summarize(
fav_color = case_when(
n() > 1 ~ 'Multiple Colors',
n() == 1 ~ 'Single Color'
)
)
# A tibble: 5 x 2
id fav_color
<dbl> <chr>
1 1 Multiple Colors
2 2 Single Color
3 3 Single Color
4 4 Multiple Colors
5 5 Single Color
CodePudding user response:
case_when
is not designed to deal with this, according to a Tidyverse developer (Github issue #5730).
CodePudding user response:
You need to reduce your grouped dataset using reduce functions. Try below using a max
on the desired output column:
dat %>%
group_by(id) %>%
summarize(
color_count = n(),
fav_color = case_when(
color_count > 1 ~ 'Multiple Colors',
color_count == 1 ~ max(fav_color)
)
)