Home > database >  grouped case_when returns duplicate rows
grouped case_when returns duplicate rows

Time:09-30

I am grouping by an identifier (the identifier may have multiple rows) and then attempting to make a unique row per identifier by using a group_by summarize with a case_when. The idea is if there are 2 or more rows then return a single preset value otherwise return the original value. The problem is that case_when returns duplicate rows with the preset value. How do I make sure the output is unique? I could use a distinct but my expectation would be that the value is already unique. dplyr v. 1.0.7

MWE

library(tidyverse)

dat <- tibble(
    id = c(1, 1, 2, 3, 4, 4, 5),
    fav_color = c('red', 'blue', 'red', 'green', 'white', 'black', 'pink')
) 


dat %>%
    group_by(id) %>%
    summarize(
        fav_color = case_when(
            n() > 1 ~ 'Multiple Colors',
            n() == 1 ~ fav_color
        )
    )

Returns

# A tibble: 7 x 2
# Groups:   id [5]
     id fav_color      
  <dbl> <chr>          
1     1 Multiple Colors
2     1 Multiple Colors
3     2 red            
4     3 green          
5     4 Multiple Colors
6     4 Multiple Colors
7     5 pink    

What I want ifelse gives:

dat %>%
    group_by(id) %>%
    summarize(
        fav_color = ifelse(n() > 1, 'Multiple Colors', fav_color)
    )
# A tibble: 5 x 2
     id fav_color      
  <dbl> <chr>          
1     1 Multiple Colors
2     2 red            
3     3 green          
4     4 Multiple Colors
5     5 pink 

Note Using a character output for the final return results in the correct number of rows:

dat %>%
    group_by(id) %>%
    summarize(
        fav_color = case_when(
            n() > 1 ~ 'Multiple Colors',
            n() == 1 ~ 'Single Color'
        )
    )
# A tibble: 5 x 2
     id fav_color      
  <dbl> <chr>          
1     1 Multiple Colors
2     2 Single Color   
3     3 Single Color   
4     4 Multiple Colors
5     5 Single Color 

CodePudding user response:

case_when is not designed to deal with this, according to a Tidyverse developer (Github issue #5730).

CodePudding user response:

You need to reduce your grouped dataset using reduce functions. Try below using a max on the desired output column:

dat %>%
group_by(id) %>%
summarize(
 color_count = n(),
 fav_color = case_when(
  color_count > 1 ~ 'Multiple Colors',
  color_count == 1 ~ max(fav_color)
 )
)
  • Related