Home > Blockchain >  Collapse levels of a factor when number of observations within a level are below a limit
Collapse levels of a factor when number of observations within a level are below a limit

Time:09-16

I would like a way to collapse levels of a factor based on the number of observations for each level.

For example, if I have the tibble below with a factor column of animals (four levels: cat, dog, hamster, goldfish), can I collapse levels with less than 2 observations into a level called "other"?

# A tibble: 7 × 1
  animal  
  <fct>   
1 cat     
2 cat     
3 cat     
4 dog     
5 dog     
6 hamster 
7 goldfish

This should result in the following...

# A tibble: 7 × 2
  animal   animal2
  <fct>    <fct>  
1 cat      cat    
2 cat      cat    
3 cat      cat    
4 dog      dog    
5 dog      dog    
6 hamster  other  
7 goldfish other  

I would like to be able to adjust the cut-off (e.g. groups with less that 5 observations) and ideally this would be done using tidyverse.

CodePudding user response:

You're looking for forcats::fct_lump_min; which collapse to 'Other' levels that appear less than min times:

library(forcats)
library(dplyr)
df %>% 
  mutate(animal2 = fct_lump_min(animal, min = 2),
         animal3 = fct_lump_min(animal, 3))

output

# A tibble: 7 × 3
  animal   animal2 animal3
  <fct>    <fct>   <fct>  
1 cat      cat     cat    
2 cat      cat     cat    
3 cat      cat     cat    
4 dog      dog     Other  
5 dog      dog     Other  
6 hamster  Other   Other  
7 goldfish Other   Other
  • Related