I would like a way to collapse levels of a factor based on the number of observations for each level.
For example, if I have the tibble below with a factor column of animals (four levels: cat, dog, hamster, goldfish), can I collapse levels with less than 2 observations into a level called "other"?
# A tibble: 7 × 1
animal
<fct>
1 cat
2 cat
3 cat
4 dog
5 dog
6 hamster
7 goldfish
This should result in the following...
# A tibble: 7 × 2
animal animal2
<fct> <fct>
1 cat cat
2 cat cat
3 cat cat
4 dog dog
5 dog dog
6 hamster other
7 goldfish other
I would like to be able to adjust the cut-off (e.g. groups with less that 5 observations) and ideally this would be done using tidyverse.
CodePudding user response:
You're looking for forcats::fct_lump_min
; which collapse to 'Other'
levels that appear less than min
times:
library(forcats)
library(dplyr)
df %>%
mutate(animal2 = fct_lump_min(animal, min = 2),
animal3 = fct_lump_min(animal, 3))
output
# A tibble: 7 × 3
animal animal2 animal3
<fct> <fct> <fct>
1 cat cat cat
2 cat cat cat
3 cat cat cat
4 dog dog Other
5 dog dog Other
6 hamster Other Other
7 goldfish Other Other