I have been struggling using the {forcats}
fct_lump_prop()
function, more specifically the use of its w =
and prop =
arguments. What exactly is it supposed to represent with the example below :
df <- tibble(var1 = c(rep("a", 3), rep("b", 3)),
var2 = c(0.2, 0.3, 0.2, 0.1, 0.1, 0.1))
# A tibble: 6 x 2
var1 var2
<chr> <dbl>
1 a 0.2
2 a 0.3
3 a 0.2
4 b 0.1
5 b 0.1
6 b 0.1
What would be the prop
value to set so that a
is kept intact, but b
is lumped as Others
?
# A tibble: 6 x 2
var1 var2 fct
<chr> <dbl> <fct>
1 a 0.2 a
2 a 0.3 a
3 a 0.2 a
4 b 0.1 Others
5 b 0.1 Others
6 b 0.1 Others
I have tried empirically but can't find how this works. Intuitively I would say that setting a prop
value above 0.3 (sum of b
s) and below 0.6 (sum of a
s) would lump the b
factor but this doesn't lump anything. The only threshold I found was with prop = 0.7
and then everything becomes a Others
level...
df %>%
mutate(fct = fct_lump_prop(var1, w = var2, prop = 0.7))
# A tibble: 6 x 3
var1 var2 fct
<fct> <dbl> <fct>
1 a 0.2 Other
2 a 0.3 Other
3 a 0.2 Other
4 b 0.1 Other
5 b 0.1 Other
6 b 0.1 Other
So what am I not understanding ? The little examples I found elsewhere did not help me grasp the prop
and w
behavior.
Thanks a lot.
CodePudding user response:
Relevant explanation is in this line of the source of fct_lump_prop()
if (prop > 0 && sum(prop_n <= prop) <= 1) {
return(f) # No lumping needed
}
That is, if there are only one factor in "others" the function does nothing.
library(forcats)
library(dplyr)
df <- tibble(var1 = c(rep("a", 3), rep("b", 3),rep("c", 3)),
var2 = (1:9)/(45))
df %>% group_by(var1) %>% summarise(sum(var2))
#> # A tibble: 3 × 2
#> var1 `sum(var2)`
#> <chr> <dbl>
#> 1 a 0.133
#> 2 b 0.333
#> 3 c 0.533
df %>%
mutate(fct = fct_lump_prop(factor(var1), w = var2, prop = 0.45))
#> # A tibble: 9 × 3
#> var1 var2 fct
#> <chr> <dbl> <fct>
#> 1 a 0.0222 Other
#> 2 a 0.0444 Other
#> 3 a 0.0667 Other
#> 4 b 0.0889 Other
#> 5 b 0.111 Other
#> 6 b 0.133 Other
#> 7 c 0.156 c
#> 8 c 0.178 c
#> 9 c 0.2 c
df %>%
mutate(fct = fct_lump_prop(factor(var1), w = var2, prop = 0.25))
#> # A tibble: 9 × 3
#> var1 var2 fct
#> <chr> <dbl> <fct>
#> 1 a 0.0222 a
#> 2 a 0.0444 a
#> 3 a 0.0667 a
#> 4 b 0.0889 b
#> 5 b 0.111 b
#> 6 b 0.133 b
#> 7 c 0.156 c
#> 8 c 0.178 c
#> 9 c 0.2 c
Created on 2022-11-14 with reprex v2.0.2
It does'nt appear to be documented except in the source code itself.
Edit
There is also an issue in github because manpage says that factors are lumper if they appear "fewer than" prop times, but it is also true if it appears "exactly" prop times.