Home > Blockchain >  forcats::fct_lump_prop() behavior
forcats::fct_lump_prop() behavior

Time:11-15

I have been struggling using the {forcats} fct_lump_prop() function, more specifically the use of its w = and prop = arguments. What exactly is it supposed to represent with the example below :

df <- tibble(var1 = c(rep("a", 3), rep("b", 3)),
             var2 = c(0.2, 0.3, 0.2, 0.1, 0.1, 0.1))
# A tibble: 6 x 2
  var1   var2
  <chr> <dbl>
1 a       0.2
2 a       0.3
3 a       0.2
4 b       0.1
5 b       0.1
6 b       0.1

What would be the prop value to set so that a is kept intact, but b is lumped as Others ?

# A tibble: 6 x 2
  var1   var2  fct
  <chr> <dbl> <fct>
1 a       0.2   a
2 a       0.3   a
3 a       0.2   a
4 b       0.1   Others
5 b       0.1   Others
6 b       0.1   Others

I have tried empirically but can't find how this works. Intuitively I would say that setting a prop value above 0.3 (sum of bs) and below 0.6 (sum of as) would lump the b factor but this doesn't lump anything. The only threshold I found was with prop = 0.7 and then everything becomes a Others level...

df %>%
  mutate(fct = fct_lump_prop(var1, w = var2, prop = 0.7))

# A tibble: 6 x 3
  var1   var2 fct  
  <fct> <dbl> <fct>
1 a       0.2 Other
2 a       0.3 Other
3 a       0.2 Other
4 b       0.1 Other
5 b       0.1 Other
6 b       0.1 Other

So what am I not understanding ? The little examples I found elsewhere did not help me grasp the prop and w behavior.

Thanks a lot.

CodePudding user response:

Relevant explanation is in this line of the source of fct_lump_prop()

if (prop > 0 && sum(prop_n <= prop) <= 1) {
   return(f) # No lumping needed
}

That is, if there are only one factor in "others" the function does nothing.

library(forcats)
library(dplyr)
df <- tibble(var1 = c(rep("a", 3), rep("b", 3),rep("c", 3)),
             var2 = (1:9)/(45))

df %>% group_by(var1) %>% summarise(sum(var2))
#> # A tibble: 3 × 2
#>   var1  `sum(var2)`
#>   <chr>       <dbl>
#> 1 a           0.133
#> 2 b           0.333
#> 3 c           0.533

df %>%
  mutate(fct = fct_lump_prop(factor(var1), w = var2, prop = 0.45))
#> # A tibble: 9 × 3
#>   var1    var2 fct  
#>   <chr>  <dbl> <fct>
#> 1 a     0.0222 Other
#> 2 a     0.0444 Other
#> 3 a     0.0667 Other
#> 4 b     0.0889 Other
#> 5 b     0.111  Other
#> 6 b     0.133  Other
#> 7 c     0.156  c    
#> 8 c     0.178  c    
#> 9 c     0.2    c

df %>%
  mutate(fct = fct_lump_prop(factor(var1), w = var2, prop = 0.25))
#> # A tibble: 9 × 3
#>   var1    var2 fct  
#>   <chr>  <dbl> <fct>
#> 1 a     0.0222 a    
#> 2 a     0.0444 a    
#> 3 a     0.0667 a    
#> 4 b     0.0889 b    
#> 5 b     0.111  b    
#> 6 b     0.133  b    
#> 7 c     0.156  c    
#> 8 c     0.178  c    
#> 9 c     0.2    c

Created on 2022-11-14 with reprex v2.0.2

It does'nt appear to be documented except in the source code itself.

Edit

There is also an issue in github because manpage says that factors are lumper if they appear "fewer than" prop times, but it is also true if it appears "exactly" prop times.

  • Related