I have a data.frame
containing a factor column. I want to (a) drop from the data.frame
any rows where the value in that column does not appear in at least 8 rows and (b) drop those levels from the factor.
In the below case, it would be the factors C, D, and G.
> table(x.train$oilType)
A B C D E F G
30 21 3 6 9 8 2
From what I can tell, 'droplevels' only works if the factor is not being used at all. I gave this a shot with no success.
> droplevels(x.train$oilType[-c(C,D,G)])
Error in NextMethod("[") : object 'G' not found
Any guidance?
CodePudding user response:
You can use add_count()
to get the counts for each value of the factor, then filter()
to keep rows where the count is >= 8
. You then can drop levels with droplevels
and mutate
.
library(dplyr)
# Example factor
df <- data.frame(fac = as.factor(c(rep("a", 3), rep("b", 8), rep("c", 9))))
df$fac %>% table()
#> .
#> a b c
#> 3 8 9
# Keep only rows where the value of `fac` for that row is observed in at least
# 8 rows and drop unused levels
result <- df %>%
add_count(fac) %>%
filter(n >= 8) %>%
mutate(fac = droplevels(fac))
print(result)
#> fac n
#> 1 b 8
#> 2 b 8
#> 3 b 8
#> 4 b 8
#> 5 b 8
#> 6 b 8
#> 7 b 8
#> 8 b 8
#> 9 c 9
#> 10 c 9
#> 11 c 9
#> 12 c 9
#> 13 c 9
#> 14 c 9
#> 15 c 9
#> 16 c 9
#> 17 c 9
levels(result$fac)
#> [1] "b" "c"