R: drop factors with certain values-CodePudding

I have a data.frame containing a factor column. I want to (a) drop from the data.frame any rows where the value in that column does not appear in at least 8 rows and (b) drop those levels from the factor.

In the below case, it would be the factors C, D, and G.

> table(x.train$oilType)

 A  B  C  D  E  F  G 
30 21  3  6  9  8  2

From what I can tell, 'droplevels' only works if the factor is not being used at all. I gave this a shot with no success.

> droplevels(x.train$oilType[-c(C,D,G)])
Error in NextMethod("[") : object 'G' not found

Any guidance?

CodePudding user response：

You can use add_count() to get the counts for each value of the factor, then filter() to keep rows where the count is >= 8. You then can drop levels with droplevels and mutate.

library(dplyr)

# Example factor
df <- data.frame(fac = as.factor(c(rep("a", 3), rep("b", 8), rep("c", 9))))
df$fac %>% table()
#> .
#> a b c 
#> 3 8 9

# Keep only rows where the value of `fac` for that row is observed in at least
# 8 rows and drop unused levels
result <- df %>%
  add_count(fac) %>%
  filter(n >= 8) %>%
  mutate(fac = droplevels(fac))

print(result)
#>    fac n
#> 1    b 8
#> 2    b 8
#> 3    b 8
#> 4    b 8
#> 5    b 8
#> 6    b 8
#> 7    b 8
#> 8    b 8
#> 9    c 9
#> 10   c 9
#> 11   c 9
#> 12   c 9
#> 13   c 9
#> 14   c 9
#> 15   c 9
#> 16   c 9
#> 17   c 9

levels(result$fac)
#> [1] "b" "c"