Conditionally drop rows if each unique value in a column appears less than n times-CodePudding

How would I go about dropping all rows where each unique value in number has less than 5 rows with that value? For example, the tibble above would become:

If I wanted to drop all rows where the unique value in number has less than 4 rows with that value, the tibble would become:

I've heard I could use a count variable for the number of rows for each value in numbers and then filtering, but I'm not sure how to code this.

CodePudding user response：

Perhaps using functions from the dplyr package:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- read.table(text = "    number
1        1
2        2
3        2
4        3
5        3
6        3
7        4
8        4
9        4
10       4
11       5
12       5
13       5
14       5
15       5", header = TRUE)

df %>%
  group_by(number) %>%
  filter(n() >= 5)
#> # A tibble: 5 × 1
#> # Groups:   number [1]
#>   number
#>    <int>
#> 1      5
#> 2      5
#> 3      5
#> 4      5
#> 5      5

If you want to drop all rows where the unique value in number has less than 4 rows:

df %>%
  group_by(number) %>%
  filter(n() >= 4)
#> # A tibble: 9 × 1
#> # Groups:   number [2]
#>   number
#>    <int>
#> 1      4
#> 2      4
#> 3      4
#> 4      4
#> 5      5
#> 6      5
#> 7      5
#> 8      5
#> 9      5

^{Created on 2022-10-17 by the reprex package (v2.0.1)}

CodePudding user response：

Group by the specific column, and then add a column for the number of rows per group, finally filter the desired rows out

library(dplyr)
df2 <- df %>%  
    group_by(number) %>% 
    mutate(groupCount = n()) %>%
    filter(groupCount > 4)

CodePudding user response：

x <- rep(1:5, 1:5)

fltr <- data.table::rleid(x)

x[fltr >= 5]
#> [1] 5 5 5 5 5

^{Created on 2022-10-17 with reprex v2.0.2}