Home > Blockchain >  Conditionally drop rows if each unique value in a column appears less than n times
Conditionally drop rows if each unique value in a column appears less than n times

Time:10-17

    number
1        1
2        2
3        2
4        3
5        3
6        3
7        4
8        4
9        4
10       4
11       5
12       5
13       5
14       5
15       5

How would I go about dropping all rows where each unique value in number has less than 5 rows with that value? For example, the tibble above would become:

    number
1        5
2        5
3        5
4        5
5        5

If I wanted to drop all rows where the unique value in number has less than 4 rows with that value, the tibble would become:

    number
1        4
2        4
3        4
4        4
5        5
6        5
7        5
8        5
9        5

I've heard I could use a count variable for the number of rows for each value in numbers and then filtering, but I'm not sure how to code this.

CodePudding user response:

Perhaps using functions from the dplyr package:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- read.table(text = "    number
1        1
2        2
3        2
4        3
5        3
6        3
7        4
8        4
9        4
10       4
11       5
12       5
13       5
14       5
15       5", header = TRUE)

df %>%
  group_by(number) %>%
  filter(n() >= 5)
#> # A tibble: 5 × 1
#> # Groups:   number [1]
#>   number
#>    <int>
#> 1      5
#> 2      5
#> 3      5
#> 4      5
#> 5      5

If you want to drop all rows where the unique value in number has less than 4 rows:

df %>%
  group_by(number) %>%
  filter(n() >= 4)
#> # A tibble: 9 × 1
#> # Groups:   number [2]
#>   number
#>    <int>
#> 1      4
#> 2      4
#> 3      4
#> 4      4
#> 5      5
#> 6      5
#> 7      5
#> 8      5
#> 9      5

Created on 2022-10-17 by the reprex package (v2.0.1)

CodePudding user response:

Group by the specific column, and then add a column for the number of rows per group, finally filter the desired rows out

library(dplyr)
df2 <- df %>%  
    group_by(number) %>% 
    mutate(groupCount = n()) %>%
    filter(groupCount > 4)

CodePudding user response:

x <- rep(1:5, 1:5)

fltr <- data.table::rleid(x)

x[fltr >= 5]
#> [1] 5 5 5 5 5

Created on 2022-10-17 with reprex v2.0.2

  • Related