Home > OS >  Filter R dataframe to n most frequent cases and order by frequency
Filter R dataframe to n most frequent cases and order by frequency

Time:02-14

In a dataframe with repeat values, I want the rows for the most frequent n cases, say the two most frequently occurring. The code below does this, selecting rows where x==3 or x==4 and return the rows in that order.

I don't want to have to use the value 5; however, I want some way of programmatically stating the top 2 most frequent x values, without knowing a threshold (5 in this example) apriori. In addition, I would like to order the resulting dataframe by frequency of occurrence, so x==4 rows come before x==3 rows.

I am presuming it is related to count, top_n or slice_max and arrange but maybe not!

Any hints on how to do this with dplyr would be greatly appreciated.

require(tidyverse)
adf <- data.frame(x=c(rep(2,5),rep(3,7), rep(4,9), 1), y=1:22)
adf %>% group_by(x) %>% filter(n() > 5)

CodePudding user response:

We can do this by calculating the size of each group using n() and then filtering on that. If you want them in order, can also use dplyr::arrange().

library(dplyr)

adf %>%
  group_by(x) %>%
  mutate(size = n()) %>%
  ungroup() %>%
  filter(size >= sort(unique(size), TRUE)[2]) %>%
  arrange(desc(size))
#> # A tibble: 16 × 3
#>        x     y  size
#>    <dbl> <int> <int>
#>  1     4    13     9
#>  2     4    14     9
#>  3     4    15     9
#>  4     4    16     9
#>  5     4    17     9
#>  6     4    18     9
#>  7     4    19     9
#>  8     4    20     9
#>  9     4    21     9
#> 10     3     6     7
#> 11     3     7     7
#> 12     3     8     7
#> 13     3     9     7
#> 14     3    10     7
#> 15     3    11     7
#> 16     3    12     7

CodePudding user response:

Here is another tidyverse option, where I use add_count to get the number of observations per group, then I filter the 2 groups with the highest number of observations. Then, I arrange by the frequency of occurence.

library(tidyverse)

adf %>%
  add_count(x) %>% 
  filter(n %in% tail(sort(unique(n)),2)) %>% 
  arrange(desc(n))

Output

   x  y n
1  4 13 9
2  4 14 9
3  4 15 9
4  4 16 9
5  4 17 9
6  4 18 9
7  4 19 9
8  4 20 9
9  4 21 9
10 3  6 7
11 3  7 7
12 3  8 7
13 3  9 7
14 3 10 7
15 3 11 7
16 3 12 7

Data

adf <- structure(list(x = c(2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 
4, 4, 4, 4, 4, 4, 4, 1), y = 1:22), class = "data.frame", row.names = c(NA, 
-22L))
  • Related