Filter R dataframe to n most frequent cases and order by frequency-CodePudding

In a dataframe with repeat values, I want the rows for the most frequent n cases, say the two most frequently occurring. The code below does this, selecting rows where x==3 or x==4 and return the rows in that order.

I don't want to have to use the value 5; however, I want some way of programmatically stating the top 2 most frequent x values, without knowing a threshold (5 in this example) apriori. In addition, I would like to order the resulting dataframe by frequency of occurrence, so x==4 rows come before x==3 rows.

I am presuming it is related to count, top_n or slice_max and arrange but maybe not!

Any hints on how to do this with dplyr would be greatly appreciated.

require(tidyverse)
adf <- data.frame(x=c(rep(2,5),rep(3,7), rep(4,9), 1), y=1:22)
adf %>% group_by(x) %>% filter(n() > 5)

CodePudding user response：

We can do this by calculating the size of each group using n() and then filtering on that. If you want them in order, can also use dplyr::arrange().

library(dplyr)

adf %>%
  group_by(x) %>%
  mutate(size = n()) %>%
  ungroup() %>%
  filter(size >= sort(unique(size), TRUE)[2]) %>%
  arrange(desc(size))
#> # A tibble: 16 × 3
#>        x     y  size
#>    <dbl> <int> <int>
#>  1     4    13     9
#>  2     4    14     9
#>  3     4    15     9
#>  4     4    16     9
#>  5     4    17     9
#>  6     4    18     9
#>  7     4    19     9
#>  8     4    20     9
#>  9     4    21     9
#> 10     3     6     7
#> 11     3     7     7
#> 12     3     8     7
#> 13     3     9     7
#> 14     3    10     7
#> 15     3    11     7
#> 16     3    12     7

CodePudding user response：

Here is another tidyverse option, where I use add_count to get the number of observations per group, then I filter the 2 groups with the highest number of observations. Then, I arrange by the frequency of occurence.

library(tidyverse)

adf %>%
  add_count(x) %>% 
  filter(n %in% tail(sort(unique(n)),2)) %>% 
  arrange(desc(n))

Output

Data

adf <- structure(list(x = c(2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 
4, 4, 4, 4, 4, 4, 4, 1), y = 1:22), class = "data.frame", row.names = c(NA, 
-22L))