In a dataframe with repeat values, I want the rows for the most frequent n
cases, say the two most frequently occurring. The code below does this, selecting rows where x==3
or x==4
and return the rows in that order.
I don't want to have to use the value 5; however, I want some way of programmatically stating the top 2 most frequent x values, without knowing a threshold (5 in this example) apriori. In addition, I would like to order the resulting dataframe by frequency of occurrence, so x==4
rows come before x==3
rows.
I am presuming it is related to count
, top_n
or slice_max
and arrange
but maybe not!
Any hints on how to do this with dplyr
would be greatly appreciated.
require(tidyverse)
adf <- data.frame(x=c(rep(2,5),rep(3,7), rep(4,9), 1), y=1:22)
adf %>% group_by(x) %>% filter(n() > 5)
CodePudding user response:
We can do this by calculating the size of each group using n()
and then filtering on that. If you want them in order, can also use dplyr::arrange()
.
library(dplyr)
adf %>%
group_by(x) %>%
mutate(size = n()) %>%
ungroup() %>%
filter(size >= sort(unique(size), TRUE)[2]) %>%
arrange(desc(size))
#> # A tibble: 16 × 3
#> x y size
#> <dbl> <int> <int>
#> 1 4 13 9
#> 2 4 14 9
#> 3 4 15 9
#> 4 4 16 9
#> 5 4 17 9
#> 6 4 18 9
#> 7 4 19 9
#> 8 4 20 9
#> 9 4 21 9
#> 10 3 6 7
#> 11 3 7 7
#> 12 3 8 7
#> 13 3 9 7
#> 14 3 10 7
#> 15 3 11 7
#> 16 3 12 7
CodePudding user response:
Here is another tidyverse
option, where I use add_count
to get the number of observations per group, then I filter
the 2 groups with the highest number of observations. Then, I arrange
by the frequency of occurence.
library(tidyverse)
adf %>%
add_count(x) %>%
filter(n %in% tail(sort(unique(n)),2)) %>%
arrange(desc(n))
Output
x y n
1 4 13 9
2 4 14 9
3 4 15 9
4 4 16 9
5 4 17 9
6 4 18 9
7 4 19 9
8 4 20 9
9 4 21 9
10 3 6 7
11 3 7 7
12 3 8 7
13 3 9 7
14 3 10 7
15 3 11 7
16 3 12 7
Data
adf <- structure(list(x = c(2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4,
4, 4, 4, 4, 4, 4, 4, 1), y = 1:22), class = "data.frame", row.names = c(NA,
-22L))