Home > Software engineering >  Filter groups based on difference two highest values
Filter groups based on difference two highest values

Time:01-16

I have the following dataframe called df (dput below):

> df
   group value
1      A     5
2      A     1
3      A     1
4      A     5
5      B     8
6      B     2
7      B     2
8      B     3
9      C    10
10     C     1
11     C     1
12     C     8

I would like to filter groups based on the difference between their highest value (max) and second highest value. The difference should be smaller equal than 2 (<=2), this means that group B should be removed because the highest value is 8 and the second highest value is 3 which is a difference of 5. The desired output should look like this:

  group value
1     A     5
2     A     1
3     A     1
4     A     5
5     C    10
6     C     1
7     C     1
8     C     8

So I was wondering if anyone knows how to filter groups based on the difference between their highest and second-highest value?


dput of df:

df<-structure(list(group = c("A", "A", "A", "A", "B", "B", "B", "B", 
"C", "C", "C", "C"), value = c(5, 1, 1, 5, 8, 2, 2, 3, 10, 1, 
1, 8)), class = "data.frame", row.names = c(NA, -12L))

CodePudding user response:

Using dplyr

library(dplyr)

df %>% 
  group_by(group) %>% 
  filter(abs(diff(sort(value, decreasing=T)[1:2])) <= 2) %>%
  ungroup()
# A tibble: 8 × 2
  group value
  <chr> <int>
1 A         5
2 A         1
3 A         1
4 A         5
5 C        10
6 C         1
7 C         1
8 C         8

CodePudding user response:

I possibility would be to first create a vector with the groups that achieve your condition and then filter in the original data.frame. Here how I thought:

library(dplyr)

group_to_keep <-
  df %>% 
  group_by(group) %>% 
  slice_max(n = 2,value) %>% 
  filter(abs(diff(value)) <= 2) %>% 
  pull(group) %>% 
  unique()

df %>% 
  filter(group %in% group_to_keep)
  • Related