Home > front end >  R dplyr filter data based on values in other rows
R dplyr filter data based on values in other rows

Time:09-23

I am trying to filter a data frame using dplyr and I can't really think of a way to achieve what I want. I have a data frame of the following form:

A    B    C
-----------
1    2    5
1    4    6
2    2    7
2    4    6

Each value in column A appears exactly 2 times. Column B has exactly 2 different values, each appearing exactly once for each value of A. Column C can have any positive values. I want to keep all rows where for one value of A, the row with the bigger B value has a smaller C value than the row with the smaller B value. In the example above, this would result in:

A    B    C
-----------
2    2    7
2    4    6

Is there a way to achieve this using dplyr?

CodePudding user response:

1) Sort by A and B to ensure that the larger B is always the second within A and then grouping by A use a filter based on diff(C) < 0.

library(dplyr)

DF %>%
  arrange(A, B) %>%
  group_by(A) %>%
  filter((diff(C) < 0)) %>%
  ungroup
## # A tibble: 2 × 3
##       A     B     C
##   <int> <int> <int>
## 1     2     2     7
## 2     2     4     6

2) Another possibility is to ensure that the maximum of B is on the same row as the minimum of C. This would also work with non-numeric data.

See comments below this answer for another idea along these lines.

DF %>%
  group_by(A) %>%
  filter(which.max(B) == which.min(C)) %>%
  ungroup

3) If the slope of B with respect to C is negative then keep the group.

DF %>%
  group_by(A) %>%
  filter(coef(lm(B ~ C))[[2]] < 0) %>%
  ungroup

or we can calculate the slope ourselves:

DF %>%
  group_by(A) %>%
  filter(diff(C) / diff(B) < 0) %>%
  ungroup

Note

Lines <- "A    B    C
1    2    5
1    4    6
2    2    7
2    4    6"

DF <- read.table(text = Lines, header = TRUE)
  • Related