Home > Software design >  Filtering by Conditional maximum in R/dplyr
Filtering by Conditional maximum in R/dplyr

Time:09-29

I have data grouped by test and by a subgroup within each test. Then I collect a series of proportions, each with a p-value to determine its significance. So the basic structure of my data is:

testid <- paste0("T", c(rep(1, 5), rep(2, 5)))
groupid <- rep(1:5, 2)
prop <- c(0.98, 0.76, 0.54, 0.32, 0.1, 0.79, 0.32, 0.09, 0.04, 0.01)
pval <- c(0.01, 0.02, 0.03, 0.04, 0.1, 0.03, 0.06, 0.25, 0.35, 0.45)

library(dplyr)
df <- tibble(testid, groupid, prop, pval)

df

# A tibble: 10 x 4
   testid groupid  prop  pval
   <chr>    <int> <dbl> <dbl>
 1 T1           1  0.98  0.01
 2 T1           2  0.76  0.02
 3 T1           3  0.54  0.03
 4 T1           4  0.32  0.04
 5 T1           5  0.1   0.1 
 6 T2           1  0.79  0.03
 7 T2           2  0.32  0.06
 8 T2           3  0.09  0.25
 9 T2           4  0.04  0.35
10 T2           5  0.01  0.45

What I need is to look at each test, and collect the results from the highest group number which has p below 0.05. So the desired output is:

desired <- df %>% 
  filter((testid == "T1" & groupid == 4) | (testid == "T2" & groupid == 1))

desired

# A tibble: 2 x 4
  testid groupid  prop  pval
  <chr>    <int> <dbl> <dbl>
1 T1           4  0.32  0.04
2 T2           1  0.79  0.03

Looking at similar questions on Stack Overflow, I tried this code...

x <- df %>%
  group_by(testid) %>%
  filter(row_number() <= max(pval < 0.5)[1])

...but this is not quite right, it returned group 1 for both tests.

Help? (Note that, due to prior filtering I have done, there should be at least one group number with p < 0.05 for each test).

(Also, dplyr is my preferred method, but not required.)

CodePudding user response:

We may subset the 'groupid' using a logical vector with 'pval', and use %in% to construct a logical vector from the max value of 'groupid'

library(dplyr)
df %>% 
   group_by(testid) %>% 
   filter(groupid %in% max(groupid[pval < 0.05])) %>%
   ungroup

-output

# A tibble: 2 × 4
  testid groupid  prop  pval
  <chr>    <int> <dbl> <dbl>
1 T1           4  0.32  0.04
2 T2           1  0.79  0.03

Or another option is to arrange the columns, then do a group_by with slice_head/slice_tail

library(dplyr)
df %>% 
    filter(pval < 0.05) %>% 
    arrange(testid, groupid, desc(pval)) %>%
    group_by(testid) %>% 
    slice_tail(n = 1) %>%
    ungroup

-output

# A tibble: 2 × 4
  testid groupid  prop  pval
  <chr>    <int> <dbl> <dbl>
1 T1           4  0.32  0.04
2 T2           1  0.79  0.03

CodePudding user response:

Another approach using which.max

library(dplyr)
df %>% 
  group_by(testid) %>% 
  filter(pval<0.05) %>% 
  slice(which.max(groupid))
  testid groupid  prop  pval
  <chr>    <int> <dbl> <dbl>
1 T1           4  0.32  0.04
2 T2           1  0.79  0.03

CodePudding user response:

This should do the trick:

x <- df %>%
  group_by(testid) %>%
  filter(pval <0.05) %>% 
  slice(n())

This is assuming that the data is nominally ordered; if not, you might need to arrange!

Output for completeness:

  testid groupid  prop  pval
  <chr>    <int> <dbl> <dbl>
1 T1           4  0.32  0.04
2 T2           1  0.79  0.03
  • Related