I have data grouped by test and by a subgroup within each test. Then I collect a series of proportions, each with a p-value to determine its significance. So the basic structure of my data is:
testid <- paste0("T", c(rep(1, 5), rep(2, 5)))
groupid <- rep(1:5, 2)
prop <- c(0.98, 0.76, 0.54, 0.32, 0.1, 0.79, 0.32, 0.09, 0.04, 0.01)
pval <- c(0.01, 0.02, 0.03, 0.04, 0.1, 0.03, 0.06, 0.25, 0.35, 0.45)
library(dplyr)
df <- tibble(testid, groupid, prop, pval)
df
# A tibble: 10 x 4
testid groupid prop pval
<chr> <int> <dbl> <dbl>
1 T1 1 0.98 0.01
2 T1 2 0.76 0.02
3 T1 3 0.54 0.03
4 T1 4 0.32 0.04
5 T1 5 0.1 0.1
6 T2 1 0.79 0.03
7 T2 2 0.32 0.06
8 T2 3 0.09 0.25
9 T2 4 0.04 0.35
10 T2 5 0.01 0.45
What I need is to look at each test, and collect the results from the highest group number which has p below 0.05. So the desired output is:
desired <- df %>%
filter((testid == "T1" & groupid == 4) | (testid == "T2" & groupid == 1))
desired
# A tibble: 2 x 4
testid groupid prop pval
<chr> <int> <dbl> <dbl>
1 T1 4 0.32 0.04
2 T2 1 0.79 0.03
Looking at similar questions on Stack Overflow, I tried this code...
x <- df %>%
group_by(testid) %>%
filter(row_number() <= max(pval < 0.5)[1])
...but this is not quite right, it returned group 1 for both tests.
Help? (Note that, due to prior filtering I have done, there should be at least one group number with p < 0.05 for each test).
(Also, dplyr is my preferred method, but not required.)
CodePudding user response:
We may subset the 'groupid' using a logical vector with 'pval', and use %in%
to construct a logical vector from the max
value of 'groupid'
library(dplyr)
df %>%
group_by(testid) %>%
filter(groupid %in% max(groupid[pval < 0.05])) %>%
ungroup
-output
# A tibble: 2 × 4
testid groupid prop pval
<chr> <int> <dbl> <dbl>
1 T1 4 0.32 0.04
2 T2 1 0.79 0.03
Or another option is to arrange
the columns, then do a group_by
with slice_head/slice_tail
library(dplyr)
df %>%
filter(pval < 0.05) %>%
arrange(testid, groupid, desc(pval)) %>%
group_by(testid) %>%
slice_tail(n = 1) %>%
ungroup
-output
# A tibble: 2 × 4
testid groupid prop pval
<chr> <int> <dbl> <dbl>
1 T1 4 0.32 0.04
2 T2 1 0.79 0.03
CodePudding user response:
Another approach using which.max
library(dplyr)
df %>%
group_by(testid) %>%
filter(pval<0.05) %>%
slice(which.max(groupid))
testid groupid prop pval
<chr> <int> <dbl> <dbl>
1 T1 4 0.32 0.04
2 T2 1 0.79 0.03
CodePudding user response:
This should do the trick:
x <- df %>%
group_by(testid) %>%
filter(pval <0.05) %>%
slice(n())
This is assuming that the data is nominally ordered; if not, you might need to arrange!
Output for completeness:
testid groupid prop pval
<chr> <int> <dbl> <dbl>
1 T1 4 0.32 0.04
2 T2 1 0.79 0.03