How to filter in R only if a certain condition is valid?-CodePudding

I have a dataset with groups--"A", "B", "C", and "A & B"--at two time points--"before" and "after". I only want to include "A & B" if the any of the sample sizes for A or B at either time point fall below 10 people. Otherwise, I want to drop the "A & B" group. How do I tell R to drop this group only if the other criteria are satisfied?

Here's are two sample datasets--one where it should filter out group A & B and one where it should retain it:

library(dplyr)

#This should not filter out anything

should_not_drop_group <- tibble(group = rep(c("A", "B", "C", "A & B"), 2),
                    time = c(rep(c("Before"), 4), rep(c("After"), 4)),
                    sample_size = c(5, 100, 132, 105, 250, 50, 224, 300))


#This dataset should drop group A&B

should_drop_group <- tibble(group = rep(c("A", "B", "C", "A & B"), 2),
                                                  time = c(rep(c("Before"), 4), rep(c("After"), 4)),
                                                  sample_size = c(500, 100, 132, 600, 250, 50, 224, 300))

And here's why I tried to no avail:

library(dplyr)

should_drop_group %>%
  filter_if(~any(sample_size[group  %in% c("A", "B")] < 10), group != "A & B" )

CodePudding user response：

Maybe the condition in filter would be - subset the group where the sample_size is less than 10, check if there are any values of 'A', 'B' in that group, negate (!), then create the second expression where group is "A & B", join them with &, and then negate (!) the whole expression to filter out those cases

library(dplyr)
should_not_drop_group %>% 
   filter(!(!any(c("A", "B") %in% group[sample_size < 10]) & group == "A & B"))
   # or can be written as
   #filter(!(!any(group %in% c("A", "B") & sample_size < 10) & group == "A & B"))

-output

# A tibble: 8 × 3
  group time   sample_size
  <chr> <chr>        <dbl>
1 A     Before           5
2 B     Before         100
3 C     Before         132
4 A & B Before         105
5 A     After          250
6 B     After           50
7 C     After          224
8 A & B After          300

and second case

should_drop_group %>% 
    filter(!(!any(c("A", "B") %in% group[sample_size < 10]) & group == "A & B"))
# A tibble: 6 × 3
  group time   sample_size
  <chr> <chr>        <dbl>
1 A     Before         500
2 B     Before         100
3 C     Before         132
4 A     After          250
5 B     After           50
6 C     After          224

If we want to reuse it on several datasets, create a function and reuse it

> f1 <- function(x, sample_size) 
   !(!any(c("A", "B") %in% x[sample_size < 10]) & x == "A & B")
> should_not_drop_group %>% 
   filter(if_any(group, f1, sample_size = sample_size))
# A tibble: 8 × 3
  group time   sample_size
  <chr> <chr>        <dbl>
1 A     Before           5
2 B     Before         100
3 C     Before         132
4 A & B Before         105
5 A     After          250
6 B     After           50
7 C     After          224
8 A & B After          300
> should_drop_group %>% 
   filter(if_any(group, f1, sample_size = sample_size))
# A tibble: 6 × 3
  group time   sample_size
  <chr> <chr>        <dbl>
1 A     Before         500
2 B     Before         100
3 C     Before         132
4 A     After          250
5 B     After           50
6 C     After          224

CodePudding user response：

Here is a solution with an ifelse statement and a helper column x:

library(dplyr)

df %>%
#df1 %>%   
  mutate(x = ifelse(any(sample_size < 10) & group == "A & B", 1, 0)) %>% 
  filter(x!=1) %>% 
  select(-x)

for df:

group time   sample_size
  <chr> <chr>        <dbl>
1 A     Before         500
2 B     Before         100
3 C     Before         132
4 A & B Before         600
5 A     After          250
6 B     After           50
7 C     After          224
8 A & B After          300

for df1

  group time   sample_size
  <chr> <chr>        <dbl>
1 A     Before           5
2 B     Before         100
3 C     Before         132
4 A     After          250
5 B     After           50
6 C     After          224