Home > other >  In R, how do I filter whole groups conditional on a value in the first row of the group?
In R, how do I filter whole groups conditional on a value in the first row of the group?

Time:10-24

I have a data frame in which I want to filter out whole groups if the top row of that group does not contain a particular condition in one column.

An example using the following dataset:

df <- data.frame(team=c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'D', 'D', 'D', 'E', 'E'), gameplayed=c('Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'))

I want to group these by 'team' first. Then, I want to remove the entire group if the first row contains a 'No' in the 'gameplayed' column.

This would be the desired output:

df2 <- data.frame(team=c('A', 'A', 'A', 'A', 'C', 'D', 'D', 'D'), gameplayed=c('Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes'))

I've played around with various options, such as the following, but can't get it to work for me:

> df %>% group_by(team) %>%   filter("Yes" == first(gameplayed))

CodePudding user response:

You can use ave like.

df[ave(df$gameplayed != "No", df$team, FUN=\(x) x[1]),]
#df[ave(df$gameplayed, df$team, FUN=\(x) x[1]) != "No",] #Alternative
#   team gameplayed
#1     A        Yes
#2     A         No
#3     A        Yes
#4     A        Yes
#8     C        Yes
#9     D        Yes
#10    D         No
#11    D        Yes

Benchmark

library(tidyverse)
bench::mark(check=FALSE,
  GKi = df[ave(df$gameplayed != "No", df$team, FUN=\(x) x[1]),],
  Limney = df %>% group_by(team) %>% filter(first(gameplayed) == "Yes") %>% ungroup(),
  Elias = df %>% group_by(team) %>% mutate(id = row_number()) %>% filter(!any(gameplayed == "No" && id == 1)) #With Warings
)
#  expression      min  median itr/s…¹ mem_a…² gc/se…³ n_itr  n_gc total…⁴ result
#  <bch:expr> <bch:tm> <bch:t>   <dbl> <bch:b>   <dbl> <int> <dbl> <bch:t> <list>
#1 GKi         75.74µs 82.05µs  11936.  4.52KB    16.8  5685     8   476ms <NULL>
#2 Limney       4.52ms   4.6ms    214.  7.92KB    15.6    96     7   448ms <NULL>
#3 Elias        3.36ms  3.42ms    291.  9.53KB    13.0   134     6   461ms <NULL>

GKi is in this case about 40 times faster than Elias and 50 times faster than Limney and uses less memory.

CodePudding user response:

You can first creat an id for your groups and then filter on gameplayed and id. You have to use any() to filter out the whole group.

library(tidyverse)
df <- df %>% group_by(team) %>% mutate(id = row_number())
df_final <- df %>% filter(!any(gameplayed == "No" && id == 1))

Or as pointed out in the comment you do not have to create an id and just use first():

df_final <- df %>% filter(!any(first(gameplayed) == "No"))
  •  Tags:  
  • r
  • Related