I have a data frame in which I want to filter out whole groups if the top row of that group does not contain a particular condition in one column.
An example using the following dataset:
df <- data.frame(team=c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'D', 'D', 'D', 'E', 'E'), gameplayed=c('Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'))
I want to group these by 'team' first. Then, I want to remove the entire group if the first row contains a 'No' in the 'gameplayed' column.
This would be the desired output:
df2 <- data.frame(team=c('A', 'A', 'A', 'A', 'C', 'D', 'D', 'D'), gameplayed=c('Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes'))
I've played around with various options, such as the following, but can't get it to work for me:
> df %>% group_by(team) %>% filter("Yes" == first(gameplayed))
CodePudding user response:
You can use ave
like.
df[ave(df$gameplayed != "No", df$team, FUN=\(x) x[1]),]
#df[ave(df$gameplayed, df$team, FUN=\(x) x[1]) != "No",] #Alternative
# team gameplayed
#1 A Yes
#2 A No
#3 A Yes
#4 A Yes
#8 C Yes
#9 D Yes
#10 D No
#11 D Yes
Benchmark
library(tidyverse)
bench::mark(check=FALSE,
GKi = df[ave(df$gameplayed != "No", df$team, FUN=\(x) x[1]),],
Limney = df %>% group_by(team) %>% filter(first(gameplayed) == "Yes") %>% ungroup(),
Elias = df %>% group_by(team) %>% mutate(id = row_number()) %>% filter(!any(gameplayed == "No" && id == 1)) #With Warings
)
# expression min median itr/s…¹ mem_a…² gc/se…³ n_itr n_gc total…⁴ result
# <bch:expr> <bch:tm> <bch:t> <dbl> <bch:b> <dbl> <int> <dbl> <bch:t> <list>
#1 GKi 75.74µs 82.05µs 11936. 4.52KB 16.8 5685 8 476ms <NULL>
#2 Limney 4.52ms 4.6ms 214. 7.92KB 15.6 96 7 448ms <NULL>
#3 Elias 3.36ms 3.42ms 291. 9.53KB 13.0 134 6 461ms <NULL>
GKi is in this case about 40 times faster than Elias and 50 times faster than Limney and uses less memory.
CodePudding user response:
You can first creat an id for your groups and then filter on gameplayed
and id
. You have to use any()
to filter out the whole group.
library(tidyverse)
df <- df %>% group_by(team) %>% mutate(id = row_number())
df_final <- df %>% filter(!any(gameplayed == "No" && id == 1))
Or as pointed out in the comment you do not have to create an id and just use first()
:
df_final <- df %>% filter(!any(first(gameplayed) == "No"))