Does anyone have experience filtering AFTER grouped pandas data frames by all rows? Let me explain.
This is the reproducible data:
name users_rated average bayes_average maxplayers
<chr> <dbl> <dbl> <dbl> <dbl>
1 Pandemic 108975 7.59 7.49 4
2 Carcassonne 108738 7.42 7.31 5
3 Catan 108024 7.14 6.97 4
4 7 Wonders 89982 7.74 7.63 7
5 Dominion 81561 7.61 7.50 4
6 Ticket to Ride 76171 7.41 7.30 5
7 Codenames 74419 7.6 7.51 8
8 Terraforming M… 74216 8.42 8.27 5
9 7 Wonders Duel 69472 8.11 7.98 2
10 Agricola 66093 7.93 7.81 5
For the question I'm trying to solve, in R, I can do it in this way:
df |>
select(
name,
users_rated,
average,
bayes_average,
maxplayers
) |>
group_by(maxplayers) |>
filter(average == min(average) | average == max(average)) |> # What I'm struggling right now in Python
filter(maxplayers <= 4 & maxplayers != 0) |>
arrange(maxplayers, average) |>
ungroup()
What this code is doing is grouping by maxplayers
, and then filtering BY GROUPS (For example, if there are three games with maxplayers
= 99, then those three games will be applied to the filter)
This is an example output (NOTE: is not the same as the reproducible data, but it helps what I'm trying to do)
name users_rated average bayes_average maxplayers
<chr> <dbl> <dbl> <dbl> <dbl>
1 Solitaire 1014 4.4 5.07 1
2 Five Parsecs Fr… 65 8.89 5.59 1
3 W.W.B 40 1.44 5.41 2
4 System Gateway … 47 9.4 5.6 2
5 Exploration: Wa… 47 3.55 5.44 3
6 Old School Tact… 151 8.54 5.69 3
7 Oneupmanship: M… 75 1.04 5.32 4
8 TerroriXico 70 9.43 5.50 4
(Check last COLUMN where every max player has it's own max and min average grouped)
In fact, I did it in SQL by using PARTITION BY, but in Python, I don't know what I'm doing wrong.
This is in SQL:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals()) # Quicker to write query
q = """
WITH cte_data AS
(
SELECT
name
, users_rated
, average
, bayes_average
, maxplayers
, MIN(average) OVER(PARTITION BY maxplayers) AS avg_min
, MAX(average) OVER(PARTITION BY maxplayers) AS avg_max
FROM df
)
SELECT
*
FROM cte_data
WHERE 1=1
AND (average = avg_min OR average = avg_max)
AND (maxplayers <= 4 AND maxplayers != 0)
ORDER BY maxplayers, average
;
"""
pysqldf(q)
And this is what I have currently in Python Pandas after stumbling upon this similar question in StackOverflow: filtering grouped pandas dataframe by all records being the same
(
df[["name", "users_rated", "average", "bayes_average", "maxplayers"]]
.groupby(["maxplayers"])
.filter(lambda x: ((x["average"] == x["average"].min()).all()) or ((x["average"] == x["average"].max()).all()))
)
Another approach but no luck:
dta = (df[["name", "users_rated", "average", "bayes_average", "maxplayers"]]
.assign(avg_min = lambda x: x.groupby(["maxplayers"])["average"].min())
.assign(avg_max = lambda x: x.groupby(["maxplayers"])["average"].max())
)
dta.query("average == avg_min or average == avg_max")
Any ideas? I really appreciate any help you can provide.
CodePudding user response:
Build the conditions on the groupby with transform
, which in a way is the equivalent of SQL's partition by:
grouped = df.groupby('maxplayers').average
cond1 = df.average.eq(grouped.transform('min')) | df.average.eq(grouped.transform('max'))
cond2 = df.maxplayers.between(0,4) # a simpler interpretation
df.loc[cond1 & cond2].sort_values(['maxplayers', 'average'])
name users_rated average bayes_average maxplayers
8 7 Wonders Duel 69472 8.11 7.98 2
2 Catan 108024 7.14 6.97 4
4 Dominion 81561 7.61 7.50 4
CodePudding user response:
You can you do this:
import pandas as pd
(
df[(df.maxplayers<=4) & (df.maxplayers!=0)]
.groupby("maxplayers")
.apply(lambda gp: gp.assign(avg_min = min(gp["average"]), avg_max=max(gp["average"])))
.reset_index(drop=True)
.query("average == avg_min or average==avg_max")
.sort_values(["maxplayers","average"])
)
Output:
name users_rated average bayes_average maxplayers
2 7 Wonders Duel 69472 8.11 7.98 2
0 Catan 108024 7.14 6.97 4
1 Dominion 81561 7.61 7.50 4
R Output:
# A tibble: 3 x 5
name users_rated average bayes_average maxplayers
<chr> <int> <dbl> <dbl> <int>
1 7 Wonders Duel 69472 8.11 7.98 2
2 Catan 108024 7.14 6.97 4
3 Dominion 81561 7.61 7.5 4