Home > Blockchain >  Filtering AFTER grouped pandas dataframe by two or more conditions in Python
Filtering AFTER grouped pandas dataframe by two or more conditions in Python

Time:05-06

Does anyone have experience filtering AFTER grouped pandas data frames by all rows? Let me explain.

This is the reproducible data:

     name            users_rated average bayes_average maxplayers
   <chr>                 <dbl>   <dbl>         <dbl>      <dbl>
 1 Pandemic             108975    7.59          7.49          4
 2 Carcassonne          108738    7.42          7.31          5
 3 Catan                108024    7.14          6.97          4
 4 7 Wonders             89982    7.74          7.63          7
 5 Dominion              81561    7.61          7.50          4
 6 Ticket to Ride        76171    7.41          7.30          5
 7 Codenames             74419    7.6           7.51          8
 8 Terraforming M…       74216    8.42          8.27          5
 9 7 Wonders Duel        69472    8.11          7.98          2
10 Agricola              66093    7.93          7.81          5

For the question I'm trying to solve, in R, I can do it in this way:

df |> 
select(
    name,
    users_rated,
    average,
    bayes_average,
    maxplayers
) |> 
group_by(maxplayers) |> 
filter(average == min(average) | average == max(average)) |> # What I'm struggling right now in Python
filter(maxplayers <= 4 & maxplayers != 0) |> 
arrange(maxplayers, average) |> 
ungroup()

What this code is doing is grouping by maxplayers, and then filtering BY GROUPS (For example, if there are three games with maxplayers = 99, then those three games will be applied to the filter)

This is an example output (NOTE: is not the same as the reproducible data, but it helps what I'm trying to do)

  name             users_rated average bayes_average maxplayers
  <chr>                  <dbl>   <dbl>         <dbl>      <dbl>
1 Solitaire               1014    4.4           5.07          1
2 Five Parsecs Fr…          65    8.89          5.59          1
3 W.W.B                     40    1.44          5.41          2
4 System Gateway …          47    9.4           5.6           2
5 Exploration: Wa…          47    3.55          5.44          3
6 Old School Tact…         151    8.54          5.69          3
7 Oneupmanship: M…          75    1.04          5.32          4
8 TerroriXico               70    9.43          5.50          4

(Check last COLUMN where every max player has it's own max and min average grouped)

In fact, I did it in SQL by using PARTITION BY, but in Python, I don't know what I'm doing wrong.

This is in SQL:

    from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals()) # Quicker to write query

q = """
    WITH cte_data AS
    (
        SELECT
            name
            , users_rated
            , average
            , bayes_average
            , maxplayers
            , MIN(average) OVER(PARTITION BY maxplayers) AS avg_min
            , MAX(average) OVER(PARTITION BY maxplayers) AS avg_max
    
        FROM df
    )
    
    SELECT
        *
    
    FROM cte_data
        WHERE 1=1
            AND (average = avg_min OR average = avg_max)
            AND (maxplayers <= 4 AND maxplayers != 0)
        ORDER BY maxplayers, average
    ;
    """

pysqldf(q)

And this is what I have currently in Python Pandas after stumbling upon this similar question in StackOverflow: filtering grouped pandas dataframe by all records being the same

    (
    df[["name", "users_rated", "average", "bayes_average", "maxplayers"]]
        .groupby(["maxplayers"])
        .filter(lambda x: ((x["average"] == x["average"].min()).all()) or ((x["average"] == x["average"].max()).all()))
)

Another approach but no luck:

    dta = (df[["name", "users_rated", "average", "bayes_average", "maxplayers"]]
        .assign(avg_min =  lambda x: x.groupby(["maxplayers"])["average"].min())
        .assign(avg_max = lambda x: x.groupby(["maxplayers"])["average"].max())

        )

dta.query("average == avg_min or average == avg_max")

Any ideas? I really appreciate any help you can provide.

CodePudding user response:

Build the conditions on the groupby with transform, which in a way is the equivalent of SQL's partition by:

grouped = df.groupby('maxplayers').average
cond1 = df.average.eq(grouped.transform('min')) | df.average.eq(grouped.transform('max'))
cond2 = df.maxplayers.between(0,4) # a simpler interpretation

df.loc[cond1 & cond2].sort_values(['maxplayers', 'average'])

             name  users_rated  average  bayes_average  maxplayers
8  7 Wonders Duel        69472     8.11           7.98           2
2           Catan       108024     7.14           6.97           4
4        Dominion        81561     7.61           7.50           4

CodePudding user response:

You can you do this:

import pandas as pd

(
    df[(df.maxplayers<=4) & (df.maxplayers!=0)]
    .groupby("maxplayers")
    .apply(lambda gp: gp.assign(avg_min = min(gp["average"]), avg_max=max(gp["average"])))
    .reset_index(drop=True)
    .query("average == avg_min or average==avg_max")
    .sort_values(["maxplayers","average"])
)

Output:

             name  users_rated  average  bayes_average  maxplayers
2  7 Wonders Duel        69472     8.11           7.98           2
0           Catan       108024     7.14           6.97           4
1        Dominion        81561     7.61           7.50           4

R Output:

# A tibble: 3 x 5
  name           users_rated average bayes_average maxplayers
  <chr>                <int>   <dbl>         <dbl>      <int>
1 7 Wonders Duel       69472    8.11          7.98          2
2 Catan               108024    7.14          6.97          4
3 Dominion             81561    7.61          7.5           4
  • Related