I have df for my work with 3 main columns: cid1, cid2, cid3
, and more 7 columns cid4, cid5, etc
.
cid1
and cid2
is int
, another columns is float
.
Each combitations of cid1
and cid2
is a workset with some rows where is values of all other columns is different. I want to filter df and receive my df with only max values in column cid3
for each combination of cid1
and cid2
. cid4
and next columns must be leaved without changes.
This code helps me with one part of my task:
df = (df
.groupby(["cid1", "cid2"])
.agg([pl.max("cid3").alias("max_cid3")])
)
It's receives only 3 columns: cid1
, cid2
, max_cid3
and filter all rows when cid3
is not maximal.
But I can't find how to receive all another columns (cid4, etc
) for that rows without changes.
df = (df
.groupby(["cid1", "cid2"])
.agg([pl.max("cid3").alias("max_cid3"), pl.col("cid4")])
)
I tried to add pl.col("cid4")
to list of aggs but in column I see as values different lists of some cid4
values.
How I can make it properly? Maybe Polars haves another way to make it then groupby?
In Pandas I can make it:
import pandas as pd
import numpy as np
df["max_cid3"] = df.groupby(['cid1', 'cid2'])['cid3'].transform(np.max)
And then filter df wherever cid3==max_cid3
But I can't find a way to make it in Polars.
Thank you!
CodePudding user response:
In polars you can use a Window function
df.with_column(
pl.col("cid3").max().over(["cid1", "cid2"])
.alias("max_cid3")
)
shape: (5, 6)
┌──────┬──────┬──────┬──────┬──────┬──────────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ max_cid3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╪══════╪══════╪══════════╡
│ 1 ┆ 1 ┆ 1 ┆ 4 ┆ 4 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 2 ┆ 5 ┆ 5 ┆ 9 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 9 ┆ 6 ┆ 4 ┆ 9 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ 1 ┆ 7 ┆ 9 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 1 ┆ 8 ┆ 3 ┆ 1 │
└──────┴──────┴──────┴──────┴──────┴──────────┘
You could also put it directly inside .filter()
df.filter(
pl.col("cid3") == pl.col("cid3").max().over(["cid1", "cid2"])
)
Data used:
df = pl.DataFrame({
"cid1": [1, 2, 2, 1, 3],
"cid2": [1, 2, 2, 1, 3],
"cid3": [1, 2, 9, 1, 1],
"cid4": [4, 5, 6, 7, 8],
"cid5": [4, 5, 4, 9, 3],
})
>>> df.to_pandas().groupby(["cid1", "cid2"])["cid3"].transform("max")
0 1
1 9
2 9
3 1
4 1
Name: cid3, dtype: int64