in pandas , i can run the code like:
day = ['day1','day2','day3','day4','day1','day2','day3','day1','day2']
code = ["a","a","a","a","b","b","b","c","c"]
price = [1,2,3,4,5,6,7,8,9]
df = pd.DataFrame({"date":day,"code":code,"price":price})
df['codeindex'] = df.groupby('code')['date'].transform(lambda x: range(0, len(x), 1))
in pandas i can generate such a column 'codeindex' using transform. how can i do that in py-polars?
thx a lot!
CodePudding user response:
If you need that exact functionality - you're doing the equivalent of .cumcount()
>>> df.groupby('code').cumcount()
0 0
1 1
2 2
3 3
4 0
5 1
6 2
7 0
8 1
dtype: int64
You can achieve the same result in polars with:
df.with_column(
df.groupby('code', maintain_order=True)
.agg(pl.col('date').cumcount())['date'].alias('codeindex').explode()
)
result:
shape: (9, 4)
┌──────┬──────┬───────┬───────────┐
│ date ┆ code ┆ price ┆ codeindex │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ u32 │
╞══════╪══════╪═══════╪═══════════╡
│ day1 ┆ a ┆ 1 ┆ 0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ a ┆ 2 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day3 ┆ a ┆ 3 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day4 ┆ a ┆ 4 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ b ┆ 6 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day3 ┆ b ┆ 7 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day1 ┆ c ┆ 8 ┆ 0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ c ┆ 9 ┆ 1 │
└──────┴──────┴───────┴───────────┘
CodePudding user response:
Your dataframe is ordered by the code
column. This means you can use a window
expression.
A windows expression takes an aggregation, like col("date").cumcount()
and applies that over a group defined by .over("code")
.
The code looks like this:
day = ['day1','day2','day3','day4','day1','day2','day3','day1','day2']
code = ["a","a","a","a","b","b","b","c","c"]
price = [1,2,3,4,5,6,7,8,9]
df = pl.DataFrame({"date":day,"code":code,"price":price})
(df.select([
pl.all(),
pl.col("date").cumcount().over("code").flatten().alias("codeindex"),
]))
outputs
shape: (9, 4)
┌──────┬──────┬───────┬───────────┐
│ date ┆ code ┆ price ┆ codeindex │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ u32 │
╞══════╪══════╪═══════╪═══════════╡
│ day1 ┆ a ┆ 1 ┆ 0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ a ┆ 2 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day3 ┆ a ┆ 3 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day4 ┆ a ┆ 4 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ b ┆ 6 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day3 ┆ b ┆ 7 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day1 ┆ c ┆ 8 ┆ 0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ c ┆ 9 ┆ 1 │
└──────┴──────┴───────┴───────────┘
Disect
Note that if we don't flatten
we would get lists with cumcounts for every group. Like this
df.select([
pl.col("date").cumcount().over("code").alias("codeindex"),
]))
shape: (9, 1)
┌───────────────┐
│ codeindex │
│ --- │
│ list [u32] │
╞═══════════════╡
│ [0, 1, ... 3] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, ... 3] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, ... 3] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, ... 3] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, 2] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, 2] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1] │
└───────────────┘
Flatten in window functions.
In window functions a flatten
takes the first value per group and explodes/flattens
that. That result is stacked back to the original DataFrame
.
For this output to make sense, you often want the DataFrame
to be sorted by the columns you group by (in the over
clause).
This is a gotcha to remember, but doing so allows us to make the window operations very fast.