Home > other >  is there any similar function in python poloars like 'transform' in pandas?
is there any similar function in python poloars like 'transform' in pandas?

Time:01-22

in pandas , i can run the code like:

day = ['day1','day2','day3','day4','day1','day2','day3','day1','day2']

code = ["a","a","a","a","b","b","b","c","c"]

price = [1,2,3,4,5,6,7,8,9]

df = pd.DataFrame({"date":day,"code":code,"price":price})


df['codeindex'] = df.groupby('code')['date'].transform(lambda x: range(0, len(x), 1))

in pandas i can generate such a column 'codeindex' using transform. how can i do that in py-polars?

thx a lot!

CodePudding user response:

If you need that exact functionality - you're doing the equivalent of .cumcount()

>>> df.groupby('code').cumcount()
0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    0
8    1
dtype: int64

You can achieve the same result in polars with:

df.with_column(
    df.groupby('code', maintain_order=True)
      .agg(pl.col('date').cumcount())['date'].alias('codeindex').explode()
)

result:

shape: (9, 4)
┌──────┬──────┬───────┬───────────┐
│ date ┆ code ┆ price ┆ codeindex │
│ ---  ┆ ---  ┆ ---   ┆ ---       │
│ str  ┆ str  ┆ i64   ┆ u32       │
╞══════╪══════╪═══════╪═══════════╡
│ day1 ┆ a    ┆ 1     ┆ 0         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ a    ┆ 2     ┆ 1         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day3 ┆ a    ┆ 3     ┆ 2         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day4 ┆ a    ┆ 4     ┆ 3         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ...  ┆ ...  ┆ ...   ┆ ...       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ b    ┆ 6     ┆ 1         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day3 ┆ b    ┆ 7     ┆ 2         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day1 ┆ c    ┆ 8     ┆ 0         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ c    ┆ 9     ┆ 1         │
└──────┴──────┴───────┴───────────┘

CodePudding user response:

Your dataframe is ordered by the code column. This means you can use a window expression.

A windows expression takes an aggregation, like col("date").cumcount() and applies that over a group defined by .over("code").

The code looks like this:

day = ['day1','day2','day3','day4','day1','day2','day3','day1','day2']

code = ["a","a","a","a","b","b","b","c","c"]

price = [1,2,3,4,5,6,7,8,9]

df = pl.DataFrame({"date":day,"code":code,"price":price})

(df.select([
    pl.all(),
    pl.col("date").cumcount().over("code").flatten().alias("codeindex"),
]))

outputs

shape: (9, 4)
┌──────┬──────┬───────┬───────────┐
│ date ┆ code ┆ price ┆ codeindex │
│ ---  ┆ ---  ┆ ---   ┆ ---       │
│ str  ┆ str  ┆ i64   ┆ u32       │
╞══════╪══════╪═══════╪═══════════╡
│ day1 ┆ a    ┆ 1     ┆ 0         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ a    ┆ 2     ┆ 1         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day3 ┆ a    ┆ 3     ┆ 2         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day4 ┆ a    ┆ 4     ┆ 3         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ...  ┆ ...  ┆ ...   ┆ ...       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ b    ┆ 6     ┆ 1         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day3 ┆ b    ┆ 7     ┆ 2         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day1 ┆ c    ┆ 8     ┆ 0         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ c    ┆ 9     ┆ 1         │
└──────┴──────┴───────┴───────────┘

Disect

Note that if we don't flatten we would get lists with cumcounts for every group. Like this

df.select([
    pl.col("date").cumcount().over("code").alias("codeindex"),
]))
shape: (9, 1)
┌───────────────┐
│ codeindex     │
│ ---           │
│ list [u32]    │
╞═══════════════╡
│ [0, 1, ... 3] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, ... 3] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, ... 3] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, ... 3] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, 2]     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, 2]     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1]        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1]        │
└───────────────┘

Flatten in window functions.

In window functions a flatten takes the first value per group and explodes/flattens that. That result is stacked back to the original DataFrame.

For this output to make sense, you often want the DataFrame to be sorted by the columns you group by (in the over clause).

This is a gotcha to remember, but doing so allows us to make the window operations very fast.

  •  Tags:  
  • Related