Home > OS >  py-polars: groupby_dynamic but via expressions. (timestamp based window functions)
py-polars: groupby_dynamic but via expressions. (timestamp based window functions)

Time:08-25

df = pl.DataFrame({
  'txn_id': ['0x5...60', '0x1...6d', '0x9...84', '0xc...25', '0x5...50', '0xe...14', '0x2...f3', '0xe...75', '0x3...95', '0x4...4e'], 
  'txn_grouping': ['0x4...dd', '0x4...dd', '0xf...e2', '0x4...17', '0xe...8b', '0x6...4e', '0xe...da', '0xf...f2', '0x1...21', '0xc...cf'], 
  'ts': [1438918233, 1438918613, 1438918630, 1438918983, 1438919175, 1438918630, 1438919451, 1438919461, 1438919491, 1438919571]
})
df = df.select([
  pl.exclude('ts'),
  (pl.col("ts") * 1000).cast(pl.Datetime('ms'))
]).groupby_dynamic("ts", every = "5m").agg([
  pl.n_unique("txn_id").alias("num_txs_per_5m"),
])

Ideally, i'd like to retain the original dataframe, and add a new column called 'num_txs_per_5m'. I can do this kind of window operation with non timestamp based fields,

df.select([pl.col('txn_id').count().over('txn_grouping'), 'txn_grouping'])

is it possible in polars to do this style of operation via the .over() operation?

CodePudding user response:

Is this the result that you're looking for - associating each observation with the number of unique transactions that occurred in each of the 5-minute windows?

shape: (10, 4)
┌──────────┬──────────────┬─────────────────────┬────────────────┐
│ txn_id   ┆ txn_grouping ┆ ts                  ┆ num_txs_per_5m │
│ ---      ┆ ---          ┆ ---                 ┆ ---            │
│ str      ┆ str          ┆ datetime[ms]        ┆ u32            │
╞══════════╪══════════════╪═════════════════════╪════════════════╡
│ 0x5...60 ┆ 0x4...dd     ┆ 2015-08-07 03:30:33 ┆ 1              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0x1...6d ┆ 0x4...dd     ┆ 2015-08-07 03:36:53 ┆ 3              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0x9...84 ┆ 0xf...e2     ┆ 2015-08-07 03:37:10 ┆ 3              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0xe...14 ┆ 0x6...4e     ┆ 2015-08-07 03:37:10 ┆ 3              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0xc...25 ┆ 0x4...17     ┆ 2015-08-07 03:43:03 ┆ 1              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0x5...50 ┆ 0xe...8b     ┆ 2015-08-07 03:46:15 ┆ 1              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0x2...f3 ┆ 0xe...da     ┆ 2015-08-07 03:50:51 ┆ 4              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0xe...75 ┆ 0xf...f2     ┆ 2015-08-07 03:51:01 ┆ 4              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0x3...95 ┆ 0x1...21     ┆ 2015-08-07 03:51:31 ┆ 4              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0x4...4e ┆ 0xc...cf     ┆ 2015-08-07 03:52:51 ┆ 4              │
└──────────┴──────────────┴─────────────────────┴────────────────┘

Edit: using the truncate expression with over

If you'd like to use the over windowing expression, one way we can accomplish this is to assign each observation to a 5-minute interval, using the truncate expression.

(I've added a trunc_time to show the results of the truncate function on the ts column.)

(
    df
    .with_columns([
        pl.col('ts').dt.truncate(every='5m').alias('trunc_time'),

        pl.col('txn_id')
        .n_unique()
        .over(pl.col('ts').dt.truncate(every='5m'))
        .alias('num_txs_per_5m'),
    ])
)

shape: (10, 5)
┌──────────┬──────────────┬─────────────────────┬─────────────────────┬────────────────┐
│ txn_id   ┆ txn_grouping ┆ ts                  ┆ trunc_time          ┆ num_txs_per_5m │
│ ---      ┆ ---          ┆ ---                 ┆ ---                 ┆ ---            │
│ str      ┆ str          ┆ datetime[ms]        ┆ datetime[ms]        ┆ u32            │
╞══════════╪══════════════╪═════════════════════╪═════════════════════╪════════════════╡
│ 0x5...60 ┆ 0x4...dd     ┆ 2015-08-07 03:30:33 ┆ 2015-08-07 03:30:00 ┆ 1              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0x1...6d ┆ 0x4...dd     ┆ 2015-08-07 03:36:53 ┆ 2015-08-07 03:35:00 ┆ 3              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0x9...84 ┆ 0xf...e2     ┆ 2015-08-07 03:37:10 ┆ 2015-08-07 03:35:00 ┆ 3              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0xe...14 ┆ 0x6...4e     ┆ 2015-08-07 03:37:10 ┆ 2015-08-07 03:35:00 ┆ 3              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0xc...25 ┆ 0x4...17     ┆ 2015-08-07 03:43:03 ┆ 2015-08-07 03:40:00 ┆ 1              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0x5...50 ┆ 0xe...8b     ┆ 2015-08-07 03:46:15 ┆ 2015-08-07 03:45:00 ┆ 1              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0x2...f3 ┆ 0xe...da     ┆ 2015-08-07 03:50:51 ┆ 2015-08-07 03:50:00 ┆ 4              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0xe...75 ┆ 0xf...f2     ┆ 2015-08-07 03:51:01 ┆ 2015-08-07 03:50:00 ┆ 4              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0x3...95 ┆ 0x1...21     ┆ 2015-08-07 03:51:31 ┆ 2015-08-07 03:50:00 ┆ 4              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0x4...4e ┆ 0xc...cf     ┆ 2015-08-07 03:52:51 ┆ 2015-08-07 03:50:00 ┆ 4              │
└──────────┴──────────────┴─────────────────────┴─────────────────────┴────────────────┘

Using join_asof after groupby_dynamic

We can also generate the values by performing a join_asof after the groupby_dynamic so that each observation is associated with the number of unique transactions in the 5-minute intervals.

df = df.sort('ts')

(
    df
    .join_asof(
        df
        .groupby_dynamic("ts", every="5m")
        .agg(
            [
                pl.n_unique("txn_id").alias("num_txs_per_5m"),
            ]
        ),
        on='ts',
        strategy='backward',
    )
)
  • Related