Checking out how to implement binning with Python polars, I can easily calculate aggregates for individual columns:
import polars as pl
import numpy as np
t, v = np.arange(0, 100, 2), np.arange(0, 100, 2)
df = pl.DataFrame({"t": t, "v0": v, "v1": v})
df = df.with_column((pl.datetime(2022,10,30) pl.duration(seconds=df["t"])).alias("datetime")).drop("t")
df.groupby_dynamic("datetime", every="10s").agg(pl.col("v0").mean())
┌─────────────────────┬──────┐
│ datetime ┆ v0 │
│ --- ┆ --- │
│ datetime[μs] ┆ f64 │
╞═════════════════════╪══════╡
│ 2022-10-30 00:00:00 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-10-30 00:00:10 ┆ 14.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-10-30 00:00:20 ┆ 24.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-10-30 00:00:30 ┆ 34.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ... │
or calculate multiple aggregations like
df.groupby_dynamic("datetime", every="10s").agg([
pl.col("v0").mean().alias("v0_binmean"),
pl.col("v0").count().alias("v0_bincount")
])
┌─────────────────────┬────────────┬─────────────┐
│ datetime ┆ v0_binmean ┆ v0_bincount │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ u32 │
╞═════════════════════╪════════════╪═════════════╡
│ 2022-10-30 00:00:00 ┆ 4.0 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:10 ┆ 14.0 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:20 ┆ 24.0 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:30 ┆ 34.0 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
or calculate one aggregation for multiple columns like
cols = [c for c in df.columns if "datetime" not in c]
df.groupby_dynamic("datetime", every="10s").agg([
pl.col(f"{c}").mean().alias(f"{c}_binmean")
for c in cols
])
┌─────────────────────┬────────────┬────────────┐
│ datetime ┆ v0_binmean ┆ v1_binmean │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ f64 │
╞═════════════════════╪════════════╪════════════╡
│ 2022-10-30 00:00:00 ┆ 4.0 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:10 ┆ 14.0 ┆ 14.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:20 ┆ 24.0 ┆ 24.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:30 ┆ 34.0 ┆ 34.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
However, combining both approaches fails!
df.groupby_dynamic("datetime", every="10s").agg([
[
pl.col(f"{c}").mean().alias(f"{c}_binmean"),
pl.col(f"{c}").count().alias(f"{c}_bincount")
]
for c in cols
])
Traceback (most recent call last):
File "/tmp/ipykernel_2666/421808935.py", line 2, in <cell line: 2>
df.groupby_dynamic("datetime", every="10s").agg([
File ".../3.10.9/lib/python3.10/site-packages/polars/internals/dataframe/groupby.py", line 924, in agg
.agg(aggs)
File ".../3.10.9/lib/python3.10/site-packages/polars/internals/lazyframe/groupby.py", line 55, in agg
raise TypeError(msg)
TypeError: expected 'Expr | Sequence[Expr]', got '<class 'list'>'
Is there a "polarustic" approach to calculate multiple statistical parameters for multiple (all) columns of the dataframe in one go?
related, pandas-specific: Python pandas groupby aggregate on multiple columns
CodePudding user response:
Instead of using a comprehension you can do:
df.groupby_dynamic("datetime", every="10s").agg(
pl.exclude("datetime").mean().suffix("_binmean")
)
shape: (10, 3)
┌─────────────────────┬────────────┬────────────┐
│ datetime | v0_binmean | v1_binmean │
│ --- | --- | --- │
│ datetime[μs] | f64 | f64 │
╞═════════════════════╪════════════╪════════════╡
│ 2022-10-30 00:00:00 | 4.0 | 4.0 │
├─────────────────────┼────────────┼────────────┤
│ 2022-10-30 00:00:10 | 14.0 | 14.0 │
├─────────────────────┼────────────┼────────────┤
│ 2022-10-30 00:00:20 | 24.0 | 24.0 │
Multiple:
df.groupby_dynamic("datetime", every="10s").agg([
pl.exclude("datetime").mean().suffix("_binmean"),
pl.exclude("datetime").count().suffix("_bincount")
])
shape: (10, 5)
┌─────────────────────┬────────────┬────────────┬─────────────┬─────────────┐
│ datetime | v0_binmean | v1_binmean | v0_bincount | v1_bincount │
│ --- | --- | --- | --- | --- │
│ datetime[μs] | f64 | f64 | u32 | u32 │
╞═════════════════════╪════════════╪════════════╪═════════════╪═════════════╡
│ 2022-10-30 00:00:00 | 4.0 | 4.0 | 5 | 5 │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:10 | 14.0 | 14.0 | 5 | 5 │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:20 | 24.0 | 24.0 | 5 | 5 │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:30 | 34.0 | 34.0 | 5 | 5 │
For comprehension - you'd need to combine the 2 into a single list:
df.groupby_dynamic("datetime", every="10s").agg(
[pl.col(f"{c}").mean().alias(f"{c}_binmean") for c in cols]
[pl.col(f"{c}").count().alias(f"{c}_bincount") for c in cols]
)
shape: (10, 5)
┌─────────────────────┬────────────┬────────────┬─────────────┬─────────────┐
│ datetime | v0_binmean | v1_binmean | v0_bincount | v1_bincount │
│ --- | --- | --- | --- | --- │
│ datetime[μs] | f64 | f64 | u32 | u32 │
╞═════════════════════╪════════════╪════════════╪═════════════╪═════════════╡
│ 2022-10-30 00:00:00 | 4.0 | 4.0 | 5 | 5 │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:10 | 14.0 | 14.0 | 5 | 5 │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:20 | 24.0 | 24.0 | 5 | 5 │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:30 | 34.0 | 34.0 | 5 | 5 │