polars groupy cannot get mean of datetime column-CodePudding

I have a dataframe with a column of datetimes, a column of floats, and a column of integers like this:

┌─────────────────────────┬───────────┬─────────────┐
│ time                    ┆ NAV_DEPTH ┆ coarse_ints │
│ ---                     ┆ ---       ┆ ---         │
│ datetime[ms]            ┆ f64       ┆ i64         │
╞═════════════════════════╪═══════════╪═════════════╡
│ 2019-07-21 23:25:02.737 ┆ 3.424     ┆ 0           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-21 23:25:32.745 ┆ 2.514     ┆ 0           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-21 23:26:02.753 ┆ 2.514     ┆ 0           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-21 23:26:32.668 ┆ 2.323     ┆ 0           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                     ┆ ...       ┆ ...         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-23 21:24:16.383 ┆ 3.17      ┆ 689         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-23 21:24:46.390 ┆ 3.213     ┆ 689         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-23 21:25:16.396 ┆ 3.361     ┆ 689         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-23 21:25:46.402 ┆ 3.403     ┆ 690         │

The integer column serves to split the dataset up into sequential groups of 8 samples for averaging. I would like to perform a groupby on the integer column and get the mean depth and datetime for each integer. It works with median

df.groupby('coarse_ints').median()
┌─────────────┬─────────────────────────┬───────────┐
│ coarse_ints ┆ time                    ┆ NAV_DEPTH │
│ ---         ┆ ---                     ┆ ---       │
│ i64         ┆ datetime[ms]            ┆ f64       │
╞═════════════╪═════════════════════════╪═══════════╡
│ 128         ┆ 2019-07-22 07:58:55.498 ┆ 207.8305  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 672         ┆ 2019-07-23 20:15:29.461 ┆ 3.086     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 328         ┆ 2019-07-22 21:19:08.667 ┆ 694.677   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤

But with mean, the datetimes all go to null

df.groupby('coarse_ints').mean()
┌─────────────┬──────────────┬────────────┐
│ coarse_ints ┆ time         ┆ NAV_DEPTH  │
│ ---         ┆ ---          ┆ ---        │
│ i64         ┆ datetime[ms] ┆ f64        │
╞═════════════╪══════════════╪════════════╡
│ 232         ┆ null         ┆ 96.967125  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 216         ┆ null         ┆ 156.889    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤

groupby_dynamic looked promising, but it needs a regular time interval. I need to average every 8 samples, regardless of the irregular time interval.

CodePudding user response：

If you operate on the underlying integer representation of the datetime, then cast back when done, you can get the mean via a regular groupby (I admit this is slightly non-intuitive ;)

df.with_column(
    pl.col('time').to_physical()
).groupby(
    by = pl.col('coarse_ints'),
    maintain_order = True  # or not :)
).mean().with_column(
    pl.col('time').cast( pl.Datetime('ms') )
)

Note that casting back from the physical/integer representation should include the original timeunit (eg: 'ms','us','ns') so as to avoid potentially incorrect scaling.