Home > Mobile >  polars groupy cannot get mean of datetime column
polars groupy cannot get mean of datetime column

Time:10-08

I have a dataframe with a column of datetimes, a column of floats, and a column of integers like this:

┌─────────────────────────┬───────────┬─────────────┐
│ time                    ┆ NAV_DEPTH ┆ coarse_ints │
│ ---                     ┆ ---       ┆ ---         │
│ datetime[ms]            ┆ f64       ┆ i64         │
╞═════════════════════════╪═══════════╪═════════════╡
│ 2019-07-21 23:25:02.737 ┆ 3.424     ┆ 0           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-21 23:25:32.745 ┆ 2.514     ┆ 0           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-21 23:26:02.753 ┆ 2.514     ┆ 0           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-21 23:26:32.668 ┆ 2.323     ┆ 0           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                     ┆ ...       ┆ ...         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-23 21:24:16.383 ┆ 3.17      ┆ 689         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-23 21:24:46.390 ┆ 3.213     ┆ 689         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-23 21:25:16.396 ┆ 3.361     ┆ 689         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-23 21:25:46.402 ┆ 3.403     ┆ 690         │

The integer column serves to split the dataset up into sequential groups of 8 samples for averaging. I would like to perform a groupby on the integer column and get the mean depth and datetime for each integer. It works with median

df.groupby('coarse_ints').median()
┌─────────────┬─────────────────────────┬───────────┐
│ coarse_ints ┆ time                    ┆ NAV_DEPTH │
│ ---         ┆ ---                     ┆ ---       │
│ i64         ┆ datetime[ms]            ┆ f64       │
╞═════════════╪═════════════════════════╪═══════════╡
│ 128         ┆ 2019-07-22 07:58:55.498 ┆ 207.8305  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 672         ┆ 2019-07-23 20:15:29.461 ┆ 3.086     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 328         ┆ 2019-07-22 21:19:08.667 ┆ 694.677   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤

But with mean, the datetimes all go to null

df.groupby('coarse_ints').mean()
┌─────────────┬──────────────┬────────────┐
│ coarse_ints ┆ time         ┆ NAV_DEPTH  │
│ ---         ┆ ---          ┆ ---        │
│ i64         ┆ datetime[ms] ┆ f64        │
╞═════════════╪══════════════╪════════════╡
│ 232         ┆ null         ┆ 96.967125  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 216         ┆ null         ┆ 156.889    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤

groupby_dynamic looked promising, but it needs a regular time interval. I need to average every 8 samples, regardless of the irregular time interval.

CodePudding user response:

If you operate on the underlying integer representation of the datetime, then cast back when done, you can get the mean via a regular groupby (I admit this is slightly non-intuitive ;)

df.with_column(
    pl.col('time').to_physical()
).groupby(
    by = pl.col('coarse_ints'),
    maintain_order = True  # or not :)
).mean().with_column(
    pl.col('time').cast( pl.Datetime('ms') )
)

Note that casting back from the physical/integer representation should include the original timeunit (eg: 'ms','us','ns') so as to avoid potentially incorrect scaling.

  • Related