Home > Mobile >  Interpolate time series data from one df to time axis of another df in Python polars
Interpolate time series data from one df to time axis of another df in Python polars

Time:12-16

I have time series data on different time axes in different dataframes. I need to interpolate data from one df to onto the time axis of another df. Ex:

import polars as pl

df0 = pl.DataFrame({"dt": ["2022-12-14T14:00:01.000", "2022-12-14T14:00:02.000",
                           "2022-12-14T14:00:03.000", "2022-12-14T14:00:04.000",
                           "2022-12-14T14:00:05.000", "2022-12-14T14:00:06.000"],
                    "v0": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]})
df0 = df0.with_column(pl.col("dt").str.strptime(pl.Datetime).cast(pl.Datetime))

df1 = pl.DataFrame({
        "dt": ["2022-12-14T14:00:01.001", "2022-12-14T14:00:03.001", "2022-12-14T14:00:05.002"],
        "v1": [1.0, 3.0, 5.0]})
df1 = df1.with_column(pl.col("dt").str.strptime(pl.Datetime).cast(pl.Datetime))

I cannot join the dfs since keys don't match:

print(df0.join(df1, on="dt", how="left").interpolate())
shape: (6, 3)
┌─────────────────────┬─────┬──────┐
│ dt                  ┆ v0  ┆ v1   │
│ ---                 ┆ --- ┆ ---  │
│ datetime[μs]        ┆ f64 ┆ f64  │
╞═════════════════════╪═════╪══════╡
│ 2022-12-14 14:00:01 ┆ 1.0 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-12-14 14:00:02 ┆ 2.0 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-12-14 14:00:03 ┆ 3.0 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-12-14 14:00:04 ┆ 4.0 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-12-14 14:00:05 ┆ 5.0 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-12-14 14:00:06 ┆ 6.0 ┆ null │
└─────────────────────┴─────┴──────┘

So my 'iterative' approach would be to interpolate each column individually, for instance like

from scipy.interpolate import interp1d

f = interp1d(
    df1["dt"].dt.timestamp(),
    df1["v1"],
    kind="linear",
    bounds_error=False,
    fill_value="extrapolate",
)

out = f(df0["dt"].dt.timestamp())

df0 = df0.with_column(pl.Series(out).alias("v1_interp"))

print(df0.head(6))
shape: (6, 3)
┌─────────────────────┬─────┬───────────┐
│ dt                  ┆ v0  ┆ v1_interp │
│ ---                 ┆ --- ┆ ---       │
│ datetime[μs]        ┆ f64 ┆ f64       │
╞═════════════════════╪═════╪═══════════╡
│ 2022-12-14 14:00:01 ┆ 1.0 ┆ 0.999     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-12-14 14:00:02 ┆ 2.0 ┆ 1.999     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-12-14 14:00:03 ┆ 3.0 ┆ 2.999     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-12-14 14:00:04 ┆ 4.0 ┆ 3.998501  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-12-14 14:00:05 ┆ 5.0 ┆ 4.998001  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-12-14 14:00:06 ┆ 6.0 ┆ 5.997501  │
└─────────────────────┴─────┴───────────┘

Although this gives the result I need, I wonder if there is a more idiomatic (efficient) approach?

CodePudding user response:

First approach

I think this is one of those problems where you want to create a ton more rows than you actually want to end up and then filter back down to what you want.

Because polars's interpolate function only computes missing values in between known values rather than extrapolating forward and backwards, let's make our first step to manually extrapolate df1 to add an extra row before and after.

df1=df1.lazy()
df1=pl.concat([df1,
        df1.sort('dt').with_row_count('n') \
            .select(
                    [pl.col('n')]   \
                    [pl.when(pl.col('n')<=1) \
                       .then(pl.col(x)-(pl.col(x).shift(-1)-pl.col(x))) \
                    .when(pl.col('n')>=pl.col('n').max()-1) \
                       .then(pl.col(x) (pl.col(x)-pl.col(x).shift(1)))
                               for x in df1.columns]
                ) \
        .filter((pl.col('n')==0) | (pl.col('n')==pl.col('n').max())) \
        .select(pl.exclude('n'))]).sort('dt')

I'm using a list comprehension in the select so this should be extensible to any number of columns.

The next thing to do is make a df with a dt column that starts with the earliest dt and ends with the latest between df0 and df1 with the minimum time difference. By fixing the difference in your key column, it allows polars's interpolate to work as you expect.

specs = pl.concat([df0.select('dt'),df1.select('dt')]) \
          .sort('dt').select([
                       pl.col('dt').min().alias('mindt'),
                       pl.col('dt').max().alias('maxdt'), 
                       (pl.col('dt')-pl.col('dt').shift()).min().alias('mindiff')
                      ]).collect()


newdf = pl.DataFrame({'dt':pl.date_range(specs[0,0], specs[0,1], specs[0,2])}).lazy()

Alternatively you can make newdf with a list comprehension incase dt isn't a datetime pl.DataFrame({'dt': [specs[0,0] specs[0,2]*x for x in range(int(1 (specs[0,1]-specs[0,0])/specs[0,2]))]}).lazy()

With that you do an outer join between that and your two dfs then use the embedded interpolate to get all the values you're looking for. You can chain a filter and select at the end to match your output.

newdf = newdf.join(df0, on='dt', how='outer') \
    .join(df1, on='dt', how='outer') \
    .with_columns([pl.col(x).interpolate().suffix('_interp') for x in df1.columns if x != 'dt']) \
    .filter(~pl.col('v0').is_null()).select(pl.exclude('v1')) \
    .collect()

Second approach

Another way to tackle the problem is to essentially recreate the scipy interpolate function with a bunch of shift and whenthen statements...

First you do a diagonal concat and then add a bunch of helper columns representing the dt and v1 columns but shifted, one pair for a forward shift and another pair for backwards. Then calculate the change in v1 by the time difference which is then itself carried forwards and backwards. Almost lastly is whenthen logic for begining/ending/middle rows. True lastly, is filtering and selecting away the helper columns.

    pl.concat([df0, df1], how='diagonal').sort('dt') \
    .with_column(pl.when(~pl.col('v1').is_null()).then(pl.col('dt')).alias('v1dt')) \
    .with_columns([
        pl.col('v1').fill_null(strategy='forward').alias('v1_for'),
        pl.col('v1dt').fill_null(strategy='forward').alias('v1dt_for'),
        pl.col('v1').fill_null(strategy='backward').alias('v1_back'),
        pl.col('v1dt').fill_null(strategy='backward').alias('v1dt_back')
        ]) \
    .with_column(((pl.col('v1_back')-pl.col('v1_for'))/(pl.col('v1dt_back')-pl.col('v1dt_for'))).alias('diff')) \
    .with_column((pl.when(pl.col('diff').is_nan()).then(None).otherwise(pl.col('diff'))).alias('diff')) \
    .with_column(pl.col('diff').fill_null(strategy='forward').fill_null(strategy='backward')) \
    .with_column((pl.when(~pl.col('v1').is_null()).then(pl.col('v1')) \
                .when((~pl.col('v1_for').is_null()) & (~pl.col('v1_back').is_null())) \
                        .then((pl.col('dt')-pl.col('v1dt_for'))*pl.col('diff') pl.col('v1_for')) \
                .when(~pl.col('v1_back').is_null()) \
                        .then(pl.col('v1_back')-(pl.col('v1dt_back')-pl.col('dt'))*pl.col('diff')) \
                .otherwise(pl.col('v1_for') (pl.col('dt')-pl.col('v1dt_for'))*pl.col('diff'))).alias('v1_interp')) \
    .filter(~pl.col('v0').is_null()).select(['dt','v0','v1_interp'])

CodePudding user response:

Similar method to @DeanMacGregor's second approach but using .join_asof() to combine the rows.

(
   df0
   .join_asof(
      df1.with_column(pl.col("dt").alias("v1_dt")),
      on="dt",
      strategy="forward")
   .with_columns([
      pl.col(["v1", "v1_dt"]).shift(-1).suffix("_next"),
      pl.col(["v1", "v1_dt"]).shift( 1).suffix("_prev"),
   ])
   .with_column((
      (pl.col("v1_prev") - pl.col("v1_next"))
         / (pl.col("v1_dt_prev") - pl.col("v1_dt_next")))
      .forward_fill()
      .backward_fill()
      .alias("diff"))
   .with_column(
      pl.col(["v1_next", "v1_dt_next"]).forward_fill())
   .select([
      pl.col(["dt", "v0"]),
      pl.col("v1_next").alias("v1_interp")
           (pl.col("dt") - pl.col("v1_dt_next"))
         * pl.col("diff")
   ])
)
shape: (6, 3)
┌─────────────────────┬─────┬───────────┐
│ dt                  | v0  | v1_interp │
│ ---                 | --- | ---       │
│ datetime[μs]        | f64 | f64       │
╞═════════════════════╪═════╪═══════════╡
│ 2022-12-14 14:00:01 | 1.0 | 0.999     │
├─────────────────────┼─────┼───────────┤
│ 2022-12-14 14:00:02 | 2.0 | 1.999     │
├─────────────────────┼─────┼───────────┤
│ 2022-12-14 14:00:03 | 3.0 | 2.999     │
├─────────────────────┼─────┼───────────┤
│ 2022-12-14 14:00:04 | 4.0 | 3.998501  │
├─────────────────────┼─────┼───────────┤
│ 2022-12-14 14:00:05 | 5.0 | 4.998001  │
├─────────────────────┼─────┼───────────┤
│ 2022-12-14 14:00:06 | 6.0 | 5.997501  │
└─//──────────────────┴─//──┴─//────────┘
  • Related