Home > other >  Polars DataFrame filter data in a period of time (start and end time)
Polars DataFrame filter data in a period of time (start and end time)

Time:10-05

Is there a way to filter data in a period of time (i.e., start time and end time) using polars?

import pandas as pd
import polars as pl

dr = pd.date_range(start='2020-01-01', end='2021-01-01', freq="30min") 
df = pd.DataFrame({"timestamp": dr})
pf = pl.from_pandas(df)

The best try I've got was:

pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30))

It only gave me everything after 9:30; and if I append another filter after that:

pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30)).filter((pl.col("timestamp").dt.hour()<16))

This however does not give me the slice that falls right on 16:00.

polars API do not seem to specifically deal with the time part of time series (only date part); Is there a better workaround here using polars?

CodePudding user response:

Good question!

Firstly, we can create this kind of DataFrame in Polars:

from datetime import datetime, time
import polars as pl

start = datetime(2020,1,1)
stop = datetime(2021,1,1)
df = pl.DataFrame({'timestamp':pl.date_range(low=start, high=stop, interval="30m")}) 

To work on the time components of a datetime we cast the timestamp column to the pl.Time dtype.

To filter on a range of times we then pass the upper and lower boundaries of time to in_between.

In this example I've printed the original timestamp column, the timestamp column cast to pl.Time and the filter condition.

(
df
.select(
    [
        pl.col("timestamp"),
        pl.col("timestamp").cast(pl.Time).alias('time_component'), 
        (pl.col("timestamp").cast(pl.Time).is_between(
            time(9,30),time(16),include_bounds=True
        )
        )
    ]
)
)

What you are after is:

(
df
.filter(
        pl.col("timestamp").cast(pl.Time).is_between(
            time(9,30),time(16),include_bounds=True
            )
        )
)

See the API docs for the syntax on controlling behaviour at the boundaries: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.is_between.html#polars.Expr.is_between

CodePudding user response:

It is described in the polars book here: https://pola-rs.github.io/polars-book/user-guide/howcani/timeseries/selecting_dates.html#filtering-by-a-date-range

It would look something like this:

start_date = "2022-03-22 00:00:00"
end_date = "2022-03-27 00:00:00"
df = pl.DataFrame(
    {
        "dates": [
                  "2022-03-22 00:00:00",
                  "2022-03-23 00:00:00",
                  "2022-03-24 00:00:00",
                  "2022-03-25 00:00:00",
                  "2022-03-26 00:00:00",
                  "2022-03-27 00:00:00",
                  "2022-03-28 00:00:00",
                 ]
    }
)
df.with_column(pl.col("dates").is_between(start_date,end_date)).filter(pl.col("is_between") == True)


shape: (4, 2)
┌─────────────────────┬────────────┐
│ dates               ┆ is_between │
│ ---                 ┆ ---        │
│ str                 ┆ bool       │
╞═════════════════════╪════════════╡
│ 2022-03-23 00:00:00 ┆ true       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-24 00:00:00 ┆ true       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-25 00:00:00 ┆ true       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-26 00:00:00 ┆ true       │
└─────────────────────┴────────────┘
  • Related