Is there a way to filter data in a period of time (i.e., start time
and end time
) using polars
?
import pandas as pd
import polars as pl
dr = pd.date_range(start='2020-01-01', end='2021-01-01', freq="30min")
df = pd.DataFrame({"timestamp": dr})
pf = pl.from_pandas(df)
The best try I've got was:
pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30))
It only gave me everything after 9:30
; and if I append another filter
after that:
pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30)).filter((pl.col("timestamp").dt.hour()<16))
This however does not give me the slice that falls right on 16:00
.
polars
API do not seem to specifically deal with the time
part of time series (only date
part); Is there a better workaround here using polars
?
CodePudding user response:
Good question!
Firstly, we can create this kind of DataFrame in Polars:
from datetime import datetime, time
import polars as pl
start = datetime(2020,1,1)
stop = datetime(2021,1,1)
df = pl.DataFrame({'timestamp':pl.date_range(low=start, high=stop, interval="30m")})
To work on the time components of a datetime we cast the timestamp
column to the pl.Time
dtype.
To filter on a range of times we then pass the upper and lower boundaries of time to in_between
.
In this example I've printed the original timestamp
column, the timestamp
column cast to pl.Time
and the filter condition.
(
df
.select(
[
pl.col("timestamp"),
pl.col("timestamp").cast(pl.Time).alias('time_component'),
(pl.col("timestamp").cast(pl.Time).is_between(
time(9,30),time(16),include_bounds=True
)
)
]
)
)
What you are after is:
(
df
.filter(
pl.col("timestamp").cast(pl.Time).is_between(
time(9,30),time(16),include_bounds=True
)
)
)
See the API docs for the syntax on controlling behaviour at the boundaries: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.is_between.html#polars.Expr.is_between
CodePudding user response:
It is described in the polars book here: https://pola-rs.github.io/polars-book/user-guide/howcani/timeseries/selecting_dates.html#filtering-by-a-date-range
It would look something like this:
start_date = "2022-03-22 00:00:00"
end_date = "2022-03-27 00:00:00"
df = pl.DataFrame(
{
"dates": [
"2022-03-22 00:00:00",
"2022-03-23 00:00:00",
"2022-03-24 00:00:00",
"2022-03-25 00:00:00",
"2022-03-26 00:00:00",
"2022-03-27 00:00:00",
"2022-03-28 00:00:00",
]
}
)
df.with_column(pl.col("dates").is_between(start_date,end_date)).filter(pl.col("is_between") == True)
shape: (4, 2)
┌─────────────────────┬────────────┐
│ dates ┆ is_between │
│ --- ┆ --- │
│ str ┆ bool │
╞═════════════════════╪════════════╡
│ 2022-03-23 00:00:00 ┆ true │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-24 00:00:00 ┆ true │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-25 00:00:00 ┆ true │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-26 00:00:00 ┆ true │
└─────────────────────┴────────────┘