I am trying to move from pandas to polars but I am running into the following issue.
import polars as pl
df = pl.DataFrame(
{
"integer": [1, 2, 3],
"date": [
"2010-01-31T23:00:00 00:00",
"2010-02-01T00:00:00 00:00",
"2010-02-01T01:00:00 00:00"
]
}
)
df = df.with_columns(
[
pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
]
)
Yields the following dataframe:
>>> df
shape: (3, 2)
┌─────────┬────────────────────────────────┐
│ integer ┆ date │
│ --- ┆ --- │
│ i64 ┆ datetime[μs, Europe/Amsterdam] │
╞═════════╪════════════════════════════════╡
│ 1 ┆ 2010-02-01 00:00:00 CET │
│ 2 ┆ 2010-02-01 01:00:00 CET │
│ 3 ┆ 2010-02-01 02:00:00 CET │
└─────────┴────────────────────────────────┘
As you can see, I transformed the datetime string from UTC to CET succesfully. However, when I try to extract the date (using the accepted answer by the polars author in this thread: https://stackoverflow.com/a/73212748/16332690), it seems to extract the date from the UTC string even though it has been transformed, e.g.:
df = df.with_columns(
[
pl.col("date").cast(pl.Date).alias("valueDay")
]
)
>>> df
shape: (3, 3)
┌─────────┬────────────────────────────────┬────────────┐
│ integer ┆ date ┆ valueDay │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs, Europe/Amsterdam] ┆ date │
╞═════════╪════════════════════════════════╪════════════╡
│ 1 ┆ 2010-02-01 00:00:00 CET ┆ 2010-01-31 │
│ 2 ┆ 2010-02-01 01:00:00 CET ┆ 2010-02-01 │
│ 3 ┆ 2010-02-01 02:00:00 CET ┆ 2010-02-01 │
└─────────┴────────────────────────────────┴────────────┘
The valueDay
should be 2010-02-01 for all 3 values.
Can anyone help me fix this? By the way, what is the best way to optimize this code? Do I constantly have to assign everything to df
or is there a way to chain all of this?
Edit:
I managed to find a quick way around this but it would be nice if the issue above could be addressed. A pandas dt.date
like way to approach this would be nice, I noticed that it is missing over here: https://pola-rs.github.io/polars/py-polars/html/reference/series/timeseries.html
df = pl.DataFrame(
{
"integer": [1, 2, 3],
"date": [
"2010-01-31T23:00:00 00:00",
"2010-02-01T00:00:00 00:00",
"2010-02-01T01:00:00 00:00"
]
}
)
df = df.with_columns(
[
pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
]
)
df = df.with_columns(
[
pl.col("date").dt.day().alias("day"),
pl.col("date").dt.month().alias("month"),
pl.col("date").dt.year().alias("year"),
]
)
df = df.with_columns(
pl.datetime(year=pl.col("year"), month=pl.col("month"), day=pl.col("day"))
)
df = df.with_columns(
[
pl.col("datetime").cast(pl.Date).alias("valueDay")
]
)
Yields the following:
>>> df
shape: (3, 7)
┌─────────┬────────────────────────────────┬─────┬───────┬──────┬─────────────────────┬────────────┐
│ integer ┆ date ┆ day ┆ month ┆ year ┆ datetime ┆ valueDay │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs, Europe/Amsterdam] ┆ u32 ┆ u32 ┆ i32 ┆ datetime[μs] ┆ date │
╞═════════╪════════════════════════════════╪═════╪═══════╪══════╪═════════════════════╪════════════╡
│ 1 ┆ 2010-02-01 00:00:00 CET ┆ 1 ┆ 2 ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
│ 2 ┆ 2010-02-01 01:00:00 CET ┆ 1 ┆ 2 ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
│ 3 ┆ 2010-02-01 02:00:00 CET ┆ 1 ┆ 2 ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
└─────────┴────────────────────────────────┴─────┴───────┴──────┴─────────────────────┴────────────┘
CodePudding user response:
Would this temporary workaround help? Starting with this data:
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
"date": pl.date_range(
datetime(2010, 1, 30, 22, 0, 0),
datetime(2010, 2, 1, 2, 0, 0),
"1h",
).dt.with_time_zone("Europe/Amsterdam"),
}
)
df
shape: (29, 1)
┌────────────────────────────────┐
│ date │
│ --- │
│ datetime[μs, Europe/Amsterdam] │
╞════════════════════════════════╡
│ 2010-01-30 23:00:00 CET │
│ 2010-01-31 00:00:00 CET │
│ 2010-01-31 01:00:00 CET │
│ 2010-01-31 02:00:00 CET │
│ 2010-01-31 03:00:00 CET │
│ 2010-01-31 04:00:00 CET │
│ 2010-01-31 05:00:00 CET │
│ 2010-01-31 06:00:00 CET │
│ 2010-01-31 07:00:00 CET │
│ 2010-01-31 08:00:00 CET │
│ 2010-01-31 09:00:00 CET │
│ 2010-01-31 10:00:00 CET │
│ 2010-01-31 11:00:00 CET │
│ 2010-01-31 12:00:00 CET │
│ 2010-01-31 13:00:00 CET │
│ 2010-01-31 14:00:00 CET │
│ 2010-01-31 15:00:00 CET │
│ 2010-01-31 16:00:00 CET │
│ 2010-01-31 17:00:00 CET │
│ 2010-01-31 18:00:00 CET │
│ 2010-01-31 19:00:00 CET │
│ 2010-01-31 20:00:00 CET │
│ 2010-01-31 21:00:00 CET │
│ 2010-01-31 22:00:00 CET │
│ 2010-01-31 23:00:00 CET │
│ 2010-02-01 00:00:00 CET │
│ 2010-02-01 01:00:00 CET │
│ 2010-02-01 02:00:00 CET │
│ 2010-02-01 03:00:00 CET │
└────────────────────────────────┘
You can extract the date using
(
df.with_columns(
pl.col("date")
.dt.cast_time_zone("UTC")
.cast(pl.Date)
.alias("trunc_date")
)
)
shape: (29, 2)
┌────────────────────────────────┬────────────┐
│ date ┆ trunc_date │
│ --- ┆ --- │
│ datetime[μs, Europe/Amsterdam] ┆ date │
╞════════════════════════════════╪════════════╡
│ 2010-01-30 23:00:00 CET ┆ 2010-01-30 │
│ 2010-01-31 00:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 01:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 02:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 03:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 04:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 05:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 06:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 07:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 08:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 09:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 10:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 11:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 12:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 13:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 14:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 15:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 16:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 17:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 18:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 19:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 20:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 21:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 22:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 23:00:00 CET ┆ 2010-01-31 │
│ 2010-02-01 00:00:00 CET ┆ 2010-02-01 │
│ 2010-02-01 01:00:00 CET ┆ 2010-02-01 │
│ 2010-02-01 02:00:00 CET ┆ 2010-02-01 │
│ 2010-02-01 03:00:00 CET ┆ 2010-02-01 │
└────────────────────────────────┴────────────┘
CodePudding user response:
There's nothing wrong with the cast
function per se. It's just not intended to be used for truncating the time out of a datetime.
As it happens the function you're looking for is truncate so this is what you want to do (in the last with_columns
chunk)
pl.DataFrame(
{
"integer": [1, 2, 3],
"date": [
"2010-01-31T23:00:00 00:00",
"2010-02-01T00:00:00 00:00",
"2010-02-01T01:00:00 00:00"
]
}
).with_columns(
[
pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
]
).with_columns(
[
pl.col("date").dt.truncate('1d').alias("valueDay")
]
)