Home > Software engineering >  polars extracts wrong date from timezone aware datetime column
polars extracts wrong date from timezone aware datetime column

Time:02-02

I am trying to move from pandas to polars but I am running into the following issue.

import polars as pl

df = pl.DataFrame(
    {
        "integer": [1, 2, 3], 
        "date": [
            "2010-01-31T23:00:00 00:00",
            "2010-02-01T00:00:00 00:00",
            "2010-02-01T01:00:00 00:00"
        ]
    }
)
df = df.with_columns(
    [
        pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
    ]
)

Yields the following dataframe:

>>> df
shape: (3, 2)
┌─────────┬────────────────────────────────┐
│ integer ┆ date                           │
│ ---     ┆ ---                            │
│ i64     ┆ datetime[μs, Europe/Amsterdam] │
╞═════════╪════════════════════════════════╡
│ 1       ┆ 2010-02-01 00:00:00 CET        │
│ 2       ┆ 2010-02-01 01:00:00 CET        │
│ 3       ┆ 2010-02-01 02:00:00 CET        │
└─────────┴────────────────────────────────┘

As you can see, I transformed the datetime string from UTC to CET succesfully. However, when I try to extract the date (using the accepted answer by the polars author in this thread: https://stackoverflow.com/a/73212748/16332690), it seems to extract the date from the UTC string even though it has been transformed, e.g.:

df = df.with_columns(
    [
        pl.col("date").cast(pl.Date).alias("valueDay")
    ]
)
>>> df
shape: (3, 3)
┌─────────┬────────────────────────────────┬────────────┐
│ integer ┆ date                           ┆ valueDay   │
│ ---     ┆ ---                            ┆ ---        │
│ i64     ┆ datetime[μs, Europe/Amsterdam] ┆ date       │
╞═════════╪════════════════════════════════╪════════════╡
│ 1       ┆ 2010-02-01 00:00:00 CET        ┆ 2010-01-31 │
│ 2       ┆ 2010-02-01 01:00:00 CET        ┆ 2010-02-01 │
│ 3       ┆ 2010-02-01 02:00:00 CET        ┆ 2010-02-01 │
└─────────┴────────────────────────────────┴────────────┘

The valueDay should be 2010-02-01 for all 3 values.

Can anyone help me fix this? By the way, what is the best way to optimize this code? Do I constantly have to assign everything to df or is there a way to chain all of this?

Edit:

I managed to find a quick way around this but it would be nice if the issue above could be addressed. A pandas dt.date like way to approach this would be nice, I noticed that it is missing over here: https://pola-rs.github.io/polars/py-polars/html/reference/series/timeseries.html

df = pl.DataFrame(
    {
        "integer": [1, 2, 3], 
        "date": [
            "2010-01-31T23:00:00 00:00",
            "2010-02-01T00:00:00 00:00",
            "2010-02-01T01:00:00 00:00"
        ]
    }
)
df = df.with_columns(
    [
        pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
    ]
)
df = df.with_columns(
    [
        pl.col("date").dt.day().alias("day"),
        pl.col("date").dt.month().alias("month"),
        pl.col("date").dt.year().alias("year"),
    ]
)
df = df.with_columns(
    pl.datetime(year=pl.col("year"), month=pl.col("month"), day=pl.col("day"))
)
df = df.with_columns(
    [
        pl.col("datetime").cast(pl.Date).alias("valueDay")
    ]
)

Yields the following:

>>> df
shape: (3, 7)
┌─────────┬────────────────────────────────┬─────┬───────┬──────┬─────────────────────┬────────────┐
│ integer ┆ date                           ┆ day ┆ month ┆ year ┆ datetime            ┆ valueDay   │
│ ---     ┆ ---                            ┆ --- ┆ ---   ┆ ---  ┆ ---                 ┆ ---        │
│ i64     ┆ datetime[μs, Europe/Amsterdam] ┆ u32 ┆ u32   ┆ i32  ┆ datetime[μs]        ┆ date       │
╞═════════╪════════════════════════════════╪═════╪═══════╪══════╪═════════════════════╪════════════╡
│ 1       ┆ 2010-02-01 00:00:00 CET        ┆ 1   ┆ 2     ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
│ 2       ┆ 2010-02-01 01:00:00 CET        ┆ 1   ┆ 2     ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
│ 3       ┆ 2010-02-01 02:00:00 CET        ┆ 1   ┆ 2     ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
└─────────┴────────────────────────────────┴─────┴───────┴──────┴─────────────────────┴────────────┘

CodePudding user response:

Would this temporary workaround help? Starting with this data:

import polars as pl
from datetime import datetime

df = pl.DataFrame(
    {
        "date": pl.date_range(
            datetime(2010, 1, 30, 22, 0, 0),
            datetime(2010, 2, 1, 2, 0, 0),
            "1h",
        ).dt.with_time_zone("Europe/Amsterdam"),
    }
)
df
shape: (29, 1)
┌────────────────────────────────┐
│ date                           │
│ ---                            │
│ datetime[μs, Europe/Amsterdam] │
╞════════════════════════════════╡
│ 2010-01-30 23:00:00 CET        │
│ 2010-01-31 00:00:00 CET        │
│ 2010-01-31 01:00:00 CET        │
│ 2010-01-31 02:00:00 CET        │
│ 2010-01-31 03:00:00 CET        │
│ 2010-01-31 04:00:00 CET        │
│ 2010-01-31 05:00:00 CET        │
│ 2010-01-31 06:00:00 CET        │
│ 2010-01-31 07:00:00 CET        │
│ 2010-01-31 08:00:00 CET        │
│ 2010-01-31 09:00:00 CET        │
│ 2010-01-31 10:00:00 CET        │
│ 2010-01-31 11:00:00 CET        │
│ 2010-01-31 12:00:00 CET        │
│ 2010-01-31 13:00:00 CET        │
│ 2010-01-31 14:00:00 CET        │
│ 2010-01-31 15:00:00 CET        │
│ 2010-01-31 16:00:00 CET        │
│ 2010-01-31 17:00:00 CET        │
│ 2010-01-31 18:00:00 CET        │
│ 2010-01-31 19:00:00 CET        │
│ 2010-01-31 20:00:00 CET        │
│ 2010-01-31 21:00:00 CET        │
│ 2010-01-31 22:00:00 CET        │
│ 2010-01-31 23:00:00 CET        │
│ 2010-02-01 00:00:00 CET        │
│ 2010-02-01 01:00:00 CET        │
│ 2010-02-01 02:00:00 CET        │
│ 2010-02-01 03:00:00 CET        │
└────────────────────────────────┘

You can extract the date using

(
    df.with_columns(
        pl.col("date")
        .dt.cast_time_zone("UTC")
        .cast(pl.Date)
        .alias("trunc_date")
    )
)
shape: (29, 2)
┌────────────────────────────────┬────────────┐
│ date                           ┆ trunc_date │
│ ---                            ┆ ---        │
│ datetime[μs, Europe/Amsterdam] ┆ date       │
╞════════════════════════════════╪════════════╡
│ 2010-01-30 23:00:00 CET        ┆ 2010-01-30 │
│ 2010-01-31 00:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 01:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 02:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 03:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 04:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 05:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 06:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 07:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 08:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 09:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 10:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 11:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 12:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 13:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 14:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 15:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 16:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 17:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 18:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 19:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 20:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 21:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 22:00:00 CET        ┆ 2010-01-31 │
│ 2010-01-31 23:00:00 CET        ┆ 2010-01-31 │
│ 2010-02-01 00:00:00 CET        ┆ 2010-02-01 │
│ 2010-02-01 01:00:00 CET        ┆ 2010-02-01 │
│ 2010-02-01 02:00:00 CET        ┆ 2010-02-01 │
│ 2010-02-01 03:00:00 CET        ┆ 2010-02-01 │
└────────────────────────────────┴────────────┘

CodePudding user response:

There's nothing wrong with the cast function per se. It's just not intended to be used for truncating the time out of a datetime.

As it happens the function you're looking for is truncate so this is what you want to do (in the last with_columns chunk)

pl.DataFrame(
    {
        "integer": [1, 2, 3], 
        "date": [
            "2010-01-31T23:00:00 00:00",
            "2010-02-01T00:00:00 00:00",
            "2010-02-01T01:00:00 00:00"
        ]
    }
).with_columns(
    [
        pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
    ]
).with_columns(
    [
        pl.col("date").dt.truncate('1d').alias("valueDay")
    ]
)
  • Related