I wanted to add/subtract the UTC
offset (usually in hours) to/from the datetime object in polars but I don't seem to see a way to do this. the UTC offset can be dynamic given there's Day Light Saving period comes into play in a calendar year. (e.g., EST/EDT maps to 5/4 hours of UTC
offset respectively).
from datetime import datetime
import pytz
import polars as pl
from datetime import date
# Make a datetime-only dataframe that covers DST period of year, in UTC time first.
df = pl.DataFrame(
pl.date_range(low=date(2022,1,3),
high=date(2022,9,30),
interval="5m",
time_unit="ns",
time_zone="UTC")
.alias("timestamp")
)
# Convert timezone to "America/New_York", which covers both EST and EDT.
us_df = df.with_column(
pl.col("timestamp")
.dt
.cast_time_zone(tz="America/New_York")
.alias("datetime")
)
# Check us_df output
us_df
# output, here `polars` is showing US time without the UTC offset
# Before 0.14.22 `polars` is showing time with UTC offset
# i.e., `23:45:00 UTC` should be `19:45:00 EDT`
# Now `polars` is showing `15:45:00 EDT`, without 4 hours of offset
┌─────────────────────────┬────────────────────────────────┐
│ timestamp ┆ datetime │
│ --- ┆ --- │
│ datetime[ns, UTC] ┆ datetime[ns, America/New_York] │
╞═════════════════════════╪════════════════════════════════╡
│ 2022-01-03 00:00:00 UTC ┆ 2022-01-02 14:00:00 EST │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-03 00:05:00 UTC ┆ 2022-01-02 14:05:00 EST │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-03 00:10:00 UTC ┆ 2022-01-02 14:10:00 EST │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-03 00:15:00 UTC ┆ 2022-01-02 14:15:00 EST │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-29 23:45:00 UTC ┆ 2022-09-29 15:45:00 EDT │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-29 23:50:00 UTC ┆ 2022-09-29 15:50:00 EDT │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-29 23:55:00 UTC ┆ 2022-09-29 15:55:00 EDT │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-30 00:00:00 UTC ┆ 2022-09-29 16:00:00 EDT │
└─────────────────────────┴────────────────────────────────┘
Converting to_pandas
, we should observe that the underlying datetime
object does not include that 4 hours of offset in the actual time as well (remember EST is also in this dataframe, and it has a 5-hour offset).
# Convert to pandas
us_pd = us_df.to_pandas()
us_pd
# output
timestamp datetime
0 2022-01-03 00:00:00 00:00 2022-01-02 14:00:00-05:00
1 2022-01-03 00:05:00 00:00 2022-01-02 14:05:00-05:00
2 2022-01-03 00:10:00 00:00 2022-01-02 14:10:00-05:00
3 2022-01-03 00:15:00 00:00 2022-01-02 14:15:00-05:00
4 2022-01-03 00:20:00 00:00 2022-01-02 14:20:00-05:00
... ... ...
77756 2022-09-29 23:40:00 00:00 2022-09-29 15:40:00-04:00
77757 2022-09-29 23:45:00 00:00 2022-09-29 15:45:00-04:00
77758 2022-09-29 23:50:00 00:00 2022-09-29 15:50:00-04:00
77759 2022-09-29 23:55:00 00:00 2022-09-29 15:55:00-04:00
77760 2022-09-30 00:00:00 00:00 2022-09-29 16:00:00-04:00
What I wanted was to include the UTC
offset into the actual time, such that I can do filtering on the time (in a natural way). For instance, if I am seeing 2300UTC is 1900EDT, I can use filter using 1900 directly (please note I can't just add/substract the UTC
offset on the fly during filtering, as the number of hours is a dynamic variable given DST).
The underlying python datetime
does have utcoffset
function, which can be applied on each datetime object, but I'd need to convert polars
to pandas
first (I don't see how to do this within polars
).
I've also observed this peculiar difference:
us_pd.datetime[us_pd.shape[0]-1].to_pydatetime()
# We can see it is identical to what's already in `polars` and `pandas` dataframe.
datetime.datetime(2022, 9, 29, 16, 0, tzinfo=<DstTzInfo 'America/New_York' EDT-1 day, 20:00:00 DST>)
# Now we create a single datetime object with arbitrary UTC time and convert it to New York time
datetime(2022, 9, 30, 22, 45, 0,0, pytz.utc).astimezone(pytz.timezone("America/New_York"))
# The representation here is actually the correct New York time (as in, the offset has been included)
datetime.datetime(2022, 9, 30, 18, 45, tzinfo=<DstTzInfo 'America/New_York' EDT-1 day, 20:00:00 DST>)
CodePudding user response:
It seems you're looking for with_time_zone
;
from datetime import date
import polars as pl
df = pl.DataFrame(
pl.date_range(
low=date(2022, 1, 3),
high=date(2022, 9, 30),
interval="5m",
time_unit="ns",
time_zone="UTC",
).alias("timestamp")
)
us_df = df.with_column(
pl.col("timestamp").dt.with_time_zone(tz="America/New_York").alias("datetime")
)
us_df
shape: (77761, 2)
┌─────────────────────────┬────────────────────────────────┐
│ timestamp ┆ datetime │
│ --- ┆ --- │
│ datetime[ns, UTC] ┆ datetime[ns, America/New_York] │
╞═════════════════════════╪════════════════════════════════╡
│ 2022-01-03 00:00:00 UTC ┆ 2022-01-02 19:00:00 EST │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-03 00:05:00 UTC ┆ 2022-01-02 19:05:00 EST │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-03 00:10:00 UTC ┆ 2022-01-02 19:10:00 EST │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-03 00:15:00 UTC ┆ 2022-01-02 19:15:00 EST │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-29 23:45:00 UTC ┆ 2022-09-29 19:45:00 EDT │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-29 23:50:00 UTC ┆ 2022-09-29 19:50:00 EDT │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-29 23:55:00 UTC ┆ 2022-09-29 19:55:00 EDT │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-30 00:00:00 UTC ┆ 2022-09-29 20:00:00 EDT │
└─────────────────────────┴────────────────────────────────┘
[opinionated] As of polars v0.14.22, the docs seem to be a bit sparse on the topic. with_time_zone
is supposed to not modify the timestamp, while cast_time_zone
is. Given the results shown here, that is confusing to me. What do polar developers understand by "(not) modify a timestamp"? It would probably be more intuitive to use something like pandas' tz_localize and tz_convert (set vs. convert the tz).
CodePudding user response:
Good question.
I think it will be helpful to go through some fundamentals to understand this fully.
I've made the date range monthly to make it easier to follow.
Firstly we create a datetime range and give it a UTC timezone:
from datetime import datetime
import pytz
import polars as pl
from datetime import date
df = (
pl.DataFrame(
pl.date_range(low=date(2022,2,3),
high=date(2022,5,30),
interval="1mo",
)
.alias("utc")
)
.with_column(
pl.col("utc").cast(pl.Datetime)
)
.with_column(
pl.col("utc").dt.with_time_zone("UTC")
)
)
shape: (4, 1)
┌─────────────────────────┐
│ utc │
│ --- │
│ datetime[μs, UTC] │
╞═════════════════════════╡
│ 2022-02-03 00:00:00 UTC │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-03 00:00:00 UTC │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-04-03 00:00:00 UTC │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-05-03 00:00:00 UTC │
└─────────────────────────┘
What is a datetime column? In a simple way we can think of it as a mapping from an underlying integer representation to a datetime string.
When we set a timezone using with_time_zone
we tell Polars to adjust the mapping so that the same integers map to different datetime strings. In the case of UTC the only difference in the string is that we add UTC to the end.
If we want to get the time difference between two time zones we can use cast_time_zone
. This function tells Polars to
- get the time offset between the UTC and New York time zones (say 5 hours in microseconds in winter) and
- change the integer representation for New York by 5 hours in microseconds
However, as we have changed the integer representation the string representation must also update. So in your UTC timestamp column above you have a string for midnight on the first row. In the datetime column you have a string for 10 hours earlier at 14:00 EST - because the integer representation has reduced by 5 hours plus EST is 5 hours behind UTC!
Nonetheless you are on the right track with cast_time_zone
- the integer representation has changed by the offset between the time zones. However, it doesn't make immediate sense when you look at the output in its string representation!
So to get the offset in hours between UTC and New York we:
- cast_time_zone from UTC to New York
- subtract their physical integer representations
- divide by (microseconds per second * seconds per hour) to get the offset in hours
I've done it below with the intermediate steps for clarity, you could do everything in a single expression.
# Cast timezone to "America/New_York", which covers both EST and EDT.
us_df = (
# Cast UTC to NYC to get the offset in the integer representation
df.with_column(
pl.col("utc")
.dt
.cast_time_zone(tz="America/New_York")
.alias("nyc")
)
# Add columns that display the integer representations
.with_column(
pl.all().to_physical().suffix("_physical")
)
# Subtract the integer representations to get the offset in hours
.with_column(
((pl.col("utc_physical") - pl.col("nyc_physical"))/(3600*1e6)).alias("diff_in_hours")
)
shape: (4, 5)
┌─────────────────────────┬───────────────────────────┬──────────────────┬──────────────────┬──────┐
│ utc ┆ nyc ┆ utc_phys ┆ nyc_phys ┆ diff │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs, UTC] ┆ datetime[μs, ┆ i64 ┆ i64 ┆ f64 │
│ ┆ America/New_York] ┆ ┆ ┆ │
╞═════════════════════════╪═══════════════════════════╪══════════════════╪══════════════════╪══════╡
│ 2022-02-03 00:00:00 UTC ┆ 2022-02-02 14:00:00 EST ┆ 1643846400000000 ┆ 1643828400000000 ┆ 5.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-03-03 00:00:00 UTC ┆ 2022-03-02 14:00:00 EST ┆ 1646265600000000 ┆ 1646247600000000 ┆ 5.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-04-03 00:00:00 UTC ┆ 2022-04-02 16:00:00 EDT ┆ 1648944000000000 ┆ 1648929600000000 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-05-03 00:00:00 UTC ┆ 2022-05-02 16:00:00 EDT ┆ 1651536000000000 ┆ 1651521600000000 ┆ 4.0 │
└─────────────────────────┴───────────────────────────┴──────────────────┴──────────────────┴──────