Suppose we have a dataset with a UNIX timestamp in milliseconds:
data = [
{
"unix_ts": 1669291200000,
"val": 10
},
{
"unix_ts": 1669291260000,
"val": 25
}
]
Which we convert to a Pandas dataframe with a Pandas timestamp (datetime) set to US/Eastern:
df = pd.DataFrame(data)
df['ET'] = pd.to_datetime(df['unix_ts'], unit='ms').dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
unix_ts val ET
0 1669291200000 10 2022-11-24 07:00:00-05:00
1 1669291260000 25 2022-11-24 07:01:00-05:00
We can see that the ET time is 5 hours behind the UTC unix_ts
Suppose we want a new integer column with a value that corresponds with that -5 hours difference. Naively, we could do this:
df['adjusted_ts'] = df['unix_ts'] - (3600000 * 5)
# Include column to allow us to check the result by eye.
df['Check_ET'] = pd.to_datetime(df['adjusted_ts'], unit='ms').dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
Result:
unix_ts val ET adjusted_ts Check_ET
0 1669291200000 10 2022-11-24 07:00:00-05:00 1669273200000 2022-11-24 02:00:00-05:00
1 1669291260000 25 2022-11-24 07:01:00-05:00 1669273260000 2022-11-24 02:01:00-05:00
We can see from the Check_ET column that this "works", but it won't when we get to the part of the year when US/Eastern is only 4 hours behind UTC.
It would be handy to be able to do something like this:
import numpy as np
df['smart_adjusted_ts'] = (df['ET'].astype(np.int64) // 10 ** 9) * 1000
But, sadly, that's not so 'smart', as it results in a column that's identical to the original unix_ts (Regardless of the ET column's timezone, the underlying data (NumPy) is always stored as nanoseconds since the EPOCH in UTC.):
unix_ts ... Check_ET smart_adjusted_ts
0 1669291200000 ... 2022-11-24 02:00:00-05:00 1669291200000
1 1669291260000 ... 2022-11-24 02:01:00-05:00 1669291260000
So, unless there is a special method to do this (Anyone?), my only thought is to go back to the original approach but dynamically extract the UTC offset (ideally as ints: 4 or 5) from the ET column.
The problem is, I can't find how to do that either, but I'm hoping to achieve something like this:
df['adjusted_ts'] = df['unix_ts'] - (3600000 * et_utc_abs_diff)
Please be aware that a dataset could include dates with both (4 & 5 hour) differences, so it's important to get this difference on a row-by-row basis as opposed to having a master variable set to 4 or 5.
Any ideas for an elegant solution, please?
Edit
I came up with the following, and it gets the right result, but I suspect there must be a better way using standard Pandas methods.
df['adjusted_ts'] = df['unix_ts'] - (df['ET'].astype(str).str.slice(start=21, stop=22).astype(int) * 3600000)
CodePudding user response:
Here's a way to implement this by localizing to None, as I've described in the comments.
import pandas as pd
df = pd.DataFrame({"unix_ts": [1651363200000, 1669291260000],
"val": [10, 25]})
df["ET"] = pd.to_datetime(df["unix_ts"], unit='ms', utc=True).dt.tz_convert("America/New_York")
# df["ET"]
# 0 2022-04-30 20:00:00-04:00
# 1 2022-11-24 07:01:00-05:00
# Name: ET, dtype: datetime64[ns, America/New_York]
# we can remove the time zone to get naive datetime. pandas will treat this as UTC
df["ET_naive"] = df["ET"].dt.tz_localize(None)
# df
# unix_ts val ET ET_naive
# 0 1669291200000 10 2022-11-24 07:00:00-05:00 2022-11-24 07:00:00
# 1 1669291260000 25 2022-11-24 07:01:00-05:00 2022-11-24 07:01:00
# now we can convert back to units of time since the epoch,
# only that the epoch is now eastern time:
df["ET_epochtime"] = df["ET_naive"].astype("int64") / 1e6 # division gives milliseconds
# df["ET_epochtime"] correctly accounts for DST offset:
(df["unix_ts"]-df["ET_epochtime"])/3600000
# 0 4.0
# 1 5.0
# dtype: float64