Create an integer timestamp that corresponds to a Pandas Timestamp's timezone-CodePudding

Suppose we have a dataset with a UNIX timestamp in milliseconds:

data = [
    {
      "unix_ts": 1669291200000,
      "val": 10
    },
    {
      "unix_ts": 1669291260000,
      "val": 25
    }
  ]

Which we convert to a Pandas dataframe with a Pandas timestamp (datetime) set to US/Eastern:

df = pd.DataFrame(data)
df['ET'] = pd.to_datetime(df['unix_ts'], unit='ms').dt.tz_localize('UTC').dt.tz_convert('US/Eastern')

    unix_ts         val ET
0   1669291200000   10  2022-11-24 07:00:00-05:00
1   1669291260000   25  2022-11-24 07:01:00-05:00

We can see that the ET time is 5 hours behind the UTC unix_ts

Suppose we want a new integer column with a value that corresponds with that -5 hours difference. Naively, we could do this:

df['adjusted_ts'] = df['unix_ts'] - (3600000 * 5)
# Include column to allow us to check the result by eye.
df['Check_ET'] = pd.to_datetime(df['adjusted_ts'], unit='ms').dt.tz_localize('UTC').dt.tz_convert('US/Eastern')

Result:

    unix_ts         val ET                          adjusted_ts           Check_ET
0   1669291200000   10  2022-11-24 07:00:00-05:00   1669273200000           2022-11-24 02:00:00-05:00
1   1669291260000   25  2022-11-24 07:01:00-05:00   1669273260000           2022-11-24 02:01:00-05:00

We can see from the Check_ET column that this "works", but it won't when we get to the part of the year when US/Eastern is only 4 hours behind UTC.

It would be handy to be able to do something like this:

import numpy as np
df['smart_adjusted_ts'] = (df['ET'].astype(np.int64) // 10 ** 9) * 1000

But, sadly, that's not so 'smart', as it results in a column that's identical to the original unix_ts (Regardless of the ET column's timezone, the underlying data (NumPy) is always stored as nanoseconds since the EPOCH in UTC.):

    unix_ts         ... Check_ET                    smart_adjusted_ts
0   1669291200000   ... 2022-11-24 02:00:00-05:00   1669291200000
1   1669291260000   ... 2022-11-24 02:01:00-05:00   1669291260000

So, unless there is a special method to do this (Anyone?), my only thought is to go back to the original approach but dynamically extract the UTC offset (ideally as ints: 4 or 5) from the ET column.

The problem is, I can't find how to do that either, but I'm hoping to achieve something like this:

df['adjusted_ts'] = df['unix_ts'] - (3600000 * et_utc_abs_diff)

Please be aware that a dataset could include dates with both (4 & 5 hour) differences, so it's important to get this difference on a row-by-row basis as opposed to having a master variable set to 4 or 5.

Any ideas for an elegant solution, please?

Edit

I came up with the following, and it gets the right result, but I suspect there must be a better way using standard Pandas methods.

df['adjusted_ts'] = df['unix_ts'] - (df['ET'].astype(str).str.slice(start=21, stop=22).astype(int) * 3600000)

CodePudding user response：

Here's a way to implement this by localizing to None, as I've described in the comments.

import pandas as pd

df = pd.DataFrame({"unix_ts": [1651363200000, 1669291260000],
                   "val": [10, 25]})

df["ET"] = pd.to_datetime(df["unix_ts"], unit='ms', utc=True).dt.tz_convert("America/New_York")
# df["ET"]
# 0   2022-04-30 20:00:00-04:00
# 1   2022-11-24 07:01:00-05:00
# Name: ET, dtype: datetime64[ns, America/New_York]

# we can remove the time zone to get naive datetime. pandas will treat this as UTC
df["ET_naive"] = df["ET"].dt.tz_localize(None)
# df
#          unix_ts  val                        ET            ET_naive
# 0  1669291200000   10 2022-11-24 07:00:00-05:00 2022-11-24 07:00:00
# 1  1669291260000   25 2022-11-24 07:01:00-05:00 2022-11-24 07:01:00

# now we can convert back to units of time since the epoch, 
# only that the epoch is now eastern time:
df["ET_epochtime"] = df["ET_naive"].astype("int64") / 1e6 # division gives milliseconds

# df["ET_epochtime"] correctly accounts for DST offset:
(df["unix_ts"]-df["ET_epochtime"])/3600000
# 0    4.0
# 1    5.0
# dtype: float64