Converting a datetime column with UTC timezone to another TimeZone based on a column of the Datafram-CodePudding

I have a Pandas dateframe which has two columns, with column names 'DateTimeInUTC' and 'TimeZone'. 'DateTimeInUTC' is the date and time of the instance in UTC, and 'TimeZone' is the time zone of the location of the instance.

An instance in the dataframe could be like this:

DateTimeInUTC: '2019-12-31 07:00:00'
TimeZone: 'US/Eastern'

I want to add another column to the dataframe with dType: datetime64 which converts 'DateTimeInUTC' to the specified time zone in that instance.

I tried using the method Pandas.tz_convert() but it takes the timezone as an argument and not as another column in the dataframe

EDIT: My best solution so far is to split the dataframe by timezone using pandas select statements and then apply the timezone on each dataframe with the same timezone, and then concatenate all the dataframes

CodePudding user response：

Approach with groupby():

import pytz
import random
import time

tic = time.perf_counter()
ltz = len(pytz.all_timezones) - 1
length = 7 * 10 ** 6
pd.options.display.max_columns = None
pd.options.display.max_colwidth = None
# generate the dummy data
df = pd.DataFrame({'DateTimeInUTC': pd.date_range('01.01.2000', periods=length, freq='T', tz='UTC'),
                   'TimeZone': [pytz.all_timezones[random.randint(0, ltz)] for tz in range(length)]})

toc = time.perf_counter()
print(f"Generated the df in {toc - tic:0.4f} seconds\n")

tic = time.perf_counter()

df['Converted'] = df.groupby('TimeZone')['DateTimeInUTC'].apply(lambda x: x.dt.tz_convert(x.name).dt.tz_localize(None))

print(df)

toc = time.perf_counter()
print(f"\nConverted the df in {toc - tic:0.4f} seconds")

Output:

Generated the df in 6.3333 seconds

                    DateTimeInUTC            TimeZone           Converted
0       2000-01-01 00:00:00 00:00      Asia/Qyzylorda 2000-01-01 05:00:00
1       2000-01-01 00:01:00 00:00     America/Moncton 1999-12-31 20:01:00
2       2000-01-01 00:02:00 00:00     America/Cordoba 1999-12-31 21:02:00
3       2000-01-01 00:03:00 00:00        Africa/Dakar 2000-01-01 00:03:00
4       2000-01-01 00:04:00 00:00      Pacific/Wallis 2000-01-01 12:04:00
...                           ...                 ...                 ...
6999995 2013-04-23 02:35:00 00:00      America/Guyana 2013-04-22 22:35:00
6999996 2013-04-23 02:36:00 00:00  America/St_Vincent 2013-04-22 22:36:00
6999997 2013-04-23 02:37:00 00:00             MST7MDT 2013-04-22 20:37:00
6999998 2013-04-23 02:38:00 00:00  Antarctica/McMurdo 2013-04-23 14:38:00
6999999 2013-04-23 02:39:00 00:00    America/Atikokan 2013-04-22 21:39:00

[7000000 rows x 3 columns]

Converted the df in 4.1579 seconds

CodePudding user response：

This solution works, but it is probably not optimal:

let's assume weatherDf is my dataframe, which has these columns: DateTimeInUTC and TimeZone

timezones = weatherDf['TimeZone'].unique()
weatherDfs = []
for timezone in timezones:
    tempDf = weatherDf[weatherDf['TimeZone'] == timezone]
    tempDf['DateTimeInTimeZone'] = tempDf['DateTimeInUTC'].dt.tz_convert(timezone)
    weatherDfs.append(tempDf)
weatherDfConverted = pd.concat(weatherDfs)

This solution converted around 7 million instances on my system in around 40 seconds