Localize time zone based on column in pandas-CodePudding

I am trying to set timezone to a datetime column, based on another column containing the time zone.

Example data:

  DATETIME                VALUE          TIME_ZONE
0   2021-05-01 00:00:00    1.00      Europe/Athens
1   2021-05-01 00:00:00    2.13      Europe/London
2   2021-05-01 00:00:00    5.13      Europe/London
3   2021-05-01 01:00:00    4.25      Europe/Dublin
4   2021-05-01 01:00:00    4.25       Europe/Paris

I am trying to assign a time zone to the DATETIME column, but using the tz_localize method, I cannot avoid using an apply call, which will be very slow on my large dataset. Is there some way to do this without using apply?

What I have now (which is slow):

df['DATETIME_WITH_TZ'] = df.apply(lambda row: row['DATETIME'].tz_localize(row['TIME_ZONE']), axis=1)

CodePudding user response：

I'm not sure but a listcomp seems to be x17 faster than apply in your case :

df["DATETIME_WITH_TZ"] = [dt.tz_localize(tz)
                          for dt,tz in zip(df["DATETIME"], df["TIME_ZONE"])]

Another variant, with tz_convert :

df["DATETIME_WITH_TZ"] = [dt.tz_localize("UTC").tz_convert(tz)
                          for dt,tz in zip(df["DATETIME"], df["TIME_ZONE"])]

Timing :

#%%timeit #listcomp1
47.4 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

#%%timeit #listcomp2
25.7 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

#%%timeit #apply
457 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Output :

print(df)
             DATETIME  VALUE      TIME_ZONE           DATETIME_WITH_TZ
0 2021-05-01 00:00:00   1.00  Europe/Athens  2021-05-01 03:00:00 03:00
1 2021-05-01 00:00:00   2.13  Europe/London  2021-05-01 01:00:00 01:00
2 2021-05-01 00:00:00   5.13  Europe/London  2021-05-01 01:00:00 01:00
3 2021-05-01 01:00:00   4.25  Europe/Dublin  2021-05-01 02:00:00 01:00
4 2021-05-01 01:00:00   4.25   Europe/Paris  2021-05-01 03:00:00 02:00