Converting timestamps in large dataset to multiple timezones-CodePudding

I have a large dataset with ~ 9 million rows and 4 columns - one of which is a utc timestamp. Data in this set has been recorded from 507 sites across Australia, and there is a site ID column. I have another dataset that has the timezones for each site ID in the format 'Australia/Brisbane'. I've written a function to create a new column in the main dataset that is the utc timestamp converted to the local time. However the wrong new time is being matched up with the utc timestamp, for example 2019-01-05 12:10:00 00:00 and 2019-01-13 18:55:00 11:00 (wrong timezone). I believe that sites are not mixed up in the data, but I've tried to sort the data incase that was the problem. Below is my code and images of the first row of each dataset, any help is much appreciated!

import pytz
from dateutil import tz

def update_timezone(df):
    newtimes = []
    df = df.sort_values('site_id')
    sites = df['site_id'].unique().tolist()
    for site in sites:
        timezone = solarbom.loc[solarbom['site_id'] == site].iloc[0, 1]
        dfsub = df[df['site_id'] == site].copy()
        dfsub['utc_timestamp'] = dfsub['utc_timestamp'].dt.tz_convert(timezone)
        newtimes.extend(dfsub['utc_timestamp'].tolist())
    df['newtimes'] = newtimes

Main large dataset Site info dataset

CodePudding user response：

IIUC, you're looking to group your data by ID, then convert the timestamp specific to each ID. You could achieve this by using groupby, then applying a converter function to each group. Ex:

import pandas as pd

# dummy data:
df = pd.DataFrame({'utc_timestamp': [pd.Timestamp("2022-01-01 00:00 Z"),
                                     pd.Timestamp("2022-01-01 01:00 Z"),
                                     pd.Timestamp("2022-01-05 00:00 Z"),
                                     pd.Timestamp("2022-01-03 00:00 Z"),
                                     pd.Timestamp("2022-01-03 01:00 Z"),
                                     pd.Timestamp("2022-01-03 02:00 Z")],
                   'site_id': [1, 1, 5, 3, 3, 3],
                   'values': [11, 11, 55, 33, 33, 33]})

# time zone info for each ID:
timezdf = pd.DataFrame({'site_id': [1, 3, 5],
                        'timezone_id_x': ["Australia/Adelaide", "Australia/Perth", "Australia/Darwin"]})

### what we want:
# for row, data in timezdf.iterrows():
#     print(f"ID: {data['site_id']}, tz: {data['timezone_id_x']}")
#     print(pd.Timestamp("2022-01-01 00:00 Z"), "to", pd.Timestamp("2022-01-01 00:00 Z").tz_convert(data['timezone_id_x']))

# ID: 1, tz: Australia/Adelaide
# 2022-01-01 00:00:00 00:00 to 2022-01-01 10:30:00 10:30
# ID: 3, tz: Australia/Perth
# 2022-01-01 00:00:00 00:00 to 2022-01-01 08:00:00 08:00
# ID: 5, tz: Australia/Darwin
# 2022-01-01 00:00:00 00:00 to 2022-01-01 09:30:00 09:30
###

def converter(group, timezdf):
    # get the time zone by looking for the current group ID in timezdf
    z = timezdf.loc[timezdf["site_id"] == group["site_id"].iloc[0], 'timezone_id_x'].iloc[0]
    group["localtime"] = group["localtime"].dt.tz_convert(z)
    return group

df["localtime"] = df["utc_timestamp"]
df = df.groupby("site_id").apply(lambda g: converter(g, timezdf))

now df looks like

df
Out[71]: 
              utc_timestamp  site_id  values                  localtime
0 2022-01-01 00:00:00 00:00        1      11  2022-01-01 10:30:00 10:30
1 2022-01-01 01:00:00 00:00        1      11  2022-01-01 11:30:00 10:30
2 2022-01-05 00:00:00 00:00        5      55  2022-01-05 09:30:00 09:30
3 2022-01-03 00:00:00 00:00        3      33  2022-01-03 08:00:00 08:00
4 2022-01-03 01:00:00 00:00        3      33  2022-01-03 09:00:00 08:00
5 2022-01-03 02:00:00 00:00        3      33  2022-01-03 10:00:00 08:00