I have a large dataset with ~ 9 million rows and 4 columns - one of which is a utc timestamp. Data in this set has been recorded from 507 sites across Australia, and there is a site ID column. I have another dataset that has the timezones for each site ID in the format 'Australia/Brisbane'. I've written a function to create a new column in the main dataset that is the utc timestamp converted to the local time. However the wrong new time is being matched up with the utc timestamp, for example 2019-01-05 12:10:00 00:00 and 2019-01-13 18:55:00 11:00 (wrong timezone). I believe that sites are not mixed up in the data, but I've tried to sort the data incase that was the problem. Below is my code and images of the first row of each dataset, any help is much appreciated!
import pytz
from dateutil import tz
def update_timezone(df):
newtimes = []
df = df.sort_values('site_id')
sites = df['site_id'].unique().tolist()
for site in sites:
timezone = solarbom.loc[solarbom['site_id'] == site].iloc[0, 1]
dfsub = df[df['site_id'] == site].copy()
dfsub['utc_timestamp'] = dfsub['utc_timestamp'].dt.tz_convert(timezone)
newtimes.extend(dfsub['utc_timestamp'].tolist())
df['newtimes'] = newtimes
Main large dataset Site info dataset
CodePudding user response:
IIUC, you're looking to group your data by ID, then convert the timestamp specific to each ID. You could achieve this by using groupby, then applying a converter function to each group. Ex:
import pandas as pd
# dummy data:
df = pd.DataFrame({'utc_timestamp': [pd.Timestamp("2022-01-01 00:00 Z"),
pd.Timestamp("2022-01-01 01:00 Z"),
pd.Timestamp("2022-01-05 00:00 Z"),
pd.Timestamp("2022-01-03 00:00 Z"),
pd.Timestamp("2022-01-03 01:00 Z"),
pd.Timestamp("2022-01-03 02:00 Z")],
'site_id': [1, 1, 5, 3, 3, 3],
'values': [11, 11, 55, 33, 33, 33]})
# time zone info for each ID:
timezdf = pd.DataFrame({'site_id': [1, 3, 5],
'timezone_id_x': ["Australia/Adelaide", "Australia/Perth", "Australia/Darwin"]})
### what we want:
# for row, data in timezdf.iterrows():
# print(f"ID: {data['site_id']}, tz: {data['timezone_id_x']}")
# print(pd.Timestamp("2022-01-01 00:00 Z"), "to", pd.Timestamp("2022-01-01 00:00 Z").tz_convert(data['timezone_id_x']))
# ID: 1, tz: Australia/Adelaide
# 2022-01-01 00:00:00 00:00 to 2022-01-01 10:30:00 10:30
# ID: 3, tz: Australia/Perth
# 2022-01-01 00:00:00 00:00 to 2022-01-01 08:00:00 08:00
# ID: 5, tz: Australia/Darwin
# 2022-01-01 00:00:00 00:00 to 2022-01-01 09:30:00 09:30
###
def converter(group, timezdf):
# get the time zone by looking for the current group ID in timezdf
z = timezdf.loc[timezdf["site_id"] == group["site_id"].iloc[0], 'timezone_id_x'].iloc[0]
group["localtime"] = group["localtime"].dt.tz_convert(z)
return group
df["localtime"] = df["utc_timestamp"]
df = df.groupby("site_id").apply(lambda g: converter(g, timezdf))
now df looks like
df
Out[71]:
utc_timestamp site_id values localtime
0 2022-01-01 00:00:00 00:00 1 11 2022-01-01 10:30:00 10:30
1 2022-01-01 01:00:00 00:00 1 11 2022-01-01 11:30:00 10:30
2 2022-01-05 00:00:00 00:00 5 55 2022-01-05 09:30:00 09:30
3 2022-01-03 00:00:00 00:00 3 33 2022-01-03 08:00:00 08:00
4 2022-01-03 01:00:00 00:00 3 33 2022-01-03 09:00:00 08:00
5 2022-01-03 02:00:00 00:00 3 33 2022-01-03 10:00:00 08:00