I have millions of rows containing a UTC datetime64 with timezone information and Latitude/Longitude pairs. For each row I need to know the local timezone and create a column containing the local time. To do this I use the tzwhere package.
A simple data set that illustrates the problem:
TimeUTC,Latitude,Longitude
2021-10-11 12:16:00 00:00,42.289723,-71.031715
2021-10-11 12:16:00 00:00,0,0
The function I use to get the timezone and then create the local time value
def tz_from_location(row, tz):
# Hardcoded in an effort to circumvent the problem. The returned value is still
# an object, even though row.TimeUTC is a datetime64
if (row.Latitude == 0) & (row.Longitude == 0):
print ("0,0")
ret_val = row.TimeUTC.tz_convert('UTC')
return (row.TimeUTC)
try:
# forceTZ=True tells it to find the nearest timezone for places without one
tzname = tz.tzNameAt(row.Latitude, row.Longitude, forceTZ=True)
if (tzname == 'uninhabited'):
return(row.TimeUTC)
ret_val = row.TimeUTC.tz_convert(tzname)
# ret_val = ret_val.to_pydatetime()
except Exception as e:
print(f'tz_from_location - Latitude: {row.Latitude} Longitude: {row.Longitude}')
print(f'Error {e}')
exit(-1)
return(ret_val)
Call the function as follows:
from tzwhere import tzwhere
from datetime import datetime
bug = pd.read_csv('./foo.csv')
# Initialize tzwhere
tz = tzwhere.tzwhere(forceTZ=True)
# Create the UTC column
bug['TimeUTC'] = bug['TimeUTC'].astype('datetime64[ns]')
# The original data comes in with a timezone that is of the local computer, not
# the location. Turn that into UTC
bug['TimeUTC'] = bug['TimeUTC'].dt.tz_localize('US/Eastern', ambiguous='NaT', nonexistent='shift_forward')
# Now call the function
bug['TimeLocal'] = bug.apply(geospatial.tz_from_location, tz=tz, axis=1)
# We are putting this into PostgreSQL. If you try to put a TZ aware datetime
# in it will automatically convert it to UTC. So, we need to make this value
# naive and then upload it
bug['TimeLocal'] = bug['TimeLocal'].dt.tz_localize(None, ambiguous='infer')
That last line throws an error on the row with 0,0 but not on any other rows.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/8d/jp8b0rbx5rq0l8p8cnbb5k_r0000gn/T/ipykernel_49416/4161114700.py in <module>
4 bug['TimeUTC'] = bug['TimeUTC'].dt.tz_localize('US/Eastern', ambiguous='NaT', nonexistent='shift_forward')
5 bug['TimeLocal'] = bug.apply(geospatial.tz_from_location, tz=tz, axis=1)
----> 6 bug['TimeLocal'] = bug['TimeLocal'].dt.tz_localize(None, ambiguous='infer')
~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/generic.py in __getattr__(self, name)
5459 or name in self._accessors
5460 ):
-> 5461 return object.__getattribute__(self, name)
5462 else:
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
178 # we're accessing the attribute of the class, i.e., Dataset.geo
179 return self._accessor
--> 180 accessor_obj = self._accessor(obj)
181 # Replace the property with the accessor object. Inspired by:
182 # https://www.pydanny.com/cached-property.html
~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/indexes/accessors.py in __new__(cls, data)
492 return PeriodProperties(data, orig)
493
--> 494 raise AttributeError("Can only use .dt accessor with datetimelike values")
AttributeError: Can only use .dt accessor with datetimelike values
This is because the first row contains a datetime64 but the second row is an object.
Here is the TimeUTC value before the call:
bug.TimeUTC
0 2021-10-11 12:16:00-04:00
1 2021-10-11 12:16:00-04:00
Name: TimeUTC, dtype: datetime64[ns, US/Eastern]
Here is the dataframe with TimeLocal added:
bug.TimeLocal
0 2021-10-11 12:16:00-04:00
1 2021-10-11 12:16:00-04:00
Name: TimeLocal, dtype: object
If you look at individual rows the first one is correct but the second one is an object.
All of my efforts to return something that doesn't appear as an object for 0,0 rows have failed. I'm sure I'm missing something simple.
CodePudding user response:
Here's some suggestions; given the example DataFrame
TimeUTC Latitude Longitude
0 2021-10-11 12:16:00 00:00 42.289723 -71.031715
1 2021-10-11 12:16:00 00:00 0.000000 0.000000
make sure to parse datetime column to datetime data type:
df['TimeUTC'] = pd.to_datetime(df['TimeUTC'])
Then, you can refactor the function that derives the tz from lat/long for example like
from timezonefinder import TimezoneFinder
def tz_from_location(row, _tf=TimezoneFinder()):
# if lat/lon aren't specified, we just want the existing name (e.g. UTC)
if (row.Latitude == 0) & (row.Longitude == 0):
return row.TimeUTC.tzname()
# otherwise, try to find tz name
tzname = _tf.timezone_at(lng=row.Longitude, lat=row.Latitude)
if tzname: # return the name if it is not None
return tzname
return row.TimeUTC.tzname() # else return existing name
I'd suggest using timezonefinder
since I found it to be more efficient an reliable - docs, github.
Now you can easily apply & create a column converted to tz:
df['TimeLocal'] = df.apply(lambda row: row['TimeUTC'].tz_convert(tz_from_location(row)), axis=1)
giving you
TimeUTC Latitude Longitude TimeLocal
0 2021-10-11 12:16:00 00:00 42.289723 -71.031715 2021-10-11 08:16:00-04:00
1 2021-10-11 12:16:00 00:00 0.000000 0.000000 2021-10-11 12:16:00 00:00
df['TimeLocal'].iloc[0]
Out[2]: Timestamp('2021-10-11 08:16:00-0400', tz='America/New_York')
df['TimeLocal'].iloc[1]
Out[3]: Timestamp('2021-10-11 12:16:00 0000', tz='UTC')
(!) But... since you have mixed time zones in your TimeLocal
column, the data type of the whole Series will be object
- no way around that, that's how pandas datetime handles mixed time zones in one Series.
Addendum
If we also want a column with the time zone name in one go, we can make the function return a tuple and use expand in the call to apply:
def convert_to_location_tz(row, _tf=TimezoneFinder()):
# if lat/lon aren't specified, we just want the existing name (e.g. UTC)
if (row.Latitude == 0) & (row.Longitude == 0):
return (row.TimeUTC.tzname(), row.TimeUTC)
# otherwise, try to find tz name
tzname = _tf.timezone_at(lng=row.Longitude, lat=row.Latitude)
if tzname: # return the name if it is not None
return (tzname, row.TimeUTC.tz_convert(tzname))
return (row.TimeUTC.tzname(), row.TimeUTC) # else return existing name
df[['tzname', 'TimeLocal']] = df.apply(lambda row: convert_to_location_tz(row), axis=1, result_type='expand')
df[['tzname', 'TimeLocal']]
Out[9]:
tzname TimeLocal
0 America/New_York 2021-10-11 08:16:00-04:00
1 UTC 2021-10-11 12:16:00 00:00