Home > front end >  Python timezone determination and manipulation with tzwhere produce objects rather than datetime64s
Python timezone determination and manipulation with tzwhere produce objects rather than datetime64s

Time:01-08

I have millions of rows containing a UTC datetime64 with timezone information and Latitude/Longitude pairs. For each row I need to know the local timezone and create a column containing the local time. To do this I use the tzwhere package.

A simple data set that illustrates the problem:

TimeUTC,Latitude,Longitude
2021-10-11 12:16:00 00:00,42.289723,-71.031715
2021-10-11 12:16:00 00:00,0,0

The function I use to get the timezone and then create the local time value

def tz_from_location(row, tz):

    # Hardcoded in an effort to circumvent the problem. The returned value is still
    # an object, even though row.TimeUTC is a datetime64
    if (row.Latitude == 0) & (row.Longitude == 0):
        print ("0,0")
        ret_val = row.TimeUTC.tz_convert('UTC')
        return (row.TimeUTC)

    try:
        # forceTZ=True tells it to find the nearest timezone for places without one
        tzname = tz.tzNameAt(row.Latitude, row.Longitude, forceTZ=True)
        if (tzname == 'uninhabited'):
            return(row.TimeUTC)
            
        ret_val = row.TimeUTC.tz_convert(tzname)
#        ret_val = ret_val.to_pydatetime()
    except Exception as e:
        print(f'tz_from_location - Latitude: {row.Latitude} Longitude: {row.Longitude}')
        print(f'Error {e}')
        exit(-1)
        
    return(ret_val)

Call the function as follows:

from tzwhere import tzwhere
from datetime import datetime

bug = pd.read_csv('./foo.csv')

# Initialize tzwhere
tz = tzwhere.tzwhere(forceTZ=True)

# Create the UTC column
bug['TimeUTC'] = bug['TimeUTC'].astype('datetime64[ns]')

# The original data comes in with a timezone that is of the local computer, not
# the location. Turn that into UTC
bug['TimeUTC'] = bug['TimeUTC'].dt.tz_localize('US/Eastern', ambiguous='NaT', nonexistent='shift_forward')

# Now call the function
bug['TimeLocal'] = bug.apply(geospatial.tz_from_location, tz=tz, axis=1)

# We are putting this into PostgreSQL. If you try to put a TZ aware datetime
# in it will automatically convert it to UTC. So, we need to make this value
# naive and then upload it
bug['TimeLocal'] = bug['TimeLocal'].dt.tz_localize(None, ambiguous='infer')

That last line throws an error on the row with 0,0 but not on any other rows.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/8d/jp8b0rbx5rq0l8p8cnbb5k_r0000gn/T/ipykernel_49416/4161114700.py in <module>
      4 bug['TimeUTC'] = bug['TimeUTC'].dt.tz_localize('US/Eastern', ambiguous='NaT', nonexistent='shift_forward')
      5 bug['TimeLocal'] = bug.apply(geospatial.tz_from_location, tz=tz, axis=1)
----> 6 bug['TimeLocal'] = bug['TimeLocal'].dt.tz_localize(None, ambiguous='infer')

~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5459             or name in self._accessors
   5460         ):
-> 5461             return object.__getattribute__(self, name)
   5462         else:
   5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):

~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
    178             # we're accessing the attribute of the class, i.e., Dataset.geo
    179             return self._accessor
--> 180         accessor_obj = self._accessor(obj)
    181         # Replace the property with the accessor object. Inspired by:
    182         # https://www.pydanny.com/cached-property.html

~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/indexes/accessors.py in __new__(cls, data)
    492             return PeriodProperties(data, orig)
    493 
--> 494         raise AttributeError("Can only use .dt accessor with datetimelike values")

AttributeError: Can only use .dt accessor with datetimelike values

This is because the first row contains a datetime64 but the second row is an object.

Here is the TimeUTC value before the call:

bug.TimeUTC
0   2021-10-11 12:16:00-04:00
1   2021-10-11 12:16:00-04:00
Name: TimeUTC, dtype: datetime64[ns, US/Eastern]

Here is the dataframe with TimeLocal added:

bug.TimeLocal
0    2021-10-11 12:16:00-04:00
1    2021-10-11 12:16:00-04:00
Name: TimeLocal, dtype: object

If you look at individual rows the first one is correct but the second one is an object.

All of my efforts to return something that doesn't appear as an object for 0,0 rows have failed. I'm sure I'm missing something simple.

CodePudding user response:

Here's some suggestions; given the example DataFrame

                     TimeUTC   Latitude  Longitude
0  2021-10-11 12:16:00 00:00  42.289723 -71.031715
1  2021-10-11 12:16:00 00:00   0.000000   0.000000

make sure to parse datetime column to datetime data type:

df['TimeUTC'] = pd.to_datetime(df['TimeUTC'])

Then, you can refactor the function that derives the tz from lat/long for example like

from timezonefinder import TimezoneFinder

def tz_from_location(row, _tf=TimezoneFinder()):
    # if lat/lon aren't specified, we just want the existing name (e.g. UTC)
    if (row.Latitude == 0) & (row.Longitude == 0):
        return row.TimeUTC.tzname()
    # otherwise, try to find tz name
    tzname = _tf.timezone_at(lng=row.Longitude, lat=row.Latitude)
    if tzname: # return the name if it is not None
        return tzname
    return row.TimeUTC.tzname() # else return existing name

I'd suggest using timezonefinder since I found it to be more efficient an reliable - docs, github.

Now you can easily apply & create a column converted to tz:

df['TimeLocal'] = df.apply(lambda row: row['TimeUTC'].tz_convert(tz_from_location(row)), axis=1)

giving you

                    TimeUTC   Latitude  Longitude                  TimeLocal
0 2021-10-11 12:16:00 00:00  42.289723 -71.031715  2021-10-11 08:16:00-04:00
1 2021-10-11 12:16:00 00:00   0.000000   0.000000  2021-10-11 12:16:00 00:00
df['TimeLocal'].iloc[0]
Out[2]: Timestamp('2021-10-11 08:16:00-0400', tz='America/New_York')

df['TimeLocal'].iloc[1]
Out[3]: Timestamp('2021-10-11 12:16:00 0000', tz='UTC')

(!) But... since you have mixed time zones in your TimeLocal column, the data type of the whole Series will be object - no way around that, that's how pandas datetime handles mixed time zones in one Series.


Addendum

If we also want a column with the time zone name in one go, we can make the function return a tuple and use expand in the call to apply:

def convert_to_location_tz(row, _tf=TimezoneFinder()):
    # if lat/lon aren't specified, we just want the existing name (e.g. UTC)
    if (row.Latitude == 0) & (row.Longitude == 0):
        return (row.TimeUTC.tzname(), row.TimeUTC)
    # otherwise, try to find tz name
    tzname = _tf.timezone_at(lng=row.Longitude, lat=row.Latitude)
    if tzname: # return the name if it is not None
        return (tzname, row.TimeUTC.tz_convert(tzname))
    return (row.TimeUTC.tzname(), row.TimeUTC) # else return existing name

df[['tzname', 'TimeLocal']] = df.apply(lambda row: convert_to_location_tz(row), axis=1, result_type='expand')

df[['tzname', 'TimeLocal']]
Out[9]: 
             tzname                  TimeLocal
0  America/New_York  2021-10-11 08:16:00-04:00
1               UTC  2021-10-11 12:16:00 00:00
  •  Tags:  
  • Related