How can I transfer my working for loop over to pandas much quicker apply functionality?-CodePudding

I have a dataframe with longitude and latitude columns. I need to get the county name for the location based on long and lat values with the help of the geoPy package.

 longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -114.31     34.19                15.0       5612.0          1283.0   
1    -114.47     34.40                19.0       7650.0          1901.0   
2    -114.56     33.69                17.0        720.0           174.0   
3    -114.57     33.64                14.0       1501.0           337.0   
4    -114.57     33.57                20.0       1454.0           326.0   

   population  households  median_income  median_house_value  
0      1015.0       472.0         1.4936             66900.0  
1      1129.0       463.0         1.8200             80100.0  
2       333.0       117.0         1.6509             85700.0  
3       515.0       226.0         3.1917             73400.0  
4       624.0       262.0         1.9250             65500.0

I had success with a for loop:

geolocator = geopy.Nominatim(user_agent='1234')

for index, row in df.iloc[:10, :].iterrows():
    location = geolocator.reverse([row["latitude"], row["longitude"]])
    county = location.raw['address']['county']
    print(county)

The dataset has 17,000 rows, so that should be a problem, right?

So I've been trying to figure out how to build a function which I could use in pandas.apply() in order to get quicker results.

def get_zipcodes():
    location = geolocator.reverse([row["latitude"], row["longitude"]])
    county = location.raw['address']['county']
    print(county)

counties = get_zipcodes()

I'm stuck and don't know how to use apply (or any other clever method) in here. Help is much appreciated.

CodePudding user response：

The pandas calculations in your code are unlikely to be the speed bottleneck when using geopy (see this answer to a different geopy question).

However, if there's a possibility that there may be a significant number of rows with duplicate latitude, longitude coordinates, you can use the @cache (or @lru_cache(None)) decorator from functools.

Here is how to use apply() on your dataframe with no special caching:

df["county"] = df.apply(lambda row: geolocator.reverse([row["latitude"], row["longitude"]]).raw["address"]["county"], axis=1)

Full test code:

import geopy
geolocator = geopy.Nominatim(user_agent='1234')
import pandas as pd
df = pd.DataFrame({
'longitude':[-114.31,-114.47,-114.56,-114.57,-114.57], 
'latitude':[34.19,34.40,33.69,33.64,33.57], 
'housing_median_age':[15]*5, 
'total_rooms':[1000]*5, 
'total_bedrooms':[500]*5, 
'population':[800]*5})

print(df)

df["county"] = df.apply(lambda row: geolocator.reverse([row["latitude"], row["longitude"]]).raw["address"]["county"], axis=1)
print(df)

Input:

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  population
0    -114.31     34.19                  15         1000             500         800
1    -114.47     34.40                  15         1000             500         800
2    -114.56     33.69                  15         1000             500         800
3    -114.57     33.64                  15         1000             500         800
4    -114.57     33.57                  15         1000             500         800

Output:

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  population                 county
0    -114.31     34.19                  15         1000             500         800  San Bernardino County
1    -114.47     34.40                  15         1000             500         800  San Bernardino County
2    -114.56     33.69                  15         1000             500         800       Riverside County
3    -114.57     33.64                  15         1000             500         800       Riverside County
4    -114.57     33.57                  15         1000             500         800       Riverside County

Here is how to use a decorator to cache results (i.e., to avoid going all the way out to a geopy server multiple times) for identical latitude, longitude coordinates:

from functools import cache
@cache
def bar(lat, long):
    return geolocator.reverse([lat, long]).raw["address"]["county"]

def foo(row):
    return bar(row["latitude"], row["longitude"])
df["county"] = df.apply(foo, axis=1)