I have a dataframe with longitude and latitude columns. I need to get the county name for the location based on long and lat values with the help of the geoPy package.
longitude latitude housing_median_age total_rooms total_bedrooms \
0 -114.31 34.19 15.0 5612.0 1283.0
1 -114.47 34.40 19.0 7650.0 1901.0
2 -114.56 33.69 17.0 720.0 174.0
3 -114.57 33.64 14.0 1501.0 337.0
4 -114.57 33.57 20.0 1454.0 326.0
population households median_income median_house_value
0 1015.0 472.0 1.4936 66900.0
1 1129.0 463.0 1.8200 80100.0
2 333.0 117.0 1.6509 85700.0
3 515.0 226.0 3.1917 73400.0
4 624.0 262.0 1.9250 65500.0
I had success with a for loop:
geolocator = geopy.Nominatim(user_agent='1234')
for index, row in df.iloc[:10, :].iterrows():
location = geolocator.reverse([row["latitude"], row["longitude"]])
county = location.raw['address']['county']
print(county)
The dataset has 17,000 rows, so that should be a problem, right?
So I've been trying to figure out how to build a function which I could use in pandas.apply() in order to get quicker results.
def get_zipcodes():
location = geolocator.reverse([row["latitude"], row["longitude"]])
county = location.raw['address']['county']
print(county)
counties = get_zipcodes()
I'm stuck and don't know how to use apply (or any other clever method) in here. Help is much appreciated.
CodePudding user response:
The pandas calculations in your code are unlikely to be the speed bottleneck when using geopy (see this answer to a different geopy question).
However, if there's a possibility that there may be a significant number of rows with duplicate latitude, longitude
coordinates, you can use the @cache
(or @lru_cache(None)
) decorator from functools.
Here is how to use apply()
on your dataframe with no special caching:
df["county"] = df.apply(lambda row: geolocator.reverse([row["latitude"], row["longitude"]]).raw["address"]["county"], axis=1)
Full test code:
import geopy
geolocator = geopy.Nominatim(user_agent='1234')
import pandas as pd
df = pd.DataFrame({
'longitude':[-114.31,-114.47,-114.56,-114.57,-114.57],
'latitude':[34.19,34.40,33.69,33.64,33.57],
'housing_median_age':[15]*5,
'total_rooms':[1000]*5,
'total_bedrooms':[500]*5,
'population':[800]*5})
print(df)
df["county"] = df.apply(lambda row: geolocator.reverse([row["latitude"], row["longitude"]]).raw["address"]["county"], axis=1)
print(df)
Input:
longitude latitude housing_median_age total_rooms total_bedrooms population
0 -114.31 34.19 15 1000 500 800
1 -114.47 34.40 15 1000 500 800
2 -114.56 33.69 15 1000 500 800
3 -114.57 33.64 15 1000 500 800
4 -114.57 33.57 15 1000 500 800
Output:
longitude latitude housing_median_age total_rooms total_bedrooms population county
0 -114.31 34.19 15 1000 500 800 San Bernardino County
1 -114.47 34.40 15 1000 500 800 San Bernardino County
2 -114.56 33.69 15 1000 500 800 Riverside County
3 -114.57 33.64 15 1000 500 800 Riverside County
4 -114.57 33.57 15 1000 500 800 Riverside County
Here is how to use a decorator to cache results (i.e., to avoid going all the way out to a geopy server multiple times) for identical latitude, longitude
coordinates:
from functools import cache
@cache
def bar(lat, long):
return geolocator.reverse([lat, long]).raw["address"]["county"]
def foo(row):
return bar(row["latitude"], row["longitude"])
df["county"] = df.apply(foo, axis=1)