I am aiming to add a few columns based on the existing 'latitude' and 'longitude' columns, where each of the column entry is used to retrieve relevant data from googla api. My method works fine but it takes extremely long to compute for more than 450k data.
Could anyone suggest a more efficient way to iterate through the dataframe, or apply the user-defined function to all rows so that the same goal could be achieved?
```
"""This function takes in a property csv (pandas dataframe) file, and compute distance, time for each property location to Melbourne CBD, using
the latitude and longitude of each property, and return the imputed csv file"""
def add_distance(property_csv):
...
gmaps = googlemaps.Client(key='myAPI')
driving_distance_list_km = []
driving_time_list_mins = []
walking_distance_list_km = []
walking_time_list_mins = []
i=0
for row in property_dict:
print(f"Row {i}, location {row['street_address']}")
origin_latitude, origin_longitude = row['latitude'], row['longitude']
#print(f"Row {i}, location {row[2]}")
#origin_latitude, origin_longitude = row[0], row[1]
driving_distance_matrix = gmaps.distance_matrix([str(origin_latitude) " " str(origin_longitude)], [str(destination_latitude) " " str(destination_longitude)], mode='driving')
walking_distance_matrix = gmaps.distance_matrix([str(origin_latitude) " " str(origin_longitude)], [str(destination_latitude) " " str(destination_longitude)], mode='walking')
# individual values
driving_distance_km = pd.to_numeric(pattern.findall(driving_distance_matrix['rows'][0]['elements'][0]['distance']['text'])[0])
driving_time_mins = pd.to_numeric(pattern.findall(driving_distance_matrix['rows'][0]['elements'][0]['duration']['text'])[0])
walking_distance_km = pd.to_numeric(pattern.findall(walking_distance_matrix['rows'][0]['elements'][0]['distance']['text'])[0])
walking_time_mins = pd.to_numeric(pattern.findall(walking_distance_matrix['rows'][0]['elements'][0]['duration']['text'])[0])
# store values to corresponding list
...
i = 1
...
return property_csv
Many thanks!
CodePudding user response:
Obviously, by far the limiting factor here is each gmaps.distance_matrix
call, all other code here is extremely fast relative to an API call.
I've never worked with this API, but looking at the driver code / docs, it appears you can pass a list of origins & destinations in each API call. It appears to be limited to 25 origins/destinations per call.
I would batch the calls into sets of 25 origins/destinations per API call which will give you ~25x less overhead on the API round trips (will leave the batching itself as an exercise to the reader :)). This should yield a very significant performance improvement. I very quickly skimmed docs, so you'll need to verify the details.
Other than that, you could consider multiprocessing.pool
(as a way to parallelize) to spawn several worker procs that each runs through a subset of the data. Of course, you'll need to be cognizant of any rate limiting imposed by Google.
CodePudding user response:
you can apply the pandas map (Series.map(arg, na_action=None) function and your series are a tuple of latitude and longitude and the arg is the function above without the for clause, and return another tuple with the informantion. In general, those map functions are very fast.
def add_distance(geotuple): origin_latitude, origin_longitude = geotuple[0], geotuple[1]
driving_distance_matrix = gmaps.distance_matrix([str(origin_latitude) " " str(origin_longitude)], [str(destination_latitude) " " str(destination_longitude)], mode='driving')
walking_distance_matrix = gmaps.distance_matrix([str(origin_latitude) " " str(origin_longitude)], [str(destination_latitude) " " str(destination_longitude)], mode='walking')
# individual values
driving_distance_km = pd.to_numeric(pattern.findall(driving_distance_matrix['rows'][0]['elements'][0]['distance']['text'])[0])
driving_time_mins = pd.to_numeric(pattern.findall(driving_distance_matrix['rows'][0]['elements'][0]['duration']['text'])[0])
walking_distance_km = pd.to_numeric(pattern.findall(walking_distance_matrix['rows'][0]['elements'][0]['distance']['text'])[0])
walking_time_mins = pd.to_numeric(pattern.findall(walking_distance_matrix['rows'][0]['elements'][0]['duration']['text'])[0])
return (driving_distance_km, driving_time_mins, walking_distance_km, walking_time_mins)
df = pd.Dataframe() gmaps = googlemaps.Client(key='myAPI')
Those lat and long from gmaps should be a list
s = pd.Series('', ') df['geotuple'] = list(zip(s,s.index))
df['add_distance'] = df.geotuple.map(add_distance)
https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html