I have a Python performance issue. I need to merge two dataframes on an openstreetmaps function, and it is too slow. My first dataframe df1 is 35K rows long, and my second one 2K rows long. Full merge is then 70M. I suppose I won't be able to improve openstreetmaps response time but maybe my code could be faster. Can you see a way of improvments?
Thanks for answers.
import requests, json
from sqlalchemy import create_engine
url = 'http://router.project-osrm.org/route/v1/driving/'
conn = 'postgresql psycopg2://postgres:{0}@localhost/postgres'.format('')
my_conn = create_engine(conn)
def closest(s,df2):
distance_min = 1000000
duration_min = 1000000
place_min = ''
for i in range(len(df2)):
x = str(s['LONG']) ',' str(s['LAT']) ";" str(df2.loc[i, "LONG"]) "," str(df2.loc[i, "LAT"])
response = requests.get(url x)
data = json.loads(response.content)
duration = int((data['routes'][0]['duration']*0.95)/60)
if duration_min > duration:
distance_min = int(data['routes'][0]['distance']*0.000621371*1.609344)
duration_min = duration
place_min = df2.loc[i, "PLACE_NAME"]
s['CLOSEST_PLACE'] = place_min
s['DISTANCE'] = distance_min
s['DURATION'] = duration_min
return s
df1 = df1.apply(lambda x: closest(x, df2),axis=1)
Edit: Head of df1 & df2 - it is French open data: basically I want to find the closest station (df2) to each city (df1)
ENT_CODE COM_NAME COM_CODE DEP_NAME DEP_CODE REG_NAME REG_CODE LAT LONG
0 01001 L'Abergement-Clémenciat 1 Ain 01 Auvergne-Rhône-Alpes 84 46.153426 4.926114
1 01002 L'Abergement-de-Varey 2 Ain 01 Auvergne-Rhône-Alpes 84 46.009188 5.428017
2 01004 Ambérieu-en-Bugey 4 Ain 01 Auvergne-Rhône-Alpes 84 45.960848 5.372926
3 01005 Ambérieux-en-Dombes 5 Ain 01 Auvergne-Rhône-Alpes 84 45.996180 4.912273
4 01006 Ambléon 6 Ain 01 Auvergne-Rhône-Alpes 84 45.749499 5.594320
5 01007 Ambronay 7 Ain 01 Auvergne-Rhône-Alpes 84 46.005591 5.357607
6 01008 Ambutrix 8 Ain 01 Auvergne-Rhône-Alpes 84 45.936713 5.332809
7 01009 Andert-et-Condon 9 Ain 01 Auvergne-Rhône-Alpes 84 45.787357 5.657883
8 01010 Anglefort 10 Ain 01 Auvergne-Rhône-Alpes 84 45.909372 5.795160
9 01011 Apremont 11 Ain 01 Auvergne-Rhône-Alpes 84 46.205498 5.657815
PLACE_NAME LAT LONG
0 Aéroport Charles de Gaulle 2 TGV 49.003652 2.570892
1 Agde 43.317280 3.466203
2 Agen 44.208311 0.620932
3 Aime - La Plagne 45.554400 6.648646
4 Aix-en-Provence TGV 43.455237 5.317534
5 Aix-les-Bains le Revard 45.688161 5.909371
6 Albertville 45.672977 6.383167
7 Amiens 49.890746 2.312592
8 Ancenis 47.369334 -1.177763
9 Angers Saint-Laud 47.464647 -0.556820
CodePudding user response:
The following items come to mind:
- See if the page allows to query multiple locations at once (so you do only one query per row of
df1
instead of looping through each row ofdf2
) - Use a
threadPoolExecutor
to query the page for the locations concurrently - Do the calculations as the
threadPoolExecutor
returns the values
CodePudding user response:
Below is a lot of code, but the general idea is that you don't need to measure the distance between every pair of N=35K cities and and M=2K stations which is prohibitively slow and O(N*M).
You only care about the closest station to each city which means that you can do a pre-filtering step and avoid making 70M request
calls.
The steps are:
- Convert each city and station from LAT/LONG to X,Y,Z
- Build a station-rtree of the station which takes O(M) time. An rtree is like a spatial binary search tree
- Iterate through the cities using the station-rtree to find the closest station to each city in euclidean measurements (this is an approximation). This takes O(N*log2(M)) time
- Finding the closest pairs used to be O(N*M) = 70M, but is now O(M) O(N*log(M)) ~= 387K with the rtree (not 100% sure on the time complexity...)
- Now you have a list of approximate closest pairs "as the crow flies" which you can then use the external API to find the actual driving dist/time
- The number of API calls went from 70M to 35K (once for each city). This is still a lot of calls
This is an approximate solution since a city/station might be very close in euclidean space, but very far in driving space (i.e. an indirect route due to a mountain in between). You can hedge against this by using the API to measure the k closest stations to each city. The code below shows k=2
import pandas as pd
import numpy as np
import math
import io #<-- just used to read in the test data
#Package information https://rtree.readthedocs.io/en/latest/
from rtree import index #<-- retree package requires "pip install rtree" (or conda etc)
#input test data
cities = pd.read_csv(io.StringIO("""
ENT_CODE,COM_NAME,COM_CODE,DEP_NAME,DEP_CODE,REG_NAME,REG_CODE,LAT,LONG
01001,L'Abergement-Clémenciat,1,Ain,01,Auvergne-Rhône-Alpes,84,46.153426,4.926114
01002,L'Abergement-de-Varey,2,Ain,01,Auvergne-Rhône-Alpes,84,46.009188,5.428017
01004,Ambérieu-en-Bugey,4,Ain,01,Auvergne-Rhône-Alpes,84,45.960848,5.372926
01005,Ambérieux-en-Dombes,5,Ain,01,Auvergne-Rhône-Alpes,84,45.996180,4.912273
01006,Ambléon,6,Ain,01,Auvergne-Rhône-Alpes,84,45.749499,5.594320
01007,Ambronay,7,Ain,01,Auvergne-Rhône-Alpes,84,46.005591,5.357607
01008,Ambutrix,8,Ain,01,Auvergne-Rhône-Alpes,84,45.936713,5.332809
01009,Andert-et-Condon,9,Ain,01,Auvergne-Rhône-Alpes,84,45.787357,5.657883
01010,Anglefort,10,Ain,01,Auvergne-Rhône-Alpes,84,45.909372,5.795160
01011,Apremont,11,Ain,01,Auvergne-Rhône-Alpes,84,46.205498,5.657815
"""))
stations = pd.read_csv(io.StringIO("""
PLACE_NAME,LAT,LONG
Aéroport Charles de Gaulle 2 TGV, 49.003652, 2.570892
Agde, 43.317280, 3.466203
Agen, 44.208311, 0.620932
Aime - La Plagne, 45.554400, 6.648646
Aix-en-Provence TGV ,43.455237, 5.317534
Aix-les-Bains le Revard, 45.688161, 5.909371
Albertville, 45.672977, 6.383167
Amiens, 49.890746, 2.312592
Ancenis, 47.369334, -1.177763
Angers Saint-Laud, 47.464647, -0.556820
"""))
#estimate x,y,z coordinates from LAT/LONG for each table
#stolen from https://stackoverflow.com/questions/1185408/converting-from-longitude-latitude-to-cartesian-coordinates
R = 6371 #approximate radius of earth in km
to_rad = math.pi/180 #convert from degrees to radians
station_xyz_coords = pd.DataFrame(index=stations.index)
station_xyz_coords['x'] = R*np.cos(stations['LAT']*to_rad)*np.cos(stations['LONG']*to_rad)
station_xyz_coords['y'] = R*np.cos(stations['LAT']*to_rad)*np.sin(stations['LONG']*to_rad)
station_xyz_coords['z'] = R*np.sin(stations['LAT']*to_rad)
city_xyz_coords = pd.DataFrame(index=cities.index)
city_xyz_coords['x'] = R*np.cos(cities['LAT']*to_rad)*np.cos(cities['LONG']*to_rad)
city_xyz_coords['y'] = R*np.cos(cities['LAT']*to_rad)*np.sin(cities['LONG']*to_rad)
city_xyz_coords['z'] = R*np.sin(cities['LAT']*to_rad)
#create an RTree index from df1
#this allows for fast lookups of the closest city for a given station
p = index.Property()
p.dimension = 3
station_rtree = index.Index(properties=p)
for station_ind,s in station_xyz_coords.iterrows():
station_rtree.insert(id=station_ind, coordinates=(s.x, s.y, s.z, s.x, s.y, s.z))
#find, "as the crow flies" the nearest town to each station
#this is an assumption that the closest direct distance will also have the closest driving distance
#can adjust the num_nearest below to get back the k nearest cities to each station
num_nearest = 2
nearby_station_cities = {
'city_ind':[],
'station_ind':[],
'euclidean_dist':[],
}
for city_ind,c in city_xyz_coords.iterrows():
nearby_stations = station_rtree.nearest((c.x, c.y, c.z, c.x, c.y, c.z), num_nearest)
for station_ind in nearby_stations:
nearby_station_cities['city_ind'].append(city_ind)
nearby_station_cities['station_ind'].append(station_ind)
#getting euclidean dist "for fun", wouldn't be needed for anything later
city_coords = city_xyz_coords.loc[city_ind][['x','y','z']].values
station_coords = station_xyz_coords.loc[station_ind][['x','y','z']].values
euclidean_dist = np.linalg.norm(station_coords-city_coords)
nearby_station_cities['euclidean_dist'].append(euclidean_dist)
euc_nearby_df = pd.DataFrame(nearby_station_cities)
#merge the station and city information into the table of closest pairs
euc_nearby_df = euc_nearby_df.merge(
cities,left_on='city_ind',right_index=True,
).merge(
stations,left_on='station_ind',right_index=True,
suffixes = ('_city','_station'),
)
#then iterate through the closest pairs calling the external API
for i,r in euc_nearby_df.iterrows():
x = '{},{};{},{}'.format(
r.LAT_station, r.LONG_station,
r.LAT_city, r.LONG_city,
)
#response = requests.get(url x)
#etc...
#your prior code