I have 2 dataframes:
- df_exposure (>300k rows)
ID Limit Lat Lon
0 1 49 21.066107 121.930200
1 2 49 20.932773 121.913533
2 3 49 20.932773 121.921867
3 4 49 20.924440 121.930200
4 5 49 20.899440 121.905200
from df_exposure i extract
lat_loc = df_exposure.loc[:, 'Lat']
lon_loc = df_exposure.loc[:, 'Lon']
- df (3k rows):
Lat Lon Limit
0 4.125 116.125 0.0
1 4.375 116.125 0.0
2 4.625 116.125 0.0
3 4.875 116.125 0.0
4 5.125 116.125 0.0
And this is the haversine function:
def haversine(lat2, lon2, lat1, lon1):
lat1_ = lat1 * np.pi / 180
lat2_ = lat2 * np.pi / 180
lon1_ = lon1 * np.pi / 180
lon2_ = lon2 * np.pi / 180
a = (np.sin((lat2_ - lat1_) / 2)**2) (np.sin((lon2_ - lon1_) / 2)**2) * np.cos(lat1_) * np.cos(lat2_)
dist = 2 * 6371 * np.arcsin(np.sqrt(a))
return dist
Essentially, the df is a subset of df_exposure with bigger grid size and I would like to get the get the distance between all locations in df against each location (row) of lat long in df_exposure to find the minimum distance and allocate the Limit in the corresponding df_exposure row to location in df with smallest distance and this will be iterative on each location in df_exposure until all locations are calculated.
This is how currently it is done but it takes a very long time because of the size of df_exposure (>300k rows)
for i in range(len(lat_loc)):
r = haversine(df.loc[:, 'Lat'], df.loc[:, 'Lon'], lat_loc[i], lon_loc[i])
dist = r.min() # find minimum distance
df.loc[list(r).index(dist), 'Limit'] = df.loc[list(r).index(dist), 'Limit'] df_exposure.loc[i, 'Limit']
I would appreciate some suggestion to improve the current code. Thank you.
CodePudding user response:
You can use sklearn.neighbors.DistanceMetric
for the haversine
distance,
from sklearn.neighbors import DistanceMetric
distance = DistanceMetric.get_metric('haversine')
lat1 = df_exposure.loc[:, 'Lat']
lon1 = df_exposure.loc[:, 'Lon']
lat2 = df.loc[:, 'Lat']
lon2 = df.loc[:, 'Lon']
(6371*distance.pairwise((np.array([lat1,lon1])* np.pi / 180).T,
(np.array([lat2,lon2])* np.pi / 180).T).min(1))
CodePudding user response:
Let's go in order. I have created dataframes with the specified dimensions. Here is the runtime of your implementation:
import time
import numpy as np
import pandas as pd
EXPOSURE_SIZE = 300_000
DF_SIZE = 3000
df_exposure = pd.DataFrame({'Limit': np.random.randint(0, 1000, size=(EXPOSURE_SIZE,)),
'Lat': np.random.uniform(-10, 10, size=EXPOSURE_SIZE),
'Lon': np.random.uniform(-10, 10, size=EXPOSURE_SIZE)})
df = pd.DataFrame(
{'Limit': np.random.randint(0, 1000, size=(DF_SIZE,)),
'Lat': np.random.uniform(-10, 10, size=DF_SIZE),
'Lon': np.random.uniform(-10, 10, size=DF_SIZE)})
def haversine(lat2, lon2, lat1, lon1):
lat1_ = lat1 * np.pi / 180
lat2_ = lat2 * np.pi / 180
lon1_ = lon1 * np.pi / 180
lon2_ = lon2 * np.pi / 180
a = (np.sin((lat2_ - lat1_) / 2) ** 2) (np.sin((lon2_ - lon1_) / 2) ** 2) * np.cos(lat1_) * np.cos(lat2_)
dist = 2 * 6371 * np.arcsin(np.sqrt(a))
return dist
if __name__ == '__main__':
lat_loc = df_exposure.loc[:, 'Lat']
lon_loc = df_exposure.loc[:, 'Lon']
start = time.monotonic()
for i in range(len(lat_loc)):
r = haversine(df.loc[:, 'Lat'], df.loc[:, 'Lon'], lat_loc[i], lon_loc[i])
dist = r.min() # find minimum distance
df.loc[list(r).index(dist), 'Limit'] = df.loc[list(r).index(dist), 'Limit'] df_exposure.loc[i, 'Limit']
print(f'with for loop and series time took: {time.monotonic() - start:.1f} s.')
Out:
with for loop and series time took: 456.3 s.
You should understand that in this case you are passing lat
and lon
as a pd.Series
to the haversine
function. With that, your function is vectorized. Let's pass the coordinates as numpy
arrays
lon = df['Lon'].values
lat = df['Lat'].values
start = time.monotonic()
for i in range(len(lat_loc)):
r = haversine(lat, lon, lat_loc[i], lon_loc[i])
df.iloc[np.argmin(r), 0] = df.iloc[np.argmin(r), 0] df_exposure.iloc[i, 0]
print(f'with for loop and ndarray time took: {time.monotonic() - start:.1f} s.')
Out:
with for loop and ndarray time took: 62.4 s.
Wow! Speedup is ~ 7 time.
Let's try to use V.M answer and use the DistanceMetric
class from sklearn.metrics
module:
from sklearn.metrics import DistanceMetric
distance = DistanceMetric.get_metric('haversine')
lat1 = df_exposure.loc[:, 'Lat']
lon1 = df_exposure.loc[:, 'Lon']
lat2 = df.loc[:, 'Lat']
lon2 = df.loc[:, 'Lon']
start = time.monotonic()
res = (6371 * distance.pairwise((np.array([lat1, lon1]) * np.pi / 180).T,
(np.array([lat2, lon2]) * np.pi / 180).T)).argmin(axis=1)
print(f'with sklearn pairwise distance time took: {time.monotonic() - start:.1f} s.')
Out:
with sklearn pairwise distance time took: 45.6 s.
Even better! Speed-up is ~ 10 time
But what if the logic inside the loop is moved to a new function and the apply
methods are used?
def foo(row, lat, lon):
"""
row: row of DataFrame
lat: ndarray with latitude
lon: ndarray with longitude
"""
r = haversine(lat, lon, row[1], row[2])
return r.argmin()
start = time.monotonic()
res = df_exposure.apply(foo, raw=True, axis=1, args=(lat, lon))
print(f'synchronous apply time took: {time.monotonic() - start:.1f} s.')
Out:
synchronous apply time took: 32.4 s.
Wow! It's even faster.
Can we speed up our calculations even more? Yes! If we remember that pandas
always runs on one core of your CPU
. We need to parallelize the best of the received ways. This can easily be done with parallel-pandas
#pip install parallel-pandas
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
ParallelPandas.initialize(disable_pr_bar=True)
#p_apply is a parallel analog of apply method
start = time.monotonic()
res = df_exposure.p_apply(foo, raw=True, axis=1, args=(lat, lon))
print(f'parallel apply time took: {time.monotonic() - start:.1f} s.')
Out:
parallel apply time took: 3.7
This is amazing!
Total speed-up 456/3.7 ~ 120