Home > database >  Save the minimum value of a nested for loop to Pandas Dataframe
Save the minimum value of a nested for loop to Pandas Dataframe

Time:12-11

I am trying to calculate the distance from each point (coordinate) of a column in a Dataframe to each point (coordinate) of another column in another Dataframe and save the minimum distance to a Dataframe, such that the resulting Dataframe has the same length as the first column. data: intern:

ID coordinates
1 (50.939266, 6.934996)
2 (50.935998, 6.931481)

360 entries

stops:

Name coordinates
A (50.93576, 6.96046)
B (50.9357, 6.95957)

2082 entries

desired result:

ID coordinates min_distance
1 (50.939266, 6.934996) 1.8263939732112
2 (50.935998, 6.931481) 0.3

intern: 3 columns with 360 entries each, where the last column is the minimum distance of the coordinate in the second column to each coordinate of the stops Dataframe

I have tried:

intern_index = 0
stops_index = 0
min_distance = 99999


hs.haversine((50.939266, 6.934996), (50.93576, 6.96046), unit=Unit.KILOMETERS)
Out: 1.8263939732112031

for i in intern['coordinates']:
     
    
    for j in stops['coordinates']:
         
        new_distance = hs.haversine(i, j, unit=Unit.KILOMETERS)
        
        if new_distance < new_distance:
            min_distance = new_distance
             
            
    intern['min_distance'] = min_distance 

this yields:

ID coordinates min_distance
1 (50.939266, 6.934996) 22.941973
2 (50.935998, 6.931481) 22.941973

why doesn't it save the correct value? even the first try was smaller than this value and it cannot all be the same distance either

CodePudding user response:

I would suggest implementing what you want in the vectorised form as it's much faster. Numpy is super-efficient in the kind of calculation you need.

Initialising some test data:

df_1 = pd.DataFrame({"ID": [1, 2], 
                     "coordinates": [(1.0, 0.0), (-0.5, 0.0)]})
df_1.set_index("ID", inplace=True)
df_2 = pd.DataFrame({"Name": ["A", "B", "C"], 
                     "coordinates": [(1.0, 0.0), (0.0, 0.0), (-1.0, 0.0)]})
df_2.set_index("Name", inplace=True)

With the help of Vectorizing Haversine distance calculation in Python writing a vectorised version of haversine distance calculation between each pair of the two collections of inputs:

def haversine(df_1, df_2):
    np_1 = np.array(df_1['coordinates'].tolist())
    np_2 = np.array(df_2['coordinates'].tolist())
    np_1 = np.deg2rad(np_1)
    np_2 = np.deg2rad(np_2)        
    diff_lat = np_1[:, 0, None] - np_2[:, 0]
    diff_lng = np_1[:, 1, None] - np_2[:, 1]
    d = np.sin(diff_lat/2)**2   np.cos(np_1[:, 0, None])*np.cos(np_2[:, 0]) * np.sin(diff_lng/2)**2
    return 2 * 6371 * np.arcsin(np.sqrt(d))

Applying:

df_1["min_haversine_dist"] = np.min(haversine(df_1, df_2), axis=1)
df_1.head()

Returns:

    ID, coordinates,    min_haversine_dist  
    1   (1.0, 0.0)  0.000000
    2   (-0.5, 0.0) 55.597463

Note. My personal rule of thumb: if I'm looping over a dataFrame I'm doing something wrong, if I have a nested loop over records in a dataFrame I'm doing something completely wrong.

  • Related