Finding the distance (Haversine) between all elements in a single dataframe-CodePudding

I currently have a dataframe which includes five columns as seen below. I group the elements of the original dataframe such that they are within a 100km x 100km grid. For each grid element, I need to determine whether there is at least one set of points which are 100m away from each other. In order to do this, I am using the Haversine formula and calculating the distance between all points within a grid element using a for loop. This is rather slow as my parent data structure can have billions of points, and each grid element millions. Is there a quicker way to do this?

Here is a view into a group in the dataframe. "approx_LatSp" & "approx_LonSp" are what I use for groupBy in a previous function.

print(group.head())
            Time        Lat         Lon  approx_LatSp  approx_LonSp
197825  1.144823 -69.552576 -177.213646         -70.0   -177.234835
197826  1.144829 -69.579416 -177.213370         -70.0   -177.234835
197827  1.144834 -69.606256 -177.213102         -70.0   -177.234835
197828  1.144840 -69.633091 -177.212856         -70.0   -177.234835
197829  1.144846 -69.659925 -177.212619         -70.0   -177.234835

This group is equivalent to one grid element. This group gets passed to the following function which seems to be the crux of my issue (from a performance perspective):

def get_pass_in_grid(group):
    '''
        Checks if there are two points within 100m
    '''
    check_100m = 0
    check_1km  = 0

    row_mins = []

    for index, row in group.iterrows():
        # Get distance
        distance_from_row = get_distance_lla(row['Lat'], row['Lon'], group['Lat'].drop(index), group['Lon'].drop(index))
        minimum = np.amin(distance_from_row)

        row_mins = row_mins   [minimum]
    array = np.array(row_mins)

    m_100 = array[array < 0.1]
    km_1  = array[array < 1.0]

    if m_100.size > 0:
        check_100m = 1   
    if km_1.size > 0:
        check_1km = 1

    return check_100m, check_1km

And the Haversine formula is calculated as follows

def get_distance_lla(row_lat, row_long, group_lat, group_long):
    def radians(degrees):
        return degrees * np.pi / 180.0
    
    global EARTH_RADIUS
    lon1 = radians(group_long)
    lon2 = radians(row_long)
    lat1 = radians(group_lat)
    lat2 = radians(row_lat)
      
    # Haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat / 2)**2   np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
 
    c = 2 * np.arcsin(np.sqrt(a))
      
    # calculate the result
    return(c * EARTH_RADIUS)

One way in which I know I can improve this code is to stop the for loop if the 100m is met for any two points. If this is the only way to improve the speed then I will apply this. But I am hoping there is a better way to resolve my problem. Any thoughts are greatly appreciated! Let me know if I can help to clear something up.

CodePudding user response：

Convert all points to carthesian coordinates to have much easier task (distance of 100m is small enough to disregard that Earth is not flat)
Divide each grid into NxN subgrids (20x20, 100x100? check what is faster), for each point determine in which subgrid it is located. Determine distances within smaller subgrids (and their neighbours) instead of searching whole grid.
Use numpy to vectorize calculations (doing point no1 will definitely help you)

CodePudding user response：

Thanks @Corralien for his advice. I was able to use the BallTree in order to quickly find the closest elements. Improvement is something like 100x over my original code from a performance standpoint. Here is the new get_pass_in_grid:

def get_pass_in_grid(group):
    '''
        Checks if there is a pass within 100m to meet SWE L-Band requirement
    '''
    check_100m = 0
    check_1km  = 0

    if len(group) < 2:
        return check_100m, check_1km 

    row_mins = []
    group['Lat'] = np.deg2rad(group['Lat'])
    group['Lon'] = np.deg2rad(group['Lon'])
    temp = np.array([group['Lat'],group['Lon']]).T
    tree = BallTree(temp, leaf_size=2, metric='haversine')

    for _, row in group.iterrows():
        # Get distance
        row_arr = np.array([row['Lat'], row['Lon']]).reshape((-1,2))
        closest_elem_lst, _ = tree.query(row_arr, k=2)

        # First element is always just this one (since d=0)
        closest_elem = closest_elem_lst[0,1] * EARTH_RADIUS

        row_mins = row_mins   [closest_elem]

        if closest_elem < 0.1:
            break
    array = np.array(row_mins)

    m_100 = array[array < 0.1]
    km_1  = array[array < 1.0]

    if m_100.size > 0:
        check_100m = 1 
    if km_1.size > 0:
        check_1km = 1

    return check_100m, check_1km