Running an external function within Pandas Dataframe to speed up processing loops-CodePudding

Good Day Peeps,

I currently have 2 data frames, "Locations" and "Pokestops", both containing a list of coordinates. The goal with these 2 data frames, is to cluster points from "Pokestops" that are within 70m of the points in "Locations".

I have created a "Brute Force" clustering script.

The process is as follows:

Calculate which "Pokestops" are within 70m of each point in "Locations".
Add all nearby Pokestops to Locations["Pokestops"], as a list/array of their index value eg, ([0, 4, 22])
If no Pokestops are near a point in "Locations", remove that line from the Locations df

for i in range(len(locations)-1, -1, -1):
    array = []
    for f in range(0, len(pokestops)):
        if geopy.distance.geodesic(locations.iloc[i, 2], pokestops.iloc[f, 2]).m <= 70:
            array.append(f)
    if len(array) <= 0:
        locations.drop([i], inplace=True)
    else:
        locations.iat[i, 3] = array
        locations["Length"] = locations["Pokestops"].map(len)

This results in:

           Lat       Long                             Coordinates     Pokestops  Length
2   -33.916432  18.426188                   -33.916432,18.4261883           [1]       1
3   -33.916432  18.426287                  -33.916432,18.42628745           [1]       1
4   -33.916432  18.426387                   -33.916432,18.4263866           [1]       1
5   -33.916432  18.426486                  -33.916432,18.42648575        [0, 1]       2
6   -33.916432  18.426585                   -33.916432,18.4265849        [0, 1]       2
7   -33.916432  18.426684           -33.916432,18.426684050000002        [0, 1]       2

Sort by most to least amount of pokestops within 70m.

locations.sort_values("Length", ascending=False, inplace=True)

This results in:

           Lat       Long                             Coordinates     Pokestops  Length
136 -33.915441  18.426585           -33.91544050000003,18.4265849  [1, 2, 3, 4]       4
149 -33.915341  18.426585          -33.915341350000034,18.4265849  [1, 2, 3, 4]       4
110 -33.915639  18.426585          -33.915638800000025,18.4265849  [1, 2, 3, 4]       4
111 -33.915639  18.426684  -33.915638800000025,18.426684050000002  [1, 2, 3, 4]       4

Remove all index values listed in Locations[0, "Pokestops"], from all other rows Locations[1:, "Pokestops"]

    stops = list(locations['Pokestops'])
    seen = list(locations.iloc[0, 3])
    stops_filtered = [seen]
    for xx in stops[1:]:
        xx = [x for x in xx if x not in seen]
        stops_filtered.append(xx)
    locations['Pokestops'] = stops_filtered

This results in:

           Lat       Long                             Coordinates     Pokestops  Length
136 -33.915441  18.426585           -33.91544050000003,18.4265849  [1, 2, 3, 4]       4
149 -33.915341  18.426585          -33.915341350000034,18.4265849            []       4
110 -33.915639  18.426585          -33.915638800000025,18.4265849            []       4
111 -33.915639  18.426684  -33.915638800000025,18.426684050000002            []       4

Remove all empty rows in Locations["Pokestops]

locations = locations[locations['Pokestops'].map(len)>0]

This results in:

           Lat       Long                             Coordinates     Pokestops  Length
136 -33.915441  18.426585           -33.91544050000003,18.4265849  [1, 2, 3, 4]       4
176 -33.915143  18.426684   -33.91514305000004,18.426684050000002           [5]       3
180 -33.915143  18.427081   -33.91514305000004,18.427080650000004           [5]       3
179 -33.915143  18.426982   -33.91514305000004,18.426981500000004           [5]       3

Add Locations[0, "Coordinates"] to an array that can be saved to .txt later, which will form our final list of "Clustered" coordinates.

clusters = np.append(clusters, locations.iloc[0 , 0:2])

This results in:

           Lat       Long                             Coordinates Pokestops  Length
176 -33.915143  18.426684   -33.91514305000004,18.426684050000002       [5]       3
180 -33.915143  18.427081   -33.91514305000004,18.427080650000004       [5]       3
179 -33.915143  18.426982   -33.91514305000004,18.426981500000004       [5]       3
64  -33.916035  18.427180   -33.91603540000001,18.427179800000005       [0]       3

Repeat the process from 4-7 till the Locations df is empty.

This all results in an array containing all coordinates of points from the Locations dataframe, that contain points within 70m from Pokestops, sorted from Largest to Smallest cluster.

Now for the actual question.

The method I am using in steps 1-3, results in needing to loop a few million times for a small-medium dataset.

I believe I can achieve faster times by migrating away from using the "for" loops and allowing Pandas to calculate the distances between the two tables "Directly" using the geopy.distance.geodesic function.

I am just unsure how to even approach this...

How do I get it to iterate through rows without using a for loop?
How do I maintain using my "lists/arrays" in my locations["Pokestops"] column?
Will it even be faster?

I know there is a library called GeoPandas, but this requires conda, and will mean I need to step away from being able to use my arrays/lists in the column Locations["Pokestops"]. (I also have 0 knowlege on how to use GeoPandas to be fair)

I know very broad questions like this are generally shunned, but I am fully self-taught in python, trying to achieve what is most likely too complicated of a script for my level.

I've made it this far, I just need this last step to make it more efficient. The script is fully working, and provides the required results, it simply takes too long to run due to the nested for loops.

Any advise/ideas are greatly appreciated, and keep in mind my knowlege on python/Pandas is somewhat limited and i do not know all the functions/terminology.

CodePudding user response：

The Apply function Might be helpful. The Apply function applies the specified function to every cell of the Dataset (Of course you can control the parameters). Check this documentation (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) for further understanding.

I do believe loops will be very chaotic to implement once the solution hides in multiple layers. From the perspective of playing with datasets it is far better without a loop approach and apply functions as here we are expected to provide quick solutions.

CodePudding user response：

I don't fully understand your code, but from your text there are a few things you can do to speed this up. I think the most beneficial thing to do is to vectorize your distance calculations, because looping to every combination takes forever. So i stole the calculation from this answer and adopted it to create a matrix:

import numpy as np
import pandas as pd

pokestops = pd.read_csv('Pokestops.txt', header=None)
pokestops.columns = ["Lat", "Long"]
locations = pd.read_csv('Locations.txt')
locations.columns = ["Lat", "Long"]

def haversine_np_matrix(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    lon1 = np.expand_dims(lon1, axis=0)
    lat1 = np.expand_dims(lat1, axis=0)
    lon2 = np.expand_dims(lon2, axis=1)
    lat2 = np.expand_dims(lat2, axis=1)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2   np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

distances = haversine_np_matrix(pokestops["Long"],pokestops["Lat"], locations["Long"],locations['Lat'])

This gives you the distance from each Location to each Pokestop. Now you can use something like distances < 0.07 to find all that are closer than 70 meter. For this to work i stripped Location.txt of everything but Long and Lat. I am not sure meters = 10 and degrees = 0.000009915 in your text do, so you may have to adapt the calculation and you may want to compare the 6367km to the calculation from geopy.distance.geodesic as described here to get the same results.