I'm building a program in Python that loads in the geographic location of all addresses in my country and then calculates a "remoteness" value for each address by considering its distance to all the other addresses. Since there are 3-5 million addresses in my country, I ultimately need to run 3-5 million iterations of the algorithm that calculates the index.
I have already taken some measures to bring down running time, including:
- trying to process the data efficiently using numPy
- instead of each address looking at its distance to each other address, I am dividing the country into zones, which have each already been assigned a population, and each address is then merely aware of how many of those zone centers fall within each of the distance values enumerated in "RADII"
It's still taking 23 seconds to get through a list of 5000 addresses, which means I expect 24h running time for the full dataset of 3-5 million.
So my question is: Which parts of my algorithm should be improved in order to save time? What stands out to you as slow, redundant or inefficient code? Does my use of JSON and numPy make sense, for example? Could simplifying the "distance" function be of much help? Or would I gain a significant advantage by throwing it all out and using another language than Python?
Any advice would be much appreciated. I'm a newbie programmer so there could easily be issues I'm not aware of.
Here's the code. It's currently working on a limited dataset (one minor island) which makes up about 0,1% of the full dataset. The algorithm starts at line 30 (#Main loop):
import json
import math
import numpy as np
RADII = [888, 1480, 2368, 3848, 6216, 10064, 16280]
R = 6371000
remoteness = []
db = SQL("sqlite:///samsozones.db")
def main():
with open("samso.json", "r") as json_data:
data = json.load(json_data)
rows = db.execute("SELECT * FROM zones")
#Establish amount of zones with addresses in them
ZONES = len(rows)
#Initialize matrix with location of the center of each zone and the population of the zone
zonematrix = np.zeros((ZONES, 3), dtype="float")
for i, row in enumerate(rows):
zonematrix[i,:] = row["x"], row["y"], row["population"]
#Initialize matrix with distance from current address to centers of each zone and the population of the zone (will be filled out for each address in the main loop)
distances = np.zeros((ZONES, 2), dtype="float")
#Main loop (calculate remoteness index for each address)
for address in data:
#Reset remoteness index for new address
index = 0
#Calculate distance from address to each zone center and insert the distances into the distances matrix along with population
for j in range(ZONES):
distances[j, 0] = distance(address["x"], address["y"], zonematrix[j, 0], zonematrix[j, 1])
distances[j, 1] = zonematrix[j, 2]
#Calculate remoteness index
for radius in RADII:
#Calculate amount of zone centers within each radius and add up their population
allwithincircle = distances[distances[:,0] < radius]
count = len(allwithincircle)
pop = allwithincircle[:,1].sum()
#Calculate average within-radius zone population
try:
factor = pop / count
except:
factor = 0
#Increment remoteness index:
index = (1 / radius) * factor
remoteness.append((address["betegnelse"], index))
# Haversine function by Deduplicator, adapted from https://stackoverflow.com/questions/27928/calculate-distance-between-two-latitude-longitude-points-haversine-formula
def distance(lat1,lon1,lat2,lon2):
dLat = deg2rad(lat2-lat1)
dLon = deg2rad(lon2-lon1)
a = math.sin(dLat/2) * math.sin(dLat/2) math.cos(deg2rad(lat1)) * math.cos(deg2rad(lat2)) * math.sin(dLon/2) * math.sin(dLon/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
d = R * c
return d;
def deg2rad(deg):
return deg * (math.pi / 180)
if __name__ == "__main__":
main()
# Running time (N = 5070): 23 seconds
# Results quite reasonable
# Scales more or less linearly```
CodePudding user response:
I can recommend an number of potential speed ups:
Don't use Haversine to calculate your distances. Find a good local projection for your country (you don't say where you are, so I can't) and reproject your data into that CRS which will allow you to use simple Euclidean distances which are much faster to compute (especially if you work in the square of the distance to save a bunch of square roots).
I would avoid calculating the distances from all the possible zones by converting the calculation to a raster surface of population and then calculating the remoteness for each cell and then looking the point of the address on that raster. If one house on my street is remote then I don't need to look the rest up! I would look at, for example, Wilderness attribute mapping in the United Kingdom by my former colleagues Carver, Evan and Fritz for good examples.
CodePudding user response:
My answer focuses on the computational tricks you can try and implement in these kinds of situations rather than on domain-specific optimizations (which you should definitely prioritize as they'll usually give you the largest speedup).
For that, I use cython
, which adds static typing to python and tries to transpile as much as the code as possible to c.
calc_dist
takes a Nx2
numpy array of N
latitudes and longitudes as an input, and uses it to compute a NxN
distance array where every value above the diagonal at coordinates i,j
represents the distance between locations i
and j
:
# To test, I recommend using a notebook with the %%cython cell magic
# %load_ext cython
# %%cython
import numpy as np
cimport numpy as np
from libc.math cimport sin, cos, asin, sqrt, pi
import cython
ctypedef np.float64_t dtype_t
@cython.boundscheck(False)
@cython.wraparound(False)
def calc_dist(dtype_t[:, :] coords):
result = np.zeros((coords.shape[0], coords.shape[0]), dtype=np.float64)
cdef dtype_t[:, :] result_view = result
cdef int d1, d2
for d1 in range(coords.shape[0]):
for d2 in range(d1, coords.shape[0]):
result_view[d1][d2] = haversine(coords[d1][0], coords[d1][1], coords[d2][0], coords[d2][1])
return result
@cython.cdivision(True)
cdef dtype_t haversine(dtype_t lat1, dtype_t lon1, dtype_t lat2, dtype_t lon2):
"""Pure C haversine. Based on https://stackoverflow.com/a/29546836/13145954"""
cdef dtype_t deg_to_rad = pi/180
lon1 = lon1 * deg_to_rad
lon2 = lon2 * deg_to_rad
lat1 = lat1 * deg_to_rad
lat2 = lat2 * deg_to_rad
cdef dtype_t dlon = lon2 - lon1
cdef dtype_t dlat = lat2 - lat1
a = sin(dlat/2.0)**2 cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
Running calc_dist
on the following random coordinates takes only 780ms on my laptop, which translates to a running time of about 13 minutes for 5 million rows.
coords = np.random.random((5000,2))
coords[:,0] = (coords[:,0]*2-1)*90
coords[:,1] = (coords[:,1]*2-1)*180
NB: the code will have to adapted and process the data in chunks in order to to avoid OOM errors. on the full database.