Faster way to calculate distance from coordinates on DataFrame-CodePudding

I want calculate the distance from each building in df1 to each city in df2. The issue is that I have ~30 rows in df1 and ~30,000 rows in df2.

The desired output is set up in output_df.

How do I go about this? Is it possible using Geopy or would that take too long?

df1 = pd.DataFrame({'Building': ['One World Trade Center', 'Central Park Tower','Willis Tower', '111 West 57th Street', 'One Vanderbilt'],
                 'Latitude': [40.713005, 40.765957, 41.878872, 40.764760, 40.752971], 
                 'Longitude': [-74.013190, -73.980844, -87.635908, -73.977581, -73.978541],
                 })

df2 = pd.DataFrame({'City': ['Santa Barbra, CA', 'Washington, D.C.'],
                 'Latitude': [34.024212, 38.9072], 
                 'Longitude': [-118.496475, -77.0369],
                 })

output_df = pd.DataFrame({'City': ['Santa Barbra', 'Santa Barbra', 'Santa Barbra', 'Santa Barbra', 'Santa Barbra', 'Washington D.C.', 'Washington D.C.', 'Washington D.C.', 'Washington D.C.', 'Washington D.C.'], 
                 'Building': ['One World Trade Center', 'Central Park Tower', 'Willis Tower', '111 West 57th Street', 'One Vanderbilt', 'One World Trade Center', 'Central Park Tower', 'Willis Tower', '111 West 57th Street', 'One Vanderbilt'],
                 'Latitude': [40.713005, 40.765957, 41.878872, 40.764760, 40.752971, 40.713005, 40.765957, 41.878872, 40.764760, 40.752971], 
                 'Longitude': [-74.013190, -73.980844, -87.635908, -73.977581, -73.978541, -74.013190, -73.980844, -87.635908, -73.977581, -73.978541],
                 'Distance': ['dis_to_SB', 'dis_to_SB', 'dis_to_SB', 'dis_to_SB', 'dis_to_SB', 'dis_to_DC', 'dis_to_DC', 'dis_to_DC', 'dis_to_DC', 'dis_to_DC']})

output_df.set_index(['City', 'Building'])

CodePudding user response：

Using distance() from GeoPy:

from geopy import distance

pd.merge(df2.rename(columns={'Latitude':'C-Lat','Longitude':'C-Lon'}), df1, how='cross') \
    .assign(Distance=lambda r: \
        r.apply(lambda x: distance.distance((x['C-Lat'],x['C-Lon']),(x['Latitude'],x['Longitude'])).miles, axis=1)
           ).drop(columns=['C-Lat','C-Lon'])

               City                Building   Latitude  Longitude     Distance
0  Santa Barbra, CA  One World Trade Center  40.713005 -74.013190  2464.602573
1  Santa Barbra, CA      Central Park Tower  40.765957 -73.980844  2466.054087
2  Santa Barbra, CA            Willis Tower  41.878872 -87.635908  1759.257288
3  Santa Barbra, CA    111 West 57th Street  40.764760 -73.977581  2466.230348
4  Santa Barbra, CA          One Vanderbilt  40.752971 -73.978541  2466.233832
5  Washington, D.C.  One World Trade Center  40.713005 -74.013190   203.461336
6  Washington, D.C.      Central Park Tower  40.765957 -73.980844   207.017141
7  Washington, D.C.            Willis Tower  41.878872 -87.635908   595.065660
8  Washington, D.C.    111 West 57th Street  40.764760 -73.977581   207.103384
9  Washington, D.C.          One Vanderbilt  40.752971 -73.978541   206.571970

CodePudding user response：

Hope this is what you are looking for:

from numpy import radians, cos, sin, sqrt from numpy import arcsin as asin

df1 = pd.DataFrame({'Building': ['One World Trade Center', 'Central Park Tower','Willis Tower', '111 West 57th Street', 'One Vanderbilt'],
                 'Latitude': [40.713005, 40.765957, 41.878872, 40.764760, 40.752971], 
                 'Longitude': [-74.013190, -73.980844, -87.635908, -73.977581, -73.978541],
                 })

df2 = pd.DataFrame({'City': ['Santa Barbra, CA', 'Washington, D.C.'],
                 'Latitude': [34.024212, 38.9072], 
                 'Longitude': [-118.496475, -77.0369],
                 })

Full join

df1['key'] = 1
df2['key'] = 1
df = pd.merge(df1, df2, on = 'key')

Distance between 2 cities

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2   cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    # Radius of earth in kilometers is 6371
    km = 6371* c
    return km

df['distance'] = df.apply(lambda x:haversine(x['Latitude_x'], x['Longitude_x'], x['Latitude_y'], x['Longitude_y']) * 0.90, axis = 1)

print(df)

The distance now is in km. The haversine function I took from this post.

CodePudding user response：

Assume for simplicity that Earth is a perfect sphere with radius r.

Let θ = longitude (east of Greenwich, London) and φ = latitude (north of the equator). You can then transform the coordinates into rectangular coordinates as follows:

x = r cos(θ) cos(φ)
y = r sin(θ) cos(φ)
z = r sin(φ)

If you have two of these longitude/latitude vectors, then the dot product of the two is:

x₁x₂ y₁y₂ z₁z₂

= r²cos(θ₁)cos(φ₁)cos(θ₂)cos(φ₂) r²sin(θ₁)cos(φ₁)sin(θ₂)cos(φ₂) r²sin(φ₁)sin(φ₂)

= r²(cos(θ₁)cos(φ₁)cos(θ₂)cos(φ₂) sin(θ₁)cos(φ₁)sin(θ₂)cos(φ₂) sin(φ₁)sin(φ₂))

= r²(cos(φ₁)cos(φ₂)(cos(θ₁)cos(θ₂) sin(θ₁)sin(θ₂)) sin(φ₁)sin(φ₂))

= r²(cos(φ₁)cos(φ₂)cos(θ₁-θ₂) sin(φ₁)sin(φ₂))

But the dot product is also equal to r²cos(α), where α is the angle between two vectors. Thus,

α = acos(cos(φ₁)cos(φ₂)cos(θ₁-θ₂) sin(φ₁)sin(φ₂))

This gives an angle in radians. To convert to distance (along a great circle), we simply need to multiply by the radius of the earth. But because Earth isn't a perfect sphere, its radius isn't a constant, but varies between 6356.7523 km at the poles to 6378.1370 km at the equator.

For the "average" radius, I'll use the volumetric radius of 6371.0008 km, converted to the appropriate unit of measure.

# Choose one of the following:
EARTH_RADIUS = 6371.0008 # kilometers
EARTH_RADIUS = 3958.7564 # miles
EARTH_RADIUS = 3440.0652 # nautical miles

Putting it all together, here's a function for estimating the distance between two points:

import math

def geo_distance(lat1, lon1, lat2, lon2):
    # Convert all angles to radians
    lat1_r = math.radians(lat1)
    lon1_r = math.radians(lon1)
    lat2_r = math.radians(lat2)
    lon2_r = math.radians(lon2)
    # Calculate the distance
    dp = math.cos(lat1_r) * math.cos(lat2_r) * math.cos(lon1_r - lon2_r)   math.sin(lat1_r) * math.sin(lat2_r)
    angle = math.acos(dp)
    return EARTH_RADIUS * angle

And a test case for the distance between Santa Barbara, CA and Washington, DC (in miles).

>>> geo_distance(34.024212, -118.496475, 38.9072, -77.0369)
2308.3300743996306

If you need the calculation to be faster, you could pre-compute the rectangular (x, y, z) coordinates for each point, and compute the dot products based on that, so you don't need to call the trig functions as often.