I want calculate the distance from each building in df1 to each city in df2. The issue is that I have ~30 rows in df1 and ~30,000 rows in df2.
The desired output is set up in output_df.
How do I go about this? Is it possible using Geopy or would that take too long?
df1 = pd.DataFrame({'Building': ['One World Trade Center', 'Central Park Tower','Willis Tower', '111 West 57th Street', 'One Vanderbilt'],
'Latitude': [40.713005, 40.765957, 41.878872, 40.764760, 40.752971],
'Longitude': [-74.013190, -73.980844, -87.635908, -73.977581, -73.978541],
})
df2 = pd.DataFrame({'City': ['Santa Barbra, CA', 'Washington, D.C.'],
'Latitude': [34.024212, 38.9072],
'Longitude': [-118.496475, -77.0369],
})
output_df = pd.DataFrame({'City': ['Santa Barbra', 'Santa Barbra', 'Santa Barbra', 'Santa Barbra', 'Santa Barbra', 'Washington D.C.', 'Washington D.C.', 'Washington D.C.', 'Washington D.C.', 'Washington D.C.'],
'Building': ['One World Trade Center', 'Central Park Tower', 'Willis Tower', '111 West 57th Street', 'One Vanderbilt', 'One World Trade Center', 'Central Park Tower', 'Willis Tower', '111 West 57th Street', 'One Vanderbilt'],
'Latitude': [40.713005, 40.765957, 41.878872, 40.764760, 40.752971, 40.713005, 40.765957, 41.878872, 40.764760, 40.752971],
'Longitude': [-74.013190, -73.980844, -87.635908, -73.977581, -73.978541, -74.013190, -73.980844, -87.635908, -73.977581, -73.978541],
'Distance': ['dis_to_SB', 'dis_to_SB', 'dis_to_SB', 'dis_to_SB', 'dis_to_SB', 'dis_to_DC', 'dis_to_DC', 'dis_to_DC', 'dis_to_DC', 'dis_to_DC']})
output_df.set_index(['City', 'Building'])
CodePudding user response:
Using distance()
from GeoPy:
from geopy import distance
pd.merge(df2.rename(columns={'Latitude':'C-Lat','Longitude':'C-Lon'}), df1, how='cross') \
.assign(Distance=lambda r: \
r.apply(lambda x: distance.distance((x['C-Lat'],x['C-Lon']),(x['Latitude'],x['Longitude'])).miles, axis=1)
).drop(columns=['C-Lat','C-Lon'])
City Building Latitude Longitude Distance
0 Santa Barbra, CA One World Trade Center 40.713005 -74.013190 2464.602573
1 Santa Barbra, CA Central Park Tower 40.765957 -73.980844 2466.054087
2 Santa Barbra, CA Willis Tower 41.878872 -87.635908 1759.257288
3 Santa Barbra, CA 111 West 57th Street 40.764760 -73.977581 2466.230348
4 Santa Barbra, CA One Vanderbilt 40.752971 -73.978541 2466.233832
5 Washington, D.C. One World Trade Center 40.713005 -74.013190 203.461336
6 Washington, D.C. Central Park Tower 40.765957 -73.980844 207.017141
7 Washington, D.C. Willis Tower 41.878872 -87.635908 595.065660
8 Washington, D.C. 111 West 57th Street 40.764760 -73.977581 207.103384
9 Washington, D.C. One Vanderbilt 40.752971 -73.978541 206.571970
CodePudding user response:
Hope this is what you are looking for:
from numpy import radians, cos, sin, sqrt from numpy import arcsin as asin
df1 = pd.DataFrame({'Building': ['One World Trade Center', 'Central Park Tower','Willis Tower', '111 West 57th Street', 'One Vanderbilt'],
'Latitude': [40.713005, 40.765957, 41.878872, 40.764760, 40.752971],
'Longitude': [-74.013190, -73.980844, -87.635908, -73.977581, -73.978541],
})
df2 = pd.DataFrame({'City': ['Santa Barbra, CA', 'Washington, D.C.'],
'Latitude': [34.024212, 38.9072],
'Longitude': [-118.496475, -77.0369],
})
Full join
df1['key'] = 1
df2['key'] = 1
df = pd.merge(df1, df2, on = 'key')
Distance between 2 cities
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
# Radius of earth in kilometers is 6371
km = 6371* c
return km
df['distance'] = df.apply(lambda x:haversine(x['Latitude_x'], x['Longitude_x'], x['Latitude_y'], x['Longitude_y']) * 0.90, axis = 1)
print(df)
The distance now is in km
. The haversine
function I took from this post.
CodePudding user response:
Assume for simplicity that Earth is a perfect sphere with radius r.
Let θ = longitude (east of Greenwich, London) and φ = latitude (north of the equator). You can then transform the coordinates into rectangular coordinates as follows:
- x = r cos(θ) cos(φ)
- y = r sin(θ) cos(φ)
- z = r sin(φ)
If you have two of these longitude/latitude vectors, then the dot product of the two is:
x1x2 y1y2 z1z2
= r2cos(θ1)cos(φ1)cos(θ2)cos(φ2) r2sin(θ1)cos(φ1)sin(θ2)cos(φ2) r2sin(φ1)sin(φ2)
= r2(cos(θ1)cos(φ1)cos(θ2)cos(φ2) sin(θ1)cos(φ1)sin(θ2)cos(φ2) sin(φ1)sin(φ2))
= r2(cos(φ1)cos(φ2)(cos(θ1)cos(θ2) sin(θ1)sin(θ2)) sin(φ1)sin(φ2))
= r2(cos(φ1)cos(φ2)cos(θ1-θ2) sin(φ1)sin(φ2))
But the dot product is also equal to r2cos(α), where α is the angle between two vectors. Thus,
α = acos(cos(φ1)cos(φ2)cos(θ1-θ2) sin(φ1)sin(φ2))
This gives an angle in radians. To convert to distance (along a great circle), we simply need to multiply by the radius of the earth. But because Earth isn't a perfect sphere, its radius isn't a constant, but varies between 6356.7523 km at the poles to 6378.1370 km at the equator.
For the "average" radius, I'll use the volumetric radius of 6371.0008 km, converted to the appropriate unit of measure.
# Choose one of the following:
EARTH_RADIUS = 6371.0008 # kilometers
EARTH_RADIUS = 3958.7564 # miles
EARTH_RADIUS = 3440.0652 # nautical miles
Putting it all together, here's a function for estimating the distance between two points:
import math
def geo_distance(lat1, lon1, lat2, lon2):
# Convert all angles to radians
lat1_r = math.radians(lat1)
lon1_r = math.radians(lon1)
lat2_r = math.radians(lat2)
lon2_r = math.radians(lon2)
# Calculate the distance
dp = math.cos(lat1_r) * math.cos(lat2_r) * math.cos(lon1_r - lon2_r) math.sin(lat1_r) * math.sin(lat2_r)
angle = math.acos(dp)
return EARTH_RADIUS * angle
And a test case for the distance between Santa Barbara, CA and Washington, DC (in miles).
>>> geo_distance(34.024212, -118.496475, 38.9072, -77.0369)
2308.3300743996306
If you need the calculation to be faster, you could pre-compute the rectangular (x, y, z) coordinates for each point, and compute the dot products based on that, so you don't need to call the trig functions as often.