So I am trying to find a more efficient way of doing a task I already made some code for. The purpose of the code is to use 4 columns (LATITUDE, LONGITUDE, YORK_LATITUDE, YORK_LONGITUDE) to create a new column which calculates the distance between two coordinates in kilometers in a panda dataframe. Where the first coordinate is (LATITUDE, LONGITUDE) and the second coordinate is (YORK_LATITUDE, YORK_LONGITUDE).
A link of what the table looks like
In order to complete the task right now I create a list using the following code (geopy and pandas iterrows), convert that into a column and concatenate that to the dataframe. This is cumbersome, I know that there is an easier way to utilize .apply and the geopy function, but I haven't been able to figure out the syntax.
from geopy.distance import geodesic as GD
list = []
for index, row in result.iterrows():
coordinate1 = (row['LATITUDE'], row['LONGITUDE'])
coordinate2 = (row['LATITUDE_YORK_UNIVERSITY'], row['LONGITUDE_YORK_UNIVERSITY'])
list.append(GD(coordinate1, coordinate2).km)
CodePudding user response:
TL;DR
df.apply(lambda x: distance(x[:2], x[2:]), axis=1)
Some explanation
Let's say we have a function, which requires two tuples as arguments. For example:
from math import dist
def distance(point1: tuple, point2: tuple) -> float:
# suppose that developer checks the type
# so we can pass only tuples as arguments
assert type(point1) is tuple
assert type(point2) is tuple
return dist(point1, point2)
Let's apply the function to this data:
df = pd.DataFrame(
data=np.arange(10*4).reshape(10, 4),
columns=['long', 'lat', 'Y long', 'Y lat']
)
We pass to apply
two parameters: axis=1
to iterate over rows, and a wrapper over distance
as a lambda-function. To split the row in tuples we can apply tuple(...)
or `(*...,), note the comma at the end in the latter option:
df.apply(lambda x: distance((*x[:2],), (*x[2:],)), axis=1)
The thing is that geopy.distance
doesn't require exactly tuples as an arguments, they can be any iterables with 2 to 3 elements (see the endpoint how an argument is transformed into the inner type Point
while defining distance). So we can simplify this to:
df.apply(lambda x: distance(x[:2], x[2:]), axis=1)
To make it independent from the columns order we could write this (in your terms):
common_point = ['LATITUDE','LONGITUDE']
york_point = ['LATITUDE_YORK_UNIVERSITY','LONGITUDE_YORK_UNIVERSITY']
result.apply(lambda x: GD(x[common_point], x[york_point]).km, axis=1)