Home > Net >  Flightradar24 pandas groupby and vectorize. A no looping solution
Flightradar24 pandas groupby and vectorize. A no looping solution

Time:09-16

I am looking to perform a fast operation on flightradar data to see if the speed in distance matches the speed reported. I have multiple flights and was told not to run double loops on pandas dataframes. Here is a sample dataframe:

import pandas as pd
from datetime import datetime
from shapely.geometry import Point
from geopy.distance import distance

dates = ['2020-12-26 15:13:01', '2020-12-26 15:13:07','2020-12-26 15:13:19','2020-12-26 15:13:32','2020-12-26 15:13:38']
datetimes = [datetime.fromisoformat(date) for date in dates]
data = {'UTC': datetimes, 
        'Callsign': ["1", "1","2","2","2"],
        'Position':[Point(30.542175,-91.13999200000001), Point(30.546204,-91.14020499999999),Point(30.551443,-91.14417299999999),Point(30.553909,-91.15136699999999),Point(30.554489,-91.155075)]
       }
df = pd.DataFrame(data)

What I want to do is add a new column called "dist". This column will be 0 if it is the first element of a new callsign but if not it will be the distance between a point and the previous point.

The resulting df should look like this:

df1 = df
dist = [0,0.27783309075379214,0,0.46131362750613436,0.22464461718704595]
df1['dist'] = dist

What I have tried is to first assign a group index:

df['group_index'] = df.groupby('Callsign').cumcount()

Then groupby

Then try and apply the function:

df['dist'] = df.groupby('Callsign').apply(lambda g: 0 if g.group_index == 0 else distance((g.Position.x , g.Position.y),
    (g.Position.shift().x , g.Position.shift().y)).miles)

I was hoping this would give me the 0 for the first index of each group and then run the distance function on all others and return a value in miles. However it does not work.

The code errors out for at least one reason which is because the .x and .y attributes of the shapely object are being called on the series rather than the object.

Any ideas on how to fix this would be much appreciated.

CodePudding user response:

  1. Sort df by callsign then timestamp
  2. Compute distances between adjacent rows using a temporary column of shifted points
  3. For the first row of each new callsign, set distance to 0
  4. Drop temporary column
df = df.sort_values(by=['Callsign', 'UTC'])

df['Position_prev'] = df['Position'].shift().bfill()

def get_dist(row):
return distance((row['Position'].x, row['Position'].y),
                (row['Position_prev'].x, row['Position_prev'].y)).miles

df['dist'] = df.apply(get_distances, axis=1)

# Flag row if callsign is different from previous row callsign
new_callsign_rows = df['Callsign'] != df['Callsign'].shift()

# Zero out the first distance of each callsign group
df.loc[new_callsign_rows, 'dist'] = 0.0

# Drop shifted column
df = df.drop(columns='Position_prev')

print(df)

                  UTC Callsign                              Position      dist
0 2020-12-26 15:13:01        1  POINT (30.542175 -91.13999200000001)  0.000000
1 2020-12-26 15:13:07        1  POINT (30.546204 -91.14020499999999)  0.277833
2 2020-12-26 15:13:19        2  POINT (30.551443 -91.14417299999999)  0.000000
3 2020-12-26 15:13:32        2  POINT (30.553909 -91.15136699999999)  0.461314
4 2020-12-26 15:13:38        2          POINT (30.554489 -91.155075)  0.224645
  • Related