Optimization of this python function-CodePudding

def split_trajectories(df):
    trajectories_list = []
    count = 0
    for record in range(len(df)):
        if record == 0:
            continue
        if df['time'].iloc[record] - df['time'].iloc[record - 1] > pd.Timedelta('0 days 00:00:30'):
            temp_df = reset_index(df[count:record])
            if not temp_df.empty:
                if len(temp_df) > 50:
                    trajectories_list.append(temp_df)
            count = record
    return trajectories_list

This is a python function that receives a pandas dataframe and divides it into a list of dataframes when their time delta is greater than 30 seconds and if the dataframe contains than 50 records. In my case I need to execute this function thousands of times and I wonder if anyone can help me optimize it. Thanks in advance!

I tried to optimize it as far as I can.

CodePudding user response：

Maybe you can try this one below, you could use count if you want to keep track of number of df's or else the length of trajectories_list will give you that info.

def split_trajectories(df):
  trajectories_list = []
  df_difference = df['time'].diff()
  if not (df_difference.empty) and (df_difference.shape[0]>50):
     trajectories_list.append(df_difference)            
return trajectories_list

CodePudding user response：

You're doing a few things right here, like iterating using a range instead of .iterrows and using .iloc.

Simple things you can do:

Switch .iloc to .iat since you only need 'time' anyway
Don't recalculate Timedelta every time, just save it as a variable

The big thing is that you don't need to actually save each temp_df, or even create it. You can save a tuple (count, record) and retrieve from df afterwards, as needed. (Although, I have to admit I don't quite understand what you're accomplishing with the count:record logic; should one of your .iloc's involve count instead?)

def split_trajectories(df):
  trajectories_list = []
  td = pd.Timedelta('0 days 00:00:30')
  count = 0
  for record in range(1, len(df)):
    if df['time'].iat[record] - df['time'].iat[record - 1] > td:
      if record-count>50:
        new_tuple = (count, record)
        trajectories_list.append(new_tuple)
  return trajectories_list