Increased efficiency while looping through dataframe and appending to list using if conditions?-CodePudding

I have a dataframe with ~40k lines and 52 columns. It describes locations of interest, with some overlap.

I am trying to filter out locations that overlap, so that I only keep the last entry.

An example dataset is:

The expected output is:

The first Track A entry is to be excluded.

My code is below:

df_length = len(df.axes[0])

duplicate_flag = False

i = 0

trim_value = 0.005

data = []

for j in range(df_length - 1):
    for k in range(df_length - 1):
        if k >= i:
            if(df.at[i, 'Track'] == df.at[k 1, 'Track'] and
             df.at[i, 'Asset Name'] == df.at[k 1, 'Asset Name'] and
             df.at[i, 'Type'] == df.at[k 1, 'Type'] and
             df.at[i, 'Location From'] >= (df.at[k 1, 'Location From'] - trim_value) and
             df.at[i, 'Location From'] <= (df.at[k 1, 'Location From']   trim_value) and
             df.at[i, 'Location To'] >= (df.at[k 1, 'Location To'] - trim_value) and
             df.at[i, 'Location To'] <= (df.at[k 1, 'Location To']   trim_value)):
                duplicate_flag = True
       
    if duplicate_flag == False:
        data.append(df.iloc[i])
        
    i  = 1    
        
    duplicate_flag = False

data.append(df.iloc[-1])
df2 = pd.DataFrame(data)

For each row of the dataframe, I compare it to each row below it. If the Track, Asset Name and Type all equal, and the Location From and To values are or - the trim value, I set a duplicate flag as true.

After comparing to the rows below, if there was no duplicate found, I append that row of the original dataframe to a list, which is later converted into another dataframe.

I then increment the counter i, and reset the duplicate_flag variable to False.

My issue is this takes an extremely long time to run even with only 40k lines. I learnt that appending to a list and not to a dataframe is much more efficient, but I'm unsure what else is causing the bottleneck.

My intuition is suggesting it is the conditional if statements, though unsure what would be an improvement.

Any suggestions appreciated.

Thanks.

CodePudding user response：

dict = {'Track': ['D','A','B','C','A','B','C'],
        'Asset Name':['Fred','James','Albert','Sam','James','Albert','Sam'],
        'Type':['X1','C2','S3','H4','C2','S3','H4'],
        'Location From':[999,1,12,24,1.001,35,42],
        'Location To':[1999,1.105,12.5,25,1.106,36,43]}

df = pd.DataFrame(data=dict)

Tolerance=0.02
series = df
series = df.iloc[:,3:5].mul(1/Tolerance).round()
series = series.duplicated(keep='last')
Duplicate_Index = series[series].index
df = df.drop(Duplicate_Index)
print(df)

CodePudding user response：

I ended up experimenting with a list of dictionaries instead and the performance for this task in magnitudes quicker.

dict_list = df.to_dict('records')

dict_length = len(dict_list)

duplicate_flag = False

i = 0

trim_value = 0.005

data = []

for j in range(dict_length - 1):
    for k in range(dict_length - 1):
        if k >= i:
            if(dict_list[i]['Running Date'] == dict_list[k 1]['Running Date'] and
             dict_list[i]['Category'] == dict_list[k 1]['Category'] and
             dict_list[i]['Location From'] >= (dict_list[k 1]['Location From'] - trim_value) and
             dict_list[i]['Location From'] <= (dict_list[k 1]['Location From']   trim_value) and
             dict_list[i]['Location To'] >= (dict_list[k 1]['Location To'] - trim_value) and
             dict_list[i]['Location To'] <= (dict_list[k 1]['Location To']   trim_value)):
                duplicate_flag = True
   
    if duplicate_flag == False:
        data.append(dict_list[i])
    
    i  = 1    
    
    duplicate_flag = False

data.append(dict_list[-1])
df2 = pd.DataFrame(data)