Pandas : Optimising multiple sequential for loops-CodePudding

I have a dataframe df like below


import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

data = {'Name': ['XYZ', 'XYZ', 'XYZ', 'XYZ', 'PQR', 'XYZ', 'XYZ', 'ABC', 'XYZ', 'ABC'], 'Init_Time': ['2022-02-16 14:00:31', '2022-02-16 14:03:15', '2022-02-16 14:05:26',
                                                                                           '2022-02-16 14:06:23', '2022-02-16 14:10:00', '2022-02-16 14:12:36', 
                                                                                           '2022-02-16 14:14:11', '2022-02-17 07:07:25', '2022-02-17 15:08:35', 
                                                                                           '2022-02-17 15:09:46'], 'Category_flag': [1,1,0,0,1,0,1,1,0,0], '10min_window_group': [1,1,1,1,1,2,2,3,4,4]}
df = pd.DataFrame(data)
df['Init_Time'] = pd.to_datetime(df['Init_Time'])
print(df)

  Name           Init_Time  Category_flag  10min_window_group
0  XYZ 2022-02-16 14:00:31              1                   1
1  XYZ 2022-02-16 14:03:15              1                   1
2  XYZ 2022-02-16 14:05:26              0                   1
3  XYZ 2022-02-16 14:06:23              0                   1
4  PQR 2022-02-16 14:10:00              1                   1
5  XYZ 2022-02-16 14:12:36              0                   2
6  XYZ 2022-02-16 14:14:11              1                   2
7  ABC 2022-02-17 07:07:25              1                   3
8  XYZ 2022-02-17 15:08:35              0                   4
9  ABC 2022-02-17 15:09:46              0                   4

I'm assigning duplicate flags (Duplicate_Flags) (1/0) for each of the names of column Name that falls under a 10-minute window interval belonging to each of the category flags by filtering :

The column 'Name`, first (XYZ, PQR, ...)
The column Category_Flag, second (1/0).
The column 10min_window_group, third (1/2/3/4).

For instance, in order to find duplicates of XYZ in the first 10-minute interval of the category 1, we first filter Name i.e XYZ among unique names of Name, then we filter for which category_flag we want to find duplicates which in this case is 1, before finally filtering the 10-minute window grouping value i.e 1. In order to achieve this goal, I have utilized 3 for loops which work well in most cases, however, the issue is that it consumes a lot of computational time when the number of data points is very large (say like 2 million data points) since the code needs to iterate through all the 3 for loops.


for name in df['Name'].unique().tolist(): #Iterate over unique names of column `Name`.
  df1 = df[df['Name'] == name]
  for category in df1['Category_flag'].unique().tolist(): #Iterate over unique category flag values.
    df2 = df1[df1['Category_flag'] == category]
    for group in df['10min_window_group'].unique().tolist(): #Iterate over unique window interval values.
      df3 = df2[df2['10min_window_group'] == group]

      if(len(df3) > 0): #Check if the len of df3 is greater than 0.
        df3['Duplicates_flag'] = np.where(df3['Name'].duplicated(), 0, 1) #Finds the duplicates.
        df3_indices = df3['Duplicates_flag'].index #fetch index of duplicates
        df3_values = df3['Duplicates_flag'].values #fetch the values of duplicates.

        df.loc[df3_indices, 'Duplicates_flag'] = df3_values #Assign the duplicate values to the main `df` using the indices.

      elif(len(df3) == 1):
        df3['Duplicates_flag'] = np.where(df3['Name'].duplicated(), 0, 1)
        df3_indices = df3['Duplicates_flag'].index
        df3_values = df3['Duplicates_flag'].values

        df.loc[df3_indices, 'Duplicates_flag'] = df3_values


print(df)

  Name           Init_Time  Category_flag  10min_window_group  Duplicates_flag
0  XYZ 2022-02-16 14:00:31              1                   1              1.0
1  XYZ 2022-02-16 14:03:15              1                   1              0.0
2  XYZ 2022-02-16 14:05:26              0                   1              1.0
3  XYZ 2022-02-16 14:06:23              0                   1              0.0
4  PQR 2022-02-16 14:10:00              1                   1              1.0
5  XYZ 2022-02-16 14:12:36              0                   2              1.0
6  XYZ 2022-02-16 14:14:11              1                   2              1.0
7  ABC 2022-02-17 07:07:25              1                   3              1.0
8  XYZ 2022-02-17 15:08:35              0                   4              1.0
9  ABC 2022-02-17 15:09:46              0                   4              1.0

So, is there a way where in I can optimize the code by reducing the number of 3 for loops/replacing the 3 for loops? The primary aim is to reduce the computation time and make the code more computationally time efficient so that it results in the same output as above.

CodePudding user response：

IIUC you can use transform after groupby:

df['Duplicates_flag'] = ( df.groupby(['Name', 'Category_flag', '10min_window_group'])
           ['10min_window_group'].transform(lambda x: ~x.duplicated()*1) )

Output:

    Name    Init_Time Category_flag 10min_window_group Duplicates_flag
0   XYZ 2022-02-16 14:00:31 1   1   1
1   XYZ 2022-02-16 14:03:15 1   1   0
2   XYZ 2022-02-16 14:05:26 0   1   1
3   XYZ 2022-02-16 14:06:23 0   1   0
4   PQR 2022-02-16 14:10:00 1   1   1
5   XYZ 2022-02-16 14:12:36 0   2   1
6   XYZ 2022-02-16 14:14:11 1   2   1
7   ABC 2022-02-17 07:07:25 1   3   1
8   XYZ 2022-02-17 15:08:35 0   4   1
9   ABC 2022-02-17 15:09:46 0   4   1