I have a dataframe df
like below
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
data = {'Name': ['XYZ', 'XYZ', 'XYZ', 'XYZ', 'PQR', 'XYZ', 'XYZ', 'ABC', 'XYZ', 'ABC'], 'Init_Time': ['2022-02-16 14:00:31', '2022-02-16 14:03:15', '2022-02-16 14:05:26',
'2022-02-16 14:06:23', '2022-02-16 14:10:00', '2022-02-16 14:12:36',
'2022-02-16 14:14:11', '2022-02-17 07:07:25', '2022-02-17 15:08:35',
'2022-02-17 15:09:46'], 'Category_flag': [1,1,0,0,1,0,1,1,0,0], '10min_window_group': [1,1,1,1,1,2,2,3,4,4]}
df = pd.DataFrame(data)
df['Init_Time'] = pd.to_datetime(df['Init_Time'])
print(df)
Name Init_Time Category_flag 10min_window_group
0 XYZ 2022-02-16 14:00:31 1 1
1 XYZ 2022-02-16 14:03:15 1 1
2 XYZ 2022-02-16 14:05:26 0 1
3 XYZ 2022-02-16 14:06:23 0 1
4 PQR 2022-02-16 14:10:00 1 1
5 XYZ 2022-02-16 14:12:36 0 2
6 XYZ 2022-02-16 14:14:11 1 2
7 ABC 2022-02-17 07:07:25 1 3
8 XYZ 2022-02-17 15:08:35 0 4
9 ABC 2022-02-17 15:09:46 0 4
I'm assigning duplicate flags (Duplicate_Flags
) (1/0) for each of the names of column Name
that falls under a 10-minute window interval belonging to each of the category flags by filtering :
- The column 'Name`, first (XYZ, PQR, ...)
- The column
Category_Flag
, second (1/0). - The column
10min_window_group
, third (1/2/3/4).
For instance, in order to find duplicates of XYZ
in the first 10-minute interval of the category 1
, we first filter Name i.e XYZ
among unique names of Name
, then we filter for which category_flag
we want to find duplicates which in this case is 1, before finally filtering the 10-minute window grouping value i.e 1. In order to achieve this goal, I have utilized 3 for
loops which work well in most cases, however, the issue is that it consumes a lot of computational time when the number of data points is very large (say like 2 million data points) since the code needs to iterate through all the 3 for
loops.
for name in df['Name'].unique().tolist(): #Iterate over unique names of column `Name`.
df1 = df[df['Name'] == name]
for category in df1['Category_flag'].unique().tolist(): #Iterate over unique category flag values.
df2 = df1[df1['Category_flag'] == category]
for group in df['10min_window_group'].unique().tolist(): #Iterate over unique window interval values.
df3 = df2[df2['10min_window_group'] == group]
if(len(df3) > 0): #Check if the len of df3 is greater than 0.
df3['Duplicates_flag'] = np.where(df3['Name'].duplicated(), 0, 1) #Finds the duplicates.
df3_indices = df3['Duplicates_flag'].index #fetch index of duplicates
df3_values = df3['Duplicates_flag'].values #fetch the values of duplicates.
df.loc[df3_indices, 'Duplicates_flag'] = df3_values #Assign the duplicate values to the main `df` using the indices.
elif(len(df3) == 1):
df3['Duplicates_flag'] = np.where(df3['Name'].duplicated(), 0, 1)
df3_indices = df3['Duplicates_flag'].index
df3_values = df3['Duplicates_flag'].values
df.loc[df3_indices, 'Duplicates_flag'] = df3_values
print(df)
Name Init_Time Category_flag 10min_window_group Duplicates_flag
0 XYZ 2022-02-16 14:00:31 1 1 1.0
1 XYZ 2022-02-16 14:03:15 1 1 0.0
2 XYZ 2022-02-16 14:05:26 0 1 1.0
3 XYZ 2022-02-16 14:06:23 0 1 0.0
4 PQR 2022-02-16 14:10:00 1 1 1.0
5 XYZ 2022-02-16 14:12:36 0 2 1.0
6 XYZ 2022-02-16 14:14:11 1 2 1.0
7 ABC 2022-02-17 07:07:25 1 3 1.0
8 XYZ 2022-02-17 15:08:35 0 4 1.0
9 ABC 2022-02-17 15:09:46 0 4 1.0
So, is there a way where in I can optimize the code by reducing the number of 3 for
loops/replacing the 3 for
loops? The primary aim is to reduce the computation time and make the code more computationally time efficient so that it results in the same output as above.
CodePudding user response:
IIUC you can use transform
after groupby
:
df['Duplicates_flag'] = ( df.groupby(['Name', 'Category_flag', '10min_window_group'])
['10min_window_group'].transform(lambda x: ~x.duplicated()*1) )
Output:
Name Init_Time Category_flag 10min_window_group Duplicates_flag
0 XYZ 2022-02-16 14:00:31 1 1 1
1 XYZ 2022-02-16 14:03:15 1 1 0
2 XYZ 2022-02-16 14:05:26 0 1 1
3 XYZ 2022-02-16 14:06:23 0 1 0
4 PQR 2022-02-16 14:10:00 1 1 1
5 XYZ 2022-02-16 14:12:36 0 2 1
6 XYZ 2022-02-16 14:14:11 1 2 1
7 ABC 2022-02-17 07:07:25 1 3 1
8 XYZ 2022-02-17 15:08:35 0 4 1
9 ABC 2022-02-17 15:09:46 0 4 1