(Python) Iterating over dataframe doesn't allow me to drop rows-CodePudding

My goal is to cut off (not delete entirely) Eventlogs if the count of activities within a single Case exceeds a certain threshold.

This is the initial df:

  CaseID  ActivityID
0   Case1          11
1   Case1           5
2   Case1           2
3   Case1          23
4   Case1          86
5   Case1          27
7   Case2           0
8   Case2         256
9   Case2           5
10  Case2           7
11  Case3          23
12  Case3         556
13  Case3           4
14  Case3           2
15  Case3          33
16  Case3           5
17  Case3          67
18  Case3          32
19  Case3           5
20  Case3          66

And this my desired outcome if the threshold was set to be 5:

  CaseID  ActivityID
0   Case1          11
1   Case1           5
2   Case1           2
3   Case1          23
4   Case1          86
7   Case2           0
8   Case2         256
9   Case2           5
10  Case2           7
11  Case3          23
12  Case3         556
13  Case3           4
14  Case3           2
15  Case3          33

I came up with this code:

threshold = 5
counter = 0

for i in range(1, len(df)):
    if (df.loc[i, 'CaseID'] == df.loc[i-1, 'CaseID']) & (counter < threshold):
        counter  = 1
    elif (df.loc[i, 'CaseID'] == df.loc[i-1, 'CaseID']) & (counter >= threshold):
        df.drop(index=i, inplace=True) # <- that's the problematic line
    elif (df.loc[i, 'CaseID'] != df.loc[i-1, 'CaseID']) & (counter >= threshold):
        counter = 0

If I insert print statements in the df.drop line, the code seems to work exactly as it should. But somehow it doesn't work as soon as soon as I drop rows and either keeps the dataframe as it is or throws KeyError:6.

Looking forward to your help, thank you in advance!

CodePudding user response：

I think groupby.cumcount (cumulative count, indexed from zero) does what you need.

Number each row by its cumulative position within its CaseID group, starting from 0
Filter the DataFrame to keep only rows where cumulative count is between 0 and 4 inclusive

res = df[df.groupby('CaseID').cumcount() <= 4].copy()

print(res)

   CaseID  ActivityID
0   Case1  11        
1   Case1  5         
2   Case1  2         
3   Case1  23        
4   Case1  86        
7   Case2  0         
8   Case2  256       
9   Case2  5         
10  Case2  7         
11  Case3  23        
12  Case3  556       
13  Case3  4         
14  Case3  2         
15  Case3  33