My goal is to cut off (not delete entirely) Eventlogs if the count of activities within a single Case exceeds a certain threshold.
This is the initial df:
CaseID ActivityID
0 Case1 11
1 Case1 5
2 Case1 2
3 Case1 23
4 Case1 86
5 Case1 27
7 Case2 0
8 Case2 256
9 Case2 5
10 Case2 7
11 Case3 23
12 Case3 556
13 Case3 4
14 Case3 2
15 Case3 33
16 Case3 5
17 Case3 67
18 Case3 32
19 Case3 5
20 Case3 66
And this my desired outcome if the threshold was set to be 5:
CaseID ActivityID
0 Case1 11
1 Case1 5
2 Case1 2
3 Case1 23
4 Case1 86
7 Case2 0
8 Case2 256
9 Case2 5
10 Case2 7
11 Case3 23
12 Case3 556
13 Case3 4
14 Case3 2
15 Case3 33
I came up with this code:
threshold = 5
counter = 0
for i in range(1, len(df)):
if (df.loc[i, 'CaseID'] == df.loc[i-1, 'CaseID']) & (counter < threshold):
counter = 1
elif (df.loc[i, 'CaseID'] == df.loc[i-1, 'CaseID']) & (counter >= threshold):
df.drop(index=i, inplace=True) # <- that's the problematic line
elif (df.loc[i, 'CaseID'] != df.loc[i-1, 'CaseID']) & (counter >= threshold):
counter = 0
If I insert print statements in the df.drop line, the code seems to work exactly as it should. But somehow it doesn't work as soon as soon as I drop rows and either keeps the dataframe as it is or throws KeyError:6.
Looking forward to your help, thank you in advance!
CodePudding user response:
I think groupby.cumcount
(cumulative count, indexed from zero) does what you need.
- Number each row by its cumulative position within its
CaseID
group, starting from 0 - Filter the DataFrame to keep only rows where cumulative count is between 0 and 4 inclusive
res = df[df.groupby('CaseID').cumcount() <= 4].copy()
print(res)
CaseID ActivityID
0 Case1 11
1 Case1 5
2 Case1 2
3 Case1 23
4 Case1 86
7 Case2 0
8 Case2 256
9 Case2 5
10 Case2 7
11 Case3 23
12 Case3 556
13 Case3 4
14 Case3 2
15 Case3 33