Sample Input:
GAME_ID TIME PER EVENT
0 0022000394 12:00 1 12
1 0022000394 12:00 1 10
2 0022000394 11:36 1 1
3 0022000394 11:24 1 1
4 0022000394 11:04 1 1
5 0022000394 10:41 1 1
6 0022000394 10:30 1 2
7 0022000394 10:29 1 4
8 0022000394 10:17 1 1
9 0022000394 10:01 1 1
10 0022000394 9:48 1 2
11 0022000394 9:46 1 4
12 0022000394 9:42 1 6
13 0022000394 9:42 1 3
14 0022000394 9:42 1 3
15 0022000394 9:25 1 1
16 0022000394 9:15 1 1
17 0022000394 9:15 1 6
18 0022000394 9:15 1 3
19 0022000394 8:53 1 1
20 0022000394 8:33 1 1
21 0022000394 8:22 1 1
22 0022000394 8:16 1 2
23 0022000394 8:16 1 4
24 0022000394 8:12 1 2
I have a dataframe where I get a group of rows using groupby
.
If that group contains 3 rows where the column EVENTMSGTYPE
contains all of [1, 6, & 3]
I want to update the row in the original dataframe where EVENTMSGTYPE == 1
Current Working Solution (Slow)
# Group by
for _, data in df.groupby(['GAME_ID', 'TIME', 'PER']):
# If EVENT in group contains 1, 6, and 3 then update original df
if all(x in list(data.EVENT) for x in [1, 6, 3]):
# Update original df row where EVENT equals 1, should only have one value
index = data[data.EVENT == 1].index.values[0]
# Set UPDATED to True
df.at[index, 'UPDATED'] = True
Expected Output:
GAME_ID TIME PER EVENT UPDATED
...
16 0022000394 9:15 1 1 True
...
My dataframe has 1,694,389 rows and this takes ~53 seconds to run on my machine, can the performance of this be improved?
CodePudding user response:
df['UPDATED'] = df.groupby(['GAME_ID', 'TIME', 'PER'])['EVENT'].filter(lambda x: set(x) >= {1,3,6}, dropna=False).eq(1)
Output:
GAME_ID TIME PER EVENT UPDATED
0 22000394 12:00 1 12 False
1 22000394 12:00 1 10 False
2 22000394 11:36 1 1 False
3 22000394 11:24 1 1 False
4 22000394 11:04 1 1 False
5 22000394 10:41 1 1 False
6 22000394 10:30 1 2 False
7 22000394 10:29 1 4 False
8 22000394 10:17 1 1 False
9 22000394 10:01 1 1 False
10 22000394 9:48 1 2 False
11 22000394 9:46 1 4 False
12 22000394 9:42 1 6 False
13 22000394 9:42 1 3 False
14 22000394 9:42 1 3 False
15 22000394 9:25 1 1 False
16 22000394 9:15 1 1 True
17 22000394 9:15 1 6 False
18 22000394 9:15 1 3 False
19 22000394 8:53 1 1 False
20 22000394 8:33 1 1 False
21 22000394 8:22 1 1 False
22 22000394 8:16 1 2 False
23 22000394 8:16 1 4 False
24 22000394 8:12 1 2 False
Stole the set logic from sammywemmy~
CodePudding user response:
idx_cols = ['GAME_ID', 'TIME', 'PER']
df = df.set_index(idx_cols)
cond1 = (
df.groupby(level=idx_cols)['EVENT']
.agg(lambda event_group: all(x in event_group for x in [1, 6, 3]))
.reindex_like(df)
)
cond2 = df['EVENT'].eq(1)
df['UPDATED'] = cond1 & cond2
df = df.reset_index()
print(df)
Output:
GAME_ID TIME PER EVENT UPDATED
0 22000394 12:00 1 12 False
1 22000394 12:00 1 10 False
2 22000394 11:36 1 1 False
3 22000394 11:24 1 1 False
4 22000394 11:04 1 1 False
5 22000394 10:41 1 1 False
6 22000394 10:30 1 2 False
7 22000394 10:29 1 4 False
8 22000394 10:17 1 1 False
9 22000394 10:01 1 1 False
10 22000394 9:48 1 2 False
11 22000394 9:46 1 4 False
12 22000394 9:42 1 6 False
13 22000394 9:42 1 3 False
14 22000394 9:42 1 3 False
15 22000394 9:25 1 1 False
16 22000394 9:15 1 1 True
17 22000394 9:15 1 6 False
18 22000394 9:15 1 3 False
19 22000394 8:53 1 1 False
20 22000394 8:33 1 1 False
21 22000394 8:22 1 1 False
22 22000394 8:16 1 2 False
23 22000394 8:16 1 4 False
24 22000394 8:12 1 2 False
CodePudding user response:
One option is using set
with transform; speed wise I expect Bert2ME's solution to be faster:
df.assign(UPDATED = df.groupby(grouper)
.EVENT
.transform(lambda x: set(x) >= {1,3,6})
& df.EVENT.eq(1))
GAME_ID TIME PER EVENT UPDATED
0 22000394 12:00 1 12 False
1 22000394 12:00 1 10 False
2 22000394 11:36 1 1 False
3 22000394 11:24 1 1 False
4 22000394 11:04 1 1 False
5 22000394 10:41 1 1 False
6 22000394 10:30 1 2 False
7 22000394 10:29 1 4 False
8 22000394 10:17 1 1 False
9 22000394 10:01 1 1 False
10 22000394 9:48 1 2 False
11 22000394 9:46 1 4 False
12 22000394 9:42 1 6 False
13 22000394 9:42 1 3 False
14 22000394 9:42 1 3 False
15 22000394 9:25 1 1 False
16 22000394 9:15 1 1 True
17 22000394 9:15 1 6 False
18 22000394 9:15 1 3 False
19 22000394 8:53 1 1 False
20 22000394 8:33 1 1 False
21 22000394 8:22 1 1 False
22 22000394 8:16 1 2 False
23 22000394 8:16 1 4 False
24 22000394 8:12 1 2 False