Home > Software engineering >  Faster way to update dataframe rows after group by
Faster way to update dataframe rows after group by

Time:05-30

Sample Input:


    GAME_ID     TIME    PER EVENT
0   0022000394  12:00   1   12
1   0022000394  12:00   1   10
2   0022000394  11:36   1   1
3   0022000394  11:24   1   1
4   0022000394  11:04   1   1
5   0022000394  10:41   1   1
6   0022000394  10:30   1   2
7   0022000394  10:29   1   4
8   0022000394  10:17   1   1
9   0022000394  10:01   1   1
10  0022000394  9:48    1   2
11  0022000394  9:46    1   4
12  0022000394  9:42    1   6
13  0022000394  9:42    1   3
14  0022000394  9:42    1   3
15  0022000394  9:25    1   1
16  0022000394  9:15    1   1
17  0022000394  9:15    1   6
18  0022000394  9:15    1   3
19  0022000394  8:53    1   1
20  0022000394  8:33    1   1
21  0022000394  8:22    1   1
22  0022000394  8:16    1   2
23  0022000394  8:16    1   4
24  0022000394  8:12    1   2

I have a dataframe where I get a group of rows using groupby.

If that group contains 3 rows where the column EVENTMSGTYPE contains all of [1, 6, & 3] I want to update the row in the original dataframe where EVENTMSGTYPE == 1

Current Working Solution (Slow)

# Group by
for _, data in df.groupby(['GAME_ID', 'TIME', 'PER']):

    # If EVENT in group contains 1, 6, and 3 then update original df
    if all(x in list(data.EVENT) for x in [1, 6, 3]):

        # Update original df row where EVENT equals 1, should only have one value
        index = data[data.EVENT == 1].index.values[0]

        # Set UPDATED to True
        df.at[index, 'UPDATED'] = True

Expected Output:

    GAME_ID     TIME    PER EVENT UPDATED
...
16  0022000394  9:15    1   1     True
...

My dataframe has 1,694,389 rows and this takes ~53 seconds to run on my machine, can the performance of this be improved?

CodePudding user response:

df['UPDATED'] = df.groupby(['GAME_ID', 'TIME', 'PER'])['EVENT'].filter(lambda x: set(x) >= {1,3,6}, dropna=False).eq(1)

Output:

     GAME_ID   TIME  PER  EVENT  UPDATED
0   22000394  12:00    1     12    False
1   22000394  12:00    1     10    False
2   22000394  11:36    1      1    False
3   22000394  11:24    1      1    False
4   22000394  11:04    1      1    False
5   22000394  10:41    1      1    False
6   22000394  10:30    1      2    False
7   22000394  10:29    1      4    False
8   22000394  10:17    1      1    False
9   22000394  10:01    1      1    False
10  22000394   9:48    1      2    False
11  22000394   9:46    1      4    False
12  22000394   9:42    1      6    False
13  22000394   9:42    1      3    False
14  22000394   9:42    1      3    False
15  22000394   9:25    1      1    False
16  22000394   9:15    1      1     True
17  22000394   9:15    1      6    False
18  22000394   9:15    1      3    False
19  22000394   8:53    1      1    False
20  22000394   8:33    1      1    False
21  22000394   8:22    1      1    False
22  22000394   8:16    1      2    False
23  22000394   8:16    1      4    False
24  22000394   8:12    1      2    False

Stole the set logic from sammywemmy~

CodePudding user response:

idx_cols = ['GAME_ID', 'TIME', 'PER']

df = df.set_index(idx_cols)

cond1 = (
    df.groupby(level=idx_cols)['EVENT']
      .agg(lambda event_group: all(x in event_group for x in [1, 6, 3])) 
      .reindex_like(df)
)

cond2 = df['EVENT'].eq(1)

df['UPDATED'] = cond1 & cond2

df = df.reset_index()
print(df)

Output:

     GAME_ID   TIME  PER  EVENT  UPDATED
0   22000394  12:00    1     12    False
1   22000394  12:00    1     10    False
2   22000394  11:36    1      1    False
3   22000394  11:24    1      1    False
4   22000394  11:04    1      1    False
5   22000394  10:41    1      1    False
6   22000394  10:30    1      2    False
7   22000394  10:29    1      4    False
8   22000394  10:17    1      1    False
9   22000394  10:01    1      1    False
10  22000394   9:48    1      2    False
11  22000394   9:46    1      4    False
12  22000394   9:42    1      6    False
13  22000394   9:42    1      3    False
14  22000394   9:42    1      3    False
15  22000394   9:25    1      1    False
16  22000394   9:15    1      1     True
17  22000394   9:15    1      6    False
18  22000394   9:15    1      3    False
19  22000394   8:53    1      1    False
20  22000394   8:33    1      1    False
21  22000394   8:22    1      1    False
22  22000394   8:16    1      2    False
23  22000394   8:16    1      4    False
24  22000394   8:12    1      2    False

CodePudding user response:

One option is using set with transform; speed wise I expect Bert2ME's solution to be faster:

df.assign(UPDATED = df.groupby(grouper)
                      .EVENT
                      .transform(lambda x: set(x) >= {1,3,6}) 
                                           & df.EVENT.eq(1))

     GAME_ID   TIME  PER  EVENT  UPDATED
0   22000394  12:00    1     12    False
1   22000394  12:00    1     10    False
2   22000394  11:36    1      1    False
3   22000394  11:24    1      1    False
4   22000394  11:04    1      1    False
5   22000394  10:41    1      1    False
6   22000394  10:30    1      2    False
7   22000394  10:29    1      4    False
8   22000394  10:17    1      1    False
9   22000394  10:01    1      1    False
10  22000394   9:48    1      2    False
11  22000394   9:46    1      4    False
12  22000394   9:42    1      6    False
13  22000394   9:42    1      3    False
14  22000394   9:42    1      3    False
15  22000394   9:25    1      1    False
16  22000394   9:15    1      1     True
17  22000394   9:15    1      6    False
18  22000394   9:15    1      3    False
19  22000394   8:53    1      1    False
20  22000394   8:33    1      1    False
21  22000394   8:22    1      1    False
22  22000394   8:16    1      2    False
23  22000394   8:16    1      4    False
24  22000394   8:12    1      2    False
  • Related