Home > other >  Pandas Cumulative Sum Groupby
Pandas Cumulative Sum Groupby

Time:04-08

This may be hard to understand for someone not familiar with cricket!

I have a pandas dataframe that gives me the history of a cricket game, ball by ball. Below is a simplified and considerably shorter version. It has columns for the over being bowled, as well as the ball in that over. It also has columns for the batter facing that ball, the score they got on that ball and whether they got out or not.

Over Ball Batter Score Out
0 1 1 Ponting 0 0
1 1 2 Ponting 1 0
2 1 3 McCullum 4 0
3 1 4 McCullum 2 0
4 1 5 McCullum 1 0
5 1 6 Ponting 0 1
6 2 1 McCullum 3 0
7 2 2 Ganguly 2 0
8 2 3 Ganguly 0 0
9 2 4 Ganguly 3 0
10 2 5 McCullum 1 0
11 2 6 Ganguly 0 0

What I want to do is have another couple of columns that show the cumulative score for each batter after that ball. It would look like this:

Over Ball Batter Score Out Ponting McCullum Ganguly
0 1 1 Ponting 0 0 0 0 0
1 1 2 Ponting 1 0 1 0 0
2 1 3 McCullum 4 0 1 4 0
3 1 4 McCullum 2 0 1 6 0
4 1 5 McCullum 1 0 1 7 0
5 1 6 Ponting 0 1 1 7 0
6 2 1 McCullum 3 0 1 10 0
7 2 2 Ganguly 2 0 1 10 2
8 2 3 Ganguly 0 0 1 10 2
9 2 4 Ganguly 3 0 1 10 5
10 2 5 McCullum 1 0 1 11 5
11 2 6 Ganguly 0 0 1 11 5

Obviously this could be achieved by just iterating through the rows and updating the score for each player after every ball, but this would not be the most efficient method. I have tried things like df.groupby(['Ball','Batter']).cumsum() but this is not getting me what I want. My feeling is that it is probably going to some combination of a pivot and groupby.

CodePudding user response:

IIUC:

dummies = pd.get_dummies(df.Batter).eq(1)
sums = df.groupby(['Batter']).Score.cumsum()

dummies.mul(sums, axis=0).where(dummies).pad().fillna(0, downcast='infer')

    Ganguly  McCullum  Ponting
0         0         0        0
1         0         0        1
2         0         4        1
3         0         6        1
4         0         7        1
5         0         7        1
6         0        10        1
7         2        10        1
8         2        10        1
9         5        10        1
10        5        11        1
11        5        11        1

And subsequently

df.join(
    dummies.mul(sums, axis=0).where(dummies).pad().fillna(0, downcast='infer')
)

    Over  Ball    Batter  Score  Out  Ganguly  McCullum  Ponting
0      1     1   Ponting      0    0        0         0        0
1      1     2   Ponting      1    0        0         0        1
2      1     3  McCullum      4    0        0         4        1
3      1     4  McCullum      2    0        0         6        1
4      1     5  McCullum      1    0        0         7        1
5      1     6   Ponting      0    1        0         7        1
6      2     1  McCullum      3    0        0        10        1
7      2     2   Ganguly      2    0        2        10        1
8      2     3   Ganguly      0    0        2        10        1
9      2     4   Ganguly      3    0        5        10        1
10     2     5  McCullum      1    0        5        11        1
11     2     6   Ganguly      0    0        5        11        1
  • Related