Let's say I have a df
pd.DataFrame(
{'name':['pam','pam','bob','bob','pam','bob','pam','bob'],
'game_id':[0,0,1,1,0,2,1,2]
}
)
name game_id
0 pam 0
1 pam 0
2 bob 1
3 bob 1
4 pam 0
5 bob 2
6 pam 1
7 bob 2
I want to calculate how many games bob and amy have appeared in cumulatively. However, when I use .groupby()
and .cumcount() 1
, I get something different. I get a cumulative count within each game_id
:
df['games'] = df.groupby(['name','game_id']).cumcount() 1
name game_id games
0 pam 0 1
1 pam 0 2
2 bob 1 1
3 bob 1 2
4 pam 0 3
5 bob 2 1
6 pam 1 1
7 bob 2 2
When what I really want is a one total cumulative count rather than a cumulative count for each unique game_id
. Here's an example of my desired output:
name game_id games
0 pam 0 1
1 pam 0 1
2 bob 1 1
3 bob 1 1
4 pam 0 1
5 bob 2 2
6 pam 1 2
7 bob 2 2
Note, in my actual dataset game_id
is a random sequence of numbers.
CodePudding user response:
Lets try sort df, check consecutive difference, create new group by cumsum and then resort the df
new_df=df.sort_values(by=['name','game_id'])
new_df=new_df.assign(rank=new_df['game_id']!=new_df['game_id'].shift())
new_df=new_df.assign(rank=new_df.groupby('name')['rank'].cumsum()).sort_index()
print(new_df)
name game_id rank
0 pam 0 1
1 pam 0 1
2 bob 1 1
3 bob 1 1
4 pam 0 1
5 bob 2 2
6 pam 1 2
7 bob 2 2