I am trying to find an efficient way to create a dataframe which lists all distinct game values as the columns and then aggregates the rows by user_id for game play hours accordingly? This is my example df:
user_id | game | game_hours | rank_order
1 | Fortnight | 1.5 | 1
1 | COD | 0.5 | 2
1 | Horizon | 1.7 | 3
1 | ... | ... | n
2 | Fifa2021 | 1.9 | 1
2 | A Way Out | 0.2 | 2
2 | ... | ... | n
...
Step 1: How do I get this to this df format (match rank order correctly due to time sequence)?
user_id | game_1 | game_2 | game_3 | game_n ...| game_hours
1 | Fortnight | COD | Horizon| | 3.7
2 | Fifa21 | A Way Out | | | 2.1
...
CodePudding user response:
Use DataFrame.pivot
with DataFrame.add_prefix
and for new column DataFrame.assign
with aggregation sum
:
df = (df.pivot('user_id','rank_order','game')
.add_prefix('game_')
.assign(game_hours=df.groupby('user_id')['game_hours'].sum())
.reset_index()
.rename_axis(None, axis=1))
print (df)
user_id game_1 game_2 game_3 game_hours
0 1 Fortnight COD Horizon 3.7
1 2 Fifa2021 A Way Out NaN 2.1