So I'm trying to create new data in a time series based on past data. For example I have player data here with each row being stats accumulated at a certain age. I want to create new row in the Dataframe where I increment the max age by one and then take the average of the sa
and ga
column from the two years before that.
Here is the data
import pandas as pd
data = [['Adam Wilcox', 8476330, 25, 14.0, 0.0],
['Adin Hill', 8478499, 21, 129.0, 14.0],
['Adin Hill', 8478499, 22, 322.0, 32.0],
['Adin Hill', 8478499, 23, 343.0, 28.0],
['Adin Hill', 8478499, 24, 530.0, 46.0],
['Adin Hill', 8478499, 25, 237.0, 26.0],
['Al Montoya', 8471219, 24, 120.0, 9.0],
['Al Montoya', 8471219, 26, 585.0, 46.0],
['Al Montoya', 8471219, 27, 832.0, 89.0],
['Al Montoya', 8471219, 28, 168.0, 17.0]]
model_df = pd.DataFrame(data,
columns=['player', 'player_id', 'season_age', 'sa', 'ga'])
For example what I would want to create is ['Al Montoya', 8471219, 29, 500, 53]
(remember the last two values is the average of the sa
and ga
columns from ages 28 and 27).
I've accomplished this using iterrows
and creating a new Dataframe and appending like this:
max_ages = model_df.groupby(['player', 'player_id'])[['season_age']].max().reset_index()
added_ages = []
for player in max_ages.iterrows():
row = [player[1][0],
player[1][1],
player[1][2] 1,
(model_df[(model_df['player_id'] == player[1][1]) &
(model_df['season_age'] == player[1][2] - 1)]['sa'].sum()
model_df[(model_df['player_id'] == player[1][1]) &
(model_df['season_age'] == player[1][2] - 2)]['sa'].sum())/2,
(model_df[(model_df['player_id'] == player[1][1]) &
(model_df['season_age'] == player[1][2] - 1)]['ga'].sum()
model_df[(model_df['player_id'] == player[1][1]) &
(model_df['season_age'] == player[1][2] - 2)]['ga'].sum())/2
]
added_ages.append(row)
added_ages_df = pd.DataFrame(added_ages,
columns=['player', 'player_id', 'season_age', 'sa', 'ga'])
model_df = pd.concat([model_df, added_ages_df])
Obviously this is an adhoc solution that is very brittle, my question is if there is a built in way in pandas
of doing this without using iterrows
The expected Dataframe would look like this easier to represent in list form
data = [['Adam Wilcox', 8476330, 25, 14.0, 0.0],
['Adin Hill', 8478499, 21, 129.0, 14.0],
['Adin Hill', 8478499, 22, 322.0, 32.0],
['Adin Hill', 8478499, 23, 343.0, 28.0],
['Adin Hill', 8478499, 24, 530.0, 46.0],
['Adin Hill', 8478499, 25, 237.0, 26.0],
['Adin Hill', 8478499, 26, 502, 36],
['Al Montoya', 8471219, 24, 120.0, 9.0],
['Al Montoya', 8471219, 26, 585.0, 46.0],
['Al Montoya', 8471219, 27, 832.0, 89.0],
['Al Montoya', 8471219, 28, 168.0, 17.0],
['Al Montoya', 8471219, 29, 500, 53]]
CodePudding user response:
You may try something like below.
df_new = df.shift()
df_new['season_age'] = df['season_age'].max() 1
df_new[['sa','ga']] = df[['sa','ga']].rolling(2).mean()
CodePudding user response:
You can define a function called add_row
and pass it to a groupby. I'll assume that if there aren't two years of data for a player, you'll want sa and ga to be populated with NaN
:
def add_row(x):
last_row = x.iloc[-1]
last_row['season_age'] = last_row['season_age'] 1
if len(x) < 2:
last_row['sa'], last_row['ga'] = float("nan"), float("nan")
return x.append(last_row)
else:
last_row['sa'], last_row['ga'] = x[['sa','ga']].iloc[-2:].mean()
return x.append(last_row)
new_model_df = model_df.groupby("player").apply(add_row).reset_index(drop=True)
Output:
>>> new_model_df
player player_id season_age sa ga
0 Adam Wilcox 8476330 25 14.0 0.0
1 Adam Wilcox 8476330 26 NaN NaN
2 Adin Hill 8478499 21 129.0 14.0
3 Adin Hill 8478499 22 322.0 32.0
4 Adin Hill 8478499 23 343.0 28.0
5 Adin Hill 8478499 24 530.0 46.0
6 Adin Hill 8478499 25 237.0 26.0
7 Adin Hill 8478499 26 383.5 36.0
8 Al Montoya 8471219 24 120.0 9.0
9 Al Montoya 8471219 26 585.0 46.0
10 Al Montoya 8471219 27 832.0 89.0
11 Al Montoya 8471219 28 168.0 17.0
12 Al Montoya 8471219 29 500.0 53.0