Home > Back-end >  Create new rows based on max rows of a column
Create new rows based on max rows of a column

Time:11-27

So I'm trying to create new data in a time series based on past data. For example I have player data here with each row being stats accumulated at a certain age. I want to create new row in the Dataframe where I increment the max age by one and then take the average of the sa and ga column from the two years before that.

Here is the data

import pandas as pd

data = [['Adam Wilcox', 8476330, 25, 14.0, 0.0],
        ['Adin Hill', 8478499, 21, 129.0, 14.0],
        ['Adin Hill', 8478499, 22, 322.0, 32.0],
        ['Adin Hill', 8478499, 23, 343.0, 28.0],
        ['Adin Hill', 8478499, 24, 530.0, 46.0],
        ['Adin Hill', 8478499, 25, 237.0, 26.0],
        ['Al Montoya', 8471219, 24, 120.0, 9.0],
        ['Al Montoya', 8471219, 26, 585.0, 46.0],
        ['Al Montoya', 8471219, 27, 832.0, 89.0],
        ['Al Montoya', 8471219, 28, 168.0, 17.0]]

model_df = pd.DataFrame(data, 
                         columns=['player', 'player_id', 'season_age', 'sa', 'ga'])

For example what I would want to create is ['Al Montoya', 8471219, 29, 500, 53] (remember the last two values is the average of the sa and ga columns from ages 28 and 27).

I've accomplished this using iterrows and creating a new Dataframe and appending like this:

max_ages = model_df.groupby(['player', 'player_id'])[['season_age']].max().reset_index()
added_ages = []
for player in max_ages.iterrows():

    row = [player[1][0],
           player[1][1],
           player[1][2]   1, 
           (model_df[(model_df['player_id'] == player[1][1]) &
                    (model_df['season_age'] == player[1][2] - 1)]['sa'].sum()  
           model_df[(model_df['player_id'] == player[1][1]) &
                    (model_df['season_age'] == player[1][2] - 2)]['sa'].sum())/2,
           (model_df[(model_df['player_id'] == player[1][1]) &
                    (model_df['season_age'] == player[1][2] - 1)]['ga'].sum()  
           model_df[(model_df['player_id'] == player[1][1]) &
                    (model_df['season_age'] == player[1][2] - 2)]['ga'].sum())/2
          ]
    added_ages.append(row)

added_ages_df = pd.DataFrame(added_ages, 
                             columns=['player', 'player_id', 'season_age', 'sa', 'ga'])
model_df = pd.concat([model_df, added_ages_df])

Obviously this is an adhoc solution that is very brittle, my question is if there is a built in way in pandas of doing this without using iterrows

The expected Dataframe would look like this easier to represent in list form

data = [['Adam Wilcox', 8476330, 25, 14.0, 0.0],
        ['Adin Hill', 8478499, 21, 129.0, 14.0],
        ['Adin Hill', 8478499, 22, 322.0, 32.0],
        ['Adin Hill', 8478499, 23, 343.0, 28.0],
        ['Adin Hill', 8478499, 24, 530.0, 46.0],
        ['Adin Hill', 8478499, 25, 237.0, 26.0],
        ['Adin Hill', 8478499, 26, 502, 36],
        ['Al Montoya', 8471219, 24, 120.0, 9.0],
        ['Al Montoya', 8471219, 26, 585.0, 46.0],
        ['Al Montoya', 8471219, 27, 832.0, 89.0],
        ['Al Montoya', 8471219, 28, 168.0, 17.0],
        ['Al Montoya', 8471219, 29, 500, 53]]

CodePudding user response:

You may try something like below.

 df_new = df.shift()
 df_new['season_age'] = df['season_age'].max()  1
 df_new[['sa','ga']] = df[['sa','ga']].rolling(2).mean()

CodePudding user response:

You can define a function called add_row and pass it to a groupby. I'll assume that if there aren't two years of data for a player, you'll want sa and ga to be populated with NaN:

def add_row(x):
    last_row = x.iloc[-1]
    last_row['season_age'] = last_row['season_age'] 1
    if len(x) < 2:
        last_row['sa'], last_row['ga'] = float("nan"), float("nan")
        return x.append(last_row)
    else:
        last_row['sa'], last_row['ga'] = x[['sa','ga']].iloc[-2:].mean()
        return x.append(last_row)

new_model_df = model_df.groupby("player").apply(add_row).reset_index(drop=True)

Output:

>>> new_model_df
         player  player_id  season_age     sa    ga
0   Adam Wilcox    8476330          25   14.0   0.0
1   Adam Wilcox    8476330          26    NaN   NaN
2     Adin Hill    8478499          21  129.0  14.0
3     Adin Hill    8478499          22  322.0  32.0
4     Adin Hill    8478499          23  343.0  28.0
5     Adin Hill    8478499          24  530.0  46.0
6     Adin Hill    8478499          25  237.0  26.0
7     Adin Hill    8478499          26  383.5  36.0
8    Al Montoya    8471219          24  120.0   9.0
9    Al Montoya    8471219          26  585.0  46.0
10   Al Montoya    8471219          27  832.0  89.0
11   Al Montoya    8471219          28  168.0  17.0
12   Al Montoya    8471219          29  500.0  53.0
  • Related