Pandas: Column generation on groupby and max-CodePudding

Task is to generate a column avg where avg = 1st value of 'pts' group by 'id' / max value of 'x' group by 'id'. NB: If max of x=0, then division operation will be ignored and avg remains NaN.

Sample df:

   id  pts  x
0   1   75  0
1   1   75  1
2   1   75  2
3   2   48  0
4   2   48  0
5   3   53  0
6   3   53  2
7   3   53  3
8   3   53  3
9   4   6   1

Expected df:

   id  pts  x   avg
0   1   75  0   37.50
1   1   75  1   37.50
2   1   75  2   37.50
3   2   48  0   NaN
4   2   48  0   NaN
5   3   53  0   17.66
6   3   53  2   17.66
7   3   53  3   17.66
8   3   53  3   17.66
9   4   6   1   6.00

I tried:

df['avg'] = df['id'].map(df.groupby('id').apply(lambda x: x['pts'].iloc[0] / x['x'].max()))

CodePudding user response：

Just replace 0s with NaN before you divide.

Change

x['x'].max()

x['x'].replace(0, np.nan).max()

Like this:

df['avg'] = df['id'].map(df.groupby('id').apply(lambda x: x['pts'].iloc[0] / x['x'].replace(0, np.nan).max()))

Output:

>>> df
   id  pts  x        avg
0   1   75  0  37.500000
1   1   75  1  37.500000
2   1   75  2  37.500000
3   2   48  0        NaN
4   2   48  0        NaN
5   3   53  0  17.666667
6   3   53  2  17.666667
7   3   53  3  17.666667
8   3   53  3  17.666667
9   4    6  1   6.000000

CodePudding user response：

We can use a lazy groupby with groupby transform 'first' for pts and 'max' for x then divide and replace inf with nan:

# Save to reuse grouper.
g = df.groupby('id')
# first 'pts' value divided by max 'x' value
df['avg'] = (
        g['pts'].transform('first') / g['x'].transform('max')
).replace(np.inf, np.nan)  # replace inf with nan

df:

   id  pts  x        avg
0   1   75  0  37.500000
1   1   75  1  37.500000
2   1   75  2  37.500000
3   2   48  0        NaN
4   2   48  0        NaN
5   3   53  0  17.666667
6   3   53  2  17.666667
7   3   53  3  17.666667
8   3   53  3  17.666667
9   4    6  1   6.000000

Setup:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'id': [1, 1, 1, 2, 2, 3, 3, 3, 3, 4],
    'pts': [75, 75, 75, 48, 48, 53, 53, 53, 53, 6],
    'x': [0, 1, 2, 0, 0, 0, 2, 3, 3, 1]
})

CodePudding user response：

To handle the issue explicitly:

import numpy as np

def get_avg(x):
    avg = x['pts'].iloc[0] / x['x'].max()
    if np.isinf(avg):
        return np.nan
    else:
        return avg

df['id'].map(df.groupby("id").apply(get_avg))

CodePudding user response：

You can update your previous solution to below.

df['avg'] = df['id'].map(df.groupby('id').apply(lambda x: x['pts'].iloc[0] / x['x'].max() if x['x'].max()!=0 else None))