Home > Software engineering >  What is an elegant way of groupby & transform, when your function accesses multiple columns of the s
What is an elegant way of groupby & transform, when your function accesses multiple columns of the s

Time:08-12

(Versions: Python 3.10.4, Pandas 1.4.3, NumPy 1.23.1)

I have this dataframe:

df = pd.DataFrame({
  "Group" : ["A", "A", "A", "A", "B", "B", "B", "B"],
  "Mass"  : [100, 200, 300, 400, 100, 200, 300, 400],
  "Speed" : [  5,   3,   1,   7,   2,   2,   4,   9]
})


  Group  Mass  Speed
0     A   100      5
1     A   200      3
2     A   300      1
3     A   400      7
4     B   100      2
5     B   200      2
6     B   300      4
7     B   400      9

And I have a function that takes a (sub-)dataframe and returns a scalar:

def max_speed_of_small_masses(sub_df):
    speed_of_small_masses = sub_df.loc[sub_df["Mass"] < 400, "Speed"]
    return speed_of_small_masses.max()

I want to apply this function to every group and add the results as a new column to the dataframe.

expected_output = pd.DataFrame({
  "Group"    : ["A", "A", "A", "A", "B", "B", "B", "B"],
  "Mass"     : [100, 200, 300, 400, 100, 200, 300, 400],
  "Speed"    : [  5,   3,   1,   7,   2,   2,   4,   9],
  "SmallMax" : [  5,   5,   5,   5,   4,   4,   4,   4]
})

  Group  Mass  Speed  SmallMax
0     A   100      5         5
1     A   200      3         5
2     A   300      1         5
3     A   400      7         5
4     B   100      2         4
5     B   200      2         4
6     B   300      4         4
7     B   400      9         4

So first I group by Group:

grouped =  df.groupby(["Group"])[["Mass", "Speed"]]

I cannot use apply now in a single step, since it gives

applied = grouped.apply(max_speed_of_small_masses)

Group
A    5
B    4

which doesn't have the proper shape, and if I tried to add this as a column, I'd get NaNs:

df["SmallMax"] = applied

  Group  Mass  Speed  SmallMax
0     A   100      5       NaN
1     A   200      3       NaN
2     A   300      1       NaN
3     A   400      7       NaN
4     B   100      2       NaN
5     B   200      2       NaN
6     B   300      4       NaN
7     B   400      9       NaN

But I cannot use transform either, since it cannot access columns of the sub-dataframe:

transformed = grouped.transform(max_speed_of_small_masses)

KeyError: 'Mass'

What is an elegant way to achieve this?

CodePudding user response:

IMO, the best is to pre-process the data to replace the non small values by NaN before the groupby:

df["SmallMax"] = (df['Speed'] 
                  .where(df['Mass'].lt(400))
                  .groupby(df['Group']).transform('max')
                  )

output:

  Group  Mass  Speed  SmallMax
0     A   100      5       5.0
1     A   200      3       5.0
2     A   300      1       5.0
3     A   400      7       5.0
4     B   100      2       4.0
5     B   200      2       4.0
6     B   300      4       4.0
7     B   400      9       4.0

CodePudding user response:

A quite straightforward way would to use merge, after filtering out the rows with 'Speed' more than 40, based on each 'Group':

pd.merge(df,df.loc[df["Mass"] < 400].groupby('Group',as_index=False)['Speed'].max().rename({'Speed':'SmallMax'},axis=1),on='Group',how='left')

prints:

  Group  Mass  Speed  SmallMax
0     A   100      5         5
1     A   200      3         5
2     A   300      1         5
3     A   400      7         5
4     B   100      2         4
5     B   200      2         4
6     B   300      4         4
7     B   400      9         4

CodePudding user response:

You can try

out = (df.groupby(df['Group'])
      .apply(lambda g: g.assign(SmallMax=g.loc[g["Mass"] < 400, 'Speed'].max())))
print(out)

  Group  Mass  Speed  SmallMax
0     A   100      5         5
1     A   200      3         5
2     A   300      1         5
3     A   400      7         5
4     B   100      2         4
5     B   200      2         4
6     B   300      4         4
7     B   400      9         4
  • Related