(Versions: Python 3.10.4, Pandas 1.4.3, NumPy 1.23.1)
I have this dataframe:
df = pd.DataFrame({
"Group" : ["A", "A", "A", "A", "B", "B", "B", "B"],
"Mass" : [100, 200, 300, 400, 100, 200, 300, 400],
"Speed" : [ 5, 3, 1, 7, 2, 2, 4, 9]
})
Group Mass Speed
0 A 100 5
1 A 200 3
2 A 300 1
3 A 400 7
4 B 100 2
5 B 200 2
6 B 300 4
7 B 400 9
And I have a function that takes a (sub-)dataframe and returns a scalar:
def max_speed_of_small_masses(sub_df):
speed_of_small_masses = sub_df.loc[sub_df["Mass"] < 400, "Speed"]
return speed_of_small_masses.max()
I want to apply this function to every group and add the results as a new column to the dataframe.
expected_output = pd.DataFrame({
"Group" : ["A", "A", "A", "A", "B", "B", "B", "B"],
"Mass" : [100, 200, 300, 400, 100, 200, 300, 400],
"Speed" : [ 5, 3, 1, 7, 2, 2, 4, 9],
"SmallMax" : [ 5, 5, 5, 5, 4, 4, 4, 4]
})
Group Mass Speed SmallMax
0 A 100 5 5
1 A 200 3 5
2 A 300 1 5
3 A 400 7 5
4 B 100 2 4
5 B 200 2 4
6 B 300 4 4
7 B 400 9 4
So first I group by Group:
grouped = df.groupby(["Group"])[["Mass", "Speed"]]
I cannot use apply now in a single step, since it gives
applied = grouped.apply(max_speed_of_small_masses)
Group
A 5
B 4
which doesn't have the proper shape, and if I tried to add this as a column, I'd get NaNs:
df["SmallMax"] = applied
Group Mass Speed SmallMax
0 A 100 5 NaN
1 A 200 3 NaN
2 A 300 1 NaN
3 A 400 7 NaN
4 B 100 2 NaN
5 B 200 2 NaN
6 B 300 4 NaN
7 B 400 9 NaN
But I cannot use transform either, since it cannot access columns of the sub-dataframe:
transformed = grouped.transform(max_speed_of_small_masses)
KeyError: 'Mass'
What is an elegant way to achieve this?
CodePudding user response:
IMO, the best is to pre-process the data to replace the non small values by NaN before the groupby:
df["SmallMax"] = (df['Speed']
.where(df['Mass'].lt(400))
.groupby(df['Group']).transform('max')
)
output:
Group Mass Speed SmallMax
0 A 100 5 5.0
1 A 200 3 5.0
2 A 300 1 5.0
3 A 400 7 5.0
4 B 100 2 4.0
5 B 200 2 4.0
6 B 300 4 4.0
7 B 400 9 4.0
CodePudding user response:
A quite straightforward way would to use merge
, after filtering out the rows with 'Speed' more than 40, based on each 'Group':
pd.merge(df,df.loc[df["Mass"] < 400].groupby('Group',as_index=False)['Speed'].max().rename({'Speed':'SmallMax'},axis=1),on='Group',how='left')
prints:
Group Mass Speed SmallMax
0 A 100 5 5
1 A 200 3 5
2 A 300 1 5
3 A 400 7 5
4 B 100 2 4
5 B 200 2 4
6 B 300 4 4
7 B 400 9 4
CodePudding user response:
You can try
out = (df.groupby(df['Group'])
.apply(lambda g: g.assign(SmallMax=g.loc[g["Mass"] < 400, 'Speed'].max())))
print(out)
Group Mass Speed SmallMax
0 A 100 5 5
1 A 200 3 5
2 A 300 1 5
3 A 400 7 5
4 B 100 2 4
5 B 200 2 4
6 B 300 4 4
7 B 400 9 4