I found the following example online which explains how to essentially achieve a SQL equivalent of PARTITION BY
df['percent_of_points'] = df.groupby('team')['points'].transform(lambda x: x/x.sum())
#view updated DataFrame
print(df)
team points percent_of_points
0 A 30 0.352941
1 A 22 0.258824
2 A 19 0.223529
3 A 14 0.164706
4 B 14 0.191781
5 B 11 0.150685
6 B 20 0.273973
7 B 28 0.383562
I struggle to understand what the 'x' refers to in the lambda function lambda x: x/x.sum() because it appears to refer to an individual element when used as the numerator i.e. 'x' but also appears to be a list of values when used as a denominator i.e. x.sum().
I think I am not thinking about this is in the right way or have a gap in my understanding of python or pandas.
CodePudding user response:
it appears to refer to an individual element when used as the numerator i.e. 'x' but also appears to be a list of values when used as a denominator i.e. x.sum()
It doesn't, it returns a pd.Series
of length the size of the group, and x / x.sum()
is not a single value, it a pd.Series
of the same size.
.transform
will assign the values of this series to the corresponding values in that column from the group-by operation.
So, consider:
In [15]: df
Out[15]:
team points
0 A 30
1 A 22
2 A 19
3 A 14
4 B 14
5 B 11
6 B 20
7 B 28
In [16]: for k, g in df.groupby('team')['points']:
...: print(g)
...: print(g / g.sum())
...:
0 30
1 22
2 19
3 14
Name: points, dtype: int64
0 0.352941
1 0.258824
2 0.223529
3 0.164706
Name: points, dtype: float64
4 14
5 11
6 20
7 28
Name: points, dtype: int64
4 0.191781
5 0.150685
6 0.273973
7 0.383562
Name: points, dtype: float64
In [17]: