Need a confirmation regarding behaviors of Pandas Groupby transform:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two'],
'C' : [1, 5, 5, 2, 5, 5],
'D' : [2.0, 5., 8., 1., 2., 9.]})
grouped = df.groupby('A')
grouped.transform(lambda x: (x - x.mean()) / x.std())
C D
0 -1.154701 -0.577350
1 0.577350 0.000000
2 0.577350 1.154701
3 -1.154701 -1.000000
4 0.577350 -0.577350
5 0.577350 1.000000
It does not specify which column to apply the lambda function. how pandas decide which columns (in this case, C and D) to apply the function? why did it not apply to column B and throw an error?
why the output does not include column A and B?
CodePudding user response:
GroupBy.transform
calls the specified function for each column in each group (so B
, C
, and D
- not A
because that's what you're grouping by). However, the functions you're calling (mean
and std
) only work with numeric values, so Pandas skips the column if it's dtype
is not numeric. String columns are of dtype
object
, which isn't numeric, so B
gets dropped, and you're left with C
and D
.
You should have got warning when you ran your code—
FutureWarning: Dropping invalid columns in DataFrameGroupBy.transform is deprecated. In a future version, a TypeError will be raised. Before calling .transform, select only columns which should be valid for the transforming function.
As it indicates, you need to select the columns you want to process prior to processing in order to evade the warning. You can do that by added [['C', 'D']]
(to select, for example, your C
and D
columns) before you call transform
:
grouped[['C', 'D']].transform(lambda x: (x - x.mean()) / x.std())
# ^^^^^^^^^^^^ important