Pandas groupby transform-CodePudding

Need a confirmation regarding behaviors of Pandas Groupby transform:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                      'foo', 'bar'],
               'B' : ['one', 'one', 'two', 'three',
                      'two', 'two'],
               'C' : [1, 5, 5, 2, 5, 5],
               'D' : [2.0, 5., 8., 1., 2., 9.]})
grouped = df.groupby('A')
grouped.transform(lambda x: (x - x.mean()) / x.std())

          C         D
0 -1.154701 -0.577350
1  0.577350  0.000000
2  0.577350  1.154701
3 -1.154701 -1.000000
4  0.577350 -0.577350
5  0.577350  1.000000

It does not specify which column to apply the lambda function. how pandas decide which columns (in this case, C and D) to apply the function? why did it not apply to column B and throw an error?

why the output does not include column A and B?

CodePudding user response：

GroupBy.transform calls the specified function for each column in each group (so B, C, and D - not A because that's what you're grouping by). However, the functions you're calling (mean and std) only work with numeric values, so Pandas skips the column if it's dtype is not numeric. String columns are of dtype object, which isn't numeric, so B gets dropped, and you're left with C and D.

You should have got warning when you ran your code—

FutureWarning: Dropping invalid columns in DataFrameGroupBy.transform is deprecated. In a future version, a TypeError will be raised. Before calling .transform, select only columns which should be valid for the transforming function.

As it indicates, you need to select the columns you want to process prior to processing in order to evade the warning. You can do that by added [['C', 'D']] (to select, for example, your C and D columns) before you call transform:

grouped[['C', 'D']].transform(lambda x: (x - x.mean()) / x.std())
#      ^^^^^^^^^^^^ important