python dataframe groupby transform with condition-CodePudding

I have a dataframe like this:

ColA	ColB	ColC	ColD
A	Type1	10	Red
A	Type1	20	Blue
A	Type2	10	Blue
A	Type2	10	Red
A	Type2	30	Blue
B	Type1	70	Blue
B	Type1	10	Red
B	Type2	19	Yellow
B	Type2	50	Blue
B	Type2	12	Yellow
B	Type2	15	Blue

I would like to use groupby (ColA and ColB) and transform with condition based on ColD="Blue" to calculate avg ColC into ColE, and I've already made sure each ColA and Colb combination must contains one "Blue" in ColD.

What I would like to get:

ColA	ColB	ColC	ColD	ColE
A	Type1	10	Red	20
A	Type1	20	Blue	20
A	Type2	10	Blue	20
A	Type2	10	Red	20
A	Type2	30	Blue	20
B	Type1	70	Blue	70
B	Type1	10	Red	70
B	Type2	19	Yellow	37.5
B	Type2	50	Blue	37.5
B	Type2	12	Yellow	37.5
B	Type2	15	Blue	37.5

So far, I am successful on groupby transform, but adding condition on ColD throw it off, any ideas?

I've tried without condition on df.groupby(['ColA', 'ColB'])['ColC'].transform('mean') but don't know how to add condition ColD= "Blue" into it.

CodePudding user response：

Mask the values in ColC where ColD is not Blue then do groupby and transform on masked column

m = df['ColD'] != 'Blue'
df['ColE'] = df['ColC'].mask(m).groupby([df['ColA'], df['ColB']]).transform('mean')

   ColA   ColB  ColC    ColD  ColE
0     A  Type1    10     Red  20.0
1     A  Type1    20    Blue  20.0
2     A  Type2    10    Blue  20.0
3     A  Type2    10     Red  20.0
4     A  Type2    30    Blue  20.0
5     B  Type1    70    Blue  70.0
6     B  Type1    10     Red  70.0
7     B  Type2    19  Yellow  32.5
8     B  Type2    50    Blue  32.5
9     B  Type2    12  Yellow  32.5
10    B  Type2    15    Blue  32.5

CodePudding user response：

Let's filter the dataframe by ColD column then do a groupby.transform to get the mean of Blue then do another group.transform to fill the other color.

df['ColE'] = (df[df['ColD'].eq('Blue')]
              .groupby(['ColA', 'ColB'])['ColC'].transform('mean')
              .reindex(df.index)
              .groupby([df['ColA'], df['ColB']])
              .transform(lambda col: col.ffill().bfill()))

print(df)

   ColA   ColB  ColC    ColD  ColE
0     A  Type1    10     Red  20.0
1     A  Type1    20    Blue  20.0
2     A  Type2    10    Blue  20.0
3     A  Type2    10     Red  20.0
4     A  Type2    30    Blue  20.0
5     B  Type1    70    Blue  70.0
6     B  Type1    10     Red  70.0
7     B  Type2    19  Yellow  32.5
8     B  Type2    50    Blue  32.5
9     B  Type2    12  Yellow  32.5
10    B  Type2    15    Blue  32.5

CodePudding user response：

You were almost there! :)

df = pd.DataFrame(dict(cola=[1,1,3,4], colb=[1,1,3,7], colc=[5,1,6,9], cold=['b','b','y','b']))

df['cole'] = df.groupby(['cola', 'colb']).colc.transform('mean').where(df.cold == 'b')

|    |   cola |   colb |   colc | cold   |   cole |
|---:|-------:|-------:|-------:|:-------|-------:|
|  0 |      1 |      1 |      5 | b      |      3 |
|  1 |      1 |      1 |      1 | b      |      3 |
|  2 |      3 |      3 |      6 | y      |    nan |
|  3 |      4 |      7 |      9 | b      |      9 |