Python Pandas adding a column where the value is across other rows-CodePudding

I would like to use multiple rows in the evaluation of a new column on a pandas dataframe. For a little background lets say that I have rows where data has been aggregated and I need to now how my current row compares to that row. For example

Gender	Name	weight
M	James	125
M	Jack	157
M	*AGG	282
F	Sue	130
F	*AGG	130

Where *AGG is a special value meaning that row is an aggregate row. What I would like to do is find what percent of the *agg row each other row is? So the *AGG row is always 1.0 or 100% of itself. What I do is select off the *AGG rows and then add a new column with the agg value back to the main frame before the operation. The agg frame looks like this. | Gender | Name | weight | weight_summary | | --- | --- | --- | --- | | M | James | 125 | 282 | | M | Jack | 157 | 282 | | M | *AGG | 282 | 282 | | F | Sue | 130 | 130| | F | *AGG | 130 | 130 |

Then I do the operation and drop the extra column so that the final frame looks like this.

Gender	Name	weight	weight %
M	James	125	0.4432624
M	Jack	157	0.5567375
M	*AGG	282	1.0
F	Sue	130	1.0
F	*AGG	130	1.0

Currently I am doing this the long way which is identifying the *AGG rows then renaming weight and merging it back on the original dataframe.

    agg_rows = self.data_frame.query('Name == "*AGG"')['Gender', 'Weight']
    agg_name = 'weight_summary'
    agg_rows.rename({'weight': agg_name }, axis=1, inplace=True)
    on_cols = ['Gender']
    df = df.merge(agg_rows, how='left', on=on_cols)
    df['weight_percent'] = df['weight'].div(df[agg_name])
    df.drop(agg_name, axis=1, inplace=True)

My actual code is a little more robust as its dynamic but I think this is enough to get the point across. This gives me the desired output but doesnt give me a warm and fuzzy feeling. Is there a better way perhaps with masks or without having to add the temporary column on? Any advice is much appreciated even if its just confirming that my approach is okay.

CodePudding user response：

Lets try groupby combined with a unanimous function

print(df)

   Gender   Name  weight
0      M  James     125
1      M   Jack     157
2      F    Sue     130

df['weight%'] = df.groupby('Gender')['weight'].apply(lambda x: x/x.sum())



   Gender   Name  weight   weight%
0      M  James     125  0.443262
1      M   Jack     157  0.556738
2      F    Sue     130  1.000000

CodePudding user response：

The key here is to mask *AGG rows to compute the sum for each group then restore them:

df['weight %'] = (df['weight'] / df.mask(df['Name'] == '*AGG')
                                   .groupby('Gender')['weight']
                                   .transform('sum')
                                   .fillna(df['weight']))

Output:

Gender	Name	weight	weight %
M	James	125	0.443262
M	Jack	157	0.556738
M	*AGG	282	1
F	Sue	130	1
F	*AGG	130	1