Home > front end >  Python Pandas adding a column where the value is across other rows
Python Pandas adding a column where the value is across other rows

Time:03-25

I would like to use multiple rows in the evaluation of a new column on a pandas dataframe. For a little background lets say that I have rows where data has been aggregated and I need to now how my current row compares to that row. For example

Gender Name weight
M James 125
M Jack 157
M *AGG 282
F Sue 130
F *AGG 130

Where *AGG is a special value meaning that row is an aggregate row. What I would like to do is find what percent of the *agg row each other row is? So the *AGG row is always 1.0 or 100% of itself. What I do is select off the *AGG rows and then add a new column with the agg value back to the main frame before the operation. The agg frame looks like this. | Gender | Name | weight | weight_summary | | --- | --- | --- | --- | | M | James | 125 | 282 | | M | Jack | 157 | 282 | | M | *AGG | 282 | 282 | | F | Sue | 130 | 130| | F | *AGG | 130 | 130 |

Then I do the operation and drop the extra column so that the final frame looks like this.

Gender Name weight weight %
M James 125 0.4432624
M Jack 157 0.5567375
M *AGG 282 1.0
F Sue 130 1.0
F *AGG 130 1.0

Currently I am doing this the long way which is identifying the *AGG rows then renaming weight and merging it back on the original dataframe.

    agg_rows = self.data_frame.query('Name == "*AGG"')['Gender', 'Weight']
    agg_name = 'weight_summary'
    agg_rows.rename({'weight': agg_name }, axis=1, inplace=True)
    on_cols = ['Gender']
    df = df.merge(agg_rows, how='left', on=on_cols)
    df['weight_percent'] = df['weight'].div(df[agg_name])
    df.drop(agg_name, axis=1, inplace=True)

My actual code is a little more robust as its dynamic but I think this is enough to get the point across. This gives me the desired output but doesnt give me a warm and fuzzy feeling. Is there a better way perhaps with masks or without having to add the temporary column on? Any advice is much appreciated even if its just confirming that my approach is okay.

CodePudding user response:

Lets try groupby combined with a unanimous function

print(df)

   Gender   Name  weight
0      M  James     125
1      M   Jack     157
2      F    Sue     130

df['weight%'] = df.groupby('Gender')['weight'].apply(lambda x: x/x.sum())



   Gender   Name  weight   weight%
0      M  James     125  0.443262
1      M   Jack     157  0.556738
2      F    Sue     130  1.000000

CodePudding user response:

The key here is to mask *AGG rows to compute the sum for each group then restore them:

df['weight %'] = (df['weight'] / df.mask(df['Name'] == '*AGG')
                                   .groupby('Gender')['weight']
                                   .transform('sum')
                                   .fillna(df['weight']))

Output:

Gender Name weight weight %
M James 125 0.443262
M Jack 157 0.556738
M *AGG 282 1
F Sue 130 1
F *AGG 130 1
  • Related