Home > Back-end >  Pandas group by columns and perform aggregate on specific columns
Pandas group by columns and perform aggregate on specific columns

Time:08-27

I have the below dataframe.

df1 = pd.DataFrame({'col1': ["A", "X", "E", "A", "X", "X", "X"],
                       'col2': ["B", "Y", "E", "B", "Y","Y","Y"],
                       'col3': ["C", "Z", "E", "C", "Z", "Z", "Z"],
                       'col4': ["D", "A", "F", "D","A", "A","A"],
    'Sex':["Male","Male","Male","Female","Female","Null","Male"],
    'Count':[100,50,100,50,50,10,100],
    'Sum_me':[100,200,1,400,300,500,500],
    'Avg_me':[ 100,200,1,400,300,500,500]
    })

After filtering only the duplicate rows by columns col1,col2,col3,col4. Dataframe will look like below.

columns = ['col1', 'col2', 'col3','col4']
df1 = df1[df1[columns].duplicated(keep=False)].sort_values('col1').reset_index(drop=True)

    col1    col2    col3    col4    Sex   Count Sum_me  Avg_me
0   A       B        C      D       Male    100 100     100
1   A       B        C      D       Female  50  400     400
2   X       Y        Z      A       Male    50  200     200
3   X       Y        Z      A       Female  50  300    300
4   X       Y        Z      A       Null    10  500    500
5   X       Y        Z      A       Male    100 500    500

I am trying to perform aggregation on Sum_me and Avg_me columns and i also want to create a new column say total_male, total_female and null by taking records from count column matching sex column. total_male_female is the sum of male, female and null, I tried the below code but it is not giving expected result

result_df = df.groupby(columns).agg({'Sum_me':'sum','Avg_me':'mean'}).reset_index()

Below is my expected output. Is there a way to do this using pandas any help would be highly appreciated.

output:

col1 col2 col3 col4 total_male  total_female null total_male_female Sum_me  Avg_me
A    B    C    D    100         50            0     150              500    250
X    Y    Z    A    150         50            10    210              1500   376

CodePudding user response:

Try:

x = df1.pivot_table(
    index=["col1", "col2", "col3", "col4"],
    columns="Sex",
    values="Count",
    aggfunc="sum",
    fill_value=0,
)

g = df1.groupby(["col1", "col2", "col3", "col4"])

out = pd.concat(
    [x, g["Sum_me"].sum(), g["Avg_me"].mean()], axis=1
).reset_index()

print(out)

Prints:

  col1 col2 col3 col4  Female  Male  Null  Sum_me  Avg_me
0    A    B    C    D      50   100     0     500   250.0
1    X    Y    Z    A      50   150    10    1500   375.0
  • Related