I have the below dataframe.
df1 = pd.DataFrame({'col1': ["A", "X", "E", "A", "X", "X", "X"],
'col2': ["B", "Y", "E", "B", "Y","Y","Y"],
'col3': ["C", "Z", "E", "C", "Z", "Z", "Z"],
'col4': ["D", "A", "F", "D","A", "A","A"],
'Sex':["Male","Male","Male","Female","Female","Null","Male"],
'Count':[100,50,100,50,50,10,100],
'Sum_me':[100,200,1,400,300,500,500],
'Avg_me':[ 100,200,1,400,300,500,500]
})
After filtering only the duplicate rows by columns col1,col2,col3,col4. Dataframe will look like below.
columns = ['col1', 'col2', 'col3','col4']
df1 = df1[df1[columns].duplicated(keep=False)].sort_values('col1').reset_index(drop=True)
col1 col2 col3 col4 Sex Count Sum_me Avg_me
0 A B C D Male 100 100 100
1 A B C D Female 50 400 400
2 X Y Z A Male 50 200 200
3 X Y Z A Female 50 300 300
4 X Y Z A Null 10 500 500
5 X Y Z A Male 100 500 500
I am trying to perform aggregation on Sum_me and Avg_me columns and i also want to create a new column say total_male, total_female and null by taking records from count column matching sex column. total_male_female is the sum of male, female and null, I tried the below code but it is not giving expected result
result_df = df.groupby(columns).agg({'Sum_me':'sum','Avg_me':'mean'}).reset_index()
Below is my expected output. Is there a way to do this using pandas any help would be highly appreciated.
output:
col1 col2 col3 col4 total_male total_female null total_male_female Sum_me Avg_me
A B C D 100 50 0 150 500 250
X Y Z A 150 50 10 210 1500 376
CodePudding user response:
Try:
x = df1.pivot_table(
index=["col1", "col2", "col3", "col4"],
columns="Sex",
values="Count",
aggfunc="sum",
fill_value=0,
)
g = df1.groupby(["col1", "col2", "col3", "col4"])
out = pd.concat(
[x, g["Sum_me"].sum(), g["Avg_me"].mean()], axis=1
).reset_index()
print(out)
Prints:
col1 col2 col3 col4 Female Male Null Sum_me Avg_me
0 A B C D 50 100 0 500 250.0
1 X Y Z A 50 150 10 1500 375.0