I have the below dataframe and trying to filter rows with duplicate records present only in col 1, col2, col3, col 4 using the below code.
df1 = pd.DataFrame({'col1': ["A", "X", "E", "A", "X", "X"],
'col2': ["B", "Y", "E", "B", "Y","Y"],
'col3': ["C", "Z", "E", "C", "Z", "Z"],
'col4': ["D", "A", "F", "D","A", "A"],
'Sex':["Male","Male","Male","Female","Female","Null",],
'Count':[100,50,100,50,50,10]})
**df1**
col1 col2 col3 col4 Sex Count
A B C D Male 100
X Y Z A Male 50
E E E F Male 100
A B C D Female 50
X Y Z A Female 50
X Y Z A Null 10
dup_df = df[df[['col1', 'col2', 'col3',"col4"]].duplicated(keep=False)].sort_values("col 1").reset_index(drop=True)
**dup_df**
col1 col2 col3 col4 Sex Count
0 A B C D Male 100
1 A B C D Female 50
2 X Y Z A Male 50
3 X Y Z A Female 50
4 X Y Z A Null 10
After filtering the duplicates and sort the values, dataframe will look like above. Now I want to perform some arithmetic operation. Row 0,Row 1 and Row 2,3,4 are same from col 1 to col 4, I want to create a new column say total male, total female and Null by taking records from count column matching sex column. Later i will be dropping the sex column, row 1,3,4. After performing the operation below is the expected result.
col1 col2 col3 col4 Sex Count Total Total_Male Total_Female Null_column
0 A B C D Male 100 150 100 50
1 A B C D Female 50
2 X Y Z A Male 50 110 50 50 10
3 X Y Z A Female 50
4 X Y Z A Null 10
Could someone help me with most efficient way of producing the result using Pandas. I can only think of using loops. Anyhelp would be highly appreciated.
CodePudding user response:
You can leave one value per group right away like this:
columns = ['col1', 'col2', 'col3',"col4"]
grouped = dup_df.groupby(columns)
grouped[['Sex', 'Count']].apply(
lambda sub_df: (sub_df.groupby('Sex')
.agg(sum).T
.rename(columns={'Male': 'Total_Male',
'Female': 'Total_Female',
'Null': 'Null_column'}))
).assign(Total=lambda x: x.sum(axis=1))
.reset_index(level=4, drop=True)
.reset_index().rename_axis(columns=None)
)
col1 col2 col3 col4 Total_Female Total_Male Null_column Total
0 A B C D 50 100 NaN 150.0
1 X Y Z A 50 50 10.0 110.0