Home > Back-end >  Filter duplicate records in a dataframe using pandas and perform operations
Filter duplicate records in a dataframe using pandas and perform operations

Time:08-27

I have the below dataframe and trying to filter rows with duplicate records present only in col 1, col2, col3, col 4 using the below code.

df1 = pd.DataFrame({'col1': ["A", "X", "E", "A", "X", "X"],
                   'col2': ["B", "Y", "E", "B", "Y","Y"],
                   'col3': ["C", "Z", "E", "C", "Z", "Z"],
                   'col4': ["D", "A", "F", "D","A", "A"],
'Sex':["Male","Male","Male","Female","Female","Null",],
'Count':[100,50,100,50,50,10]})

**df1**

col1    col2    col3    col4    Sex   Count
A        B     C       D        Male    100
X        Y     Z       A        Male    50
E        E     E       F        Male    100
A        B     C       D        Female  50
X        Y     Z       A        Female  50
X        Y     Z       A        Null    10

dup_df = df[df[['col1', 'col2', 'col3',"col4"]].duplicated(keep=False)].sort_values("col 1").reset_index(drop=True)

**dup_df**

    col1    col2    col3    col4    Sex   Count
0   A      B         C       D      Male    100
1   A      B         C       D      Female  50
2   X      Y         Z       A      Male    50
3   X      Y         Z       A      Female  50
4   X      Y         Z       A      Null    10  

After filtering the duplicates and sort the values, dataframe will look like above. Now I want to perform some arithmetic operation. Row 0,Row 1 and Row 2,3,4 are same from col 1 to col 4, I want to create a new column say total male, total female and Null by taking records from count column matching sex column. Later i will be dropping the sex column, row 1,3,4. After performing the operation below is the expected result.

   col1 col2    col3 col4    Sex       Count    Total   Total_Male  Total_Female    Null_column
   0  A    B      C     D    Male       100      150        100        50   
   1  A    B      C     D    Female     50              
   2  X    Y      Z     A    Male       50       110        50         50            10
   3  X    Y      Z     A    Female     50              
   4  X    Y      Z     A     Null      10

            

Could someone help me with most efficient way of producing the result using Pandas. I can only think of using loops. Anyhelp would be highly appreciated.

CodePudding user response:

You can leave one value per group right away like this:

columns = ['col1', 'col2', 'col3',"col4"]
grouped = dup_df.groupby(columns)

grouped[['Sex', 'Count']].apply(
    lambda sub_df: (sub_df.groupby('Sex')
                          .agg(sum).T
                          .rename(columns={'Male': 'Total_Male', 
                                           'Female': 'Total_Female', 
                                           'Null': 'Null_column'}))
    ).assign(Total=lambda x: x.sum(axis=1))
     .reset_index(level=4, drop=True)
     .reset_index().rename_axis(columns=None)
)
  col1 col2 col3 col4  Total_Female  Total_Male  Null_column  Total
0    A    B    C    D            50         100          NaN  150.0
1    X    Y    Z    A            50          50         10.0  110.0
  • Related