Home > Net >  How to remove duplicates and keep values of all columns
How to remove duplicates and keep values of all columns

Time:10-09

I have a df like below

  Date         ID Colour ColourCode    Item ItemCode
0 2020-01-02  245   Blue         Bl   Apple      NaN
1 2020-01-02  245   Blue        NaN   Apple       Ap
2 2020-01-03  245   Blue         Bl  Orange      NaN
3 2020-01-03  245  Green        NaN   Apple       Ap
4 2020-01-05  472    Red         Re   Grape      NaN
5 2020-01-05  472    Red        NaN   Grape       Gr
6 2020-01-07  472    Red         Re  Banana      NaN

df.to_dict()
{'Date': {0: Timestamp('2020-01-02 00:00:00'), 1: Timestamp('2020-01-02 00:00:00'), 2: Timestamp('2020-01-03 00:00:00'), 3: Timestamp('2020-01-03 00:00:00'), 4: Timestamp('2020-01-05 00:00:00'), 5: Timestamp('2020-01-05 00:00:00'), 6: Timestamp('2020-01-07 00:00:00')}, 'ID': {0: 245, 1: 245, 2: 245, 3: 245, 4: 472, 5: 472, 6: 472}, 'Colour': {0: 'Blue', 1: 'Blue', 2: 'Blue', 3: 'Green', 4: 'Red', 5: 'Red', 6: 'Red'}, 'ColourCode': {0: 'Bl', 1: nan, 2: 'Bl', 3: nan, 4: 'Re', 5: nan, 6: 'Re'}, 'Item': {0: 'Apple', 1: 'Apple', 2: 'Orange', 3: 'Apple', 4: 'Grape', 5: 'Grape', 6: 'Banana'}, 'ItemCode': {0: nan, 1: 'Ap', 2: nan, 3: 'Ap', 4: nan, 5: 'Gr', 6: nan}}

Any rows having same Date, ID, Colour, Item would be considered duplicates. I just want to keep one row but this row would contain values of both columns ColourCode and ItemCode.
My expected result looks like

 Date          ID Colour ColourCode Item ItemCode
0 2020-01-02  245   Blue         Bl   Apple       Ap
1 2020-01-03  245   Blue         Bl  Orange      NaN
2 2020-01-03  245  Green        NaN   Apple       Ap
3 2020-01-05  472    Red         Re   Grape       Gr
4 2020-01-07  472    Red         Re  Banana      NaN

Are there any ways of doing it in Python?

CodePudding user response:

Assuming , you can use groupby.first and reindex:

out = (df
 .groupby(['Date', 'ID', 'Colour', 'Item'], as_index=False)
 .first()[df.columns]
 )

Output:

        Date   ID Colour ColourCode    Item ItemCode
0 2020-01-02  245   Blue         Bl   Apple       Ap
1 2020-01-03  245   Blue         Bl  Orange     None
2 2020-01-03  245  Green       None   Apple       Ap
3 2020-01-05  472    Red         Re   Grape       Gr
4 2020-01-07  472    Red         Re  Banana     None

CodePudding user response:

Assuming pandas and using fillna, groupby, and aggregate:

df = df.fillna("")
aggregation_functions = {'ColourCode': 'sum', 'ItemCode': 'sum'}
df = df.groupby(['Date','ID','Colour', 'Item'], as_index=False).aggregate(aggregation_functions)

Output:

        Date        ID   Colour  Item      ColourCode ItemCode                                              
     0  01/02/2020  245   Blue   Apple         Bl       Ap                     
     1  01/03/2020  245   Blue  Orange         Bl                                                                            
     2  01/03/2020  245  Green   Apple                  Ap                                                                   
     3  01/05/2020  472    Red   Grape         Re       Gr                                                                   
     4  01/07/2020  472    Red  Banana         Re         
  • Related