I have a df like below
Date ID Colour ColourCode Item ItemCode
0 2020-01-02 245 Blue Bl Apple NaN
1 2020-01-02 245 Blue NaN Apple Ap
2 2020-01-03 245 Blue Bl Orange NaN
3 2020-01-03 245 Green NaN Apple Ap
4 2020-01-05 472 Red Re Grape NaN
5 2020-01-05 472 Red NaN Grape Gr
6 2020-01-07 472 Red Re Banana NaN
df.to_dict()
{'Date': {0: Timestamp('2020-01-02 00:00:00'), 1: Timestamp('2020-01-02 00:00:00'), 2: Timestamp('2020-01-03 00:00:00'), 3: Timestamp('2020-01-03 00:00:00'), 4: Timestamp('2020-01-05 00:00:00'), 5: Timestamp('2020-01-05 00:00:00'), 6: Timestamp('2020-01-07 00:00:00')}, 'ID': {0: 245, 1: 245, 2: 245, 3: 245, 4: 472, 5: 472, 6: 472}, 'Colour': {0: 'Blue', 1: 'Blue', 2: 'Blue', 3: 'Green', 4: 'Red', 5: 'Red', 6: 'Red'}, 'ColourCode': {0: 'Bl', 1: nan, 2: 'Bl', 3: nan, 4: 'Re', 5: nan, 6: 'Re'}, 'Item': {0: 'Apple', 1: 'Apple', 2: 'Orange', 3: 'Apple', 4: 'Grape', 5: 'Grape', 6: 'Banana'}, 'ItemCode': {0: nan, 1: 'Ap', 2: nan, 3: 'Ap', 4: nan, 5: 'Gr', 6: nan}}
Any rows having same Date, ID, Colour, Item would be considered duplicates. I just want to keep one row but this row would contain values of both columns ColourCode and ItemCode.
My expected result looks like
Date ID Colour ColourCode Item ItemCode
0 2020-01-02 245 Blue Bl Apple Ap
1 2020-01-03 245 Blue Bl Orange NaN
2 2020-01-03 245 Green NaN Apple Ap
3 2020-01-05 472 Red Re Grape Gr
4 2020-01-07 472 Red Re Banana NaN
Are there any ways of doing it in Python?
CodePudding user response:
Assuming pandas, you can use groupby.first
and reindex:
out = (df
.groupby(['Date', 'ID', 'Colour', 'Item'], as_index=False)
.first()[df.columns]
)
Output:
Date ID Colour ColourCode Item ItemCode
0 2020-01-02 245 Blue Bl Apple Ap
1 2020-01-03 245 Blue Bl Orange None
2 2020-01-03 245 Green None Apple Ap
3 2020-01-05 472 Red Re Grape Gr
4 2020-01-07 472 Red Re Banana None
CodePudding user response:
Assuming pandas and using fillna, groupby, and aggregate:
df = df.fillna("")
aggregation_functions = {'ColourCode': 'sum', 'ItemCode': 'sum'}
df = df.groupby(['Date','ID','Colour', 'Item'], as_index=False).aggregate(aggregation_functions)
Output:
Date ID Colour Item ColourCode ItemCode
0 01/02/2020 245 Blue Apple Bl Ap
1 01/03/2020 245 Blue Orange Bl
2 01/03/2020 245 Green Apple Ap
3 01/05/2020 472 Red Grape Re Gr
4 01/07/2020 472 Red Banana Re