Let's say I hace a datframe like this:
df = pd.DataFrame({
'COUNTRY_CODE': ['CO','CO','CO','BR','BR','BR'],
'VERTICAL_GROUP_ID': [2,2,3,2,3,3],
'SUB_VERTICAL': ['SUPER','SUPER','HOME','LICOR','SPORTS','HOME'],
'PRODUCT_ID': [1111,1111,1111,1111,1111,2222],
'SHOWN': [7,8,12,14,16,1],
})
I want to get in another dataframe, for each COUNTRY_CODE/PRODUCT_ID combination, only ONE row for each VERTICAL_GROUP.
So for the df above, I'd like to get something like this:
COUNTRY_CODE | VERTICAL_GROUP_ID | SUB_VERTICAL | PRODUCT_ID | SHOWN |
---|---|---|---|---|
CO | 2 | SUPER | 1111 | 7 |
CO | 3 | HOME | 1111 | 12 |
BR | 2 | LICOR | 1111 | 14 |
BR | 3 | SPORTS | 1111 | 16 |
BR | 3 | HOME | 2222 | 1 |
It doesn't matter which rows of each COUNTRY_CODE/PRODUCT_ID/VERTICAL_GROUP combination I keep, as long as I get only 1 for each VERTICAL_GROUP.
Whats the best way of doing this? I tried mixing a grouby("PRODUCT_ID") with a drop_duplicates(subset=['VERTICAL_GROUP_ID']), but I keepe doing something wrong,because I keep getting errors like
AttributeError: 'DataFrameGroupBy' object has no attribute 'drop_duplicates'
CodePudding user response:
You could try:
df.groupby(['COUNTRY_CODE', 'VERTICAL_GROUP_ID', 'PRODUCT_ID']).agg('first').reset_index()
COUNTRY_CODE VERTICAL_GROUP_ID PRODUCT_ID SUB_VERTICAL SHOWN
0 BR 2 1111 LICOR 14
1 BR 3 1111 SPORTS 16
2 BR 3 2222 HOME 1
3 CO 2 1111 SUPER 7
4 CO 3 1111 HOME 12