I have this example df:
df3 = pd.DataFrame({'Customer':['Sara','John','Didi','Sara','Didi' ,'Didi'],
'Date': ['15-12-2021', '1-1-2022' , '1-3-2022','15-3-2022', '1-1-2022' , '1-4-2022'],
'Month': ['December-2021', 'January-2022', 'March-2022','March-2022', 'January-2022', 'April-2022'],
'Product': ['grocery','electronics','personal-care','grocery','electronics','personal-care'],
'status': ['purchased', 'refunded', 'refunded','refunded', 'purchased', 'refunded']
})
df3
gives:
Customer Date Month Product status
0 Sara 15-12-2021 December-2021 grocery purchased
1 John 1-1-2022 January-2022 electronics refunded
2 Didi 1-3-2022 March-2022 personal-care refunded
3 Sara 15-3-2022 March-2022 grocery refunded
4 Didi 1-1-2022 January-2022 electronics purchased
5 Didi 1-4-2022 April-2022 personal-care refunded
I am trying to group by customer, product & month and get the first status and then I want the groupby sorted by Month column:
df3.sort_values('Month').groupby(['Customer','Product','Month','Date']).agg({'status':'first'}).reset_index()
I got:
Customer Product Month Date status
0 Didi electronics January-2022 1-1-2022 purchased
1 Didi personal-care April-2022 1-4-2022 refunded
2 Didi personal-care March-2022 1-3-2022 refunded
3 John electronics January-2022 1-1-2022 refunded
4 Sara grocery December-2021 15-12-2021 purchased
5 Sara grocery March-2022 15-3-2022 refunded
I expected for index 1 & 2
to be reversed in order, March before April so what I tried to do is:
months = {'December-2021':0,'January-2022':1,'February-2022':2,'March-2022':3,'April-2022':4,'May-2022':5,'June-2022':6,'July-2022':7,'August-2022':8,'September-2022':9,'October-2022':10,'November-2022':11}
then map this through sort values:
df3.sort_values(by=['Month'], key=lambda x: x.map(months)).groupby(['Customer','Product','Month','Date']).agg({'status':'first'}).reset_index()
But I got the same exact results without the correct order
CodePudding user response:
df3['Month'] = pd.to_datetime(df3['Month'], infer_datetime_format=True)
df3 = df3.sort_values(by=["Month"],ascending=False).groupby(
['Customer','Product','Month','Date']).agg({
'status':'first'}).reset_index()
df3['Month'] = df3['Month'].dt.strftime('%B-%Y')
df3
Your desired output:
Customer Product Month Date status
0 Didi electronics January-2022 1-1-2022 purchased
1 Didi personal-care March-2022 1-3-2022 refunded
2 Didi personal-care April-2022 1-4-2022 refunded
3 John electronics January-2022 1-1-2022 refunded
4 Sara grocery December-2021 15-12-2021 purchased
5 Sara grocery March-2022 15-3-2022 refunded
CodePudding user response:
You are currently sorting by string, so April is before March.
You need to convert to datetime for sorting, here using a custom key in the form YYYY-MM.
Also, groupby
sorts the groups by default so you need to add sort=False
to prevent a new sorting after aggregation.
(df3.assign(key=pd.to_datetime(df3['Date'], dayfirst=True).dt.strftime('%Y%M'))
.sort_values(by='key')
.groupby(['Customer','Product','Month','Date'], sort=False)
.agg({'status':'first'}).reset_index()
)
output:
Customer Product Month Date status
0 Sara grocery December-2021 15-12-2021 purchased
1 John electronics January-2022 1-1-2022 refunded
2 Didi personal-care March-2022 1-3-2022 refunded
3 Sara grocery March-2022 15-3-2022 refunded
4 Didi electronics January-2022 1-1-2022 purchased
5 Didi personal-care April-2022 1-4-2022 refunded
CodePudding user response:
You may need turn sort = False
df3.sort_values(by=['Month'], key=lambda x: x.map(months)).groupby(['Customer','Product','Month','Date'],sort=False).agg({'status':'first'}).reset_index()
Out[546]:
Customer Product Month Date status
0 Sara grocery December-2021 15-12-2021 purchased
1 John electronics January-2022 1-1-2022 refunded
2 Didi electronics January-2022 1-1-2022 purchased
3 Didi personal-care March-2022 1-3-2022 refunded
4 Sara grocery March-2022 15-3-2022 refunded
5 Didi personal-care April-2022 1-4-2022 refunded
CodePudding user response:
The problem is that it is sorting a string and April
is before March
. You have to convert the string into a date first and then sort your entries. For example like this:
# Convert column Month to datetime
df3['Month'] = pd.to_datetime(df3['Month'], format='%B-%Y')
# Do your groupby
df_group = df3.sort_values('Month').groupby(['Customer','Product','Month','Date'], sort=False).first().reset_index()
# Convert column Month back to string
df_group['Month'] = df_group['Month'].dt.strftime('%B-%Y')
df_group
Output:
Customer Product Month Date status
0 Sara grocery December-2021 15-12-2021 purchased
1 Didi electronics January-2022 1-1-2022 purchased
2 John electronics January-2022 1-1-2022 refunded
3 Didi personal-care March-2022 1-3-2022 refunded
4 Sara grocery March-2022 15-3-2022 refunded
5 Didi personal-care April-2022 1-4-2022 refunded