Home > Software design >  Sort Categorial values within groupby in pandas
Sort Categorial values within groupby in pandas

Time:03-05

I have this example df:

   df3 = pd.DataFrame({'Customer':['Sara','John','Didi','Sara','Didi' ,'Didi'],

                   'Date': ['15-12-2021', '1-1-2022' , '1-3-2022','15-3-2022', '1-1-2022' , '1-4-2022'],
                   'Month': ['December-2021', 'January-2022', 'March-2022','March-2022', 'January-2022', 'April-2022'],
                   'Product': ['grocery','electronics','personal-care','grocery','electronics','personal-care'],
                   'status': ['purchased', 'refunded', 'refunded','refunded', 'purchased', 'refunded']
                   })

df3

gives:

Customer    Date        Month            Product          status
0   Sara    15-12-2021  December-2021    grocery          purchased
1   John    1-1-2022    January-2022     electronics      refunded
2   Didi    1-3-2022    March-2022       personal-care    refunded
3   Sara    15-3-2022   March-2022       grocery          refunded
4   Didi    1-1-2022    January-2022     electronics      purchased
5   Didi    1-4-2022    April-2022       personal-care    refunded

I am trying to group by customer, product & month and get the first status and then I want the groupby sorted by Month column:

df3.sort_values('Month').groupby(['Customer','Product','Month','Date']).agg({'status':'first'}).reset_index()

I got:

    Customer    Product         Month             Date         status
0   Didi    electronics         January-2022      1-1-2022     purchased
1   Didi    personal-care       April-2022        1-4-2022     refunded
2   Didi    personal-care       March-2022        1-3-2022     refunded
3   John    electronics         January-2022      1-1-2022     refunded
4   Sara    grocery             December-2021     15-12-2021   purchased
5   Sara    grocery             March-2022    15-3-2022   refunded

I expected for index 1 & 2 to be reversed in order, March before April so what I tried to do is:

months = {'December-2021':0,'January-2022':1,'February-2022':2,'March-2022':3,'April-2022':4,'May-2022':5,'June-2022':6,'July-2022':7,'August-2022':8,'September-2022':9,'October-2022':10,'November-2022':11}

then map this through sort values:

df3.sort_values(by=['Month'], key=lambda x: x.map(months)).groupby(['Customer','Product','Month','Date']).agg({'status':'first'}).reset_index()

But I got the same exact results without the correct order

CodePudding user response:

df3['Month'] = pd.to_datetime(df3['Month'], infer_datetime_format=True)
df3 = df3.sort_values(by=["Month"],ascending=False).groupby(
    ['Customer','Product','Month','Date']).agg({
    'status':'first'}).reset_index()
df3['Month'] = df3['Month'].dt.strftime('%B-%Y')

df3

Your desired output:

  Customer        Product          Month        Date     status
0     Didi    electronics   January-2022    1-1-2022  purchased
1     Didi  personal-care     March-2022    1-3-2022   refunded
2     Didi  personal-care     April-2022    1-4-2022   refunded
3     John    electronics   January-2022    1-1-2022   refunded
4     Sara        grocery  December-2021  15-12-2021  purchased
5     Sara        grocery     March-2022   15-3-2022   refunded

CodePudding user response:

You are currently sorting by string, so April is before March.

You need to convert to datetime for sorting, here using a custom key in the form YYYY-MM.

Also, groupby sorts the groups by default so you need to add sort=False to prevent a new sorting after aggregation.

(df3.assign(key=pd.to_datetime(df3['Date'], dayfirst=True).dt.strftime('%Y%M'))
    .sort_values(by='key')
    .groupby(['Customer','Product','Month','Date'], sort=False)
    .agg({'status':'first'}).reset_index()
)

output:

  Customer        Product          Month        Date     status
0     Sara        grocery  December-2021  15-12-2021  purchased
1     John    electronics   January-2022    1-1-2022   refunded
2     Didi  personal-care     March-2022    1-3-2022   refunded
3     Sara        grocery     March-2022   15-3-2022   refunded
4     Didi    electronics   January-2022    1-1-2022  purchased
5     Didi  personal-care     April-2022    1-4-2022   refunded

CodePudding user response:

You may need turn sort = False

df3.sort_values(by=['Month'], key=lambda x: x.map(months)).groupby(['Customer','Product','Month','Date'],sort=False).agg({'status':'first'}).reset_index()
Out[546]: 
  Customer        Product          Month        Date     status
0     Sara        grocery  December-2021  15-12-2021  purchased
1     John    electronics   January-2022    1-1-2022   refunded
2     Didi    electronics   January-2022    1-1-2022  purchased
3     Didi  personal-care     March-2022    1-3-2022   refunded
4     Sara        grocery     March-2022   15-3-2022   refunded
5     Didi  personal-care     April-2022    1-4-2022   refunded

CodePudding user response:

The problem is that it is sorting a string and April is before March. You have to convert the string into a date first and then sort your entries. For example like this:

# Convert column Month to datetime
df3['Month'] = pd.to_datetime(df3['Month'], format='%B-%Y')

# Do your groupby
df_group = df3.sort_values('Month').groupby(['Customer','Product','Month','Date'], sort=False).first().reset_index()

# Convert column Month back to string
df_group['Month'] = df_group['Month'].dt.strftime('%B-%Y')
df_group

Output:

Customer    Product Month   Date    status
0   Sara    grocery December-2021   15-12-2021  purchased
1   Didi    electronics January-2022    1-1-2022    purchased
2   John    electronics January-2022    1-1-2022    refunded
3   Didi    personal-care   March-2022  1-3-2022    refunded
4   Sara    grocery March-2022  15-3-2022   refunded
5   Didi    personal-care   April-2022  1-4-2022    refunded
  • Related