Home > Mobile >  Getting idxmax values in pandas groupby in case of equal values or ties
Getting idxmax values in pandas groupby in case of equal values or ties

Time:09-27

I am trying to identify the most probable event (transportation) of customers. This is an example df:

df = pd.DataFrame({'customer id':[1,1,1,2,2,2,2,2,2,2,3,3],
                   'Trans':['Car','Car','Bus','Bus','Bus','Bus','Car','Car','Car','Plane','Car','Bus']})

gives:

    customer id  Trans
0             1    Car
1             1    Car
2             1    Bus
3             2    Bus
4             2    Bus
5             2    Bus
6             2    Car
7             2    Car
8             2    Car
9             2  Plane
10            3    Car
11            3    Bus

I did the following steps to get the largest probable values for each customer id:

#Get the count & percent of each customer id transportation
df2 = df.groupby(['customer id','Trans'])['Trans'].size().reset_index(name='Trans Counts')
df2['percent'] = df2.groupby("customer id")['Trans Counts'].transform(lambda x: (x / x.sum()).round(2))

#Get records > 1
df3 = df2[df2['Trans Counts'] > 1]

So I got this df:

  customer id Trans  Trans Counts  percent
1            1   Car             2     0.67
2            2   Bus             3     0.43
3            2   Car             3     0.43

There is a tie for customer id 2, so when I use idxmax() :

df3.loc[df3.groupby('customer id')['Trans Counts'].idxmax()]

It only shows the first row:

   customer id Trans  Trans Counts  percent
1            1   Car             2     0.67
2            2   Bus             3     0.43

How can I get the top 2 records within the same group in case of a tie? I also used try nlargest(2) but I got the results for the whole df not within the grouping & if I used it within the aggregate function, it does not show the expected output!

Expected output in case of a tie:

   customer id Trans            Trans Counts  percent
1            1   Car                      2     0.67
2            2   **Bus, Car**             3     0.43

Thanks

CodePudding user response:

You could create a "rank" column:

df3["rank"] = df3.groupby("customer id")["Trans Counts"].rank(method="min")
df3
# Out[83]: 
#    customer id Trans  Trans Counts  percent  rank
# 1            1   Car             2     0.67   1.0
# 2            2   Bus             3     0.43   1.0
# 3            2   Car             3     0.43   1.0

Using method="min", the lowest rank is used when values are the same (in this case 1, rather than 2 (max) or 1.5 (average)). Then you could specify df3[df3["rank"] == 1] for those that are most frequent.

You could most likely combine several all these to not use groupby several times within the code to improve efficiency.

CodePudding user response:

This is how you can assign the new column based on the maximum transportation type

In [144]: df2 = pd.get_dummies(df).groupby("customer id").sum()
#             Trans_Bus  Trans_Car  Trans_Plane
#customer id                                   
#1                    1          2            0
#2                    3          3            1
#3                    1          1            0


In [145]: df2['max_cols'] = [ ",".join([col for col in df2 if df2.lo
     ...: c[i,col]==mx]) for i,mx in df2.max(axis=1).items()]

In [146]: df2
Out[146]: 
             Trans_Bus  Trans_Car  Trans_Plane             max_cols
customer id                                                        
1                    1          2            0            Trans_Car
2                    3          3            1  Trans_Bus,Trans_Car
3                    1          1            0  Trans_Bus,Trans_Car

More or less the same approach, but adding percentage and max count as columns:

In [4]: df2 = pd.get_dummies(df).groupby("customer id").sum()

In [5]: trans_count = df2.max(axis=1)

In [6]: trans_perc = trans_count / df2.sum(axis=1)

In [7]: trans_type = [ ",".join([col for col in df2 if df2.loc[i,col]=
   ...: =mx]) for i,mx in trans_count.items()]

In [8]: pd.DataFrame({"count":trans_count, "perc":trans_perc, "kind":t
   ...: rans_type}, index=df2.index)
Out[8]: 
             count      perc                 kind
customer id                                      
1                2  0.666667            Trans_Car
2                3  0.428571  Trans_Bus,Trans_Car
3                1  0.500000  Trans_Bus,Trans_Car

Note, you can call reset_index() on the resulting dataframe to move customer id from index to column

  • Related