I am trying to identify the most probable event (transportation) of customers. This is an example df:
df = pd.DataFrame({'customer id':[1,1,1,2,2,2,2,2,2,2,3,3],
'Trans':['Car','Car','Bus','Bus','Bus','Bus','Car','Car','Car','Plane','Car','Bus']})
gives:
customer id Trans
0 1 Car
1 1 Car
2 1 Bus
3 2 Bus
4 2 Bus
5 2 Bus
6 2 Car
7 2 Car
8 2 Car
9 2 Plane
10 3 Car
11 3 Bus
I did the following steps to get the largest probable values for each customer id
:
#Get the count & percent of each customer id transportation
df2 = df.groupby(['customer id','Trans'])['Trans'].size().reset_index(name='Trans Counts')
df2['percent'] = df2.groupby("customer id")['Trans Counts'].transform(lambda x: (x / x.sum()).round(2))
#Get records > 1
df3 = df2[df2['Trans Counts'] > 1]
So I got this df:
customer id Trans Trans Counts percent
1 1 Car 2 0.67
2 2 Bus 3 0.43
3 2 Car 3 0.43
There is a tie for customer id
2, so when I use idxmax()
:
df3.loc[df3.groupby('customer id')['Trans Counts'].idxmax()]
It only shows the first row:
customer id Trans Trans Counts percent
1 1 Car 2 0.67
2 2 Bus 3 0.43
How can I get the top 2 records within the same group in case of a tie? I also used try nlargest(2)
but I got the results for the whole df not within the grouping & if I used it within the aggregate function, it does not show the expected output!
Expected output in case of a tie:
customer id Trans Trans Counts percent
1 1 Car 2 0.67
2 2 **Bus, Car** 3 0.43
Thanks
CodePudding user response:
You could create a "rank" column:
df3["rank"] = df3.groupby("customer id")["Trans Counts"].rank(method="min")
df3
# Out[83]:
# customer id Trans Trans Counts percent rank
# 1 1 Car 2 0.67 1.0
# 2 2 Bus 3 0.43 1.0
# 3 2 Car 3 0.43 1.0
Using method="min"
, the lowest rank is used when values are the same (in this case 1, rather than 2 (max) or 1.5 (average)). Then you could specify df3[df3["rank"] == 1]
for those that are most frequent.
You could most likely combine several all these to not use groupby
several times within the code to improve efficiency.
CodePudding user response:
This is how you can assign the new column based on the maximum transportation type
In [144]: df2 = pd.get_dummies(df).groupby("customer id").sum()
# Trans_Bus Trans_Car Trans_Plane
#customer id
#1 1 2 0
#2 3 3 1
#3 1 1 0
In [145]: df2['max_cols'] = [ ",".join([col for col in df2 if df2.lo
...: c[i,col]==mx]) for i,mx in df2.max(axis=1).items()]
In [146]: df2
Out[146]:
Trans_Bus Trans_Car Trans_Plane max_cols
customer id
1 1 2 0 Trans_Car
2 3 3 1 Trans_Bus,Trans_Car
3 1 1 0 Trans_Bus,Trans_Car
More or less the same approach, but adding percentage and max count as columns:
In [4]: df2 = pd.get_dummies(df).groupby("customer id").sum()
In [5]: trans_count = df2.max(axis=1)
In [6]: trans_perc = trans_count / df2.sum(axis=1)
In [7]: trans_type = [ ",".join([col for col in df2 if df2.loc[i,col]=
...: =mx]) for i,mx in trans_count.items()]
In [8]: pd.DataFrame({"count":trans_count, "perc":trans_perc, "kind":t
...: rans_type}, index=df2.index)
Out[8]:
count perc kind
customer id
1 2 0.666667 Trans_Car
2 3 0.428571 Trans_Bus,Trans_Car
3 1 0.500000 Trans_Bus,Trans_Car
Note, you can call reset_index()
on the resulting dataframe to move customer id
from index to column