Still cannot figure out how to achieve that or find a better approach:
I have this example df:
df = pd.DataFrame({'customer':[1,1,1,2,2,2,3,3,3],
'method':['Car','Bus','Train','Bus','Car','Train','Bus','Car','Train'],
'use':[6,6,6,4,3,1,4,4,1]})
gives:
customer method use
0 1 Car 6
1 1 Bus 6
2 1 Train 6
3 2 Bus 4
4 2 Car 3
5 2 Train 1
6 3 Bus 4
7 3 Car 4
8 3 Train 1
I have here 3 customers, the first one has all values equal for each method, the second one has 3 different values and the third one has two equal values and one unique value.
I am trying to know what is the most used method for each customer! so in case of customer 1 all are equal so I want to drop this customer as we cannot decide, for the second one it is obvious it will be Bus and for the third one I will pick up the top 2 values since they have equal values in such case Bus & Car.
Now I tried to drop duplicates based on a subset of customer
and use
columns to get rid of customers with all equal values within the df:
df.drop_duplicates(subset=['customer','use'], keep=False)
But this also dropped customer 3 on both indices 6 & 7 so this solution is not working.
I tried another one by calculating the std which will be zero in case of equal values in grouping:
df.merge(df.groupby('customer')['use'].std(), on=['customer'], how='left')
so I can now drop records with use_y = 0.000000
and this is the output df2:
customer method use_x use_y
3 2 Bus 4 1.527525
4 2 Car 3 1.527525
5 2 Train 1 1.527525
6 3 Bus 4 1.732051
7 3 Car 4 1.732051
8 3 Train 1 1.732051
Now to pick the top 2 values in case of a tie and max value if no tie after grouping:
df2.sort_values('use_x', ascending=False).groupby('customer').head(2).sort_index()
I do not know how to get the max value in grouping of no tie and the nlargest(2) in case of a tie?
Desired output:
customer method use_x use_y
3 2 Bus 4 1.527525
6 3 Bus 4 1.732051
7 3 Car 4 1.732051
CodePudding user response:
This is a solution:
import pandas as pd
df = pd.DataFrame({'customer':[1,1,1,2,2,2,3,3,3],
'method':['Car','Bus','Train','Bus','Car','Train','Bus','Car','Train'],
'use':[6,6,6,4,3,1,4,4,1]})
df = df.merge(df.groupby('customer')['use'].std(), on=['customer'], how='left')
# Drop `use_y` == 0.0
df = df[df.use_y != 0]
# select max or all max rows
# df[df['use_x'] == df['use_x'].max()]
df[df.groupby(['customer','method'])['use_x'].transform('max') == df['use_x'].max()]
Output:
customer method use_x use_y
3 2 Bus 4 1.527525
6 3 Bus 4 1.732051
7 3 Car 4 1.732051
CodePudding user response:
OK, this could be NOT the best solution but it does the job if someone would like to know it. I am also still looking for better and efficient ways to do that!
So, I got this dfs after dropping symmetric records per group:
customer method use_x use_y
3 2 Bus 4 1.527525
4 2 Car 3 1.527525
5 2 Train 1 1.527525
6 3 Bus 4 1.732051
7 3 Car 4 1.732051
8 3 Train 1 1.732051
I have added two columns for ranking using pandas.DataFrame.rank
:
df2['rank-min'] = df2.groupby("customer")['use_x'].rank(method='min')
df2['rank-max'] = df2.groupby("customer")['use_x'].rank(method='max')
Which gives:
customer method use_x use_y rank-min rank-max
3 2 Bus 4 1.527525 3.0 3.0
4 2 Car 3 1.527525 2.0 2.0
5 2 Train 1 1.527525 1.0 1.0
6 3 Bus 4 1.732051 2.0 3.0
7 3 Car 4 1.732051 2.0 3.0
8 3 Train 1 1.732051 1.0 1.0
Here is the trick, if rank-min == rank-max
means it is a unique record within the group if not, it means a repeated record within the group
So, I got the top 2 results per sorted group using head(2):
df3 = df2.sort_values('use_x', ascending=False).groupby('customer').head(2).sort_index()
To get df3:
customer method use_x use_y rank-min rank-max
3 2 Bus 4 1.527525 3.0 3.0
4 2 Car 3 1.527525 2.0 2.0
6 3 Bus 4 1.732051 2.0 3.0
7 3 Car 4 1.732051 2.0 3.0
then split this df3 to ties df and non ties df
ties = df3[df3['rank-min'] != df3['rank-max']]
non_ties = df3[-(df3['rank-min'] != df3['rank-max'])]
Finally, got the idxmax()
from non_ties:
max_unique = non_ties.loc[non_ties.groupby('customer')['use_x'].idxmax()]
and concat:
output = pd.concat([ties, max_unique])
Final output is as expected per group:
customer method use_x use_y rank-min rank-max
6 3 Bus 4 1.732051 2.0 3.0
7 3 Car 4 1.732051 2.0 3.0
3 2 Bus 4 1.527525 3.0 3.0
I know there could be more efficient & elegant ways, but this is my best based on my knowledge!