Home > front end >  pandas group by & drop symmetric records and get top 2 in case of a tie and max of unique records
pandas group by & drop symmetric records and get top 2 in case of a tie and max of unique records

Time:09-26

Still cannot figure out how to achieve that or find a better approach:

I have this example df:

df = pd.DataFrame({'customer':[1,1,1,2,2,2,3,3,3],
                   'method':['Car','Bus','Train','Bus','Car','Train','Bus','Car','Train'],
                   'use':[6,6,6,4,3,1,4,4,1]})

gives:

   customer method  use
0         1    Car    6
1         1    Bus    6
2         1  Train    6
3         2    Bus    4
4         2    Car    3
5         2  Train    1
6         3    Bus    4
7         3    Car    4
8         3  Train    1

I have here 3 customers, the first one has all values equal for each method, the second one has 3 different values and the third one has two equal values and one unique value.

I am trying to know what is the most used method for each customer! so in case of customer 1 all are equal so I want to drop this customer as we cannot decide, for the second one it is obvious it will be Bus and for the third one I will pick up the top 2 values since they have equal values in such case Bus & Car.

Now I tried to drop duplicates based on a subset of customer and use columns to get rid of customers with all equal values within the df:

df.drop_duplicates(subset=['customer','use'], keep=False)

But this also dropped customer 3 on both indices 6 & 7 so this solution is not working.

I tried another one by calculating the std which will be zero in case of equal values in grouping:

df.merge(df.groupby('customer')['use'].std(), on=['customer'], how='left')

so I can now drop records with use_y = 0.000000 and this is the output df2:

  customer method  use_x     use_y
3         2    Bus      4  1.527525
4         2    Car      3  1.527525
5         2  Train      1  1.527525
6         3    Bus      4  1.732051
7         3    Car      4  1.732051
8         3  Train      1  1.732051

Now to pick the top 2 values in case of a tie and max value if no tie after grouping:

df2.sort_values('use_x', ascending=False).groupby('customer').head(2).sort_index()

I do not know how to get the max value in grouping of no tie and the nlargest(2) in case of a tie?

Desired output:

   customer method  use_x     use_y
3         2    Bus      4  1.527525
6         3    Bus      4  1.732051
7         3    Car      4  1.732051

CodePudding user response:

This is a solution:

import pandas as pd

df = pd.DataFrame({'customer':[1,1,1,2,2,2,3,3,3],
                   'method':['Car','Bus','Train','Bus','Car','Train','Bus','Car','Train'],
                   'use':[6,6,6,4,3,1,4,4,1]})
df = df.merge(df.groupby('customer')['use'].std(), on=['customer'], how='left')


# Drop `use_y` == 0.0
df = df[df.use_y != 0]
# select max or all max rows 
# df[df['use_x'] == df['use_x'].max()]

df[df.groupby(['customer','method'])['use_x'].transform('max') == df['use_x'].max()]

Output:

   customer method  use_x     use_y
3         2    Bus      4  1.527525
6         3    Bus      4  1.732051
7         3    Car      4  1.732051

CodePudding user response:

OK, this could be NOT the best solution but it does the job if someone would like to know it. I am also still looking for better and efficient ways to do that!

So, I got this dfs after dropping symmetric records per group:

   customer method  use_x     use_y
3         2    Bus      4  1.527525
4         2    Car      3  1.527525
5         2  Train      1  1.527525
6         3    Bus      4  1.732051
7         3    Car      4  1.732051
8         3  Train      1  1.732051

I have added two columns for ranking using pandas.DataFrame.rank:

df2['rank-min'] = df2.groupby("customer")['use_x'].rank(method='min')
df2['rank-max'] = df2.groupby("customer")['use_x'].rank(method='max')

Which gives:

   customer method  use_x     use_y  rank-min  rank-max
3         2    Bus      4  1.527525       3.0       3.0
4         2    Car      3  1.527525       2.0       2.0
5         2  Train      1  1.527525       1.0       1.0
6         3    Bus      4  1.732051       2.0       3.0
7         3    Car      4  1.732051       2.0       3.0
8         3  Train      1  1.732051       1.0       1.0

Here is the trick, if rank-min == rank-max means it is a unique record within the group if not, it means a repeated record within the group

So, I got the top 2 results per sorted group using head(2):

df3 = df2.sort_values('use_x', ascending=False).groupby('customer').head(2).sort_index()

To get df3:

   customer method  use_x     use_y  rank-min  rank-max
3         2    Bus      4  1.527525       3.0       3.0
4         2    Car      3  1.527525       2.0       2.0
6         3    Bus      4  1.732051       2.0       3.0
7         3    Car      4  1.732051       2.0       3.0

then split this df3 to ties df and non ties df

ties = df3[df3['rank-min'] != df3['rank-max']]
non_ties = df3[-(df3['rank-min'] != df3['rank-max'])]

Finally, got the idxmax() from non_ties:

max_unique = non_ties.loc[non_ties.groupby('customer')['use_x'].idxmax()]

and concat:

output = pd.concat([ties, max_unique])

Final output is as expected per group:

   customer method  use_x     use_y  rank-min  rank-max
6         3    Bus      4  1.732051       2.0       3.0
7         3    Car      4  1.732051       2.0       3.0
3         2    Bus      4  1.527525       3.0       3.0

I know there could be more efficient & elegant ways, but this is my best based on my knowledge!

  • Related