Is there a way to groupby a column and get the top5 columns by value (by ascending or descending)?-CodePudding

I have a dataframe as given below.My objective is that for each pf_customer_id I would require the top 5 columns (from highest to lowest).I tried using group by and doing transpose however filtering the highest 5 column is a challenge.

{'A': {0: 0.0, 1: 1916.714416, 2: 4022.357547, 3: 6275.28689, 4: 582.863581},
 'B': {0: 1916.714416, 1: 0.0, 2: 2105.715414, 3: 4358.78689, 4: 2499.358343},
 'C': {0: 4022.357547, 1: 2105.715414, 2: 0.0, 3: 2253.29093, 4: 4604.71567},
 'D': {0: 6275.28689, 1: 4358.78689, 2: 2253.29093, 3: 0.0, 4: 6857.928954},
 'E': {0: 582.863581, 1: 2499.358343, 2: 4604.71567, 3: 6857.928954, 4: 0.0},
 'F': {0: 970.716604,
  1: 2887.214892,
  2: 4992.857468,
  3: 7245.929486,
  4: 388.227816},
 'G': {0: 1235.716107,
  1: 3152.215475,
  2: 5257.785928,
  3: 7510.859423,
  4: 653.301411},
 'H': {0: 6508.144182,
  1: 4591.6443,
  2: 2486.290843,
  3: 233.217505,
  4: 7090.928801},
 'I': {0: 3256.328289,
  1: 1340.025712,
  2: 766.557835,
  3: 3019.314118,
  4: 3838.883097},
 'J': {0: 2698.412148,
  1: 782.275937,
  2: 1324.562498,
  3: 3577.356653,
  4: 3281.013733},
 'K': {0: 3147.725445,
  1: 1231.310389,
  2: 874.905741,
  3: 3127.932009,
  4: 3730.362069},
 'L': {0: 3116.819967,
  1: 1200.43964,
  2: 905.919746,
  3: 3158.947766,
  4: 3699.30572},
 'M': {0: 1904.334099,
  1: 18.283597,
  2: 2118.335643,
  3: 4371.368639,
  4: 2486.953355},
 'N': {0: 2353.462836,
  1: 436.955498,
  2: 1669.340686,
  3: 3922.224931,
  4: 2935.946158},
 'O': {0: 2913.365253,
  1: 996.876837,
  2: 1109.455848,
  3: 3362.429984,
  4: 3495.788897},
 'P': {0: 2157.443161,
  1: 242.20937,
  2: 1865.553245,
  3: 4118.418641,
  4: 2739.958665},
 'pf_customer_id': {0: 60000, 1: 60001, 2: 60002, 3: 60003, 4: 60004}}

Expected output is shown below. Just taking an instance of two customer ID to make it more easy to understand:

CodePudding user response：

IIUC, you can use:

(df.set_index('pf_customer_id').stack()
   .groupby(level=0)
   .nlargest(5)
   .droplevel(0)
   .reset_index(name='value')
 )

Output:

    pf_customer_id level_1        value
0            60000       H  6508.144182
1            60000       D  6275.286890
2            60000       C  4022.357547
3            60000       I  3256.328289
4            60000       K  3147.725445
5            60001       H  4591.644300
6            60001       D  4358.786890
7            60001       G  3152.215475
8            60001       F  2887.214892
9            60001       E  2499.358343
...

CodePudding user response：

If you are looking to dynamically get the top 5 of a column you can use cumcount()

import pandas as pd

df = pd.DataFrame({
    'Column1' : ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
    'Column2' : [1, 2, 3, 4, 5, 6, 10, 20, 30, 40, 50, 60]
})

df.loc[df.sort_values('Column2', ascending = True).groupby(['Column1'])['Column2'].cumcount()   1 <= 5]

CodePudding user response：

You can simply grab the column values and use an argsort:

df2 = df.set_index('pf_customer_id')

col_names = np.array(list(df2))

for i in df2.index.values:
    row = df2.loc[i]
    ordered = col_names[np.argsort(row)[::-1]]
    top5 = ordered[:5]
    print( "cust id %d: top 5=%s" % (i, ", ".join(top5)) )


#cust id 60000: top 5=H, D, C, I, K
#cust id 60001: top 5=H, D, G, F, E
#cust id 60002: top 5=G, F, E, A, H
#cust id 60003: top 5=G, F, E, A, M
#cust id 60004: top 5=H, D, C, I, K

For your specific output, you can instead use the melt method:

df_melt = df2.melt(ignore_index=False).reset_index(drop=False)
#    pf_customer_id variable        value
#0            60000        A     0.000000
#1            60001        A  1916.714416
#2            60002        A  4022.357547
#3            60003        A  6275.286890
#4            60004        A   582.863581
#..             ...      ...          ...
#75           60000        P  2157.443161
#76           60001        P   242.209370
#77           60002        P  1865.553245
#78           60003        P  4118.418641
#79           60004        P  2739.958665

and grab the top 5 using sort/nlargest

df_out = []
for i in df_melt.pf_customer_id.unique():
    df_id = df_melt.query("pf_customer_id==%d"% i)
    df_top5 = df_id.sort_values(by='value', ascending=False).iloc[:5]
    df_out.append(df_top5)
print( pandas.concat(df_out).reset_index(drop=True))

#    pf_customer_id variable        value
#0            60000        H  6508.144182
#1            60000        D  6275.286890
#2            60000        C  4022.357547
#3            60000        I  3256.328289
#4            60000        K  3147.725445
#5            60001        H  4591.644300
#6            60001        D  4358.786890
#7            60001        G  3152.215475
#8            60001        F  2887.214892
#9            60001        E  2499.358343
#10           60002        G  5257.785928
#11           60002        F  4992.857468
#12           60002        E  4604.715670
#13           60002        A  4022.357547
#14           60002        H  2486.290843
#15           60003        G  7510.859423
#16           60003        F  7245.929486
#17           60003        E  6857.928954
#18           60003        A  6275.286890
#19           60003        M  4371.368639
#20           60004        H  7090.928801
#21           60004        D  6857.928954
#22           60004        C  4604.715670
#23           60004        I  3838.883097
#24           60004        K  3730.362069