Select top-N from two pandas DataFrames-CodePudding

Assume, there are two pandas DataFrame: df1 & df2. The df1 is a square data frame such as following

import numpy as np 
import pandas as pd
item_names = [2,7,9,10,11,13,14,21,24]
np.random.seed(123)
nums = np.round(np.random.random(size = (9,9)),2)
df1 = pd.DataFrame(nums, index = item_names, columns = item_names)

df1 output:

      2     7     9     10    11    13    14    21    24
2   0.70  0.29  0.23  0.55  0.72  0.42  0.98  0.68  0.48
7   0.39  0.34  0.73  0.44  0.06  0.40  0.74  0.18  0.18
9   0.53  0.53  0.63  0.85  0.72  0.61  0.72  0.32  0.36
10  0.23  0.29  0.63  0.09  0.43  0.43  0.49  0.43  0.31
11  0.43  0.89  0.94  0.50  0.62  0.12  0.32  0.41  0.87
13  0.25  0.48  0.99  0.52  0.61  0.12  0.83  0.60  0.55
14  0.34  0.30  0.42  0.68  0.88  0.51  0.67  0.59  0.62
21  0.67  0.84  0.08  0.76  0.24  0.19  0.57  0.10  0.89
24  0.63  0.72  0.02  0.59  0.56  0.16  0.15  0.70  0.32

The df2 stores item and its corresponding group information such as

df2 = pd.DataFrame({'item':item_names,
                    'group':['a1','a1','a1','a2',
                             'a2','a2','a2','a3','a3']})

df2 output:

   item group
0     2    a1
1     7    a1
2     9    a1
3    10    a2
4    11    a2
5    13    a2
6    14    a2
7    21    a3
8    24    a3

The goal is to select top N items in a specific row (item name) based on the corresponding values (largest) using these two DataFrames information. However, the returned top N items and query item ALL MUST from 'different groups'. Such as

A query item (item = 10) is in the 4th row of df1 (item = 10). The top 2 returned items will be [9, 21] not [9, 14]. Since, item 10 is from group = a2 and any of returned items (top N) should not from a2 group. I have checked Scott Boston solution for a similar problem but it can't avoid the top N items and query item are from same group. Any suggestion? many thanks

CodePudding user response：

IIUC, you want to select the N largest values excluding the values from the same group.

Here is a function that does this:

def get_top_N(idx, N=2):
    group = df2.set_index('item')['group']
    incl = group[group.ne(group[idx])].index
    return df1.loc[idx, incl].nlargest(2).index.to_list()
    
get_top_N(10)
# [9, 21]

If you additionally want to ensure that all values are from different groups (this was unclear if a requirement, as this is the case for your example). You can additionally do:

def get_top_N_diff(idx, N=2):
    group = df2.set_index('item')['group']
    incl = group[group.ne(group[idx])].index
    s = df1.loc[idx, incl]
    return s.sort_values(ascending=False).groupby(group).idxmax().to_list()[:N]

get_top_N(11)      # same group
# [9, 7]

get_top_N_diff(11) # different groups
# [9, 24]

CodePudding user response：

A modification on the answer you mentioned:

def get_top(df1, df2, item_name, number_items):
    val = df1.loc[[item_name]].T.merge(df2, left_index=True, right_on = 'item')
    val = val[val['group']!=val.loc[val['item']==item_name, 'group'].values[0]]
    return (val.sort_values(item_name, ascending=False) 
                .groupby('group')              
                .head(1)                        
                .head(number_items)['item']     
                .to_numpy())

>>> get_top(df1, df2, 10, 2)
array([ 9, 21])

CodePudding user response：

Not sure exactly what you wanted... but this might point you in a direction:

import pandas as pd
import numpy as np

s2 = df2.set_index('item').group
gdf1 = df1.set_index(s2, append=True).swaplevel(0, 1)

mask = np.equal.outer(df1.index.map(s2.get), df1.columns.map(s2.get))
stacked = df1.mask(mask).stack().rename_axis(['x', 'y']).to_frame(name='v')

stacked.sort_values(['x', 'v'], ascending=[True, False]).groupby('x').head(2)

          v
x  y       
2  14  0.98
   11  0.72
7  14  0.74
   10  0.44
9  10  0.85
   11  0.72
10 9   0.63
   21  0.43
11 9   0.94
   7   0.89
13 9   0.99
   21  0.60
14 24  0.62
   21  0.59
21 7   0.84
   10  0.76
24 7   0.72
   2   0.63