Assume, there are two pandas DataFrame: df1 & df2. The df1 is a square data frame such as following
import numpy as np
import pandas as pd
item_names = [2,7,9,10,11,13,14,21,24]
np.random.seed(123)
nums = np.round(np.random.random(size = (9,9)),2)
df1 = pd.DataFrame(nums, index = item_names, columns = item_names)
df1 output:
2 7 9 10 11 13 14 21 24
2 0.70 0.29 0.23 0.55 0.72 0.42 0.98 0.68 0.48
7 0.39 0.34 0.73 0.44 0.06 0.40 0.74 0.18 0.18
9 0.53 0.53 0.63 0.85 0.72 0.61 0.72 0.32 0.36
10 0.23 0.29 0.63 0.09 0.43 0.43 0.49 0.43 0.31
11 0.43 0.89 0.94 0.50 0.62 0.12 0.32 0.41 0.87
13 0.25 0.48 0.99 0.52 0.61 0.12 0.83 0.60 0.55
14 0.34 0.30 0.42 0.68 0.88 0.51 0.67 0.59 0.62
21 0.67 0.84 0.08 0.76 0.24 0.19 0.57 0.10 0.89
24 0.63 0.72 0.02 0.59 0.56 0.16 0.15 0.70 0.32
The df2 stores item and its corresponding group information such as
df2 = pd.DataFrame({'item':item_names,
'group':['a1','a1','a1','a2',
'a2','a2','a2','a3','a3']})
df2 output:
item group
0 2 a1
1 7 a1
2 9 a1
3 10 a2
4 11 a2
5 13 a2
6 14 a2
7 21 a3
8 24 a3
The goal is to select top N items in a specific row (item name) based on the corresponding values (largest) using these two DataFrames information. However, the returned top N items and query item ALL MUST from 'different groups'. Such as
A query item (item = 10) is in the 4th row of df1 (item = 10). The top 2 returned items will be [9, 21]
not [9, 14]
. Since, item 10 is from group = a2 and any of returned items (top N) should not from a2 group. I have checked Scott Boston solution for a similar problem but it can't avoid the top N items and query item are from same group. Any suggestion? many thanks
CodePudding user response:
IIUC, you want to select the N largest values excluding the values from the same group.
Here is a function that does this:
def get_top_N(idx, N=2):
group = df2.set_index('item')['group']
incl = group[group.ne(group[idx])].index
return df1.loc[idx, incl].nlargest(2).index.to_list()
get_top_N(10)
# [9, 21]
If you additionally want to ensure that all values are from different groups (this was unclear if a requirement, as this is the case for your example). You can additionally do:
def get_top_N_diff(idx, N=2):
group = df2.set_index('item')['group']
incl = group[group.ne(group[idx])].index
s = df1.loc[idx, incl]
return s.sort_values(ascending=False).groupby(group).idxmax().to_list()[:N]
get_top_N(11) # same group
# [9, 7]
get_top_N_diff(11) # different groups
# [9, 24]
CodePudding user response:
A modification on the answer you mentioned:
def get_top(df1, df2, item_name, number_items):
val = df1.loc[[item_name]].T.merge(df2, left_index=True, right_on = 'item')
val = val[val['group']!=val.loc[val['item']==item_name, 'group'].values[0]]
return (val.sort_values(item_name, ascending=False)
.groupby('group')
.head(1)
.head(number_items)['item']
.to_numpy())
>>> get_top(df1, df2, 10, 2)
array([ 9, 21])
CodePudding user response:
Not sure exactly what you wanted... but this might point you in a direction:
import pandas as pd
import numpy as np
s2 = df2.set_index('item').group
gdf1 = df1.set_index(s2, append=True).swaplevel(0, 1)
mask = np.equal.outer(df1.index.map(s2.get), df1.columns.map(s2.get))
stacked = df1.mask(mask).stack().rename_axis(['x', 'y']).to_frame(name='v')
stacked.sort_values(['x', 'v'], ascending=[True, False]).groupby('x').head(2)
v
x y
2 14 0.98
11 0.72
7 14 0.74
10 0.44
9 10 0.85
11 0.72
10 9 0.63
21 0.43
11 9 0.94
7 0.89
13 9 0.99
21 0.60
14 24 0.62
21 0.59
21 7 0.84
10 0.76
24 7 0.72
2 0.63