How to select top K items from 2 pandas DataFrame based on conditions?-CodePudding

Assume, there are two DataFrame: visitor & group. visitor stores each visitor information and which item s/he selected (likelihood values). However, not every item has been purchased by all visitors. group stores the certain items belong to which item-family information. The following are the toy DataFrames

import numpy as np 
import pandas as pd
items = [11,12,13,14, 
         21,22,23,24,
         2,7,9,10]
col_names = [2,7,9,10,11,13,14,21,24]
np.random.seed(123)
nums = np.round(np.random.random(size = (3,9)),2)

visitor = pd.DataFrame(nums, index = (100,101,102))
visitor.columns = col_names

group = pd.DataFrame({'item':sorted(items),
                      'family':sorted(['a1','a2','a3']*4)})

print(visitor)

       2     7     9     10    11    13    14    21    24
100  0.70  0.29  0.23  0.55  0.72  0.42  0.98  0.68  0.48
101  0.39  0.34  0.73  0.44  0.06  0.40  0.74  0.18  0.18
102  0.53  0.53  0.63  0.85  0.72  0.61  0.72  0.32  0.36

print(group)

    item family
0      2     a1
1      7     a1
2      9     a1
3     10     a1
4     11     a2
5     12     a2
6     13     a2
7     14     a2
8     21     a3
9     22     a3
10    23     a3
11    24     a3

The goal is to select top 2 items that are from DIFFERENT item-family based on the values. This is my code

def Basket(df, x, num_items = 2):
    keys = list(df)   
    values = df.loc[x]   
    item_dict = dict([(i, j) for i, j in zip(keys, values)])
    output = list(dict(sorted(item_dict.items(), key=lambda kv: kv[1], reverse = True)))[:num_items]
    return output

print(Basket(df = visitor, dx = 100))
[14, 11]  # 14 & 11 from the same family: a2

print(Basket(df = visitor, x = 101))
[14, 9] # 14 & 9 from different families: a2 & a1

I am not sure how to incorporate the group df into my code to select top 2 items (based on the values and item-family information) from different family such as

print(Basket(df1 = visitor, df2 = group, x = 100))
[14, 2]

print(Basket(df1 = visitor, df2 = group, x = 101))
[14, 9]

Note: 100, 101, and 102 represent visitor id (row index). any suggestion? many thanks in advance

CodePudding user response：

You can merge your 2 dataframes before:

out = visitor.rename_axis('visitor').melt(var_name='item', ignore_index=False) \
             .reset_index().merge(group, on='item')

out = out.loc[out.groupby(['visitor', 'family'])['value'].nlargest(2).index.levels[-1]] \
         .sort_values(['visitor', 'family', 'value'], ascending=[True, True, False], ignore_index=True)

Output:

>>> out
    visitor  item  value family
0       100     2   0.70     a1
1       100    10   0.55     a1
2       100    14   0.98     a2
3       100    11   0.72     a2
4       100    21   0.68     a3
5       100    24   0.48     a3
6       101     9   0.73     a1
7       101    10   0.44     a1
8       101    14   0.74     a2
9       101    13   0.40     a2
10      101    21   0.18     a3
11      101    24   0.18     a3
12      102    10   0.85     a1
13      102     9   0.63     a1
14      102    11   0.72     a2
15      102    14   0.72     a2
16      102    24   0.36     a3
17      102    21   0.32     a3

CodePudding user response：

Try:

def basket(visitor, x, number_items=2):
    return visitor.loc[[x]].T.merge(group, left_index=True, right_on='item')\
           .sort_values(x, ascending=False).groupby('family')\
           .head(1).head(number_items)['item'].to_numpy()

Output:

basket(visitor, 100, 2)
# array([14,  2], dtype=int64)

basket(visitor, 101, 2)
# array([14,  9], dtype=int64)