Assume, there are two DataFrame: visitor & group. visitor stores each visitor information and which item s/he selected (likelihood values). However, not every item has been purchased by all visitors. group stores the certain items belong to which item-family information. The following are the toy DataFrames
import numpy as np
import pandas as pd
items = [11,12,13,14,
21,22,23,24,
2,7,9,10]
col_names = [2,7,9,10,11,13,14,21,24]
np.random.seed(123)
nums = np.round(np.random.random(size = (3,9)),2)
visitor = pd.DataFrame(nums, index = (100,101,102))
visitor.columns = col_names
group = pd.DataFrame({'item':sorted(items),
'family':sorted(['a1','a2','a3']*4)})
print(visitor)
2 7 9 10 11 13 14 21 24
100 0.70 0.29 0.23 0.55 0.72 0.42 0.98 0.68 0.48
101 0.39 0.34 0.73 0.44 0.06 0.40 0.74 0.18 0.18
102 0.53 0.53 0.63 0.85 0.72 0.61 0.72 0.32 0.36
print(group)
item family
0 2 a1
1 7 a1
2 9 a1
3 10 a1
4 11 a2
5 12 a2
6 13 a2
7 14 a2
8 21 a3
9 22 a3
10 23 a3
11 24 a3
The goal is to select top 2 items that are from DIFFERENT item-family based on the values. This is my code
def Basket(df, x, num_items = 2):
keys = list(df)
values = df.loc[x]
item_dict = dict([(i, j) for i, j in zip(keys, values)])
output = list(dict(sorted(item_dict.items(), key=lambda kv: kv[1], reverse = True)))[:num_items]
return output
print(Basket(df = visitor, dx = 100))
[14, 11] # 14 & 11 from the same family: a2
print(Basket(df = visitor, x = 101))
[14, 9] # 14 & 9 from different families: a2 & a1
I am not sure how to incorporate the group df into my code to select top 2 items (based on the values and item-family information) from different family such as
print(Basket(df1 = visitor, df2 = group, x = 100))
[14, 2]
print(Basket(df1 = visitor, df2 = group, x = 101))
[14, 9]
Note: 100, 101, and 102 represent visitor id (row index). any suggestion? many thanks in advance
CodePudding user response:
You can merge your 2 dataframes before:
out = visitor.rename_axis('visitor').melt(var_name='item', ignore_index=False) \
.reset_index().merge(group, on='item')
out = out.loc[out.groupby(['visitor', 'family'])['value'].nlargest(2).index.levels[-1]] \
.sort_values(['visitor', 'family', 'value'], ascending=[True, True, False], ignore_index=True)
Output:
>>> out
visitor item value family
0 100 2 0.70 a1
1 100 10 0.55 a1
2 100 14 0.98 a2
3 100 11 0.72 a2
4 100 21 0.68 a3
5 100 24 0.48 a3
6 101 9 0.73 a1
7 101 10 0.44 a1
8 101 14 0.74 a2
9 101 13 0.40 a2
10 101 21 0.18 a3
11 101 24 0.18 a3
12 102 10 0.85 a1
13 102 9 0.63 a1
14 102 11 0.72 a2
15 102 14 0.72 a2
16 102 24 0.36 a3
17 102 21 0.32 a3
CodePudding user response:
Try:
def basket(visitor, x, number_items=2):
return visitor.loc[[x]].T.merge(group, left_index=True, right_on='item')\
.sort_values(x, ascending=False).groupby('family')\
.head(1).head(number_items)['item'].to_numpy()
Output:
basket(visitor, 100, 2)
# array([14, 2], dtype=int64)
basket(visitor, 101, 2)
# array([14, 9], dtype=int64)