I have two pandas DataFrames, both containing the same categories but different 'id' columns. In order to illustrate, the first table looks like this:
df = pd.DataFrame({
'id': list(np.arange(1, 12)),
'category': ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c'],
'weight': list(np.random.randint(1, 5, 11))
})
df['weight_sum'] = df.groupby('category')['weight'].transform('sum')
df['p'] = df['weight'] / df['weight_sum']
Output:
id category weight weight_sum p
0 1 a 4 14 0.285714
1 2 a 4 14 0.285714
2 3 a 2 14 0.142857
3 4 a 4 14 0.285714
4 5 b 4 8 0.500000
5 6 b 4 8 0.500000
6 7 c 3 15 0.200000
7 8 c 4 15 0.266667
8 9 c 2 15 0.133333
9 10 c 4 15 0.266667
10 11 c 2 15 0.133333
The second contains only 'id' and 'category'.
What I'm trying to do is to create a third DataFrame, that would have inherit the id
of the second DataFrame, plus three new columns for the ids
of the first DataFrame - each should be selected based on the p
column, which represents its weight within that category.
I've tried multiple solutions and was thinking of applying np.random.choice
and .apply(), but couldn't figure out a way to make that work.
EDIT:
The desired output would be something like:
user_id id_1 id_2 id_3
0 2 3 1 2
1 3 2 2 3
2 4 1 3 1
With each id
being selected based on the its probability and respective category
(both DataFrames have this column), and the same not showing up more than once for the same user_id
.
CodePudding user response:
IIUC, you want to select random IDs of the same category with weighted probabilities. For this you can construct a helper dataframe (dfg) and use apply
:
df2 = pd.DataFrame({
'id': np.random.randint(1, 12, size=11),
'category': ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c']})
dfg = df.groupby('category').agg(list)
df3 = df2.join(df2['category']
.apply(lambda r: pd.Series(np.random.choice(dfg.loc[r, 'id'],
p=dfg.loc[r, 'p'],
size=3)))
.add_prefix('id_')
)
Output:
id category id_0 id_1 id_2
0 11 a 2 3 3
1 10 a 2 3 1
2 4 a 1 2 3
3 7 a 2 1 4
4 5 b 6 5 5
5 10 b 6 5 6
6 8 c 9 8 8
7 11 c 7 8 7
8 11 c 10 8 8
9 4 c 9 10 10
10 1 c 11 11 9