Create multiple columns with Pandas .apply()-CodePudding

I have two pandas DataFrames, both containing the same categories but different 'id' columns. In order to illustrate, the first table looks like this:

df = pd.DataFrame({
    'id': list(np.arange(1, 12)),
    'category': ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c'],
    'weight': list(np.random.randint(1, 5, 11))
})

df['weight_sum'] = df.groupby('category')['weight'].transform('sum')
df['p'] = df['weight'] / df['weight_sum']

Output:

id  category    weight  weight_sum  p
0   1   a   4   14  0.285714
1   2   a   4   14  0.285714
2   3   a   2   14  0.142857
3   4   a   4   14  0.285714
4   5   b   4   8   0.500000
5   6   b   4   8   0.500000
6   7   c   3   15  0.200000
7   8   c   4   15  0.266667
8   9   c   2   15  0.133333
9   10  c   4   15  0.266667
10  11  c   2   15  0.133333

The second contains only 'id' and 'category'.

What I'm trying to do is to create a third DataFrame, that would have inherit the id of the second DataFrame, plus three new columns for the ids of the first DataFrame - each should be selected based on the p column, which represents its weight within that category.

I've tried multiple solutions and was thinking of applying np.random.choice and .apply(), but couldn't figure out a way to make that work.

EDIT:

The desired output would be something like:

user_id id_1    id_2    id_3
0   2   3   1   2
1   3   2   2   3
2   4   1   3   1

With each id being selected based on the its probability and respective category (both DataFrames have this column), and the same not showing up more than once for the same user_id.

Desired DataFrame

CodePudding user response：

IIUC, you want to select random IDs of the same category with weighted probabilities. For this you can construct a helper dataframe (dfg) and use apply:

df2 = pd.DataFrame({
    'id': np.random.randint(1, 12, size=11),
    'category': ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c']})

dfg = df.groupby('category').agg(list)

df3 = df2.join(df2['category']
               .apply(lambda r: pd.Series(np.random.choice(dfg.loc[r, 'id'],
                                                           p=dfg.loc[r, 'p'],
                                                           size=3)))
               .add_prefix('id_')
               )

Output:

    id category  id_0  id_1  id_2
0   11        a     2     3     3
1   10        a     2     3     1
2    4        a     1     2     3
3    7        a     2     1     4
4    5        b     6     5     5
5   10        b     6     5     6
6    8        c     9     8     8
7   11        c     7     8     7
8   11        c    10     8     8
9    4        c     9    10    10
10   1        c    11    11     9