Return columns for top k values based on two separate dataframes-CodePudding

The data is as follows: df1 consist of users and their corresponding probabilities for A to D (these values are already given/computed; df2 consist of the relationship between the same users, for instance, user 1001 is closer to user 1004 and 1003 as shown in the first row of df2. Here are the dataframes:

df1 = pd.DataFrame({
        "user1":[1001,1002,1003,1004],
        "A" :[0.003, 0.1, 0.13, 0.23],
        "B" :[0.03, 0.3, 0.22, 0.43],
        "C" :[0.5, 0.1, 0.08, 0.04],
        "D" :[0.453,0.2,0.2,0.14]
})
df1.set_index("user1", inplace=True)


df2 = pd.DataFrame({
    "user2":[1001, 1002],
    "top1":["1004","1003"],
    "top2":["1003", "1002"]
})

If we print the two dataframes, we get the following:

print(df1)
print(df2)


        A     B     C      D
user1                          
1001   0.003  0.03  0.50  0.453
1002   0.100  0.30  0.10  0.200
1003   0.130  0.22  0.08  0.200
1004   0.230  0.43  0.04  0.140


   user2     top1     top2
0   1001     1004     1003
1   1002     1003     1002

What I am trying to do is to produce the following output:

user     1     2      3        4
1001     B     A      B        D
1002     B     D      B        D

The output simply says if I look at first row in df2, I take top1 and top2 users for user 1001 in that row and find the corresponding top two probability values in df2 and return their column. Thus, for user1001 is close to 1004 and 1003. So I look in df1, top 2 probabilities values for 1004 are 0.43 and 0.23 in B and A respectively, 1003 has top values in B and D, and that's how I get the first row in the output.

I have tried the following code but to no avail.

  df1 = df1.T


 df3 = pd.DataFrame({row["user2"]:np.append(df1[row["top1"]].nlargest(2).index.values,
                                              df1[row["top2"]].nlargest(2).index.values)
                      for (i, row) in df2.iterrows()}).T

CodePudding user response：

You can use an intermediate dataframe df1_prefs to simplify your task.

df1_prefs = pd.DataFrame(
    {
        user: df1.loc[user].sort_values(ascending=False).index 
        for user in df1.index
    }
).T[[0, 1]]

You get this dataframe :

Then, you can merge the rows of your targets index :

df3 = pd.DataFrame({
    df2.loc[i]['user2']: 
    df1_prefs.loc[int(df2.loc[i]['top1'])].append(df1_prefs.loc[int(df2.loc[i]['top2'])], ignore_index=True)
    for i in df2.index
}).T

Output:

      0  1  2  3
1001  B  A  B  D
1002  B  D  B  D

There are probably ways to make it simpler, but the idea works.

Also, be careful that your indexes do not always have the same type. In one case it is an int, in the other a string.

CodePudding user response：

You can craft a dataframe with the top2 per user, then merge:

df3 = (df1.stack()
          .groupby(level=0).nlargest(2)
          .reset_index(level=[1,2])
          .assign(col=lambda d: d.groupby(level='user1').cumcount() 1)
          .pivot(index='user1', columns='col', values='level_2')
      )

(df2.merge(df3, left_on='top1', right_index=True)
    .merge(df3.set_axis([3,4], axis=1), left_on='top2', right_index=True)
    .filter(regex='^(?!top)')
)

output:

   user2  1  2  3  4
0   1001  B  A  B  D
1   1002  B  D  B  D

CodePudding user response：

I cannot think of a very concise way - but here is something that works -

#We define a function that we can apply to each row of `df2`
def get_top_2(row):
    _from_top_1_column = sorted(df1.loc[int(row['top1']), :].to_dict().items(), key=lambda x: x[1], reverse=True)
    top_2_col_names_from_top_1_column = [colname for colname, value in _from_top_1_column][:2]
    _from_top_2_column = sorted(df1.loc[int(row['top2']), :].to_dict().items(), key=lambda x: x[1], reverse=True)
    top_2_col_names_from_top_2_column = [colname for colname, value in _from_top_2_column][:2]
    top_4 = top_2_col_names_from_top_1_column   top_2_col_names_from_top_2_column
    return '|'.join(top_4)

df2.apply(get_top_2, axis=1).str.split('|', expand=True)
   0  1  2  3
0  B  A  B  D
1  B  D  B  D

The convoluted part within that function is probably -

 _from_top_1_column = sorted(df1.loc[int(row['top1']), :].to_dict().items(), key=lambda x: x[1], reverse=True)

What it is doing is -

Pick the row corresponding to the 'top1' column from df1
Convert it to a dict (and then get the dict().items()) so that you get the row entries as (columnname, value)
Sort this list of tuples by the value - descending