Home > database >  How can I select top k rows based on another dataframe in python?
How can I select top k rows based on another dataframe in python?

Time:11-17

I have data as follows. Users are 1001 to 1004 (but actual data has one million users). Each user has corresponding probabilities for the variables AT1 to AT6.

user   AT1     AT2     AT3     AT4    AT5    AT6 
 1001  0.004   0.003   0.03    0.01   0.5    0.453
 1002  0.2     0.1     0.3     0.1    0.1    0.2    
 1003  0.07    0.13    0.22    0.3    0.08   0.2 
 1004  0.01    0.23    0.43    0.15   0.04   0.14

I would like to select the top 3 users for each choice based on the following data.

client   choice_1 choice_2
997       AT2    AT3
223       AT6    AT5
444       AT1    AT4
121       AT1    AT5

In the output, top1 to top3 are the top 3 users based on probability for choice_1 while top4 to top6 are for choice_2. client id is not computed but given. The topN are also not computed but given as top 3 for each choice. The output should look like this:

client top1   top2    top3  top4    top5    top6   
997    1004   1003    1002   1004   1002     1003     
223    1001   1002    1003   1001   1002     1003
444    1002   1003    1004   1004   1003     1002 
121    1002   1003    1004   1001   1002     1003

How can I construct the last dataframe in python?

CodePudding user response:

I have no idea how this will scale to a million rows, but have a go with this dictionary comprehension:

# Set up test df's and re-index.
df_user = pd.DataFrame({
    "user":[1001,1002,1003,1004],
    "AT2" :[0.003, 0.1, 0.13, 0.23],
    "AT3" :[0.03, 0.3, 0.22, 0.43],
    "AT5" :[0.5, 0.1, 0.08, 0.04],
    "AT6" :[0.453,0.2,0.2,0.14]
})
df_user.set_index("user", inplace=True)
df_client = pd.DataFrame({
    "client":[997, 223],
    "choice_1":["AT2","AT6"],
    "choice_2":["AT3", "AT5"]
})

# dictionary comprehension
pd.DataFrame({row["client"]:np.append(df_user[row["choice_1"]].nlargest(3).index.values,
                                      df_user[row["choice_2"]].nlargest(3).index.values)
              for (i, row) in df_client.iterrows()}).T

Output (you still have to rename the columns, obviously):

enter image description here

Short explanation: run the following code to see that the iterables in df.iterrows() are tuples of (a) the index of the dataframe (b) the columns.

for it in df_client.iterrows():
    print(it)

Once you've run that last snippet, it will contain the last row of df_client, so set row = it[1] to experiment with the various bits of information that you can extract from this. In particular, row["choice_1"] gives you something like "AT1", from which you can extract the corresponding column from df_user, upon which you can use the pandas nlargest function. The dictionary comprehension follows trivially once you've put together all the bits.

  • Related