Join two Pandas dataframes, sampling from the smaller dataframe-CodePudding

I have two dataframes that look as follows:

import pandas as pd
import io

train_data="""input_example,user_id
example0.npy, jane
example1.npy, bob
example4.npy, alice
example5.npy, jane
example3.npy, bob
example2.npy, bob
"""

user_data="""user_data,user_id
data_jane0.npy, jane
data_jane1.npy, jane
data_bob0.npy, bob
data_bob1.npy, bob
data_alice0.npy, alice
data_alice1.npy, alice
data_alice2.npy, alice
"""

train_df = pd.read_csv(io.StringIO(train_data), sep=",")
user_df = pd.read_csv(io.StringIO(user_data), sep=",")

Suppose that the train_df table is many thousands of entries long, i.e., there are 1000s of unique "exampleN.npy" files. I was wondering if there was a straightforward way to merge the train_df and user_df tables where each row of the joined table matches on the key user_id but is subsampled from user_df.

Here is one example of a resulting dataframe (I'm trying to do uniform sampling, so theoretically, there are infinite possible result dataframes):

>>> result_df
    input_example        user_data   user_id
0    example0.npy   data_jane0.npy      jane
1    example1.npy    data_bob1.npy       bob
2    example4.npy  data_alice0.npy     alice
3    example5.npy   data_jane1.npy      jane
4    example3.npy    data_bob0.npy       bob
5    example2.npy    data_bob0.npy       bob

That is, the user_data column is filled with a random choice of filename based on the corresponding user_id.

I know one could write this using some multi-line for-loop query-based approach, but perhaps there was a faster way using built-in Pandas functions, e.g., "sample", "merge", "join", or "combine".

CodePudding user response：

I don't know if it is possible to merge with a sample without first merging both. This doesn't include a multi-line for loop:

merged = train_df.merge(user_df, on="user_id", how="left").\
    groupby("input_example", as_index=False).\
        apply(lambda x: x.sample(1)).\
            reset_index(drop=True)

merge the two together, on "user_id", only taking those that appear in the left
group by "input_example", assuming these will all be unique (other could group on both columns of train_df)
take a sample of size 1 for these
reset the index

Sampling second, after the merge, means that rows with the same user_id will not necessarily be the same (but sampling user_df first would result in all rows in the output dataframe with the same user_id).

CodePudding user response：

You can sample by groups in user_df and then join that with train_df. e.g.,

# this samples by fraction so each data is equally likely 
user_df = user_df.groupby("user_id").sample(frac=0.5, replace=True) 

    user_data           user_id
6   data_alice2.npy     alice
4   data_alice0.npy     alice
3   data_bob1.npy       bob
0   data_jane0.npy      jane

# this will sample 2 samples per group
user_df = user_df.groupby("user_id").sample(n=2, replace=True) 

    user_data           user_id
6   data_alice2.npy     alice
4   data_alice0.npy     alice
2   data_bob0.npy       bob
2   data_bob0.npy       bob
0   data_jane0.npy      jane
1   data_jane1.npy      jane

Join

pd.merge(train_df, user_df)

CodePudding user response：

Think I figured out a solution myself, it's a one-liner but conceptually it's the same as what @Rawson suggested. First, I do a left-merge, which results in a table with many duplicates. Then I shuffle all the rows to give it randomness. Finally, I drop the duplicates. If I add "sort_index", the resulting table will have the same ordering as the original table.

I'm able to use the random_state kwarg to switch up which user_data file is used. See here:

>>> train_df.merge(user_df, on='user_id', how='left').sample(frac=1, random_state=0).drop_duplicates('input_example').sort_index()
   input_example user_id        user_data
1   example0.npy    jane   data_jane1.npy
2   example1.npy     bob    data_bob0.npy
6   example4.npy   alice  data_alice2.npy
8   example5.npy    jane   data_jane1.npy
10  example3.npy     bob    data_bob1.npy
11  example2.npy     bob    data_bob0.npy

>>> train_df.merge(user_df, on='user_id', how='left').sample(frac=1, random_state=1).drop_duplicates('input_example').sort_index()
   input_example user_id        user_data
1   example0.npy    jane   data_jane1.npy
2   example1.npy     bob    data_bob0.npy
4   example4.npy   alice  data_alice0.npy
7   example5.npy    jane   data_jane0.npy
10  example3.npy     bob    data_bob1.npy
12  example2.npy     bob    data_bob1.npy