Home > Enterprise >  Populate Pandas dataframe with random sample from another dataframe if condition is met, when column
Populate Pandas dataframe with random sample from another dataframe if condition is met, when column

Time:02-25

I have two DataFrames, df1 and df2. The information in df1 has to be used to populate cells in df2 if a specific condition is met. This is an example:

df1 = pd.DataFrame({"A":[1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4],"B":[1, 2, 3, 1, 2, 2, 3, 1, 2, 3, 4],"C":[5, 3, 2,10,11,12, 4, 5, 7, 2, 7], "D":[0.5, 0.3, 0.5, 0.7, 0.5, 0.6, 0.1, 0.6, 0.6, 0.5, 0.6]})
df2 = pd.DataFrame({"A":[5, 5, 6, 6, 6], "B":[1, 2, 1, 2, 3], "C":np.nan, "D":np.nan})

The np.nan entries in df2 are meant to represent the cells that need to be populated. These are empty at the start of the process.

To populate df2, I need to use the values in the column df2['B']. Specifically, in this example, if the value of df2['B'] is equal to 1, then I need to get a random sample, with replacement, from df1[df1['B']==1], for both df1['C'] and df1['D']. Importantly, these values are not independent. Therefore, I need to draw a random row from the subset of rows of df1 where df1['B'] is equal to one. And then I need to do this for all rows in df2.

Doing df1[df1['B']==1][['C','D']].sample(replace = True) draws a random sample for one case when the value of df1['B'] is one, but

  1. How do I assign the corresponding values to df2?
  2. How do I do this for every row in df2?

I have tried several alternatives with loops, such as

for index, value in df2.iterrows():
    if df2.loc[index,'B'] == 1:
        temp_df = df1[df1['B'] == 1][['C','D']].sample(n = 1, replace = True)

    if df2.loc[index,'B'] == 2:
        temp_df = df1[df1['B'] == 2][['C','D']].sample(n = 1, replace = True)
        
    if df2.loc[index,'B'] == 3:
        temp_df = df1[df1['B'] == 3][['C','D']].sample(n = 1, replace = True)
        
    if df2.loc[index,'B'] == 4:
        temp_df = df1[df1['B'] == 4][['C','D']].sample(n = 1, replace = True)
        
        
    df2.loc[index, 'C'] = temp_df['C']
    df2.loc[index, 'D'] = temp_df['D']

but I get an error message saying

---> 15     df2.loc[index, 'C'] = temp_df['C']
     16     df2.loc[index, 'D'] = temp_df['D']
...
ValueError: Incompatible indexer with Series

where the ... denotes lines from the error message that I skipped.

CodePudding user response:

Here's one approach:

(i) get the sample sizes from df2 with groupby size.

(ii) use groupby apply where we use a lambda function to sample items from df1 with the sample sizes obtained from (i) for each unique "B".

(iii) assign these sampled values to df2 (since "B" is not unique, we sorted df2 by "B" to make the rows align)

cols = ['C','D']
sample_sizes = df2.groupby('B')[cols].size()

df2 = df2.sort_values(by='B')
df2[cols] = (df1[df1['B'].isin(sample_sizes.index)]
             .groupby('B')[cols]
             .apply(lambda g: g.sample(sample_sizes[g.name], replace=True))
             .droplevel(1).reset_index(drop=True))
df2 = df2.sort_index()

One sample:

   A  B   C    D
0  5  1   5  0.6
1  5  2  10  0.7
2  6  1  12  0.6
3  6  2  11  0.5
4  6  3   4  0.1
  • Related