I have two DataFrames, df1
and df2
. The information in df1
has to be used to populate cells in df2
if a specific condition is met. This is an example:
df1 = pd.DataFrame({"A":[1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4],"B":[1, 2, 3, 1, 2, 2, 3, 1, 2, 3, 4],"C":[5, 3, 2,10,11,12, 4, 5, 7, 2, 7], "D":[0.5, 0.3, 0.5, 0.7, 0.5, 0.6, 0.1, 0.6, 0.6, 0.5, 0.6]})
df2 = pd.DataFrame({"A":[5, 5, 6, 6, 6], "B":[1, 2, 1, 2, 3], "C":np.nan, "D":np.nan})
The np.nan
entries in df2
are meant to represent the cells that need to be populated. These are empty at the start of the process.
To populate df2
, I need to use the values in the column df2['B']
. Specifically, in this example, if the value of df2['B']
is equal to 1, then I need to get a random sample, with replacement, from df1[df1['B']==1]
, for both df1['C']
and df1['D']
. Importantly, these values are not independent. Therefore, I need to draw a random row from the subset of rows of df1
where df1['B']
is equal to one. And then I need to do this for all rows in df2
.
Doing df1[df1['B']==1][['C','D']].sample(replace = True)
draws a random sample for one case when the value of df1['B']
is one, but
- How do I assign the corresponding values to
df2
? - How do I do this for every row in
df2
?
I have tried several alternatives with loops, such as
for index, value in df2.iterrows():
if df2.loc[index,'B'] == 1:
temp_df = df1[df1['B'] == 1][['C','D']].sample(n = 1, replace = True)
if df2.loc[index,'B'] == 2:
temp_df = df1[df1['B'] == 2][['C','D']].sample(n = 1, replace = True)
if df2.loc[index,'B'] == 3:
temp_df = df1[df1['B'] == 3][['C','D']].sample(n = 1, replace = True)
if df2.loc[index,'B'] == 4:
temp_df = df1[df1['B'] == 4][['C','D']].sample(n = 1, replace = True)
df2.loc[index, 'C'] = temp_df['C']
df2.loc[index, 'D'] = temp_df['D']
but I get an error message saying
---> 15 df2.loc[index, 'C'] = temp_df['C']
16 df2.loc[index, 'D'] = temp_df['D']
...
ValueError: Incompatible indexer with Series
where the ... denotes lines from the error message that I skipped.
CodePudding user response:
Here's one approach:
(i) get the sample sizes from df2
with groupby
size
.
(ii) use groupby
apply
where we use a lambda function to sample items from df1
with the sample sizes obtained from (i) for each unique "B".
(iii) assign these sampled values to df2
(since "B" is not unique, we sorted df2
by "B" to make the rows align)
cols = ['C','D']
sample_sizes = df2.groupby('B')[cols].size()
df2 = df2.sort_values(by='B')
df2[cols] = (df1[df1['B'].isin(sample_sizes.index)]
.groupby('B')[cols]
.apply(lambda g: g.sample(sample_sizes[g.name], replace=True))
.droplevel(1).reset_index(drop=True))
df2 = df2.sort_index()
One sample:
A B C D
0 5 1 5 0.6
1 5 2 10 0.7
2 6 1 12 0.6
3 6 2 11 0.5
4 6 3 4 0.1