I have two dataframes. df1 has ~31,000 rows, while df2 has ~117,000 rows. I want to add a column to df1 based on the following conditions.
(df1.id == df2.id) and (df2.min_value < df1.value <= df2.max_value)
I know that df2 will return either 0 or 1 rows satisfying the condition for each value of id in df1. For each row in df1, I want to add a column from df2 when the above condition is satisfied.
My current code is as follows. It is a line by line approach.
new_df1 = pd.DataFrame(columns = df1.columns.tolist() [new_col])
for i, row in df1.iterrows():
val = row['value']
id = row['id']
dummy = df2[(df2.id == id) & (df2.max_value >= val) & (df2.min_value < val)]
if dummy.shape[0] == 0:
new_col = np.nan
else:
new_col = dummy.new_column.values[0]
l = len(new_df1)
new_df1.loc[l] = row.tolist() [new_col]
This is a time costly approach. Is there a way to more efficiently do this problem?
CodePudding user response:
You can merge df1 and df2 based on the id column:
merged_df = df1.merge(df2, on='id', how='left')
Now, any row in DF1 for which the id matches an id of a row in DF2 will have all the DF2 columns placed alongside it. Then, you can simply filter the merged dataframe for your given condition:
merged_df.query('max_value > value and min_value < val')