I have two dataframes - df1 and df2 as following:
df_1 = pd.DataFrame( {'num': [1,2,3], 'time': [100,200,300]})
df_2 = pd.DataFrame( {'num': [1,2,3], 'time': [101,104,200]})
Match = is when 'num' in df1 not equals to 'num' in df2 and the time in df1 is in df2 with offset of 10. The results need to be in df_3:
If we find few matches, we take the only the first.
My code is works but not efficient
My code:
df_3 = pd.DataFrame(columns = ['num_df1', 'num_df2', 'time_df1', 'time_df2'])
def find_match(row_df_1, df_2):
time = row_df_1['time']
match_in_df2 = df_2[(df_2['num'] != row_df_1['num']) & (df_2['time'].between(time-10, time 10))]
if len(match_in_df2) > 0:
match_in_df2.reset_index(inplace=True, drop=True)
return pd.Series([row_df_1['num'], match_in_df2.loc[0,'num'],row_df_1['time'], match_in_df2.loc[0,'time']], index=df_3.columns)
for index, row_df_1 in df_1.iterrows():
df_3 = df_3.append(find_match(row_df_1, df_2), ignore_index=True)
results:
num_df1 num_df2 time_df1 time_df2
0 1 2 100 104
1 2 3 200 200
CodePudding user response:
Use merge
with how='cross'
before use boolean masks to select right rows:
out = pd.merge(df_1, df_2, how='cross', suffixes=('_df1', '_df2'))
m1 = out['num_df1'] != out['num_df2']
m2 = abs(out['time_df2'] - out['time_df1']) <= 10
out = out[m1 & m2]
Output:
>>> out
num_df1 time_df1 num_df2 time_df2
1 1 100 2 104
5 2 200 3 200