Iterate through two dataframes and filter based on conditions in columns-CodePudding

I have two DataFrames like that below:

df_1 = pd.DataFrame({
    'wag': [111, 111, 222], 
    'date': ['2019-08-30', '2019-09-20', '2019-08-10']
})

and

df_2 = pd.DataFrame({
    'wag': [111, 111, 111, 222, 222, 333, 333, 333], 
    'msr_date': ['2019-08-30', '2019-08-30', '2019-08-30', 
                 '2019-08-30', '2019-08-30', '2019-08-30', 
                 '2019-08-30', '2019-08-30'], 
    'measurements': [1, 2, 3, 4, 5, 6, 7, 8]
})

df_1 and df_2 are below:

wag	date
111	2019-08-30
111	2019-09-20
222	2019-08-10

wag	msr_date	measurements
111	2019-03-29	1
111	2019-03-19	2
111	2019-03-10	3
222	2019-09-30	4
222	2019-09-20	5
333	2019-08-30	6
333	2019-08-30	7
333	2019-09-20	8

So, I have two DataFrames df_1 and df_2 and I want to iterate through them at the same time and filter df_2. I need to iterate based on wag in both tables.

If wag in df_1 is in df_2 then I have to compare date and mrs_date and

if mrs_date <= date then I append all rows of that wag
else I check if I have one more condition for that row in df_1 for that wag and again repeat the comparison of dates.
When conditions / rows are finished for that wags we go to the next value of wag.

For example, here I first check the first row of df_1 and if mrs_date <= date in df_2 then I append all rows to df_new. If not then I go to the next date for that wag. If we have one if not then check another wag. So finally, for this example I will get df_new as below:

wag	msr_date	measurements
111	2019-03-29	1
111	2019-03-19	2
111	2019-03-10	3

In reality, there are so many values in df_1 and in df_2 so for_loop should be quite complicated and I couldn't find the way how to do it. If it is not clear how I described the task please ask.

CodePudding user response：

Merge your 2 dataframes on wag column then keep only rows that match your condition msr_date <= date. Finally, drop the column date from df_1 and remove duplicates rows.

>>> df_2.merge(df_1, on='wag', how='left') \
        .query('msr_date <= date') \
        .drop(columns='date').drop_duplicates()

   wag    msr_date  measurements
0  111  2019-03-29             1
2  111  2019-03-19             2
4  111  2019-03-10             3

CodePudding user response：

Using a for loop:

new_df = pd.DataFrame(data=None, columns=df_2.columns)

for idx, row in df_1.iterrows():
  for idx2, row2 in df_2.iterrows():
    if row['wag'] == row2['wag'] and row2['msr_date'] <= row['date']:
      new_df = new_df.append(row2).drop_duplicates()