Home > Software engineering >  Iterate through two dataframes and filter based on conditions in columns
Iterate through two dataframes and filter based on conditions in columns

Time:11-25

I have two DataFrames like that below:

df_1 = pd.DataFrame({
    'wag': [111, 111, 222], 
    'date': ['2019-08-30', '2019-09-20', '2019-08-10']
})

and

df_2 = pd.DataFrame({
    'wag': [111, 111, 111, 222, 222, 333, 333, 333], 
    'msr_date': ['2019-08-30', '2019-08-30', '2019-08-30', 
                 '2019-08-30', '2019-08-30', '2019-08-30', 
                 '2019-08-30', '2019-08-30'], 
    'measurements': [1, 2, 3, 4, 5, 6, 7, 8]
})

df_1 and df_2 are below:

wag date
111 2019-08-30
111 2019-09-20
222 2019-08-10
wag msr_date measurements
111 2019-03-29 1
111 2019-03-19 2
111 2019-03-10 3
222 2019-09-30 4
222 2019-09-20 5
333 2019-08-30 6
333 2019-08-30 7
333 2019-09-20 8

So, I have two DataFrames df_1 and df_2 and I want to iterate through them at the same time and filter df_2. I need to iterate based on wag in both tables.

If wag in df_1 is in df_2 then I have to compare date and mrs_date and

  • if mrs_date <= date then I append all rows of that wag
  • else I check if I have one more condition for that row in df_1 for that wag and again repeat the comparison of dates.
  • When conditions / rows are finished for that wags we go to the next value of wag.

For example, here I first check the first row of df_1 and if mrs_date <= date in df_2 then I append all rows to df_new. If not then I go to the next date for that wag. If we have one if not then check another wag. So finally, for this example I will get df_new as below:

wag msr_date measurements
111 2019-03-29 1
111 2019-03-19 2
111 2019-03-10 3

In reality, there are so many values in df_1 and in df_2 so for_loop should be quite complicated and I couldn't find the way how to do it. If it is not clear how I described the task please ask.

CodePudding user response:

Merge your 2 dataframes on wag column then keep only rows that match your condition msr_date <= date. Finally, drop the column date from df_1 and remove duplicates rows.

>>> df_2.merge(df_1, on='wag', how='left') \
        .query('msr_date <= date') \
        .drop(columns='date').drop_duplicates()

   wag    msr_date  measurements
0  111  2019-03-29             1
2  111  2019-03-19             2
4  111  2019-03-10             3

CodePudding user response:

Using a for loop:

new_df = pd.DataFrame(data=None, columns=df_2.columns)

for idx, row in df_1.iterrows():
  for idx2, row2 in df_2.iterrows():
    if row['wag'] == row2['wag'] and row2['msr_date'] <= row['date']:
      new_df = new_df.append(row2).drop_duplicates()
  • Related