Home > Back-end >  Fastest way to filter results from one dataframe into another dataframe based on multiple conditions
Fastest way to filter results from one dataframe into another dataframe based on multiple conditions

Time:05-23

Purpose of this post: high performance filtering

I've searched regarding this issue quite a bit, but the posts I've come to find are either not performant on larger dataframes or don't solve my exact issue.

Problem:

I have the below dataframes where each customer uploads a required document (recorded in dataframe 1) and the customer purchases a product (recorded in dataframe 2).

In laymen terms, at the point the customer purchased the product, we are trying to retrieve the latest status of a specific document he was supposed to upload. If the customer has NOT uploaded a document, the result should be None.

The following three filter conditions are what should be applied per row of dataframe_2:

df_1.user == df_2.user
df_1.type == x
df_1.date_1 <= df_2.date_2

#e.g date_1 from dataframe_1 is the MAXIMUM date possible <= date_2 in dataframe_2).

Once I filter on these above conditions, we want to retrieve the status of the document (or None if it doesn't exist) and create that column in dataframe_2.

DataFrame 1:

document_type user date_1 status
x 123 2021-01-01 approved
y 123 2021-01-01 approved
x 123 2022-02-03 declined

DataFrame 2:

id user date_2
1 123 2021-01-01
2 123 2021-01-01
3 123 2021-05-04
4 123 2022-02-05
5 456 2021-07-30

Result:

id user date_2 document_x_status
1 123 2021-01-01 Approved
2 123 2021-01-01 Approved
3 123 2021-05-04 Approved
4 123 2022-02-05 Declined
5 456 2021-07-30 None

I have tried many methods from Multi-Index filtering to converting fields to arrays using to_numpy() and trying to filter that way.

All of the methods took quite some time and this only started becoming an issue due to the size of the data.

Thanks one and all of your help.

CodePudding user response:

You could try with pd.merge_asof with the dates as index:

import pandas as pd

# Use sort_index if the dates are not already sorted (required for merge_asof).
df1 = df1[df1['document_type'].eq('x')].set_index('date_1').sort_index()
df2 = df2.set_index('date_2').sort_index()

res = (pd.merge_asof(df2, df1, by='user', left_index=True, right_index=True, direction='backward').
       drop(columns=['document_type']).fillna('None').reset_index()
      )

print(res)
      date_2  index  id  user    status
0 2021-01-01      0   1   123  approved
1 2021-01-01      1   2   123  approved
2 2021-05-04      2   3   123  approved
3 2021-07-30      4   5   456      None
4 2022-02-05      3   4   123  declined
  • Related