How to apply a multiple argument function on a large dataframe with pandas?-CodePudding

I've faced a challange to join two dataframes in one.

For example, I've 2 dataframes:

The first df (df1) is the base with more than 300K lines, with multiple dates for every ID.

e.g:

ID | date |
001| 01-01-2021|
001| 02-01-2021|
001| 03-01-2021|
001| 04-01-2021|
001| 05-01-2021|

...

002| 01-01-2021|
002| 02-01-2021|
002| 03-01-2021|

The second df (df2) is where I have some extra info that I should join on the first data frame (not so big as the first one, 40k lines)

e.g:

ID | start_date | end_date | status |
001| 01-01-2021 | 02-01-2021| working |
001| 02-02-2021 | 01-03-2021| not working|

The challenge is to identify all the status on df1 based on the id and the start and end date of df2..

Basically, if ID_df1 == ID_df2 and date_df1 >= start_date & date_df1 <= end_date, so I've to capture the status of df2, which one in this case is "Working"

To solve this issue, I've created a function called status:

def status (df2, id, date):
    pos = -1
    for i in range(len(df2)):
        if ((df2.iat[i,0] == id) & (pd.to_datetime(df2.iat[i,1] <= date) & (pd.to_datetime(df2.iat[i,2] >= date):
           pos = i
        break
    if pos > -1:
       return (df2.iat[pos,3])
    else:
       return "Not Found"

And my issue is to apply for the function "status" on every df1 data.

I've tried to create a column in blank and apply the function on a loop:

df1['Status'] = ""

for i in range(len(df1)):
    df1.[i,2] = status(df2, df1[i,0], df1[i,1])

It worked for me, but it took a loooot of time.. for the 300K, I've covered the first 4k in one hour.

Is there an easier way to do that?

Thank you very much!

CodePudding user response：

I also 've tried

df1['Status'] = df1.apply(lambda x: status(df2, df1['ID'], df1['Date']),axis=1)

But not worked, I've got the error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

CodePudding user response：

Not an elegant, but probably much faster solution:

# big dataframe with all ID combinations
df = pd.merge(df1, df2, how='left', on='ID')

# conditions to filter the correct combinations
c1 = df['date'] >= df['start_date']
c2 = df['date'] <= df['end_date']

# final dataframe
df = df.loc[c1 & c2, ['ID', 'date', 'status']].copy()