I've faced a challange to join two dataframes in one.
For example, I've 2 dataframes:
- The first df (df1) is the base with more than 300K lines, with multiple dates for every ID.
e.g:
ID | date |
001| 01-01-2021|
001| 02-01-2021|
001| 03-01-2021|
001| 04-01-2021|
001| 05-01-2021|
...
002| 01-01-2021|
002| 02-01-2021|
002| 03-01-2021|
- The second df (df2) is where I have some extra info that I should join on the first data frame (not so big as the first one, 40k lines)
e.g:
ID | start_date | end_date | status |
001| 01-01-2021 | 02-01-2021| working |
001| 02-02-2021 | 01-03-2021| not working|
The challenge is to identify all the status on df1 based on the id and the start and end date of df2..
Basically, if ID_df1 == ID_df2 and date_df1 >= start_date & date_df1 <= end_date, so I've to capture the status of df2, which one in this case is "Working"
To solve this issue, I've created a function called status:
def status (df2, id, date):
pos = -1
for i in range(len(df2)):
if ((df2.iat[i,0] == id) & (pd.to_datetime(df2.iat[i,1] <= date) & (pd.to_datetime(df2.iat[i,2] >= date):
pos = i
break
if pos > -1:
return (df2.iat[pos,3])
else:
return "Not Found"
And my issue is to apply for the function "status" on every df1 data.
I've tried to create a column in blank and apply the function on a loop:
df1['Status'] = ""
for i in range(len(df1)):
df1.[i,2] = status(df2, df1[i,0], df1[i,1])
It worked for me, but it took a loooot of time.. for the 300K, I've covered the first 4k in one hour.
Is there an easier way to do that?
Thank you very much!
CodePudding user response:
I also 've tried
df1['Status'] = df1.apply(lambda x: status(df2, df1['ID'], df1['Date']),axis=1)
But not worked, I've got the error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
CodePudding user response:
Not an elegant, but probably much faster solution:
# big dataframe with all ID combinations
df = pd.merge(df1, df2, how='left', on='ID')
# conditions to filter the correct combinations
c1 = df['date'] >= df['start_date']
c2 = df['date'] <= df['end_date']
# final dataframe
df = df.loc[c1 & c2, ['ID', 'date', 'status']].copy()