I have 2 dataframes, df1 and df2. I would like to compare the dates row by row (matching on name) between the two. For example, lets look at Jim in row 1 and row 2 of df1. I would like to see if any of the rows in df2 are within 15 days before the date in date1(in df1) for each Jim row. I have the tables below as well as the expected output. Also, the instance column in df2 is just an identifier for each row, so instance 2 of Jim in df2 ends up being the matching row that is within 15 days for row 1 of Jim in df1.
So the steps for row 1 of Jim in df1 would be:
- See if row 1 in df1 has one or more rows in df2 with the name Jim, if it doesn't, then put 0 in the within15days column for that row (see expected output)
- Look at date1 of Jim in row 1 of df1 and see if there is a Jim row in df2 with a date2 that is before date1
- If there is on or more that are before date1, check if there is one that is 15 days before the date1 for row 1 of Jim in df1
- If there is, then make a separate column in a new dataframe with a 1 or 0 if there isn't.
- Repeat for each row in df1
Here are the dataframes:
DF1
| Name | Test | Date1 |
| -----| ---- |------------------------|
| Jim | 1 | 2020-05-10 |
| Jim | 2 | 2021-06-11 |
| Bob | 3 | 2019-05-15 |
| Bob | 4 | 2022-03-10 |
DF2
| Name | Date2 | Instance|
| -----|------------| ------------|
| Jim | 2020-06-10 | 1 |
| Jim | 2021-05-05 | 2 |
| Bob | 2019-04-15 | 1 |
| Bob | 2022-03-20 | 2 |
Expected Output
| Name | Test | Date1 | Date2 | Within15days| Instance|
| -----| ---- |----------| ---------------|-------------|---------|
| Jim | 1 |2020-05-10|2020-05-05 |1 | 2|
| Jim | 2 |2021-06-11|2021-06-10 |1 | 1|
| Bob | 3 |2019-05-15|2019-04-15 |0 | 1|
| Bob | 4 |2022-03-10|2022-03-20 |0 | 2|
Edit: Assume that there can only be one row that is within15days of a row in DF1
CodePudding user response:
I think your output doesn't match what you have in the input as stated in my comment. But I m going to provide an answer making some corrections in your data. You could use merge_asof
to merge on Date
columns and by='Name'
The by=Name
is the key here, because you want to match on these before merging and choose direction=nearest
instead of backward
.
and then construct Within15days
column by computing the difference between the Date
columns
df1 = pd.DataFrame({
'Name':['Jim', 'Jim','Bob', 'Bob'],
'Test':[1,2,3,4],
'Date1':['2020-05-10', '2021-06-11', '2019-05-15', '2022-03-10']
})
df1['Date1'] = pd.to_datetime(df1['Date1'])
df1.sort_values('Date1', inplace=True, ignore_index=True)
df2 = pd.DataFrame({
'Name':['Jim', 'Jim','Bob', 'Bob'],
'Date2':['2021-06-10', '2020-05-05', '2019-04-15', '2022-03-20'],
'Instance':[1,2,1,2]
})
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2.sort_values('Date2', inplace=True, ignore_index=True)
df = pd.merge_asof(df1, df2, left_on='Date1', right_on='Date2', by='Name', direction='nearest')
df['Within15days'] = df.apply(lambda x: x['Date1'].to_pydatetime()-x['Date2'].to_pydatetime(), axis=1)
df['Within15days'] = df['Within15days'].apply(lambda x: x.days>=0 and x.days<=15 ).astype(int)
print(df):
Name Test Date1 Date2 Instance Within15days
0 Bob 3 2019-05-15 2019-04-15 1 0
1 Jim 1 2020-05-10 2020-05-05 2 1
2 Jim 2 2021-06-11 2021-06-10 1 1
3 Bob 4 2022-03-10 2022-03-20 2 0