Home > Enterprise >  Compare 2 dataframe date columns to each other
Compare 2 dataframe date columns to each other

Time:08-06

I have 2 dataframes, df1 and df2. I would like to compare the dates row by row (matching on name) between the two. For example, lets look at Jim in row 1 and row 2 of df1. I would like to see if any of the rows in df2 are within 15 days before the date in date1(in df1) for each Jim row. I have the tables below as well as the expected output. Also, the instance column in df2 is just an identifier for each row, so instance 2 of Jim in df2 ends up being the matching row that is within 15 days for row 1 of Jim in df1.

So the steps for row 1 of Jim in df1 would be:

  1. See if row 1 in df1 has one or more rows in df2 with the name Jim, if it doesn't, then put 0 in the within15days column for that row (see expected output)
  2. Look at date1 of Jim in row 1 of df1 and see if there is a Jim row in df2 with a date2 that is before date1
  3. If there is on or more that are before date1, check if there is one that is 15 days before the date1 for row 1 of Jim in df1
  4. If there is, then make a separate column in a new dataframe with a 1 or 0 if there isn't.
  5. Repeat for each row in df1

Here are the dataframes:

DF1
| Name | Test | Date1                  |
| -----| ---- |------------------------|
| Jim  | 1    |   2020-05-10           |
| Jim  | 2    |   2021-06-11           |
| Bob  | 3    |   2019-05-15           |
| Bob  | 4    |   2022-03-10           |

DF2
| Name | Date2      | Instance|
| -----|------------| ------------|
| Jim  | 2020-06-10 | 1           |
| Jim  | 2021-05-05 | 2           |
| Bob  | 2019-04-15 | 1           |
| Bob  | 2022-03-20 | 2           |

Expected Output
| Name | Test | Date1    | Date2          | Within15days| Instance|
| -----| ---- |----------| ---------------|-------------|---------|
| Jim  | 1    |2020-05-10|2020-05-05      |1            |        2|
| Jim  | 2    |2021-06-11|2021-06-10      |1            |        1|
| Bob  | 3    |2019-05-15|2019-04-15      |0            |        1|
| Bob  | 4    |2022-03-10|2022-03-20      |0            |        2|

Edit: Assume that there can only be one row that is within15days of a row in DF1

CodePudding user response:

I think your output doesn't match what you have in the input as stated in my comment. But I m going to provide an answer making some corrections in your data. You could use merge_asof to merge on Date columns and by='Name' The by=Name is the key here, because you want to match on these before merging and choose direction=nearest instead of backward.

and then construct Within15days column by computing the difference between the Date columns

df1 = pd.DataFrame({
    'Name':['Jim', 'Jim','Bob', 'Bob'],
    'Test':[1,2,3,4],
    'Date1':['2020-05-10', '2021-06-11', '2019-05-15', '2022-03-10']
})
df1['Date1'] = pd.to_datetime(df1['Date1'])
df1.sort_values('Date1', inplace=True, ignore_index=True)

df2 = pd.DataFrame({
    'Name':['Jim', 'Jim','Bob', 'Bob'],
    'Date2':['2021-06-10', '2020-05-05', '2019-04-15', '2022-03-20'],
    'Instance':[1,2,1,2]
    
})
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2.sort_values('Date2', inplace=True, ignore_index=True)

df = pd.merge_asof(df1, df2, left_on='Date1', right_on='Date2', by='Name', direction='nearest')
df['Within15days'] = df.apply(lambda x: x['Date1'].to_pydatetime()-x['Date2'].to_pydatetime(), axis=1)
df['Within15days'] = df['Within15days'].apply(lambda x: x.days>=0 and x.days<=15 ).astype(int)

print(df):

  Name  Test      Date1      Date2  Instance  Within15days
0  Bob     3 2019-05-15 2019-04-15         1             0
1  Jim     1 2020-05-10 2020-05-05         2             1
2  Jim     2 2021-06-11 2021-06-10         1             1
3  Bob     4 2022-03-10 2022-03-20         2             0
  • Related