Pandas selecting rows in a second dataframe based on overlapping dates in the first dataframe?-CodePudding

I've got a list of over 500k people. The data looks like the first table. I'd like to use the admission date from the first table and if the admission date of the same person in the second table is within 30 days of their admission date in the first table then I'd like to store that overlapping record in the third table. The example of what I'd like is below. Is there a faster way to do this than using iterrows using the person_ids and dates from the first table and checking every row in the second table?

Table 1
| person_id | admission_date | value |
|      1234 |     2017-01-31 |     6 |
|      5678 |     2018-03-20 |    12 |
|      9101 |     2017-02-22 |    11 |
|      1234 |     2020-10-31 |    19 |
|      5678 |     2019-06-16 |    21 |
|      9101 |     2021-12-14 |     8 |

Table 2
| person_id | admission_date | value |
|      1234 |     2015-01-31 |    10 |
|      1234 |     2017-02-12 |   152 |
|      5678 |     2017-01-31 |    10 |
|      5678 |     2018-04-10 |    10 |
|      9101 |     2017-02-25 |    99 |
|      9101 |     2017-03-01 |    10 |
|      1234 |     2012-12-31 |    10 |
|      5678 |     2019-07-10 |    11 |
|      9101 |     2017-01-31 |    10 |


Table 3
| person_id | admission_date | value |
|      1234 |     2017-02-12 |   152 |
|      5678 |     2018-04-10 |    10 |
|      9101 |     2017-02-25 |    99 |
|      9101 |     2017-03-01 |    10 |
|      5678 |     2019-07-10 |    11 |

CodePudding user response：

You need to use merge_asof:

df1['admission_date'] = pd.to_datetime(df1['admission_date'])
df2['admission_date'] = pd.to_datetime(df2['admission_date'])

out = (pd
  .merge_asof(df1.sort_values(by='admission_date')
                 .rename(columns={'admission_date': 'date'})
                 .drop(columns='value'),
              df2.sort_values(by='admission_date'),
              by='person_id',
              left_on='date',
              right_on='admission_date',
              direction='forward',
              tolerance=pd.Timedelta('30D')
             )
  .drop(columns='date')
  .dropna(subset='value')
)

output:

   person_id admission_date  value
0       1234     2017-02-12  152.0
1       9101     2017-02-25   99.0
2       5678     2018-04-10   10.0
3       5678     2019-07-10   11.0

CodePudding user response：

let table 1 be df1, table 2 be df2 and table 3 be df3

Not sure of table 1 has duplicate person id's as table 2 has, so assuming it does here and taking the most recent admission date for both table 1 and table 2.

df1 = df1.sort_values(by=['person_id','admission_date'],ascending =False)
df1 = df1[df1['person_id'].duplicated()==False] % only has the latest admission for any person_id
df2 = df2.sort_values(by=['person_id','admission_date'],ascending =False)
df2 = df2[df2['person_id'].duplicated()==False] % only has the latest admission for any person_id

df3 = pd.concat([df1.set_index('person_id')['admission_date'].to_frame('adm_date_1'),df2.set_index('person_id')],axis=1,join='inner')

Now that we have the data aligned, we can check for the 30 day condition:

mask = (df3['adm_date_1']-df3['admission_date']).apply(lambda x: x.days).abs()

df3 = df3.loc[mask,['admission_date','value']]

For this to work the date columns need to be of datetime type, if not, first the conversion is necessary