I've got a list of over 500k people. The data looks like the first table. I'd like to use the admission date from the first table and if the admission date of the same person in the second table is within 30 days of their admission date in the first table then I'd like to store that overlapping record in the third table. The example of what I'd like is below. Is there a faster way to do this than using iterrows using the person_ids and dates from the first table and checking every row in the second table?
Table 1
| person_id | admission_date | value |
| 1234 | 2017-01-31 | 6 |
| 5678 | 2018-03-20 | 12 |
| 9101 | 2017-02-22 | 11 |
| 1234 | 2020-10-31 | 19 |
| 5678 | 2019-06-16 | 21 |
| 9101 | 2021-12-14 | 8 |
Table 2
| person_id | admission_date | value |
| 1234 | 2015-01-31 | 10 |
| 1234 | 2017-02-12 | 152 |
| 5678 | 2017-01-31 | 10 |
| 5678 | 2018-04-10 | 10 |
| 9101 | 2017-02-25 | 99 |
| 9101 | 2017-03-01 | 10 |
| 1234 | 2012-12-31 | 10 |
| 5678 | 2019-07-10 | 11 |
| 9101 | 2017-01-31 | 10 |
Table 3
| person_id | admission_date | value |
| 1234 | 2017-02-12 | 152 |
| 5678 | 2018-04-10 | 10 |
| 9101 | 2017-02-25 | 99 |
| 9101 | 2017-03-01 | 10 |
| 5678 | 2019-07-10 | 11 |
CodePudding user response:
You need to use merge_asof
:
df1['admission_date'] = pd.to_datetime(df1['admission_date'])
df2['admission_date'] = pd.to_datetime(df2['admission_date'])
out = (pd
.merge_asof(df1.sort_values(by='admission_date')
.rename(columns={'admission_date': 'date'})
.drop(columns='value'),
df2.sort_values(by='admission_date'),
by='person_id',
left_on='date',
right_on='admission_date',
direction='forward',
tolerance=pd.Timedelta('30D')
)
.drop(columns='date')
.dropna(subset='value')
)
output:
person_id admission_date value
0 1234 2017-02-12 152.0
1 9101 2017-02-25 99.0
2 5678 2018-04-10 10.0
3 5678 2019-07-10 11.0
CodePudding user response:
let table 1 be df1, table 2 be df2 and table 3 be df3
Not sure of table 1 has duplicate person id's as table 2 has, so assuming it does here and taking the most recent admission date for both table 1 and table 2.
df1 = df1.sort_values(by=['person_id','admission_date'],ascending =False)
df1 = df1[df1['person_id'].duplicated()==False] % only has the latest admission for any person_id
df2 = df2.sort_values(by=['person_id','admission_date'],ascending =False)
df2 = df2[df2['person_id'].duplicated()==False] % only has the latest admission for any person_id
df3 = pd.concat([df1.set_index('person_id')['admission_date'].to_frame('adm_date_1'),df2.set_index('person_id')],axis=1,join='inner')
Now that we have the data aligned, we can check for the 30 day condition:
mask = (df3['adm_date_1']-df3['admission_date']).apply(lambda x: x.days).abs()
df3 = df3.loc[mask,['admission_date','value']]
For this to work the date columns need to be of datetime type, if not, first the conversion is necessary