I have two large dataframes with the same columns. Example data of relevance is presented below.
import pandas as pd
df1 = [[52, '1', 'Mast', '1/1/2022'], [54, '1', 'Mast', '1/1/2023'],
[55, '2', 'Mast', '25/5/2022'], [52, '1', 'Mast', '14/7/2022']]
df = pd.DataFrame(df1, columns =['Cow','Lact', 'Event', 'Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df2 = [[60, '1', 'Other', '10/1/2022'], [54, '1', 'Other', '3/1/2023'],
[55, '2', 'Other', '30/5/2022'], [52, '1', 'Other', '14/8/2022']]
df2 = pd.DataFrame(df2, columns =['Cow', 'Lact', 'Event', 'Date'])
df2['Date'] = pd.to_datetime(df2['Date'], format='%d/%m/%Y')
The structure of df is
Cow Lact Event Date
0 52 1 Mast 2022-01-01
1 54 1 Mast 2023-01-01
2 55 2 Mast 2022-05-25
3 52 1 Mast 2022-07-14
The structure of df2 is
Cow Lact Event Date
0 60 1 Other 2022-01-10
1 54 1 Other 2023-01-03
2 55 2 Other 2022-05-30
3 52 1 Other 2022-08-14
I would like to append rows from df2 to df if the "Cow" and "Lact" from df2 matches df and the "Date" of df2 fits within the range of the "Date" in df and five days greater than the "Date" of df.
The desired output with this example data is
Cow Lact Event Date
0 52 1 Mast 2022-01-01
1 54 1 Mast 2023-01-01
2 54 1 Other 2023-01-03
3 55 2 Mast 2022-05-25
4 55 2 Other 2022-05-30
5 52 1 Mast 2022-07-14
Any ideas appreciated.
CodePudding user response:
Not a clean solution, but it works:
# create temp columns cow-Lact
df['cow_lact'] = df['Cow'].astype(str) '_' df['Lact'].astype(str)
df2['cow_lact'] = df2['Cow'].astype(str) '_' df2['Lact'].astype(str)
# select correct colunms from df2
df_append = df2[
df2['cow_lact'].isin(df['cow_lact'])
& df2['Date'].between(df['Date'].min(), df['Date'].max() pd.Timedelta(days=5))]
# concat
result = pd.concat([df,df_append], ignore_index=True)
# drop temp column
result.drop(columns=['cow_lact'])
CodePudding user response:
Using merge_asof
:
df['Date'] = pd.to_datetime(df['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
keep = pd.merge_asof(
df[['Cow', 'Lact', 'Date']].sort_values(by='Date'),
df2.reset_index().sort_values(by='Date'),
on='Date', by=['Cow', 'Lact'],
direction='forward',
tolerance=pd.Timedelta('5D')
)['index'].dropna()
out = pd.concat([df, df2.loc[keep]])
Output:
Cow Lact Event Date
0 52 1 Mast 2022-01-01
1 54 1 Mast 2023-01-01
2 55 2 Mast 2022-05-25
3 52 1 Mast 2022-07-14
2 55 2 Other 2022-05-30
1 54 1 Other 2023-01-03