Home > other >  Conditionally append rows from one dataframe to another based on key-id and date range
Conditionally append rows from one dataframe to another based on key-id and date range

Time:01-22

I have two large dataframes with the same columns. Example data of relevance is presented below.

import pandas as pd 
    
df1 = [[52, '1', 'Mast', '1/1/2022'], [54, '1', 'Mast', '1/1/2023'],
       [55, '2', 'Mast', '25/5/2022'], [52, '1', 'Mast', '14/7/2022']]
    
df = pd.DataFrame(df1, columns =['Cow','Lact', 'Event', 'Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')


df2 = [[60, '1', 'Other', '10/1/2022'], [54, '1', 'Other', '3/1/2023'],
       [55, '2', 'Other', '30/5/2022'], [52, '1', 'Other', '14/8/2022']]

    
df2 = pd.DataFrame(df2, columns =['Cow', 'Lact', 'Event', 'Date'])
df2['Date'] = pd.to_datetime(df2['Date'], format='%d/%m/%Y')

The structure of df is

Cow Lact    Event   Date
0   52  1   Mast    2022-01-01
1   54  1   Mast    2023-01-01
2   55  2   Mast    2022-05-25
3   52  1   Mast    2022-07-14

The structure of df2 is

    Cow Lact    Event   Date
0   60  1   Other   2022-01-10
1   54  1   Other   2023-01-03
2   55  2   Other   2022-05-30
3   52  1   Other   2022-08-14

I would like to append rows from df2 to df if the "Cow" and "Lact" from df2 matches df and the "Date" of df2 fits within the range of the "Date" in df and five days greater than the "Date" of df.

The desired output with this example data is

Cow Lact    Event   Date
0   52  1   Mast    2022-01-01
1   54  1   Mast    2023-01-01
2   54  1   Other   2023-01-03
3   55  2   Mast    2022-05-25
4   55  2   Other   2022-05-30
5   52  1   Mast    2022-07-14

Any ideas appreciated.

CodePudding user response:

Not a clean solution, but it works:

# create temp columns cow-Lact
df['cow_lact'] = df['Cow'].astype(str)   '_'    df['Lact'].astype(str)
df2['cow_lact'] = df2['Cow'].astype(str)   '_'    df2['Lact'].astype(str)

# select correct colunms from df2 
df_append = df2[
    df2['cow_lact'].isin(df['cow_lact'])
    & df2['Date'].between(df['Date'].min(), df['Date'].max()   pd.Timedelta(days=5))]

# concat
result = pd.concat([df,df_append], ignore_index=True)
# drop temp column
result.drop(columns=['cow_lact'])

CodePudding user response:

Using merge_asof:

df['Date'] = pd.to_datetime(df['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])

keep = pd.merge_asof(
              df[['Cow', 'Lact', 'Date']].sort_values(by='Date'),
              df2.reset_index().sort_values(by='Date'),
              on='Date', by=['Cow', 'Lact'],
              direction='forward',
              tolerance=pd.Timedelta('5D')
             )['index'].dropna()

out = pd.concat([df, df2.loc[keep]])

Output:

   Cow Lact  Event       Date
0   52    1   Mast 2022-01-01
1   54    1   Mast 2023-01-01
2   55    2   Mast 2022-05-25
3   52    1   Mast 2022-07-14
2   55    2  Other 2022-05-30
1   54    1  Other 2023-01-03
  • Related