Home > Software engineering >  Merge 2 Dataframes using start_date & end_date
Merge 2 Dataframes using start_date & end_date

Time:12-30

I have 2 dataframes which look something like this.

Train Departure Details (data_df)

pd.DataFrame(columns=['Train', 'Origin', 'Dest', 'j_type', 'bucket', 'dep_date'],
             data = [['AB001', 'NZM', 'JBP', 'OP', 'S1', '2022-12-27'],
                     ['AB001', 'NZM', 'JBP', 'SP', 'S1', '2023-01-02'],
                     ['AB001', 'NZM', 'JBP', 'OP', 'S1', '2023-01-05'],
                     ['AB002', 'NZM', 'JBP', 'SP', 'S1', '2022-12-21'],
                     ['AB002', 'NZM', 'JBP', 'SP', 'S1', '2023-05-21'],
                     ['AB003', 'NZM', 'RKP', 'OP', 'S2', '2023-01-07'],
                     ['AB012', 'NZM', 'JBP', 'OP', 'S2', '2023-02-07'],
                    ]
            )

Train Departure Details

Fares Dataframe (fares_df)

pd.DataFrame(columns=['Origin', 'Dest', 'j_type', 'bucket', 'start_date', 'end_date', 'fare'],
             data = [['NZM', 'JBP', 'OP', 'S1', '2022-01-01', '2022-12-31', 200],
                     ['NZM', 'JBP', 'SP', 'S1', '2023-01-01', '2023-12-31', 400],
                     ['NZM', 'JBP', 'OP', 'S1', '2023-01-01', '2022-01-31', 205],
                     ['NZM', 'JBP', 'OP', 'S1', '2023-01-31', '2023-12-31', 210],
                     ['NZM', 'JBP', 'OP', 'S2', '2023-01-31', '2023-12-31', 215]]
            )

Fare Details

I need to merge these 2 Datframes such that fares are applied based on correct dep_date that falls between start_date & end_date and also correct origin, destination, j_type & bucket columns

Expected Result

enter image description here


What I have tried:


ranges = fares[['start_date', 'end_date']].drop_duplicates().reset_index(drop=True)

# Get array of dep_date, start_dates, and end_dates
dep_dates = data_df['dep_date'].values
low_date = ranges['start_date'].values
high_date = ranges['end_date'].values

# Generate arrays for which set of date ranges the dep_date falls between
i, j = np.where((dep_dates[:, None] >= low_date) & (dep_dates[:, None] <= high_date))

# Add date range columns to data_df:
data_df.reset_index(inplace=True)

data_df.loc[i, 'fare_start_date'] = np.column_stack([ranges['start_date'].values[j]])
data_df.loc[i, 'fare_end_date'] = np.column_stack([ranges['end_date'].values[j]])

data_df.merge(
            fares_df,
            left_on=['Origin', 'Dest',
                     'bucket', 'j_type',
                     'fare_start_date', 'fare_end_date'],
            right_on=['Origin', 'Dest',
                      'bucket', 'j_type',
                      'start_date', 'end_date'],
            how='left')

However this does not give me the correct result after merge.

CodePudding user response:

df = data_df.rename(columns={'dep_date':'start_date'}) # rename is better than too many right_on, left_on
df1 = pd.merge_asof(df.sort_values('start_date'), fares_df.sort_values('start_date'), by=fares_df.columns[:4].tolist(), on='start_date')
df1['fare'] = df1['fare'].where(df1['start_date'] < df1['end_date']) # chk out of end_date. of course example has not this case.

df1

    Train   Origin  Dest    j_type  bucket  start_date  end_date    fare
0   AB002   NZM     JBP     SP      S1      2022-12-21  NaT         NaN
1   AB001   NZM     JBP     OP      S1      2022-12-27  2022-12-31  200.0
2   AB001   NZM     JBP     SP      S1      2023-01-02  2023-12-31  400.0
3   AB001   NZM     JBP     OP      S1      2023-01-05  2023-01-31  205.0
4   AB003   NZM     RKP     OP      S2      2023-01-07  NaT         NaN
5   AB012   NZM     JBP     OP      S2      2023-02-07  2023-12-31  215.0
6   AB002   NZM     JBP     SP      S1      2023-05-21  2023-12-31  400.0

merge df1 to df for order

out = df.merge(df1.drop('end_date', axis=1), how='left').rename(columns={'start_date':'dep_date'})

out

    Train   Origin  Dest    j_type  bucket  dep_date    fare
0   AB001   NZM     JBP     OP      S1      2022-12-27  200.0
1   AB001   NZM     JBP     SP      S1      2023-01-02  400.0
2   AB001   NZM     JBP     OP      S1      2023-01-05  205.0
3   AB002   NZM     JBP     SP      S1      2022-12-21  NaN
4   AB002   NZM     JBP     SP      S1      2023-05-21  400.0
5   AB003   NZM     RKP     OP      S2      2023-01-07  NaN
6   AB012   NZM     JBP     OP      S2      2023-02-07  215.0

CodePudding user response:

at the end after data_df.merge, I guess you should select only rows where the dep_date falls between start_date and end_date like this:

merged_df = data_df.merge(fares_df,
            left_on=['Origin', 'Dest', 'bucket', 'j_type', 'fare_start_date', 'fare_end_date'],
            right_on=['Origin', 'Dest', 'bucket', 'j_type', 'start_date', 'end_date'],
            how='left')
merged_df = merged_df[(merged_df['dep_date'] >= merged_df['start_date']) & (merged_df['dep_date'] <= merged_df['end_date'])]
  • Related