Home > Software engineering >  Find last available date if date does not exist in other DataFrame
Find last available date if date does not exist in other DataFrame

Time:05-15

Suppose that you have two data frames which can be created using code below:

df1 = pd.DataFrame(data={'start_date': ['2021-07-02', '2021-07-09',
                                        '2021-07-16', '2021-07-23',
                                        '2021-07-30'],
                         'end_date': ['2021-07-09', '2021-07-16',
                                      '2021-07-23', '2021-07-30',
                                      '2021-08-06']})

price_df = pd.DataFrame(data={'date': ['2021-07-01', '2021-07-03',
                                       '2021-07-08', '2021-07-10',
                                       '2021-07-16', '2021-07-22',
                                       '2021-07-30', '2021-07-31',
                                       '2021-08-06'],
                              'price': [47, 62, 107, 171, 30, 154, 42,
                                        143, 131]})

adding conversion to datetime using code below

df1['start_date'] = pd.to_datetime(df1['start_date'])
df1['end_date'] = pd.to_datetime(df1['end_date'])
price_df['date'] = pd.to_datetime(price_df['date'])

Using df1['start_date'].isin(price_df['date']) I can check if dates in column start_date of data frame df1 exists in data frame price_df. In case that is not true I want to replace the start_date with last available date from price_df. Similar process needs to be performed on column end_date of data frame df1. Finally prices can be extracted using a join.

Expected Output

 ------------ ------------ ------------- ----------- 
| start_date |  end_date  | start_price | end_price |
 ------------ ------------ ------------- ----------- 
| 01/07/2021 | 08/07/2021 |          47 |       107 |
| 08/07/2021 | 16/07/2021 |         107 |        30 |
| 16/07/2021 | 22/07/2021 |          30 |       154 |
| 22/07/2021 | 30/07/2021 |         154 |        42 |
| 30/07/2021 | 06/08/2021 |          42 |       131 |
 ------------ ------------ ------------- ----------- 

I attempted this code:

df2 = pd.DataFrame()
df2['start_date'] = np.where(df1['start_date']
                             .isin(price_df['date']),
                             df1['start_date'],
                             price_df[
                                 price_df.date < df1['start_date']][
                                 'date'].max())

I am getting the following error:

ValueError: Can only compare identically-labeled Series objects

CodePudding user response:

You can try pandas.merge_asof

df1[['start_date', 'end_date']] = df1[['start_date', 'end_date']].apply(pd.to_datetime)
price_df['date'] = pd.to_datetime(price_df['date'])

price_df = price_df.sort_values('date')

df1[['start_date', 'start_price']] = pd.merge_asof(df1, price_df, left_on='start_date', right_on='date')[['date', 'price']]
df1[['end_date', 'end_price']] = pd.merge_asof(df1, price_df, left_on='end_date', right_on='date')[['date', 'price']]

df1[['start_date', 'end_date']] = df1[['start_date', 'end_date']].apply(lambda col: col.dt.strftime('%d/%m/%Y'))
print(df1)

   start_date    end_date  start_price  end_price
0  01/07/2021  08/07/2021           47        107
1  08/07/2021  16/07/2021          107         30
2  16/07/2021  22/07/2021           30        154
3  22/07/2021  30/07/2021          154         42
4  30/07/2021  06/08/2021           42        131
  • Related