Get records for the nearest date if record does not exist for a particular date-CodePudding

I have a pandas dataframe of stock records, my goal is to pass in a particular 'day' e.g 8 and get the filtered data frame for the 8th of each month and year in the dataset. I have gone through some SO questions and managed to get one part of my requirement that was getting the records for a particular day, however if the data for say '8th' does not exist for the particular month and year, I need to get the records for the closest day where record exists for this particular month and year.

As an example, if I pass in 8th and there is no record for 8th Jan' 2022, I need to see if records exists for 7th and 9th Jan'22, and so on..and get the record for the nearest date.

If record is present in both 7th and 9th, I will get the record for 9th (higher date).

However, it is possible if the record for 7th exists and 9th does not exist, then I will get the record for 7th (closest). Code I have written so far

filtered_df = data.loc[(data['Date'].dt.day == 8)]

If the dataset is required, please let me know. I tried to make it clear but if there is any doubt, please let me know. Any help in the correct direction is appreciated.

CodePudding user response：

Alternative 1

Resample to a daily resolution, selecting the nearest day to fill in missing values:

df2 = df.resample('D').nearest()
df2 = df2.loc[df2.index.day == 8]

Alternative 2

A more general method (and a tiny bit faster) is to generate dates/times of your choice, then use reindex() and method 'nearest'. It is more general because you can use any series of timestamps you could come up with (not necessarily aligned with any frequency).

dates = pd.date_range(
    start=df.first_valid_index().normalize(), end=df.last_valid_index(),
    freq='D')
dates = dates[dates.day == 8]
df2 = df.reindex(dates, method='nearest')

Example

Let's start with a reproducible example:

import yfinance as yf

df = yf.download(['AAPL', 'AMZN'], start='2022-01-01', end='2022-12-31', freq='D')
>>> df.iloc[:10, :5]
             Adj Close                   Close                    High
                  AAPL        AMZN        AAPL        AMZN        AAPL
Date                                                                  
2022-01-03  180.959747  170.404495  182.009995  170.404495  182.880005
2022-01-04  178.663086  167.522003  179.699997  167.522003  182.940002
2022-01-05  173.910645  164.356995  174.919998  164.356995  180.169998
2022-01-06  171.007523  163.253998  172.000000  163.253998  175.300003
2022-01-07  171.176529  162.554001  172.169998  162.554001  174.139999
2022-01-10  171.196426  161.485992  172.190002  161.485992  172.500000
2022-01-11  174.069748  165.362000  175.080002  165.362000  175.179993
2022-01-12  174.517136  165.207001  175.529999  165.207001  177.179993
2022-01-13  171.196426  161.214005  172.190002  161.214005  176.619995
2022-01-14  172.071335  162.138000  173.070007  162.138000  173.779999

Now:

df2 = df.resample('D').nearest()
df2 = df2.loc[df2.index.day == 8]

>>> df2.iloc[:5, :5]
             Adj Close                   Close                    High
                  AAPL        AMZN        AAPL        AMZN        AAPL
2022-01-08  171.176529  162.554001  172.169998  162.554001  174.139999
2022-02-08  174.042633  161.413498  174.830002  161.413498  175.350006
2022-03-08  156.730942  136.014496  157.440002  136.014496  162.880005
2022-04-08  169.323975  154.460495  170.089996  154.460495  171.779999
2022-05-08  151.597595  108.789001  152.059998  108.789001  155.830002

Warning

Replacing a missing day with data from the future (which is what happens when the nearest day is after the missing one) is called peak-ahead and can cause peak-ahead bias in quant research that would use that data. It is usually considered dangerous. You'd be safer using method='ffill'.