I have a pandas dataframe of stock records, my goal is to pass in a particular 'day' e.g 8 and get the filtered data frame for the 8th of each month and year in the dataset. I have gone through some SO questions and managed to get one part of my requirement that was getting the records for a particular day, however if the data for say '8th' does not exist for the particular month and year, I need to get the records for the closest day where record exists for this particular month and year.
As an example, if I pass in 8th and there is no record for 8th Jan' 2022, I need to see if records exists for 7th and 9th Jan'22, and so on..and get the record for the nearest date.
If record is present in both 7th and 9th, I will get the record for 9th (higher date).
However, it is possible if the record for 7th exists and 9th does not exist, then I will get the record for 7th (closest). Code I have written so far
filtered_df = data.loc[(data['Date'].dt.day == 8)]
If the dataset is required, please let me know. I tried to make it clear but if there is any doubt, please let me know. Any help in the correct direction is appreciated.
CodePudding user response:
Alternative 1
Resample to a daily resolution, selecting the nearest day to fill in missing values:
df2 = df.resample('D').nearest()
df2 = df2.loc[df2.index.day == 8]
Alternative 2
A more general method (and a tiny bit faster) is to generate dates/times of your choice, then use reindex()
and method 'nearest'
. It is more general because you can use any series of timestamps you could come up with (not necessarily aligned with any frequency).
dates = pd.date_range(
start=df.first_valid_index().normalize(), end=df.last_valid_index(),
freq='D')
dates = dates[dates.day == 8]
df2 = df.reindex(dates, method='nearest')
Example
Let's start with a reproducible example:
import yfinance as yf
df = yf.download(['AAPL', 'AMZN'], start='2022-01-01', end='2022-12-31', freq='D')
>>> df.iloc[:10, :5]
Adj Close Close High
AAPL AMZN AAPL AMZN AAPL
Date
2022-01-03 180.959747 170.404495 182.009995 170.404495 182.880005
2022-01-04 178.663086 167.522003 179.699997 167.522003 182.940002
2022-01-05 173.910645 164.356995 174.919998 164.356995 180.169998
2022-01-06 171.007523 163.253998 172.000000 163.253998 175.300003
2022-01-07 171.176529 162.554001 172.169998 162.554001 174.139999
2022-01-10 171.196426 161.485992 172.190002 161.485992 172.500000
2022-01-11 174.069748 165.362000 175.080002 165.362000 175.179993
2022-01-12 174.517136 165.207001 175.529999 165.207001 177.179993
2022-01-13 171.196426 161.214005 172.190002 161.214005 176.619995
2022-01-14 172.071335 162.138000 173.070007 162.138000 173.779999
Now:
df2 = df.resample('D').nearest()
df2 = df2.loc[df2.index.day == 8]
>>> df2.iloc[:5, :5]
Adj Close Close High
AAPL AMZN AAPL AMZN AAPL
2022-01-08 171.176529 162.554001 172.169998 162.554001 174.139999
2022-02-08 174.042633 161.413498 174.830002 161.413498 175.350006
2022-03-08 156.730942 136.014496 157.440002 136.014496 162.880005
2022-04-08 169.323975 154.460495 170.089996 154.460495 171.779999
2022-05-08 151.597595 108.789001 152.059998 108.789001 155.830002
Warning
Replacing a missing day with data from the future (which is what happens when the nearest day is after the missing one) is called peak-ahead and can cause peak-ahead bias in quant research that would use that data. It is usually considered dangerous. You'd be safer using method='ffill'
.