Home > Software design >  Trouble resampling pandas timeseries from 1min to 5min data
Trouble resampling pandas timeseries from 1min to 5min data

Time:12-04

I have a 1 minute interval intraday stock data which looks like this:

import yfinance as yf
import pandas as pd
n = yf.download('^nsei', period= '5d', interval= '1m')

I am trying to resample it to '5m' data like this:

n = n.resample('5T').agg(dict(zip(n.columns, ['first', 'max', 'min', 'last', 'last', 'sum'])))

But it tries to resample the datetime information which is not in my data. The market data is only available till 03:30 PM, but when I look at the resampled dataframe I find its tried to resample for entire 24 hrs.
How do I stop the resampling till 03:30PM and move on to the succeeding date?
Right now the dataframe has mostly NaN values due to this. Any suggestions will be welcome.

CodePudding user response:

I am not sure what you are trying to achieve with that agg() function. Assuming 'first' refers to the first quantile and 'last' to the last quantile and you want to calculate some statistics per column, I suggest you do the following:

Get your data:

import yfinance as yf
import pandas as pd
n = yf.download('^nsei', period= '5d', interval= '1m')

Resample your data:

Note: your result is the same as when you resample with n.resample('5T').first() but this means every value in the dataframe equals the first value from the 5 minute interval consisting of 5 values. A more logical resampling method is to use the mean() or sum() function as shown below.

If this is data on stock prices it makes more sense to use mean():

resampled_df = n.resample('5T').mean()

To remove resampled hours that are outside of the working stock hours you have 2 options.

Option 1: drop na values:

filtered_df = resampled_df.dropna()

Note: this will not work if you use sum() since the result won't contain missing values but zeros.

Option 2 filter based on start and end hour

Get minimum and maximum time of day where data is available as datetime.time object:

start = n.index.min().time() # 09:15 as datetime.time object
end = n.index.max().time() # 15:29 as datetime.time object

Filter dataframe based on start and end times:

filtered_df = resampled_df.between_time(start, end)

Get the statistics:

statistics = filtered_df.describe()
statistics

Note that describe() will not contain the sum, so in order to add it you could do:

statistics = pd.concat([statistics, filtered_df.agg(['sum'])])
statistics

Output: enter image description here

CodePudding user response:

The agg() is to apply individual method of operation for each column, I used this so that I can get to see the 'candlestick' formation as it is called in stock technical analysis.
I was able to fix the issue, by dropping the NaN values.

  • Related