Selecting second last row in pandas after groupby-CodePudding

I have the following code :

import pandas as pd
from dateutil.relativedelta import relativedelta
import yfinance as yf
import datetime 

ticker = 'f'
df = yf.download(tickers = ticker, start=datetime.datetime.now()-relativedelta(days=6), end= datetime.datetime.now(), interval="5m", progress = False)
df['day'] = df.index.day
df['pClose'] = df.groupby('day').nth(-2)['Close']

So with df['pClose'] = df.groupby('day').nth(-2)['Close'] I am trying to select the second last row. But it is producing this:

       Open   High    Low   Close  Adj Close   Volume  pClose
day                                                          
7    13.740  13.77  13.74  13.755     13.755  1451751     NaN
8    13.540  13.56  13.52  13.540     13.540  1199373     NaN
9    13.340  13.35  13.32  13.335     13.335  1399753     NaN
10   12.765  12.78  12.74  12.745     12.745  2398719     NaN

Could you please advise why it is not selecting the second last row?

Edit

To get the previous days's closing price: df = yf.download(tickers = ticker, start=datetime.datetime.now()-relativedelta(days=6), end= datetime.datetime.now(), progress = False)

Where ticker = "f"

             Open   High    Low  Close  Adj Close    Volume
Date                                                       
2022-06-07  13.26  13.77  13.19  13.74      13.74  38940300
2022-06-08  13.63  13.85  13.44  13.53      13.53  39441900
2022-06-09  13.51  13.59  13.28  13.28      13.28  30468000
2022-06-10  13.00  13.21  12.63  12.75      12.75  55644400

So on 9th the closing price was 13.28. However this is on a daily interval, in the code above not specifying interval defaults to daily.

Since the dataframe which I am requesting is on 5 min interval.

                              Open     High  ...  Adj Close   Volume
Datetime                                     ...                    
2022-06-07 09:30:00-04:00  13.2600  13.3800  ...    13.3700  2543171
2022-06-07 09:35:00-04:00  13.3679  13.4200  ...    13.3426   712259
2022-06-07 09:40:00-04:00  13.3500  13.3550  ...    13.2900   509699
2022-06-07 09:45:00-04:00  13.2900  13.3382  ...    13.3300   451413
2022-06-07 09:50:00-04:00  13.3300  13.4100  ...    13.3801   849915
                           ...      ...  ...        ...      ...
2022-06-10 15:40:00-04:00  12.8050  12.8100  ...    12.7801   889435
2022-06-10 15:45:00-04:00  12.7850  12.7900  ...    12.7400  1469980
2022-06-10 15:50:00-04:00  12.7450  12.7800  ...    12.7650  1910267
2022-06-10 15:55:00-04:00  12.7650  12.7800  ...    12.7450  2398719
2022-06-10 16:00:00-04:00  12.7500  12.7500  ...    12.7500        0

What I am trying to do groupby all the data on day number with datatime.index.day and then access the closing price on the second last day. This is where the problem was as when i did .nth(-2)['Close']` It was not selecting just 1 row, but was returning all the days available

CodePudding user response：

When you assign in pandas, there's this thing called index-matching where the values are assigned to the matching index of the assigned df.

You code returns:

>>> df.groupby('day').nth(-2)['Close']
day
7     13.755
8     13.540
9     13.335
10    12.745

where the indexes are days. However, the original df has datetime indexes:

                             Open     High      Low    Close  Adj Close  
Datetime                                                                   
2022-06-07 09:30:00-04:00  13.2600  13.3800  13.1850  13.3700    13.3700   
2022-06-07 09:35:00-04:00  13.3679  13.4200  13.2900  13.3426    13.3426   
2022-06-07 09:40:00-04:00  13.3500  13.3550  13.2800  13.2900    13.2900   
2022-06-07 09:45:00-04:00  13.2900  13.3382  13.2701  13.3300    13.3300   
2022-06-07 09:50:00-04:00  13.3300  13.4100  13.3100  13.3801    13.3801

That way, none of them match, so all values end up being NaNs.

Here, use transform to keep the same original indexes and broadcast the values:

>>> df.groupby('day').Close.transform('nth', n=-2)

Datetime
2022-06-07 09:30:00-04:00    13.290
2022-06-07 09:35:00-04:00    13.290
2022-06-07 09:40:00-04:00    13.290
2022-06-07 09:45:00-04:00    13.290
2022-06-07 09:50:00-04:00    13.290
                              ...  
2022-06-10 15:40:00-04:00    13.205
2022-06-10 15:45:00-04:00    13.205
2022-06-10 15:50:00-04:00    13.205
2022-06-10 15:55:00-04:00    13.205
2022-06-10 16:00:00-04:00    13.205

To get the previous day close, you can just use loc

>>> df.loc[df['day'] == df['day'].max() - 1, 'Close'].tail(1)

CodePudding user response：

Are you okay with resetting the index? Then you can try something as such

ticker = 'f'
df = yf.download(tickers = ticker, start=datetime.datetime.now()-relativedelta(days=6), end= datetime.datetime.now(), interval="5m", progress = False)
df['day'] = df.index.day
df['pClose'] = df.groupby('day')
df = df.reset_index()
df2 = df.iloc[-2]