Home > Mobile >  Efficient and performant way to calculate the change percentage columns
Efficient and performant way to calculate the change percentage columns

Time:12-25

I am using yfinance via pandas datareader to download multiple-symbols' multi-year data and am trying to calculate 'MTDChg', 'YTDChg' and I figure this is one of the slowest parts in the runtime.

Here is the snippet of the code, where I have reservations about picking of the end of the previous period vis-a-viz availability of the data in the index itself. It is a DataFrame with multiple columns.

I am curious and trying to figure out if there is a better way around this. Using asfreq looks appealing but I am afraid that I will not be able to use the actual reference for starting or ending periods, for which the data may or may not exist in the index. I am thinking of using a applymap but not really sure of how to go about it in terms of a functional code for it and if that would be better in terms of performance.

Any ideas or suggestion on how do I go about this?

    import yfinance as yf
    import pandas_datareader as pdr
    import datetime as dt
    from pandas_datareader import data as pdr
    yf.pdr_override()
 
    y_symbols = ['GOOG', 'MSFT', 'TSLA']
    price_feed = pdr.get_data_yahoo(y_symbols, 
                                      start = dt.datetime(2020,1,1),
                                      end = dt.datetime(2022,12,1),
                                      interval = "1d")

    for dt in price_feed.index:
        dt_str = dt.strftime("%Y-%m-%d")
        current_month_str = f"{dt.year}-{dt.month}"
        previous_month_str = f"{dt.year}-{dt.month - 1}"
        current_year_str = f"{dt.year}"
        previous_year_str = f"{dt.year - 1}"
        
        
        if previous_month_str in price_feed.index:
            previous_month_last_day = price_feed.loc[previous_month_str].index[-1].strftime("%Y-%m-%d")
        else:
            previous_month_last_day = price_feed.loc[current_month_str].index[0].strftime("%Y-%m-%d")
            
            
        if previous_year_str in price_feed.index:
            previous_year_last_day = price_feed.loc[previous_year_str].index[-1].strftime("%Y-%m-%d")
        else:
            previous_year_last_day = price_feed.loc[current_year_str].index[0].strftime("%Y-%m-%d")
            
            
        if dt.month == 1 or dt.month == 2 or dt.month == 3:
            previous_qtr_str = f"{dt.year - 1}-12"
            current_qtr_str  = f"{dt.year}-01"
        elif dt.month == 4 or dt.month == 5 or dt.month == 6:
            previous_qtr_str = f"{dt.year}-03"
            current_qtr_str  = f"{dt.year}-04"
        elif dt.month == 7 or dt.month == 8 or dt.month == 9:
            previous_qtr_str = f"{dt.year}-06"
            current_qtr_str  = f"{dt.year}-07"
        elif dt.month == 10 or dt.month == 11 or dt.month == 12:
            previous_qtr_str = f"{dt.year}-09"
            current_qtr_str  = f"{dt.year}-10"
        else:
            previous_qtr_str = f"{dt.year}-09"
            current_qtr_str  = f"{dt.year}-10"
                    
        if previous_qtr_str in price_feed.index:
            #print("Previous quarter string is present in price feed for ", dt_str)
            previous_qtr_last_day = price_feed.loc[previous_qtr_str].index[-1].strftime("%Y-%m-%d")
            #print("Last quarter last day is", previous_qtr_last_day)
        elif current_qtr_str in price_feed.index:
            previous_qtr_last_day = price_feed.loc[current_qtr_str].index[0].strftime("%Y-%m-%d")
            #print("Previous quarter is not present in price feed")
            #print("Last quarter last day is", previous_qtr_last_day)
        else:
            previous_qtr_last_day = price_feed.loc[current_month_str].index[0].strftime("%Y-%m-%d")
            #print("Previous quarter string is NOT present in price feed")
            #print("Last quarter last day is", previous_qtr_last_day)
            
        #print(dt.day, current_month_str, previous_month_last_day)
        for symbol in y_symbols:
            #print(symbol, dt.day, previous_month_last_day, "<--->", pivot_calculations.loc[dt, ('Close', symbol)],  pivot_calculations.loc[previous_month_last_day, ('Close', symbol)])
            mtd_perf = (pivot_calculations.loc[dt, ('Close', symbol)] - pivot_calculations.loc[previous_month_last_day, ('Close', symbol)]) / pivot_calculations.loc[previous_month_last_day, ('Close', symbol)] * 100
            pivot_calculations.loc[dt_str, ('MTDChg', symbol)] = round(mtd_perf, 2)
            # calculate the qtd performance values
            qtd_perf = (pivot_calculations.loc[dt, ('Close', symbol)] - pivot_calculations.loc[previous_qtr_last_day, ('Close', symbol)]) / pivot_calculations.loc[previous_qtr_last_day, ('Close', symbol)] * 100
            pivot_calculations.loc[dt_str, ('QTDChg', symbol)] = round(qtd_perf, 2)
            ytd_perf = (pivot_calculations.loc[dt, ('Close', symbol)] - pivot_calculations.loc[previous_year_last_day, ('Close', symbol)]) / pivot_calculations.loc[previous_year_last_day, ('Close', symbol)] * 100
            pivot_calculations.loc[dt_str, ('YTDChg', symbol)] = round(qtd_perf, 2)```

CodePudding user response:

IIUC, you are looking at period-to-date, for example "month-to-date" percent change calculations. The method below does the calculations for all three period types in 11.6 ms.

First: definition

The formal definition of "month to date" is:

A period starting at the beginning of the current calendar month and ending at the current date. Month-to-date is used in many contexts, mainly for recording results of an activity in the time between a date (exclusive, since this day may not yet be "complete") and the beginning of the current month.

More usual, however, and consistent with the calculation you seem to want, is: "from the Close of the previous period to the Close of current date (inclusive)". For example, the (usual, not formal) MTD change for '2020-01-07' would be the change between close('2019-12-31') and close('2020-01-07').

You introduce a twist that I think is a bit dangerous: if the basis (last of the previous period) isn't present in the data, then use the first day of the current period (and I presume you'd prefer to use Open than Close as basis for that initial period). I think it is safer and more correct to extend your data fetching to a few days earlier, and, after calculation, discard these extra days (see "Addendum" below).

Anyway, here is a way to do what I think you are asking for. We first calculate a basis for the desired period. For example, for "month-to-date", that basis would be:

basis = price_feed['Close'].shift().fillna(
    price_feed['Open']).groupby(pd.Grouper(freq='M')).transform('first')

To verify that basis:

>>> basis.loc['2020-01-30':'2020-02-04']
                 GOOG        MSFT       TSLA
Date                                        
2020-01-30  67.077499  158.779999  28.299999
2020-01-31  67.077499  158.779999  28.299999
2020-02-03  71.711502  170.229996  43.371334
2020-02-04  71.711502  170.229996  43.371334

Observe that each day of the month has the Close of the previous month, when available. For the first month (where the previous month is not available), we used the current month's Open.

Now, the percentage change is simply:

>>> 100 * (price_feed['Close'] - basis) / basis
                GOOG      MSFT       TSLA
Date                                     
2020-01-02  1.924640  1.158834   1.356893
2020-01-03  1.424468 -0.100771   4.360428
2020-01-06  3.925315  0.157451   6.369850
...              ...       ...        ...
2022-11-28  1.679692  4.148533 -19.609737
2022-11-29  0.824000  3.532502 -20.528256
2022-11-30  7.173033  9.912546 -14.432626

Putting it all together, for all periods of interest:

gb = price_feed['Close'].shift().fillna(price_feed['Open']).groupby
out = {
    name: 100 * (price_feed['Close'] - basis) / basis
    for name, freq in [
        ('MTD', 'M'),
        ('QTD', 'Q'),
        ('YTD', 'Y')
    ]
    for basis in [gb(pd.Grouper(freq=freq)).transform('first')]
}

>>> out['YTD']
                 GOOG       MSFT       TSLA
Date                                       
2020-01-02   1.924640   1.158834   1.356893
2020-01-03   1.424468  -0.100771   4.360428
2020-01-06   3.925315   0.157451   6.369850
...               ...        ...        ...
2022-11-28 -33.473645 -28.116083 -48.072448
2022-11-29 -34.033502 -28.541271 -48.665759
2022-11-30 -29.879496 -24.137728 -44.728328

Addendum: safer way (load some extra days before start)

As said above, it is safer (more correct) to load a few days earlier, then truncate:

y_symbols = ['GOOG', 'MSFT', 'TSLA']
s, e = pd.Timestamp('2020-01-01'), pd.Timestamp('2022-12-01')
price_feed = pdr.get_data_yahoo(y_symbols, start=s - pd.Timedelta('7 days'), end=e, interval='1d')

def pctchg(price_feed, s, periods=(('MTD', 'M'), ('QTD', 'Q'), ('YTD', 'Y'))):
    gb = price_feed['Close'].shift().truncate(before=s).groupby
    return {
        name: 100 * (price_feed['Close'] - basis).dropna() / basis
        for name, freq in periods
        for basis in [gb(pd.Grouper(freq=freq)).transform('first')]
    }

>>> pctchg(price_feed, s)['YTD']
                 GOOG       MSFT       TSLA
Date                                       
2020-01-02   2.269976   1.851616   2.851818
2020-01-03   1.768110   0.583385   5.899652
2020-01-06   4.277430   0.843375   7.938711
...               ...        ...        ...
2022-11-28 -33.473645 -28.116083 -48.072448
2022-11-29 -34.033502 -28.541271 -48.665759
2022-11-30 -29.879496 -24.137728 -44.728328

Addendum 2: performance

All the calculations are vectorized, so we expect this to be fast. Let's check this (for the "safer" version above):

%timeit pctchg(price_feed, s)
# 11.6 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • Related