Home > other >  Custom function Weighted Moving Average using Pandas.DataFrame, for some reason the value drops to 0
Custom function Weighted Moving Average using Pandas.DataFrame, for some reason the value drops to 0

Time:12-28

I am testing my functions that calculates price indicators and I have a strange BUG that I don't know how to resolve.

EDIT: Columns in the csv I've shared are all lower case, in case of testing the function with this csv you'd like to use this code:

data = pd.read_csv(csv_path)
    data = data.drop(['symbol'], axis=1)
    data.rename(columns={'open': 'Open', 'high': 'High', 'low': 'Low', 'close': 'Close', 'volume': 'Volume'}, inplace=True)

Link to data .csv file You can try it using the function with default arguments. (on the bottom of the post I am also sharing an auxilliary input_type function, just make sure not to use input mode higher than 4, since HL2, HLC3, OHLC4 and HLCC4 input modes are not calculated for this csv.

So I am calculating Weighted Moving Average using this function:

(I am testing this function with default arguments)

def wma(price_df: PandasDataFrame, n: int = 14, input_mode: int = 2, from_price: bool = True, *,
        indicator_name: str = 'None') -> PandasDataFrame:
    if from_price:
        name_var, state = input_type(__input_mode__=input_mode)
    else:
        if indicator_name == 'None':
            raise TypeError('Invalid input argument. indicator_name cannot be set to None if from_price is False.')
        else:
            name_var = indicator_name

    wma_n = pd.DataFrame(index=range(price_df.shape[0]), columns=range(1))
    wma_n.rename(columns={0: f'WMA{n}'}, inplace=True)
    weight = np.arange(1, (n   1)).astype('float64')
    weight = weight * n
    norm = sum(weight)
    weight_df = pd.DataFrame(weight)
    weight_df.rename(columns={0: 'weight'}, inplace=True)
    product = pd.DataFrame()
    product_sum = 0
    for i in range(price_df.shape[0]):
        if i < (n - 1):
            # creating NaN values where it is impossible to calculate EMA to drop the later
            wma_n[f'WMA{n}'].iloc[i] = np.nan
        elif i == (n - 1):
            product = price_df[f'{name_var}'].iloc[:(i   1)] * weight_df['weight']
            product_sum = product.sum()
            wma_n[f'WMA{n}'].iloc[i] = product_sum / norm
            print(f'index: {i}, wma: ', wma_n[f'WMA{n}'].iloc[i])
            print(product_sum)
            print(norm)
            product = product.iloc[0:0]
            product_sum = 0

        elif i > (n - 1):
            product = price_df[f'{name_var}'].iloc[(i - (n - 1)): (i   1)] * weight_df['weight']
            product_sum = product.sum()
            wma_n[f'WMA{n}'].iloc[i] = product_sum / norm
            print(f'index: {i}, wma: ', wma_n[f'WMA{n}'].iloc[i])
            print(product_sum)
            print(norm)
            product = product.iloc[0:0]
            product_sum = 0

    return wma_n

For some reason the value drops to 0.0 after 26 iteration, and I have no earthly idea why. Can someone please help me?

My output:

index: 13, wma:  14467.42857142857
product_sum:  21267120.0
norm 1470.0
index: 14, wma:  14329.609523809524
product_sum:  21064526.0
norm 1470.0
index: 15, wma:  14053.980952380953
product_sum:  20659352.0
norm 1470.0
index: 16, wma:  13640.480952380953
product_sum:  20051507.0
norm 1470.0
index: 17, wma:  13089.029523809522
product_sum:  19240873.4
norm 1470.0
index: 18, wma:  12399.72
product_sum:  18227588.4
norm 1470.0
index: 19, wma:  11572.234285714285
product_sum:  17011184.4
norm 1470.0
index: 20, wma:  10607.100952380953
product_sum:  15592438.4
norm 1470.0
index: 21, wma:  9504.32
product_sum:  13971350.4
norm 1470.0
index: 22, wma:  8263.905714285715
product_sum:  12147941.4
norm 1470.0
index: 23, wma:  6885.667619047619
product_sum:  10121931.4
norm 1470.0
index: 24, wma:  5369.710476190477
product_sum:  7893474.4
norm 1470.0
index: 25, wma:  3716.270476190476
product_sum:  5462917.6
norm 1470.0
index: 26, wma:  1926.48
product_sum:  2831925.6
norm 1470.0
index: 27, wma:  0.0
product_sum:  0.0
norm 1470.0
index: 28, wma:  0.0
product_sum:  0.0
norm 1470.0

Auxilliary function needed to run my function.

def input_type(__input_mode__: int) -> (str, bool):
    list_of_inputs = ['Open', 'Close', 'High', 'Low', 'HL2', 'HLC3', 'OHLC4', 'HLCC4']
    if __input_mode__ in range(1, 10, 1):
        input_name = list_of_inputs[__input_mode__ - 1]
        state = True
        return input_name, state
    else:
        raise TypeError('__input_mode__ out of range.')

CodePudding user response:

This problem is caused by a Pandas feature called alignment. Imagine you have two dataframes. One dataframe shows how much you own of each stock. The other DataFrame shows the stock price of each one. However, they're not in the same order, and there's missing data.

df_shares_held = pd.DataFrame({'shares': [1, 5, 10]}, index=['ABC', 'DEF', 'XYZ'])
df_price_per_share = pd.DataFrame({'price': [0.54, 1.1]}, index=['XYZ', 'ABC'])

These dataframes look like this:

     shares
ABC       1
DEF       5
XYZ      10
     price
XYZ   0.54
ABC   1.10

Pandas will let you multiply these two columns together.

print(df_shares_held['shares'] * df_price_per_share['price'])

ABC    1.1
DEF    NaN
XYZ    5.4
dtype: float64

Notice it matched up the price for ABC with the number of shares for ABC, despite them being in different orders in the original dataframes. DEF, which is missing a share price, now becomes NaN, because one side of the multiplication is missing a value.

Pandas is doing something similar here. This is the value of price_df[f'{name_var}'].iloc[(i - (n - 1)): (i 1)] partway through the loop:

1     14470.5
2     14472.5
3     14475.6
4     14475.5
5     14481.0
6     14477.0
7     14474.0
8     14471.5
9     14471.5
10    14470.5
11    14467.6
12    14456.0
13    14448.6
14    14446.6
Name: Close, dtype: float64

Notice this starts at 1 and ends at 14.

This is the value of weights_df['weights'] in the same loop:

0      14.0
1      28.0
2      42.0
3      56.0
4      70.0
5      84.0
6      98.0
7     112.0
8     126.0
9     140.0
10    154.0
11    168.0
12    182.0
13    196.0
Name: weight, dtype: float64

Notice this starts at 0 and ends at 13.

And this is the product of the two:

0           NaN
1      405174.0
2      607845.0
3      810633.6
4     1013285.0
5     1216404.0
6     1418746.0
7     1621088.0
8     1823409.0
9     2026010.0
10    2228457.0
11    2430556.8
12    2630992.0
13    2831925.6
14          NaN
dtype: float64

You now have NaNs for the first and last value, and only 13 real values. Each time it loops, it will lose one more value.

But why does it return zero, and not NaN? Pandas ignores NaN values when doing a sum over a column. If you sum only NaN values, then it returns zero.

So how can you avoid alignment? There are many ways.

  1. Approach #1: You could call reset_index():

    product = price_df[f'{name_var}'].iloc[(i - (n - 1)): (i   1)].reset_index(drop=True) * weight_df['weight']
    

    This puts the index back to starting at zero.

  2. Approach #2: You could use numpy to do the calculation. Numpy doesn't care about alignment.

    product = price_df[f'{name_var}'].iloc[(i - (n - 1)): (i   1)].values * weight_df['weight'].values
    
  3. Approach #3: Pandas already has a way to calculate what you're looking for - they're called rolling window calculations.

    import numpy as np
    def wma2(price_df, n: int = 14, input_mode: int = 2, from_price: bool = True, *,
            indicator_name: str = 'None'):
        if from_price:
            name_var, state = input_type(__input_mode__=input_mode)
        else:
            if indicator_name == 'None':
                raise TypeError('Invalid input argument. indicator_name cannot be set to None if from_price is False.')
            else:
                name_var = indicator_name
        weights = np.arange(1, (n   1)).astype('float64')
        weights_normalized = weights / weights.sum()
        wma_series = price_df['Close'].rolling(n).apply(
            lambda window: np.dot(window, weights_normalized)
        )
        return pd.DataFrame({f'WMA{n}': wma_series})
    

    This is not only simpler, but faster too.

CodePudding user response:

I think the reason this is happening is because your weight_df has indices 0-13, but when you iterate over your price_df, the indices will be 0-13 at first, then 1-14, then 2-15, 3-16, 4-17, etc. This means that when you multiply those together:

product = price_df[f'{name_var}'].iloc[(i - (n - 1)): (i   1)] * weight_df['weight']

You will be getting a whole bunch of NaN values due to indices not aligning! Here is an illustration of what is getting progressively worse:

import pandas as pd

a = pd.Series([4, 5, 6], index=[1, 2, 3])
b = pd.Series([1, 2, 3], index=[3, 4, 5])

out = a * b

out:

1    NaN
2    NaN
3    6.0
4    NaN
5    NaN

In your case, the index of both weight_df and price_df are drifting apart more and more as you iterate, creating more and more NaNs.

I'm sure this can be solved, but I would highly recommend doing this in a more "pandas" manner. Have a look at this SO post: https://stackoverflow.com/a/53833851/9499196

Pandas DataFrames provide the .rolling method, which generates the windows you're trying to create manually for you. You can then apply a function (your weighted average) to each window by calling .apply on the Rolling object returned by price_df[your_col].rolling().

  • Related