Home > OS >  Calculating column value based on previous row and column using lambda function
Calculating column value based on previous row and column using lambda function

Time:11-12

I have this pandas dataframe that looks like this:

index up_walk    down_walk   up_avg  down_avg
  0   0.000000   17.827148  0.36642   9.06815
  1   1.550781    0.000000      NaN       NaN
  2   0.957031    0.000000      NaN       NaN
  3   0.000000    2.878906      NaN       NaN

I wanted to calculate the missing values that currently are NAN by this formula:

df['up_avg'][i] = df['up_avg'][i-1] * 12   df['up_walk'][i]

explanation: I want to calculate for every row the value based on the previous row in the same column, plus the value in the current row from a different column. And that for every row with missing values. and Continue this calculation to the end of the dataframe. In this case, I have a dependency in every new row calculation that is based on the previous up_avg value calculation.

The problem is that using a loop is very slow because of the large dataframe(10K)

Can anyone please help implement a lambda function for this?

if this is not possible, can anyone share a script for an efficient loop?

I tried a lot of things with no success like this:

df['up_avg'] = df.apply(lambda x: pd.Series(np.where((x.up_avg != None), x.up_avg.shift() * 12   x.up_walk, x.up_avg)))

got an error -  "AttributeError: 'Series' object has no attribute 'up_avg'"

and also using shift to create new columns and then using a lambda function with no success

I expect that my dataframe will look like this at the end:

index up_walk    down_walk   up_avg  down_avg
  0   0.000000   17.827148  0.36642   9.06815
  1   1.550781    0.000000  5.947821  108.8178
  2   0.957031    0.000000  72.330883 1305.8136
  3   0.000000    2.878906  867.970596  15672.642106

Thanks a lot!

CodePudding user response:

you have to use np.roll instead of shift. Also, if you are using apply, you must specify an axis:

#keep going until there is no nan value left
status=True
while status:
    df['up_avg'] = np.where((np.isnan(df.up_avg)==True), np.roll(df.up_avg,1) * 12  df.up_walk ,df.up_avg)
    if df['up_avg'].isnull().sum() == 0:
        status=False
        
status=True
while status:
    df['down_avg'] = np.where((np.isnan(df.down_avg)==True), np.roll(df.down_avg,1) * 12  df.down_walk ,df.down_avg)
    if df['down_avg'].isnull().sum() == 0:
        status=False

print(df)
    up_walk   down_walk     up_avg      down_avg
0   0.0       17.827148     0.36642     9.06815
1   1.550781  0.0           5.947821    108.81779999999999
2   0.957031  0.0           72.330883   1305.8136
3   0.0       2.878906      867.970596  15672.642106

```

CodePudding user response:

Based on the math you're trying to implement here, each missing nan value is given by:

up_avg1 = 12*up_avg0   up_walk1
up_avg2 = 12*up_avg1   up_walk2
up_avg3 = 12*up_avg2   up_walk3

...and so on. Expressed in this way, each new value of up_avg depends on the previous value of up_avg, which forces you to loop.

Unpacking this, we recognise that:

up_avg1 = (12**1)*up_avg0   (12**0)*up_walk1
up_avg2 = (12**2)*up_avg0   (12**1)*up_walk1   (12**0)*up_walk2 
up_avg3 = (12**3)*up_avg0   (12**2)*up_walk1   (12**1)*up_walk2    (12**0)*up_walk3

...and so on. This allows us to express all your unknown values of up_avg (your nan values) as the product of some calculation relying on three things, all of which you know at the outset:

  1. Your constant (12)
  2. Your first known value of up_avg (up_avg0 = 0.36642)
  3. all your known values of up_walk (up_walk1, up_walk2, up_walk3, etc).

Like this:

[up_avg1]             [12  ]     [1    0    0]   [up_walk1]
[up_avg2] = up_avg0 * [144 ]     [12   1    0] * [up_walk2]  
[up_avg3]             [1728]     [144  12   1]   [up_walk3]

Therefore, instead of looping and calculating each nan value one by one, you can express this math as (basically) a single step matrix algebra problem and solve it that way - kinda like this.

Fair warning - I'm not an expert in numpy or pandas - so the implementation here might be clumsy - the rationalisation of the math is what I'm trying to get across.

import numpy as np
import pandas as pd
from scipy.linalg import toeplitz
   
num_rows_in_dataframe = df['index'].size
# sets num_rows_in_dataframe to 4
constant = 12
# this is the value you want to multiply by each "previous" up_avg value
up_avg_1 = df['up_avg'][0]
# this is the first up_avg value you have = 0.36642

toeplitz_c = np.arange(num_rows_in_dataframe-1)
toeplitz_r = np.hstack((np.array([1]), np.zeros((num_rows_in_dataframe-2))))
powers = toeplitz(toeplitz_c, toeplitz_r)
# these three rows basically constuct this matrix:
    # [0., 0., 0.]
    # [1., 0., 0.]
    # [2., 1., 0.]
    
# We then raise your constant to the powers in this matrix:
constant_array = constant**powers
# which gives us:
    # [  1.,   1.,   1.]
    # [ 12.,   1.,   1.]
    # [144.,  12.,   1.]

# We then take the bottom triangle of this matrix:
constant_array = np.tril(constant_array)

# Giving us this matrix:
    # [  1.,   0.,   0.]
    # [ 12.,   1.,   0.]
    # [144.,  12.,   1.]
# We pick up all the "same row" values of up_walk and place them in a vector:
up_walk = np.array(df['up_walk'])[1:][:, np.newaxis]
    # [1.550781]
    # [0.957031]
    # [0.      ]

# And finally, putting it all together:
replace_nans_with = np.matmul(constant_array, up_walk)   (up_avg_1*constant**np.arange(1, num_rows_in_dataframe))[:, np.newaxis]

# We get an array of your missing nans:
    # [  5.947821]
    # [ 72.330883]
    # [867.970596]

Finally put this vector of values in place of your nan values and you're home.

This isn't really the "efficient loop" you were after, but (unless I'm missing something) it is a "non-loopy" way to solve the problem and should help you do the job faster - I'd be keen to know if running it this way does in fact prove faster - please try it and let me know.

  • Related