best way to evaluate a function over each element of a dataframe or array using pandas, numpy or oth-CodePudding

I have some time series data that requires multiplying constants by variables at time t. I have come up with 3 methods to get an answer that is correct.

The main thing I am wondering is Q1 below. I appreciate Q2 and Q3 could be subjective, but I am mostly seeing if there is a much better method I am completely missing.

Q1. Is there a much better way to implement this formula across a dataframe/array that I have missed (i.e. not one of these three methods in the code)? If so please let me know.

More subjectively... I could time each method and choose one purely by the most time efficient method, I was wondering:

Q2. Are any of these certain methods preferred as they are clearer / better written / use less resource / more 'Pythonic'?
Q3. Or is it just the case that any of these 3 are absolutely fine and it is just a preference thing? One large reason I ask is that I often hear people trying to shy away from loops...

The formula to be applied is:

ans_t = x * var1_t   (1 - x) * var2_t - y * max(0, var3_t - z)

note: _t means at time t

Due to the time series nature of it, I could not get something like this to work:

x * df['var1']   (1 - x) * df['var2'] - y * max(0, df['var3'] - z)

Therefore I went for the 3 methods below:

# %%
import numpy as np
import pandas as pd

# example dataframe
df = pd.DataFrame({'var1': [6, 8, 11, 15, 10], 'var2': [1, 8, 2, 15, 4], 'var3': [21, 82, 22, 115, 64]})

# constants
x = 0.44
y = 1.68
z = 22

# function to evaluate: ans_t = x * var1_t   (1 - x) * var2_t - y * max(0, var3_t - z)
# note: _t means at time t


# %%
# ---- Method 1: use simple for loop ----
df['ans1'] = 0

for i in range(len(df)):
    df['ans1'][i] = x * df['var1'][i]   (1 - x) * df['var2'][i] - y * max(0, df['var3'][i] - z)


# %%
# ---- Method 2: apply a lambda function ----
def my_func(var1, var2, var3):
    return x * var1   (1 - x) * var2 - y * max(0, var3 - z)


df['ans2'] = df.apply(lambda x: my_func(x['var1'], x['var2'], x['var3']), axis=1)

# %%
# ---- Method 3: numpy vectorize ----
df['ans3'] = np.vectorize(my_func)(df['var1'], df['var2'], df['var3'])

CodePudding user response：

np.maximum (note: not the same as np.max) gives a vectorized way of handling the max element of the formula:

df['ans_t'] = x * df['var1']   (1 - x) * df['var2'] - y * np.maximum(0, df['var3'] - z)

after which df['ans_t'] is:

0      3.20
1    -92.80
2      5.96
3   -141.24
4    -63.92
Name: ans_t, dtype: float64