Say I have the following dataframe
x = pd.DataFrame({'value': [1.0, 1.1, 1.1, 1.1, 1.2, 1.0, 0.9, 1.9, 1.7, 0.8, 0.5, 0.3]})
and I want to calculate the cumulative product without the value ever going below 1.0
or above 3.0
.
If I simply do the cumulative product (x.cumprod()
), I end up with
value
0 1.000000
1 1.100000
2 1.210000
3 1.331000
4 1.597200
5 1.597200
6 1.437480
7 2.731212
8 4.643060
9 3.714448
10 1.857224
11 0.557167
But what I would like to do is something like this
def mycumprod(series, start, low, high):
values = []
last_value = start
for value in series.values:
last_value = last_value * value
if last_value < low:
last_value = low
elif last_value > high:
last_value = high
values.append(last_value)
return pd.Series(values)
where, during the cumulative product I prevent the value from ever going below low
or above high
.
Calling mycumprod(x['value'], 1.0, 1.0, 3.0)
leads to the following series
0 1.000000
1 1.100000
2 1.210000
3 1.331000
4 1.597200
5 1.597200
6 1.437480
7 2.731212
8 3.000000
9 2.400000
10 1.200000
11 1.000000
dtype: float64
Is there a way to do this efficiently in Pandas?
I used this solution for cumsum in the past, but I don't know how to apply it to cumprod.
Thanks for any help you can provide!
CodePudding user response:
This type of cacluation is very difficult / impossible to vectorize using pandas/numpy, but you could use numba
:
@njit
def mycumprod_numba(values, start, low, high):
products = np.empty_like(values)
last_value = start
for i in range(len(values)):
last_value *= values[i]
if last_value < low:
last_value = low
elif last_value > high:
last_value = high
products[i] = last_value
return products
For 1000 elements (pd.DataFrame({'value': np.random.rand(1_000) * 2})
) I get a speed-up of about 15x:
%timeit mycumprod(x['value'], 1.0, 1.0, 3.0)
# 534 µs ± 6.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Series(mycumprod_numba(x['value'].to_numpy(), 1.0, 1.0, 3.0))
# 36.7 µs ± 630 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)