Home > Enterprise >  Clip value of cumprod during calculation
Clip value of cumprod during calculation

Time:06-24

Say I have the following dataframe

x = pd.DataFrame({'value': [1.0, 1.1, 1.1, 1.1, 1.2, 1.0, 0.9, 1.9, 1.7, 0.8, 0.5, 0.3]})

and I want to calculate the cumulative product without the value ever going below 1.0 or above 3.0.

If I simply do the cumulative product (x.cumprod()), I end up with

       value
0   1.000000
1   1.100000
2   1.210000
3   1.331000
4   1.597200
5   1.597200
6   1.437480
7   2.731212
8   4.643060
9   3.714448
10  1.857224
11  0.557167

But what I would like to do is something like this

def mycumprod(series, start, low, high):
    values = []
    last_value = start
    for value in series.values:
        last_value = last_value * value
        if last_value < low:
            last_value = low
        elif last_value > high:
            last_value = high
        values.append(last_value)
    return pd.Series(values)

where, during the cumulative product I prevent the value from ever going below low or above high.

Calling mycumprod(x['value'], 1.0, 1.0, 3.0) leads to the following series

0     1.000000
1     1.100000
2     1.210000
3     1.331000
4     1.597200
5     1.597200
6     1.437480
7     2.731212
8     3.000000
9     2.400000
10    1.200000
11    1.000000
dtype: float64

Is there a way to do this efficiently in Pandas?

I used this solution for cumsum in the past, but I don't know how to apply it to cumprod.

Thanks for any help you can provide!

CodePudding user response:

This type of cacluation is very difficult / impossible to vectorize using pandas/numpy, but you could use numba:

@njit
def mycumprod_numba(values, start, low, high):
    products = np.empty_like(values)
    last_value = start
    for i in range(len(values)):
        last_value *= values[i]
        if last_value < low:
            last_value = low
        elif last_value > high:
            last_value = high
        products[i] = last_value
    return products

For 1000 elements (pd.DataFrame({'value': np.random.rand(1_000) * 2})) I get a speed-up of about 15x:

%timeit mycumprod(x['value'], 1.0, 1.0, 3.0)
# 534 µs ± 6.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.Series(mycumprod_numba(x['value'].to_numpy(), 1.0, 1.0, 3.0))
# 36.7 µs ± 630 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  • Related