Moving linear regression with numba-CodePudding

I am trying to create a moving linear regression indicator and I wanted to utilize numba. However, I am struggling with the latter part as I am lacking experience

Here's what I have so far utilizing numpy. It is working, however, without applying numba it is quite slow once you throw large arrays at it.

import numpy as np


def ols_1d(y, window):
    y_roll = np.lib.stride_tricks.sliding_window_view(y, window_shape=window)

    m = list()
    c = list()
    for row in np.arange(y_roll.shape[0]):
        A = np.vstack([np.arange(1, window   1), np.ones(window)]).T
        tmp_m, tmp_c = np.linalg.lstsq(A, y_roll[row], rcond=None)[0]
        m.append(tmp_m)
        c.append(tmp_c)

    m, c = np.array([m, c])

    return np.hstack((np.full((window - 1), np.nan), m * window   c))


def ols_2d(y, window):
    out = list()
    for col in range(y.shape[1]):
        out.append(ols_1d(y=y[:, col], window=window))

    return np.array(out).T


if __name__ == "__main__":
    a = np.random.randn(
        10000, 10
    )  # function is slow once you really increse number of columns (let's say 1 mln)

    print(ols_2d(a, 10))

The indicator is actually computing a linear regression applying np.linalg.lstsq function (https://numpy.org/doc/stable/reference/generated/numpy.linalg.lstsq.html) for a given window length. It than basically outputs the very last point of the regression line and moves on to the next range and computes the linear regression again. At the end of the day, ols_1d outputs the last point of every individual regression line putting it into an array.

Now, I need help in order to apply numba on top of it. I am not familiar with numba, but as far as my own trial and error goes, there might be an issue utilizing nb.lib.stride_tricks.sliding_window_view().

EDIT: Follow-up question

Using @aerobiomat's suggestion, I only needed to slightly modify np.cumsum in order to account for matrices rather then vectors.

def window_sum(x, w):
    c = np.cumsum(x, axis=0)  # inserted axis argument
    s = c[w - 1:]
    s[1:] -= c[:-w]
    return s

def window_lin_reg(x, y, w):
    sx = window_sum(x, w)
    sy = window_sum(y, w)
    sx2 = window_sum(x**2, w)
    sxy = window_sum(x * y, w)
    slope = (w * sxy - sx * sy) / (w * sx2 - sx**2)
    intercept = (sy - slope * sx) / w
    return slope, intercept

Now, let's create a 20x3 matrix, where columns represent individual time series.

timeseries_count = 3
x = np.arange(start=1,stop=21).reshape(-1, 1)
y = np.random.randn(20, timeseries_count)

slope, intercept = window_lin_reg(x, y, 10)

This is working fine. However, as soon as I introduce some np.nan, I am running into some issues.

y[0, 0] = np.nan
y[5, 0] = np.nan
y[10, 2] = np.nan

In order to compute the rolling regression, I need to drop all np.nan in each column and compute it column-wise. This does go against vectorization, doesn't it? Every column in the real dataset may contain a different number of np.nan. How to cope with that issue in a clever way? I need speed as the dataset may be quite large (10000 x 10000 or so).

CodePudding user response：

Instead of calculating the linear regression for each window, which involves repeating many intermediate calculations, you can compute the values needed by the formula for all windows and perform a vectorized calculation of all regressions:

def window_sum(x, w):
    # Faster than np.lib.stride_tricks.sliding_window_view(x, w).sum(axis=0)
    c = np.cumsum(x)
    s = c[w - 1:]
    s[1:] -= c[:-w]
    return s

def window_lin_reg(x, y, w):
    sx = window_sum(x, w)
    sy = window_sum(y, w)
    sx2 = window_sum(x**2, w)
    sxy = window_sum(x * y, w)
    slope = (w * sxy - sx * sy) / (w * sx2 - sx**2)
    intercept = (sy - slope * sx) / w
    return slope, intercept

For example:

>>> w = 5      # Window
>>> x = np.arange(15)
>>> y = 0.1 * x**2 - x   10
>>> slope, intercept = window_lin_reg(x, y, w)
>>> print(slope)
>>> print(intercept)
[-0.6 -0.4 -0.2  0.   0.2  0.4  0.6  0.8  1.   1.2  1.4]
[ 9.8  9.3  8.6  7.7  6.6  5.3  3.8  2.1  0.2 -1.9 -4.2]

Comparing with np.linalg.lstsq() and loops:

def numpy_lin_reg(x, y, w):
    m = len(x) - w   1
    slope = np.empty(m)
    intercept = np.empty(m)
    for i in range(m):
        A = np.vstack(((x[i:i   w]), np.ones(w))).T
        m, c = np.linalg.lstsq(A, y[i:i   w], rcond=None)[0]
        slope[i] = m
        intercept[i] = c
    return slope, intercept

Same example, same results:

>>> slope2, intercept2 = numpy_lin_reg(x, y, w)
>>> with np.printoptions(precision=2, suppress=True):
...     print(np.array(slope2))
...     print(np.array(intercept2))
[-0.6 -0.4 -0.2  0.   0.2  0.4  0.6  0.8  1.   1.2  1.4]
[ 9.8  9.3  8.6  7.7  6.6  5.3  3.8  2.1  0.2 -1.9 -4.2]

Some timings with a larger example:

>>> w = 20
>>> x = np.arange(10000)
>>> y = 0.1 * x**2 - x   10

>>> %timeit numpy_lin_reg(x, y, w)
324 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit window_lin_reg(x, y, w)
189 µs ± 3.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

That's three orders of magnitude of performance increase. The functions are already vectorized, so there's little Numba can do. "Only" twice as fast when the functions are decorated with @nb.njit:

>>> %timeit window_lin_reg(x, y, w)
96.4 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

CodePudding user response：

I'm adding this as an answer to @Andi's followup question where input data may contain nans.

These are the updated functions:

def window_sum(x, w):
    c = np.nancumsum(x)     # Nans no longer affect the summation
    s = c[w - 1:]
    s[1:] -= c[:-w]
    return s

def window_lin_reg(x, y, w):
    # Invalidate both x and y values when there's a nan in one of them
    valid = np.isfinite(x) & np.isfinite(y)
    x[~valid] = np.nan
    y[~valid] = np.nan

    # Sums for each window
    n = window_sum(valid, w)    # Count only valid points in the window
    sx = window_sum(x, w)
    sy = window_sum(y, w)
    sx2 = window_sum(x ** 2, w)
    sxy = window_sum(x * y, w)

    # Avoid ugly warnings
    with np.errstate(divide='ignore', invalid='ignore'):
        slope = (n * sxy - sx * sy) / (n * sx2 - sx ** 2)
        intercept = (sy - slope * sx) / n

    # Replace infinities by nans. Not necessary, but cleaner.
    invalid_results = n < 2
    slope[invalid_results] = np.nan
    intercept[invalid_results] = np.nan

    return slope, intercept

Test with nans:

>>> w = 5      # Window
>>> x = np.arange(15.)
>>> y = 0.1 * x**2 - x   10
>>> x[3] = np.nan
>>> y[7:12] = np.nan
>>> slope, intercept = window_lin_reg(x, y, w)
>>> with np.printoptions(precision=2, suppress=True):
...     print(slope)
...     print(intercept)
[-0.59 -0.4  -0.21 -0.   -0.    0.1    nan   nan   nan  1.5   1.6 ]
[ 9.8   9.35  8.69  7.57  7.57  7.     nan   nan   nan -5.6  -6.83]

The version using np.linalg.lstsq and loops, only to compare results:

def numpy_lin_reg(x, y, w):
    valid = np.isfinite(x) & np.isfinite(y)
    x[~valid] = np.nan
    y[~valid] = np.nan
    m = len(x) - w   1
    slope = np.empty(m)
    intercept = np.empty(m)
    for i in range(m):
        window = slice(i, i w)
        valid_ = valid[window]
        x_ = x[window][valid_]
        y_ = y[window][valid_]
        n = valid_.sum()
        if n < 2:
            slope[i] = np.nan
            intercept[i] = np.nan
        else:
            A = np.vstack((x_, np.ones(n))).T
            m, c = np.linalg.lstsq(A, y_, rcond=None)[0]
            slope[i] = m
            intercept[i] = c
    return slope, intercept

Tested with the same data:

>>> slope2, intercept2 = numpy_lin_reg(x, y, w)
>>> with np.printoptions(precision=2, suppress=True):
...     print(np.array(slope2))
...     print(np.array(intercept2))
[-0.59 -0.4  -0.21  0.    0.    0.1    nan   nan   nan  1.5   1.6 ]
[ 9.8   9.35  8.69  7.57  7.57  7.     nan   nan   nan -5.6  -6.83]

In order to use the new implementation with Numba, a couple of changes are needed:

@nb.njit
def window_sum(x, w):
    c = np.nancumsum(x)         # Numba needs SciPy here
    s = c[w - 1:]
    s[1:] = s[1:] - c[:-w]      # Numba doens't like -=
    return s

@nb.njit
def window_lin_reg(x, y, w):
    # Invalidate both x and y values when there's a nan in one of them
    valid = np.isfinite(x) & np.isfinite(y)
    x[~valid] = np.nan
    y[~valid] = np.nan

    # Sums for each window
    n = window_sum(valid, w)
    sx = window_sum(x, w)
    sy = window_sum(y, w)
    sx2 = window_sum(x ** 2, w)
    sxy = window_sum(x * y, w)

    # No warnings here from Numba
    slope = (n * sxy - sx * sy) / (n * sx2 - sx ** 2)
    intercept = (sy - slope * sx) / n

    # Replace infinities by nans. Not necessary, but cleaner.
    invalid_results = n < 2
    slope[invalid_results] = np.nan
    intercept[invalid_results] = np.nan

    return slope, intercept