get variance of matrix without zero values numpy-CodePudding

How to can I compute variance without zero elements?

For example

np.var([[1, 1], [1, 2]], axis=1) -> [0, 0.25]

I need:

var([[1, 1, 0], [1, 2, 0]], axis=1) -> [0, 0.25]

CodePudding user response：

Is it what your are looking for? You can filter out columns where all values are 0 (or at least one value is not 0).

m = np.array([[1, 1, 0], [1, 2, 0]])
np.var(m[:, np.any(m != 0, axis=0)], axis=1)

# Output
array([0.  , 0.25])

CodePudding user response：

You can use a masked array:

data = np.array([[1, 1, 0], [1, 2, 0]])
np.ma.array(data, mask=(data == 0)).var(axis=1)

The result is

masked_array(data=[0.  , 0.25],
             mask=False,
       fill_value=1e 20)

The raw numpy array is the data attribute of the resulting masked array:

>>> np.ma.array(data, mask=(data == 0)).var(axis=1).data
array([0.  , 0.25])

Without masked arrays, the operation of removing a variable number of elements in each row is a bit tricky. It would be simpler to implement the variance in terms of the formula sum(x**2) / N - (sum(x) / N)**2 and partial reduction of ufuncs.

First we need to find the split indices and segment lengths. In the general case, that looks like

lens = np.count_nonzero(data, axis=1)
inds = np.r_[0, lens[:-1].cumsum()]

Now you can operate on the raveled masked data:

mdata = data[data != 0]
mdata2 = mdata**2
var = np.add.reduceat(mdata2, inds) / lens - (np.add.reduceat(mdata, inds) / lens)**2

This gives you the same result for var (probably more efficiently than the masked version by the way):

array([0.  , 0.25])

The var function appears to use the more traditional formula (x - x.mean()).mean(). You can implement that using the quantities above with just a bit more work:

means = (np.add.reduceat(mdata, inds) / lens).repeat(lens)
var = np.add.reduceat((mdata - means)**2, inds) / lens

Comparison

Here is a quick benchmark for the two approaches:

def nzvar_v1(data):
    return np.ma.array(data, mask=(data == 0)).var(axis=1).data

def nzvar_v2(data):
    lens = np.count_nonzero(data, axis=1)
    inds = np.r_[0, lens[:-1].cumsum()]
    mdata = data[data != 0]
    return np.add.reduceat(mdata**2, inds) / lens - (np.add.reduceat(mdata, inds) / lens)**2

def nzvar_v3(data):
    lens = np.count_nonzero(data, axis=1)
    inds = np.r_[0, lens[:-1].cumsum()]
    mdata = data[data != 0]
    return np.add.reduceat((mdata - (np.add.reduceat(mdata, inds) / lens).repeat(lens))**2, inds) / lens

np.random.seed(100)
data = np.random.randint(10, size=(1000, 1000))

%timeit nzvar_v1(data)
18.3 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit nzvar_v2(data)
5.89 ms ± 69.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit nzvar_v3(data)
11.8 ms ± 62.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So for a large dataset, the second approach, while requiring a bit more code, appears to be ~3x faster than masked arrays and ~2x faster than using the traditional formulation.