How to can I compute variance without zero elements?
For example
np.var([[1, 1], [1, 2]], axis=1) -> [0, 0.25]
I need:
var([[1, 1, 0], [1, 2, 0]], axis=1) -> [0, 0.25]
CodePudding user response:
Is it what your are looking for? You can filter out columns where all values are 0 (or at least one value is not 0).
m = np.array([[1, 1, 0], [1, 2, 0]])
np.var(m[:, np.any(m != 0, axis=0)], axis=1)
# Output
array([0. , 0.25])
CodePudding user response:
V1
You can use a masked array:
data = np.array([[1, 1, 0], [1, 2, 0]])
np.ma.array(data, mask=(data == 0)).var(axis=1)
The result is
masked_array(data=[0. , 0.25],
mask=False,
fill_value=1e 20)
The raw numpy array is the data
attribute of the resulting masked array:
>>> np.ma.array(data, mask=(data == 0)).var(axis=1).data
array([0. , 0.25])
V2
Without masked arrays, the operation of removing a variable number of elements in each row is a bit tricky. It would be simpler to implement the variance in terms of the formula sum(x**2) / N - (sum(x) / N)**2
and partial reduction of ufuncs.
First we need to find the split indices and segment lengths. In the general case, that looks like
lens = np.count_nonzero(data, axis=1)
inds = np.r_[0, lens[:-1].cumsum()]
Now you can operate on the raveled masked data:
mdata = data[data != 0]
mdata2 = mdata**2
var = np.add.reduceat(mdata2, inds) / lens - (np.add.reduceat(mdata, inds) / lens)**2
This gives you the same result for var
(probably more efficiently than the masked version by the way):
array([0. , 0.25])
V3
The var
function appears to use the more traditional formula (x - x.mean()).mean()
. You can implement that using the quantities above with just a bit more work:
means = (np.add.reduceat(mdata, inds) / lens).repeat(lens)
var = np.add.reduceat((mdata - means)**2, inds) / lens
Comparison
Here is a quick benchmark for the two approaches:
def nzvar_v1(data):
return np.ma.array(data, mask=(data == 0)).var(axis=1).data
def nzvar_v2(data):
lens = np.count_nonzero(data, axis=1)
inds = np.r_[0, lens[:-1].cumsum()]
mdata = data[data != 0]
return np.add.reduceat(mdata**2, inds) / lens - (np.add.reduceat(mdata, inds) / lens)**2
def nzvar_v3(data):
lens = np.count_nonzero(data, axis=1)
inds = np.r_[0, lens[:-1].cumsum()]
mdata = data[data != 0]
return np.add.reduceat((mdata - (np.add.reduceat(mdata, inds) / lens).repeat(lens))**2, inds) / lens
np.random.seed(100)
data = np.random.randint(10, size=(1000, 1000))
%timeit nzvar_v1(data)
18.3 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nzvar_v2(data)
5.89 ms ± 69.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nzvar_v3(data)
11.8 ms ± 62.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So for a large dataset, the second approach, while requiring a bit more code, appears to be ~3x faster than masked arrays and ~2x faster than using the traditional formulation.