I have a 2 square matrices, some data, and their weights and I would like to extract histogram data for each row individually (without resorting to simply looping and preferably within numpy). For example:
import numpy as np
data = np.array([[1,3,4,5],[2,1,5,4],[3,3,1,6],[1,2,2,2]]) #data and weights are 4x4 matrices
weights = np.array([[1,1,2,0.4],[1,3,1,1],[1,1,6,5],[1,1,1,1]])
binn = np.array([0,1,3,4,7,10]) #there are 5 bins
#bins should be the same for every row
#would like output which is a 4x5 matrix
output = np.array([[0,1,1,2.4,0],[0,4,0,2,0],[0,6,2,5,0],[0,4,0,0,0]])
This is similar to what was asked by this question Numpy histogram on multi-dimensional array. However, I have tried to make the solutions suggested their work and the fact that the array of weights shares the dimensionality of the problem seems to break np.apply_along_axis. I've tried merging data and weights along the innermost dimension like np.concatenate((data[...,None],weights[...,None]), axis=-1)
and feeding this as the array to a wrapper function which is the argument of apply_along_axis but this produces the wrong shape output array.
def wrapper(data_weights):
h, _ = np.histogram(data_weights[...,0], bins=binn, weights=data_weights[...,1])
return h
arrs = np.concatenate((data[...,None], weights[...,None]), axis=-1)
#looping gives the expected result:
for i in arrs:
print(wrapper(i))
#however
h = np.apply_along_axis(wrapper, axis=1, arrs)
print(h, h.shape)
Not only does using apply along axis not give anything obviously meaningful, it maintains the dimensionality of bundling data and weights h.shape = (4,5,2)
. I'm clearly using apply_along_axis wrong but I can't see how or if it is even usable here (have also looked into apply_over_axes with no more success).
CodePudding user response:
The np function that does what apply_along_axis
does with more complex specification is probably np.vectorize
def myhist(data, weights, binn):
return np.histogram(data, bins=binn, weights=weights)[0]
myhistvec=np.vectorize(myhist, signature='(m),(m),(n)->(p)')
myhistvec(data, weights, binn)
#[[0. 1. 1. 2.4 0. ]
# [0. 4. 0. 2. 0. ]
# [0. 6. 2. 5. 0. ]
# [0. 4. 0. 0. 0. ]]
myhist
role is just to have only positional arguments (maybe np.vectorize
can be used with keyword arguments, but I don't know it).
vectorize
creates a new function that is the vectorized version of myhist.
Generally speaking g=np.vectorized(f)
is a vectorized version of f. Meaning that if g is called with scalar arguments, they are just passed to f. But if g is called with 1d array arguments, then f is called with each elements of those 1d arrays, and then the results of all f calls are used to build an array of results, returned by g. Sames goes if g is called with 2d, 3d, ... arguments.
So, it is a little bit like numpy versions of math functions (np.sin([[1,2],[3,4]]) returns a 2d array containing [[sin(1), sin(2)],[sin(3), sin(4)]]
)
Except that for myhist
we don't want to iterate down to each scalar of data. So we specify that with signature
. Which means that myhist
"atomic" calls are passed 3 arguments: 2 of the same size m
(m values is irrelevant, all that matters is that it is the same m for the 2 first parameters, and that those 2 parameters are 1d arrays), and a 3rd with another 1d-array but of different size. And the return result if another 1d-array of yet another size.
So result is the one you wanted.
Now, I must say that this is just cosmetic. As would have been a successful attempt to use apply_along_axis
. Those function are not helping at all performance-wise.
See the naive version (the one you wanted to avoid)
def naiveLoop(data, weights, binn):
res=[]
for i in range(len(data)):
res.append(np.histogram(data[i], weights=weights[i], bins=binn)[0])
return np.array(res)
And compare those with timeit
. Result is disappointing: naive version needs 270 μs, when vectorized function needs 460 μs to perform the same task.
Sure, with more rows, difference is less and less in favor of naive version, and for 12500 rows, vectorized finally overtook it. (Because iteration themselves, that is for loop counter, is probably faster, but that's all: for the rest, it is just an apply, so pure python function.)
For comparison, look at numba function, when reinventing the wheel (which I need because numba's version of histogram doesn't allow weights. Otherwise I could have just add a @jit
before by naiveLoop to get the same result. Maybe I could have done it, but I am a beginner in numba)
from numba import jit
def fillHist(data, weights, binn, res):
nl=len(data)
nc=data.shape[1]
nb=len(binn)-1
for i in range(nl):
for j in range(nc):
for k in range(nb):
if data[i,j]>=binn[k] and data[i,j]<binn[k 1]:
res[i,k] = weights[i,j]
break
def numbaVersion(data, weights, binn):
res=np.zeros((len(data), len(binn)-1))
fillHist(data, weights, binn, res)
return res
Same result, obviously.
But took only 2.9 μs! (compared to 270 and 460 μs of naive and vectorize version)
So, sometimes the most elegant solution is not the fastest. I mean, it may seem presomptuous, but I am pretty sure that I have had just posted my "vectorize" version, without comments on speed, comparison with naive version, you would have replied "that what I wanted", and found the solution elegant.
Yet, the solution you already knew, and did not want because it needs for loops was almost twice faster. And even worse, reinventing the wheel (and in a naive way at that: no attempt to use dichotomy or anything to find the bin, just a linear loop over all the bins!), is some 150 times faster!