I have been using NumPy for a while but there still are instances in which broadcasting is impenetrable to me. Please see the code below:
import numpy as np
np.random.seed(123)
# Number of measurements for the x variable
nx = 7
# Number of measurements for the y variable
ny = 11
# Number of items for which we run the simulation
nc = 23
# Fake some data
x = np.random.uniform(0, 1, size=(nx, ))
y = np.random.uniform(0, 1, size=(ny, ))
# histogram_2d represents the 2d frequency of the x, y measurements
histogram_2d = np.random.randint(0, 20, size=(nx, ny))
# c is the actual simulation results, size=(nx*ny, nc)
c = np.random.uniform(0, 9, size=(nx*ny, nc))
# Try broadcasting
c_3d = c.reshape((nc, nx, ny))
numpy_sum = (c_3d * histogram_2d).sum()
# Attempt possible replacement with a simple loop
partial_sum = 0.0
for i in range(nc):
c_2d = np.reshape(c[:, i], (nx, ny))
partial_sum = (c_2d * histogram_2d).sum()
print('Numpy broadcasting: ', numpy_sum)
print('Actual loop : ', partial_sum)
In my naivete, I was expecting the two approaches to give the same results (up to some multiple of machine precision). But on my system I get this:
Numpy broadcasting: 74331.4423599
Actual loop : 73599.8596346
As my ignorance is showing: given that histogram_2d
is a 2D array and c_3d
is a 3D array, I was simply thinking that NumPy would magically expand histogram_2d
with nc
copies of itself in the first axis and do the multiplication. But it appears I am not quite correct.
I would like to know how to replace the condensed, broadcasted multiplication sum with a proper for loop - I am looking at some code in Fortran to do the same and this Fortran code:
hist_3d = spread(histogram_2d, 1, nc)
c_3d = reshape(c, [nc, nx, ny])
partial_sum = sum(c_3d*hist_3d)
Does not do what the NumPy broadcasting does... Which means I am doing something fundamentally wrong somewhere and/or my understanding of broadcasting is still very limited.
CodePudding user response:
In [3]: c.shape
Out[3]: (77, 23)
This isn't a good reshape; it works, but will mess up the layout
In [5]: c_3d = c.reshape((nc, nx, ny)); c_3d.shape
Out[5]: (23, 7, 11)
This is good - splitting the 77 into 7 and 11:
In [6]: c_3d = c.reshape((nx, ny,nc)); c_3d.shape
Out[6]: (7, 11, 23)
To multiply with:
In [7]: histogram_2d.shape
Out[7]: (7, 11)
Use:
In [8]: (histogram_2d[:,:,None]*c_3d).shape
Out[8]: (7, 11, 23)
In [9]: (histogram_2d[:,:,None]*c_3d).sum()
Out[9]: 73599.85963455029
With this broadcasting
(7,11,1) and (7,11,23) => (7,11,23)
The 2 key rules are:
add leading dimensions as need to match total ndim
change all size 1 dimensions to match
(I used change
, because the 1 may actually be changed to 0. That's not a common case, but illustrates the generality of broadcasting. )
New trailing dimensions have to be explicit. This avoids some ambiguities, as when trying to add a (2,) and (3,). One of those can be expanded to (1,2) or (1,3), but which? If one is (3,1), then expanding the other to (1,2) is unambiguous.