I wanted to find out if it is significantly slower to iterate over the first two dimensions of an array in comparison to doing the operations columnwise. To my surprise if found out that its actually faster to do the operations elementwise. Can someone explain?
Here is the code:
def row_by_row(arr, cop):
for i in range(arr.shape[0]):
for ii in range(arr.shape[1]):
arr[i, ii] = cop[i, ii].copy()
return arr
def all(arr, cop):
for i in range(arr.shape[1]):
arr[:,i] = cop[:, i].copy()
return arr
print(timeit.timeit("row_by_row(arr, cop)", setup="arr=np.ones((26, 15, 5000)); cop = np.random.random((26, 15,5000))",number=50, globals=globals()))
print(timeit.timeit("all(arr, cop)",setup="arr=np.ones((26, 15, 5000)); cop=np.random.random((26, 15,5000))", number=50, globals=globals()))
this was the time:
0.12496590000000007
0.4989047
CodePudding user response:
Because all
is really cache-inefficient by iterating columns instead of rows.
Data in numpy arrays are stored by dimensions - rows, then columns, 3rd dim, etc. If we read a row, it will be a sequential segment of memory that can be efficiently cached. If we read by column, it is a few bytes here, skip a few KB, than read a few more bytes, etc - which causes a lot of cache misses. The problem gets more pronounced if we increase 3rd dimension, e.g. to 50K.
Read by rows, as opposed to columns, eliminates the difference:
def all_by_rows(arr, cop):
for row in range(arr.shape[0]):
arr[row, :] = cop[row, :].copy()
return arr
timeit
with 50k third dimension:
1.249532633984927 # row_by_row - which is actually by third dimension
2.0826793879969046 # all
1.3391598959860858 # all_by_rows
Without unnecessary .copy()
, as pointed out by Marco:
1.0241080590058118
0.9834478280099574
0.6739323509973474
CodePudding user response:
Short Answer:
Memory Allocation
Long Answer:
As the commenters in the question point out, the measure results seem very unreliable. Increasing the number of operations for the measurement to 2000 gives more steady results
Row: 3.519135099995765
All: 5.321293300003163
One thing which certainly impacts the performance is how arrays are stored in the memory and how many cache hits / misses we have.
def matrix(arr, cop):
for i in range(arr.shape[0]):
arr[i] = cop[i].copy()
return arr
This is a bit better in performance than copying "columns"
Matrix: 4.6333566999965115
It is still slower though than going through it row by row. Why?
For this, let's take one step back from the loop
def just_copy(arr, cop):
return cop.copy()
Copy: 5.482903500000248
In just copying the whole thing, we're slower again!
I would assume, the cause for it being faster to loop through the arrays is mostly memory allocation. There may also be some additional overhead of copying NumPy structures.