I wanted to find out if it is significantly slower to iterate over the first two dimensions of an array in comparison to doing the operations columnwise. To my surprise if found out that its actually faster to do the operations elementwise. Can someone explain?

Here is the code:

def row_by_row(arr, cop):

    for i in range(arr.shape[0]):
        for ii in range(arr.shape[1]):
            arr[i, ii] = cop[i, ii].copy()

    return arr

def all(arr, cop):

    for i in range(arr.shape[1]):
        arr[:,i] = cop[:, i].copy()

    return arr

print(timeit.timeit("row_by_row(arr, cop)", setup="arr=np.ones((26, 15, 5000)); cop = np.random.random((26, 15,5000))",number=50, globals=globals()))
print(timeit.timeit("all(arr, cop)",setup="arr=np.ones((26, 15, 5000)); cop=np.random.random((26, 15,5000))",  number=50, globals=globals()))

this was the time:

0.12496590000000007
0.4989047

CodePudding user response：

Because all is really cache-inefficient by iterating columns instead of rows.

Data in numpy arrays are stored by dimensions - rows, then columns, 3rd dim, etc. If we read a row, it will be a sequential segment of memory that can be efficiently cached. If we read by column, it is a few bytes here, skip a few KB, than read a few more bytes, etc - which causes a lot of cache misses. The problem gets more pronounced if we increase 3rd dimension, e.g. to 50K.

Read by rows, as opposed to columns, eliminates the difference:

def all_by_rows(arr, cop):
    for row in range(arr.shape[0]):
        arr[row, :] = cop[row, :].copy()
    return arr

timeit with 50k third dimension:

1.249532633984927  # row_by_row - which is actually by third dimension
2.0826793879969046  # all
1.3391598959860858  # all_by_rows

Without unnecessary .copy(), as pointed out by Marco:

1.0241080590058118
0.9834478280099574
0.6739323509973474

CodePudding user response：

Short Answer:

Memory Allocation

Long Answer:

As the commenters in the question point out, the measure results seem very unreliable. Increasing the number of operations for the measurement to 2000 gives more steady results

Row: 3.519135099995765

All: 5.321293300003163

One thing which certainly impacts the performance is how arrays are stored in the memory and how many cache hits / misses we have.

def matrix(arr, cop):

    for i in range(arr.shape[0]):
        arr[i] = cop[i].copy()

    return arr

This is a bit better in performance than copying "columns"

Matrix: 4.6333566999965115

It is still slower though than going through it row by row. Why?

For this, let's take one step back from the loop

def just_copy(arr, cop):
    return cop.copy()

Copy: 5.482903500000248

In just copying the whole thing, we're slower again!

I would assume, the cause for it being faster to loop through the arrays is mostly memory allocation. There may also be some additional overhead of copying NumPy structures.