Home > Enterprise >  Numpy array to ctypes with FORTRAN ordering
Numpy array to ctypes with FORTRAN ordering

Time:08-29

Is there a performant way to convert a numpy array to a FORTRAN ordered ctypes array, ideally without a copy being required, and without triggering problems related to strides?

import numpy as np

# Sample data
n = 10000
A = np.zeros((n,n), dtype=np.int8)
A[0,1] = 1

def slow_conversion(A):
    return np.ctypeslib.as_ctypes(np.ascontiguousarray(A.T))

assert slow_conversion(A)[1][0] == 1

Performance analysis of just the call to as_ctypes:

%%timeit
np.ctypeslib.as_ctypes(A)

3.35 µs ± 10.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Performance analysis of the supplied (slow) conversion

%%timeit
slow_conversion(A)

206 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The ideal outcome would be to get performance similar to that of just the as_ctypes call.

CodePudding user response:

As you've pointed out np.ctypeslib.as_ctypes(...) is fast.

The bottleneck of your computation is in np.ascontiguousarray(A.T) - it is equivalent to np.asfortranarray(A), which is equally as slow on large arrays.


This leads me to believe that this can't be made faster using only numpy functions. I mean, since a whole dedicated function to do it already exists - we'd assume it has the best possible performance.

CodePudding user response:

By default, Numpy creates C-ordered arrays (because it is written in C), that is, row-major arrays. Doing a transposition with A.T creates a view of the array where the strides are reversed (ie. no copy). That being said, np.ascontiguousarray does a copy because the array is now not contiguous anymore and copies are expensive. This is why slow_conversion is slow. Note that the contiguity can be tested with yourarray.flags['F_CONTIGUOUS'] and by checking yourarray.strides. Note also that yourarray.flags and yourarray.__array_interface__ provide information about whether the array has been copied and also information about strides.

np.asfortranarray returns an array laid out in Fortran order in memory regarding the documentation. It can perform a copy if needed. In fact, np.asfortranarray(A) does a copy while np.asfortranarray(A.T) does not. You can check the C code of the function for more information about this behaviour. Since both are seen as FORTRAN-contiguous, it is better to use np.asfortranarray(A.T) that do not make any copy in this case.

Regarding ctypes, it deals with C arrays which are stored with a row-major ordering as opposed to FORTRAN array that are stored with a column-major ordering. Furthermore, C arrays do not support strides natively as opposed to FORTRAN ones. This means a row is basically a memory view of contiguous data stored in memory. Since slow_conversion(A)[1][0] == 1 is required to be true, this means the returned object should have the first item of the second row is 1 and thus that the columns are mandatory stored contiguously in memory. The thing is that the initial array is not FORTRAN-contiguous but C-contiguous and thus a transposition is required. Transpositions are pretty expensive (although the Numpy implementation is suboptimal).

Assuming you do not want to pay the overhead of a copy/transposition, the problem need to be relaxed. There are few possible options:

  • Create FORTRAN ordered array directly with Numpy using for example np.zeros((n,n), dtype=np.int8, order='F'). This create a C array with transposed strides so to behave like a FORTRAN array where computations operating on columns are fast (remember that Numpy is written in C so row-major ordered array are the reference). With this, the first row in ctypes is actually a column. Note that you should be very careful when mixing C and FORTRAN ordered array for sake of performance as non-contiguous access are much slower.
  • Use a strided FORTRAN array. This solution basically means that basic column-based computations will be much slower and one need to write row-based computation that are quite unusual in FORTRAN. You can extract the pointer to the C-contiguous array with A.ctypes.data_as(POINTER(c_double)), the strides with A.strides and the shape with A.shape. That being said, this solution appears not to be really portable/standard. The standard way appears to use C binding in FORTRAN. I am not very familiar with this but you can find a complete example in this answer.

A last solution is to transpose data in-place manually using a fast transposition algorithm. This is faster than an out-of-place transposition but this requires a squared array and it cannot be done using Numpy directly. Moreover, it mutate the input array that should not be used later (unless it is fine to operate on a transposed array). One solution is to do in in Numba, or to do it in C or directly in FORTRAN (using a wrapper function in all cases). This should be significantly faster than what Numpy does but still far slower than a basic ctypes wrapping.

CodePudding user response:

Requirements:

  • numpy array in column main order (Fortran or 'F' order)
  • fast conversion to ctypes type
  • avoid problems with strides

One possible way would be to create the array already with the internal Fortran memory layout:

A = np.zeros((n, n), dtype=np.int8, order='F')

Then the conversion could look like this:

def fast_conversion(arr):
    return np.ctypeslib.as_ctypes(arr.flatten('F').reshape(arr.shape))

You could omit .reshape(arr.shape) if you only need a one-dimensional array - but in terms of performance, there should be no difference.

How does it work?

arr.flatten('F') return the array collapsed into one dimension. Since we have an array with F order this is fast. Afterwards with the reshape we apply the shape of the array back to it without changing its data. BTW: Since we are working with an F order array we could also use arr.flatten('K'), about which the documentation says:

‘K’ means to flatten a in the order the elements occur in memory.

see https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flatten.html

It is important that the array must be created with the F order. Otherwise the fast_conversion would be as slow as the slow_conversion.

Test

import timeit
import numpy as np

# Sample data
n = 10000
A = np.zeros((n, n), dtype=np.int8)
A[0, 1] = 1

B = np.zeros((n, n), dtype=np.int8, order='F')
B[0, 1] = 1


def slow_conversion(arr):
    return np.ctypeslib.as_ctypes(np.ascontiguousarray(arr.T))


def fast_conversion(arr):
    return np.ctypeslib.as_ctypes(arr.flatten('F').reshape(arr.shape))


assert slow_conversion(A)[1][0] == 1
assert fast_conversion(B)[1][0] == 1

loops = 10
slow_result = timeit.timeit(lambda: slow_conversion(A), number=loops)
print(f'slow: {slow_result / loops}')

fast_result = timeit.timeit(lambda: fast_conversion(B), number=loops)
print(f'fast: {fast_result / loops}')

The test gives as output:

slow: 0.45553940839999996
fast: 0.02264067879987124

The fast version is therefore about 20 times faster than the slow version.

  • Related