Numpy ndarray must have all elements of the same type and all sub-arrays on the same level must be of the same length. Those properties are also properties of C multidimensional arrays. Is it the case that numpy ndarray have those properties purely because it is implemented on top of C array? Are those properties really required to create a fast multidimensional array implementation?
CodePudding user response:
Is it the case that numpy ndarray have those properties purely because it is implemented on top of C array?
No. Using C internally is not really what cause Numpy to make this choice. Indeed, Numpy array are just a raw contiguous memory buffer internally (allocated dynamically). Numpy does not actually make use of C array in its own implementation. It only uses C pointers. Views are Python objects referencing the buffer and olding some meta information like the strides, shape, type, etc. Python users always operate on view as raw buffers cannot be directly read/written.
In fact, it is not very difficult to create jagged array in C (ie. arrays containing arrays of variable size), but one need to to that manually (ie. not directly supported by the standard). For 2D jagged array, this is typically done by allocating an array of T* items and then performing N allocation for the N sub-arrays. The sub-arrays are not guaranteed to be contiguously stored.
The point is jagged arrays are not efficient because of memory fragmentation/diffusion and non-contiguity. Additionally, many features provided by Numpy would not be (efficiently) possible with jagged arrays. For example, creating sub-view for other view with a stride would be tricky. The operations working on multiple axis (eg. np.sum(2D_array, axis=0)
) would have to be redefined so it make sense with jagged array. It would also make the implementation far more complex.
As a result, they choose not to implement jagged array but only ND-array. Please note that Numpy have been initially created for scientists and especially physicists which rarely need jagged array but care about high-performance. Jagged arrays can be implemented relatively efficiently by allocating 2 Numpy arrays: 1 array concatenating all lines and a slice-based array containing the start/end offsets.
Are those properties really required to create a fast multidimensional array implementation?
Having homogeneous types is critical for performance. Dynamic typing force a type check for each item which is expensive. Additionally, dynamic typing often requires an additional expensive indirection (eg. the array only store pointers/references and not the object itself) and the access to the objects cause memory fragmentation/diffusion. Such operations are very expensive compared to basic numerical ones like addition/subtraction/multiplication. Furthermore, the life cycle of the object must certainly be carefully controlled (eg. garbage collection) liek CPython does. In fact, a CPython list of list behave like that and is pretty inefficient. You can make Numpy array of objects that are Numpy arrays but this is also inefficient.
As for the rectangular arrays, it is dependent of the use-case, but this is at least critical for matrix multiplication and matrix-vector products (as BLAS operates on contiguous arrays possibly with a stride between lines), as well as operations not working on the most contiguous dimension (compilers can make more aggressive optimizations with a constant stride). Not to mention the additional overheads specified above (eg. additional checks and memory fragmentation/diffusion).