How is Pandas Block Manager improving performance?-CodePudding

The Pandas documentation says : The primary benefit of the BlockManager is improved performance on certain operations (construction from a 2D array, binary operations, reductions across the columns), especially for wide DataFrames.

I thought I understood how the BlockManager improves performance thanks to a great article (https://uwekorn.com/2020/05/24/the-one-pandas-internal.html), but I realized there was a small mistake in the example.

If I correct the mistake in the example :

a1 = np.arange(128 * 1024 * 10124)
a2 = np.arange(128 * 1024 * 1024)
a_both = np.empty((2, a1.shape[0]))
a_both[0, :] = a1
a_both[1, :] = a2
%timeit a1   a2
%timeit np.sum(a_both, axis=0)

#Result :
895 ms ± 204 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.09 s ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It seems grouping data in a numpy array does not improve performance.

Is Pandas BlockManger still improving performance in 2022 ? It would be great if someone could illustrate those "improved performance" with an example using numpy... (how grouping data or using a specific layout of data in memory could improve performance)

CodePudding user response：

Long story short : you need to work on more than 20 columns to benefit from the BlockManager for columns addition/multiplication.

There's actually a great explanation in Pandas documentation that I had missed :

What is BlockManager and why does it exist?

The reason for this is not really a memory layout issue (NumPy users know about how contiguous memory access produces much better performance) so much as a reliance on NumPy's two-dimensional array operations for carrying out pandas's computations. So, to do anything row oriented on an all-numeric DataFrame, pandas would concatenate all of the columns together (using numpy.vstack or numpy.hstack) then use array broadcasting or methods like ndarray.sum (combined with np.isnan to mind missing data) to carry out certain operations.

Another motivation for the BlockManager was to be able to create DataFrame objects with zero copy from two-dimensional NumPy arrays.

https://github.com/pydata/pandas-design/blob/master/source/internal-architecture.rst#what-is-blockmanager-and-why-does-it-exist