Fast inverse square root in python to normalize a vector-CodePudding

I want to normalize a vector. The easiest way to do it is

import numpy as np
v = np.random.rand(3)
v /= np.linalg.norm(v)

But I'm worried about the performance of my package and summing the squares (inevitable), taking the square root, then dividing all the vector is not a good idea.

Then I got in this question which solution uses sklearn.preprocessing.normalize to do it. Unfortunately, it adds another requirement/dependency to my package.

Here is the question.

Shouldn’t there be a numpy function to do that? Which uses the fast inverse square root algorithm. Or is it outside the numpy's scope and there shouldn't be a function like that?
Should I implement my function own in cython/numba?
Or if I'm so worried about performance, should I give up python and start making code in C/C ?

CodePudding user response：

The best solution that you can achieve has a time complexity of O(2n), which is not very efficient but linear. The way to proceed is the following:

1- First you calculate the magnitude of the vector and its numerical inverse, which inevitably will lead to a linear time complexity regardless of the performance of the programming language. So until this point, we have an O(n) algorithm.

2- All the components of the vector are multiplied by the calculated inverse magnitude of the original vector, resulting in its normalization. As we are traversing the entire list again, we add another O(n) term to the algorithm's complexity. Finally we add those time complexity terms and we get O(n) O(n)=O(2n).

import numpy as np
v = np.random.rand(3)
mod = (sum([i**2 for i in v]))**-0.5
print(v, mod)
v *= mod
print(v)

However, this does not seem to require a high degree of optimization and efficiency. Nevertheless, I will take a closer look at it in order to find a better way of approaching this problem. Maybe other programming resources like recursion or advanced data structures can slightly decrease its running time.

CodePudding user response：

Shouldn’t there be a numpy function to do that? Which uses the fast inverse square root algorithm. Or is it outside the numpy's scope and there shouldn't be a function like that?

I am not aware of any function in Numpy doing exactly that. Multiple functions calls are needed in pure-Numpy. sklearn.preprocessing.normalize is indeed a good alternative (and AFAIK not the only package to provide that).

The thing is Numpy is not designed to compute small arrays efficiently. The overhead of Numpy calls is huge for small arrays (like with only 3 values). Combining multiple function calls only make it worse. The overhead is basically due to type/shape/value checking, internal function calls, the CPython interpreter as well as the allocation of new arrays. Thus, even if Numpy would provide exactly the function you wanted, it would be slow for an array with only 3 items.

Should I implement my function own in cython/numba?

This is a good idea since Numba can do that with a much smaller overhead. Note that Numba function calls still have a small overhead though, but calling them from a Numba context is very cheap (native calls).

For example, you can use:

# Note:
# - The signature cause an eager compilation
# - ::1 means the axis is contiguous (generate a faster code)
@nb.njit('(float64[::1],)')
def normalize(v):
    s = 0.0
    for i in range(v.size):
        s  = v[i] * v[i]
    inv_norm = 1.0 / np.sqrt(s)
    for i in range(v.size):
        v[i] *= inv_norm

This function does not allocate any new array as it works in-place. Moreover, Numba can only make the minimal amount of checks in a wrapping function. The loops are very fast but they can be made even faster if you replace v.size by the actual size (eg. 3) because the JIT can then unroll the loop and generate a nearly optimal code. np.sqrt will be inlined and it should generate a fast square-root FP instruction. If you use the flag fastmath=True, the JIT might even be able to compute the reciprocal square root using a dedicated faster instruction on x86-64 platform (note that fastmath is unsafe if you use special values like NaN or care about the FP associativity). Still, the overhead of calling this function is likely of 100-300 ns on mainstream machines for very small vectors: the CPython wrapping function have a significant overhead. The only solution to remove it is to use Numba/Cython in the caller function. If you need to use them on most of your project, then writing directly a C/C code is certainly better.

Or if I'm so worried about performance, should I give up python and start making code in C/C ?

It depends of your overall project but is you want to manipulate many small vector like this, it is much more efficient to use C/C directly. The alternative is to use Numba or Cython for kernels that are currently slow.

The performance of a well-optimized Numba code or Cython code can be very close to the one of natively compiled C/C code. For example, I succeed to outperform the heavily-optimized OpenBLAS code with Numba once (thanks to specialization). One of the main source of overhead in Numba is array bound checking (that can be often optimized out regarding the loops). C/C are lower-level so you do not pay any hidden cost but the code can be harder to maintain. Additionally, you can apply lower-level optimizations that are not even possible in Numba/Cython (eg. using directly SIMD intrinsics or assembly instructions, generating specialized code with templates).

CodePudding user response：

As written by @mkrieger1, write first your code and test if the performance is acceptable for your work. If not start trying out the solutions your mentioned and compare the differences on your device. Generally, the hardware, onto the code is run, can have an impact on the performance differences. Meaning that eventually not all libraries are superior over others across all devices.

If you really want to go all in, I suggest executing your code in parallel. Eventually even on the GPU. You can use pytorch for this or write code in C and CUDA (or similar) to get the most out of the system. It is however important that summing up the elements has to be done as well in parallel.

Finally, if you stick to python, take a look at python 11 as this release improved python’s performance in some parts.