Here is a nested loop, where the inner one indices depend on the outer one, with the following parameters:

f = rand(1,70299)
nech=24*30*24
N=length(f);
xh=(1:nech)/24;

In MATLAB:

sf2(1:nech)=0.;
sf2vel(1:nech)=0.;
count(1:nech)=0.;

for i=1:nech
    for j=1:N-i-1
        sf2(i)=sf2(i) (f(j i)-f(j))^2;
        count(i)=count(i) 1; 
    end
    sf2(i)=sf2(i)/count(i);
end

In Python:

def structFunPython(f,N,nech):
    sf2 = np.zeros(N)
    count = np.zeros(N)
    for i in range(nech):
        indN = np.arange(1,N-i-1)
        for j in indN:
            sf2[i]  = np.power((f[i j]-f[j]),2)
            count[i]  = 1
        sf2[i] = sf2[i]/count[i]
    return sf2

With cython:

import cython
cimport numpy as np
import numpy as np
def structFun(np.ndarray f,N,nech):
    cdef np.ndarray sf2 = np.zeros(N), count = np.zeros(N),
    for i in range(nech):
        indN = np.arange(1,N-i-1)
        for j in indN:
            sf2[i]  = np.power((f[i j]-f[j]),2)
            count[i]  = 1
        sf2[i] = sf2[i]/count[i]
    return sf2

Executions times:

Matlab: 7.8377 sec
Python: 3651.35 sec
Cython: 3336.21 sec

I have a hard time believing Python and Cython (especially Cython) are that slow for the same computation, so I think I must have made an error in my Python/Cython loops, but I can't see where.

CodePudding user response：

Disclaimer: as @norok2 pointed out in a comment, my approach allocates at least N*(N-1)/2 doubles due to the use of pdist. For your N = 70299 this means about 18.5 GB of doubles in an array. Other index arrays will have similar size. So unless some of your use cases have smaller N, the vectorised approach in this answer is only feasible if you have a lot of memory.

As others have noted, just translating code from one language to another will not lead to optimal code in both languages. And using Cython alone is no guarantee for speed-up, in the same way that using NumPy alone is no guarantee for speed-up.

norok2's answer nicely shows you how you can use numba or something similar to compile your numerical code. This can give you something quite similar to what you have in MATLAB in terms of performance, since MATLAB has a just-in-time (JIT) compiler of its own. There's also wiggle room to optimize your code for compilation, because multiple implementations can end up with wildly different performance.

Anyway, I want to make the point that you can also speed up your code by using advanced features in NumPy and SciPy. In particular, what you want is to compute pairwise squared distances among a set of 1d points. This is what scipy.spatial.distance.pdist can do for you (with the 'sqeuclidean' squared Euclidean norm). The upside is that it only compute each pairwise distance once (which is great for CPU and memory performance), but the downside is that picking out the contributions that you want to sum up a bit more cumbersome.

Anyway, here is the code corresponding to your Python implementation (with the fix that the inner loop uses np.arange(1, N-i) rather than np.arange(1, N-i-1)):

from scipy.spatial.distance import pdist

def pdisty(f, nech):
    offset = f.size - nech
    res = np.zeros_like(f, shape=nech)
    dists = pdist(f[:, None], metric='sqeuclidean')
    counts = np.arange(offset, f.size)[::-1]
    inds = np.repeat(np.arange(res.size), counts)
    np.add.at(res, inds, dists[:inds.size])
    res /= counts
    return res

What happens here is that

we compute the pairwise distance of each unique pair of array values and store it to dists
we compute the number of pairs each point is involved in (this is what we'll have to normalise at the end), store it to counts
figure out the 1d index of each value in dists that it corresponds to (this is the hard part), store it in inds
use np.add.at to accumulate each contribution to the appropriate output index
normalise with the counts.

Here are some timings for N = 1000, where func2() is the corresponding function from norok2's answer:

>>> %timeit structFunPython(f, f.size - 1)
... %timeit func2(f, f.size - 1)
... %timeit pdisty(f, f.size - 1)
1.48 s ± 89.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
274 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
36.7 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The above solution is considerably faster, but of course it's still slower than a fully compiled solution. If you have an issue with additional dependencies or getting llvm installed on your system, this might be a reasonable compromise. The bottom line is that code should be adapted to the language you're trying to optimise it in.

For completeness' sake here are the implementations I used for the comparison (I slightly changed the signatures because N can be computed from the input array, and I fixed some fencepost errors):

def structFunPython(f, nech):
    """Very slightly modified from the question"""
    N = f.size
    sf2 = np.zeros(nech)
    count = np.zeros(nech)
    for i in range(nech):
        indN = np.arange(1,N-i)
        for j in indN:
            sf2[i]  = np.power((f[i j]-f[i]),2)
            count[i]  = 1
        sf2[i] = sf2[i]/count[i]
    return sf2


def func2(f_arr, nech):
    """Very slightly modified from norok2's answer

    See https://stackoverflow.com/a/71704834/5067311

    """
    n = f_arr.size
    sf2 = np.zeros(nech)
    for i in range(nech):
        for j in range(1, n - i):
            sf2[i]  = (f_arr[i   j] - f_arr[i]) ** 2
        sf2[i] /= (n - i - 1)
    return sf2

With these definitions all three functions give the same result within machine precision:

rng = np.random.default_rng()
N = 1000
nech = N - 2
f = rng.random(1000)
assert np.allclose(structFunPython(f, nech), func2(f, nech))
assert np.allclose(structFunPython(f, nech), pdisty(f, nech))

CodePudding user response：

The way I would rewrite the MATLAB code as equivalent Python code could be (beware of indices starting at 1 in MATLAB and at 0 in Python... since I cannot know how this should be adapted without context I went with the simplest approach):

import numpy as np


def func(f_arr, nech, n):
    sf2 = np.zeros(nech)
    count = np.zeros(nech)
    for i in range(nech):
        for j in range(n - i):
            sf2[i]  = (f_arr[i   j] - f_arr[i]) ** 2
            count[i]  = 1
        sf2[i] /= count[i]
    return sf2

Note that count[i] = 1 is useless as the final value of count[i] is known in advance, and actually the whole count is useless, e.g.:

import numpy as np


def func2(f_arr, nech, n):
    sf2 = np.zeros(nech)
    for i in range(nech):
        for j in range(n - i):
            sf2[i]  = (f_arr[i   j] - f_arr[i]) ** 2
        sf2[i] /= (n - i)
    return sf2

Speed-up

This a manual case of Numba speed up. This is just as easy as adding/using a Numba @njit decorator:

import numba as nb


func_nb = nb.njit(func)
func2_nb = nb.njit(func2)

Now, func, func_nb, func2 and func2_nb all perform the same computation:

nech = n = 6
f_arr = np.arange(n)
print(func(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func_nb(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func2(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func2_nb(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]

If you really need to stick to Cython, here is an implementation based on func2:

%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


import numpy as np
import cython as cy


cpdef func2_cy(f_arr, nech, n):
    sf2 = np.zeros(nech)
    _func2_cy(f_arr.astype(np.float_), sf2, nech, n)
    return sf2


cdef _func2_cy(double[:] f_arr, double[:] sf2, cy.int nech, cy.int n):
    for i in range(nech):
        for j in range(1, n - i):
            sf2[i] = sf2[i]   (f_arr[i   j] - f_arr[i]) ** 2
        sf2[i] = sf2[i] / (n - i)

This is significantly more complicated to write compared to Numba, but achieves similar performances. The trick is to have a function _func2_cy which operates with little to no interaction with Python (read: it goes at C speed).

Again the result is the same as func2:

nech = n = 6
f_arr = np.arange(n)
print(func2_cy(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]

Timings

With some little toy benchmarking we get a feeling of the speed-ups, including a comparable function written as you did, and the vectorized solution proposed in Andras Deak's very fine answer (but fixing the indices to match the above):

def func_OP(f, nech, n):
    sf2 = np.zeros(n)
    count = np.zeros(n)
    for i in range(nech):
        indN = np.arange(n - i)  # <-- indexing fixed
        for j in indN:
            sf2[i]  = np.power((f[i j]-f[i]),2)
            count[i]  = 1
        sf2[i] = sf2[i] / count[i]
    return sf2


func_OP_nb = nb.njit(func_OP)

def func_pdisty(f, nech, n):
    res = np.zeros(nech)
    dists = scipy.spatial.distance.pdist(f[:, None], metric='sqeuclidean')
    counts = np.arange(n - 1, n - nech - 1, -1)
    inds = np.repeat(np.arange(res.size), counts)
    np.add.at(res, inds, dists[:inds.size])
    res /= (counts   1)
    return res

nech = n = 6
f_arr = np.arange(n)
print(func_OP(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func_pdisty(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]

nech = n = 1000
f_arr = np.arange(n)
%timeit func_OP(f_arr, nech, n)
# 1 loop, best of 5: 1.5 s per loop
%timeit func(f_arr, nech, n)
# 1 loop, best of 5: 567 ms per loop
%timeit func2(f_arr, nech, n)
# 1 loop, best of 5: 352 ms per loop
%timeit func_OP_nb(f_arr, nech, n)
# 1000 loops, best of 5: 1.87 ms per loop
%timeit func_nb(f_arr, nech, n)
# 1000 loops, best of 5: 1.7 ms per loop
%timeit func2_nb(f_arr, nech, n)
# 1000 loops, best of 5: 768 µs per loop
%timeit func_pdisty(f_arr, nech, n)
# 10 loops, best of 5: 44.5 ms per loop
%timeit func2_cy(f_arr, nech, n)
# 1000 loops, best of 5: 1 ms per loop

nech = n = 2000
f_arr = np.arange(n)
%timeit func_OP(f_arr, nech, n)
# 1 loop, best of 5: 6.01 s per loop
%timeit func(f_arr, nech, n)
# 1 loop, best of 5: 2.3 s per loop
%timeit func2(f_arr, nech, n)
# 1 loop, best of 5: 1.42 s per loop
%timeit func_OP_nb(f_arr, nech, n)
# 100 loops, best of 5: 7.31 ms per loop
%timeit func_nb(f_arr, nech, n)
# 100 loops, best of 5: 6.82 ms per loop
%timeit func2_nb(f_arr, nech, n)
# 100 loops, best of 5: 3.05 ms per loop
%timeit func_pdisty(f_arr, nech, n)
# 1 loop, best of 5: 344 ms per loop
%timeit func2_cy(f_arr, nech, n)
# 100 loops, best of 5: 3.95 ms per loop

nech = n = 4000
f_arr = np.arange(n)
%timeit func_OP(f_arr, nech, n)
# 1 loop, best of 5: 24.3 s per loop
%timeit func(f_arr, nech, n)
# 1 loop, best of 5: 9.27 s per loop
%timeit func2(f_arr, nech, n)
# 1 loop, best of 5: 5.71 s per loop
%timeit func_OP_nb(f_arr, nech, n)
# 10 loops, best of 5: 29 ms per loop
%timeit func_nb(f_arr, nech, n)
# 10 loops, best of 5: 27.3 ms per loop
%timeit func2_nb(f_arr, nech, n)
# 100 loops, best of 5: 12.2 ms per loop
%timeit func_pdisty(f_arr, nech, n)
# 1 loop, best of 5: 706 ms per loop
%timeit func2_cy(f_arr, nech, n)
# 100 loops, best of 5: 15.9 ms per loop

...and with the input sizes you provided:

nech = 24 * 30 * 24
n = 70299
f_arr = np.random.random(n)
%timeit -n1 -r1 func_OP(f_arr, nech, n)
# 1 loop, best of 1: 1h 4min 50s per loop
%timeit -n1 -r1 func(f_arr, nech, n)
# 1 loop, best of 1: 21min 14s per loop
%timeit -n1 -r1 func2(f_arr, nech, n)  # only one run / loop
# 1 loop, best of 1: 13min 9s per loop
%timeit func_OP_nb(f_arr, nech, n)
# 1 loop, best of 5: 4.74 s per loop
%timeit func_nb(f_arr, nech, n)
# 1 loop, best of 5: 4 s per loop
%timeit func2_nb(f_arr, nech, n)
# 1 loop, best of 5: 1.62 s per loop
# %timeit func_pdisty(f_arr, nech, n)
# -- MEMORY ERROR --
%timeit func2_cy(f_arr, nech, n)
# 1 loop, best of 5: 2.2 s per loop

CodePudding user response：

You already got some nice answers that presented fast numpy and numba implementations. I'd like to present a decent Cython implementation for a fair comparison.

First of all, let's time norok2's fastest numba implementation on my machine:

In [3]: %timeit func2_nb(f_arr, n)
3.4 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In order to obtain fast memory access in Cython, it's crucial to give it all the necessary information about the input arrays. In your case, f_arr is a C-contiguous np.ndarray, so we use a C-contiguous memoryviews double[::1] instead of normal memoryviews double[:]. The difference is that indexing a C-contiguous memoryview reduces to plain C code f_arr[i] while the latter reduces to f_arr[i f_arr.strides[0]].
Next, it's worth mentioning that the Python power operator a**2 is gonna be substituted by the C code pow(a,2), i.e. we call the pow-function. Calling a function in tight loops has unnecessary overhead, even in C. In C, we would just write a*a. So let's do the same in Cython.

In Code:

%%cython -c=-O3 -c=-march=native
# for MSVC use: -c=/Ox -c=/arch:AVX2
#           or: -c=/Ox -c=/arch:AVX512 if your CPU supports AVX512 

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def func2_cy(double[::1] f_arr, int nech):
    cdef int n = f_arr.size
    cdef double[::1] sf2 = np.zeros(nech)
    cdef int i, j
    for i in range(nech):
        for j in range(n-i-2):
            sf2[i]  = (f_arr[i j 1]-f_arr[i])*(f_arr[i j 1]-f_arr[i])
        sf2[i] /= (n - i - 2)
    return np.asarray(sf2)

Compiled on my machine with Apple Clang 13.1.6 on macOS, this is slower than the aforementioned numba implementation:

In [5]: %timeit func2_cy(f_arr, n)
4.2 s ± 32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

However, one should be aware that Cython is basically just a Python to C compiler. This means that after we removed all the python interactions, we can try the same performance tricks we'd apply when using C code: passing the optimization flag -O3 and enabling all CPU instructions available on the current platform -march=native. Note that this also implies that the performance of good Cython code (i.e. code with no python interactions in tight loops) heavily depends on your C Compiler. In my experience, numba is often faster than Cython on Windows due to MSVC's bad auto-vectorization. On macOS/Linux and gcc or clang, it's often the other way around.

One of these well-known performance tricks is loop-unrolling in order to give the compiler hints to SIMD-vectorize the specific loop. Unrolling the innermost loop, the function looks like this:

%%cython -c=-O3 -c=-march=native
# for MSVC use: -c=/Ox -c=/arch:AVX2
#           or: -c=/Ox -c=/arch:AVX512 if your CPU supports AVX512 

#cython: cdivision=True

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def func3_cy(double[::1] f_arr, int nech):
    cdef int n = f_arr.size
    cdef double[::1] sf2 = np.zeros(nech)
    cdef double fi, fij, fij1, fij2, fij3
    cdef int i, j, ntmp
    for i in range(nech):
        # unroll the inner loop for better SIMD vectorization
        j = 1
        ntmp = (n-i-2) - ((n-i-2) % 4)
        # we use a while-loop since Cython doesn't support range(j, ntmp, 4)
        while j < ntmp:
            # unpack the values and hope the CPU keeps them in its register
            fi   = f_arr[i]
            fij  = f_arr[i j 1]
            fij1 = f_arr[i j 2]
            fij2 = f_arr[i j 3]
            fij3 = f_arr[i j 4]
            sf2[i]  = ((fij-fi)*(fij-fi)   (fij1-fi)*(fij1-fi)   (fij2-fi)*(fij2-fi)   (fij3-fi)*(fij3-fi))
            j  = 4
        for j in range(ntmp, n - i - 2):
            sf2[i]  = (f_arr[i j 1] - f_arr[i]) * (f_arr[i j 1] - f_arr[i])
        sf2[i] /= (n-i-2)
    return np.asarray(sf2)

On my machine, this is nearly two times faster than numba:

In [7]: %timeit func3_cy(f_arr, n)
1.7 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

To take it one step further, we can parallelize the outer loop with the help of prange which is just a thin wrapper around OpenMP:

%%cython -c=-O3 -c=-march=native -c=-fopenmp --link-args=-fopenmp
# for MSVC use: -c=/Ox -c=/openmp -c=/arch:AVX2
#           or: -c=/Ox -c=/openmp -c=/arch:AVX512 if your CPU supports AVX512 

#cython: cdivision=True

import numpy as np
cimport numpy as np
cimport cython
from cython.parallel cimport prange

@cython.boundscheck(False)
@cython.wraparound(False)
def func4_cy(double[::1] f_arr, int nech):
    cdef int n = f_arr.size
    cdef double[::1] sf2 = np.zeros(nech)
    cdef double fi, fij, fij1, fij2, fij3
    cdef int i, j, ntmp
    for i in prange(nech, nogil=True, schedule="static", chunksize=1):
        # unroll the inner loop for better SIMD vectorization
        j = 1
        ntmp = (n-i-2) - ((n-i-2) % 4)
        # we use a while-loop since Cython doesn't support range(j, ntmp, 4)
        while j < ntmp:
            # unpack the values and hope the CPU keeps them in its register
            fi   = f_arr[i]
            fij  = f_arr[i j 1]
            fij1 = f_arr[i j 2]
            fij2 = f_arr[i j 3]
            fij3 = f_arr[i j 4]
            sf2[i]  = ((fij-fi)*(fij-fi)   (fij1-fi)*(fij1-fi)   (fij2-fi)*(fij2-fi)   (fij3-fi)*(fij3-fi))
            j  = 4
        for j in range(ntmp, n - i - 2):
            sf2[i]  = (f_arr[i j 1] - f_arr[i]) * (f_arr[i j 1] - f_arr[i])
        sf2[i] /= (n-i-2)
    return np.asarray(sf2)

On my machine with 8 CPU Cores, this is by far the fastest implementation:

In [9]: %timeit func4_cy(f_arr, n)
329 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

However, it's worth mentioning that Numba also supports thread parallelism, so I'd expect a similar performance with Numba.

CodePudding user response：

With some basic python/numpy rewrites I can speed up your code quite a bit

def structFunPython4(f,N,nech):
    sf2 = np.zeros(N)
    #count = np.zeros(N)
    for i in range(nech):
        # indN = np.arange(1,N-i-1)
        sf2[i] = np.sum(np.power(f[i 1:N-1]-f[i],2))
        #for j in range(1,N-i-1):
        #    sf2[i]  = np.power((f[i j]-f[i]),2)
        #    #count[i]  = 1
        sf2[i] = sf2[i]/(N-i-2)
    return sf2

For a modest sample size:

In [53]: f = np.arange(100); N=f.shape[0]; nech=98

they match:

In [54]: np.allclose(structFunPython(f,N,nech),structFunPython4(f,N,nech))
Out[54]: True

and timings:

In [55]: timeit structFunPython(f,N,nech)
34.4 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [56]: timeit structFunPython4(f,N,nech)
2.22 ms ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I first replaced for i in indN: with for i in range(1,N-1-i):. Iteration on an array is slower than iteration on a list or range.

As others noted we don't need to iterate the count.

But the big change is to replace the j iteration with an array slice, power on the whole slice and array sum.

I haven't looked at the i iteration enough to eliminate that. f[i 1:N-1] slice varies in length, from nech down to 0.

MATLAB does a lot of jit compilation, so you can get by with iterations. Back in the 1990s when I worked MATLAB that would have been bad form - those versions required whole-matrix calculations to get any reasonable speed. Python level iteration is slow (interpreted), and slower on arrays than on lists. Try to use whole-array methods where possible. Or use numba to compile the calculations.

====