Here is a nested loop, where the inner one indices depend on the outer one, with the following parameters:

f = rand(1,70299)
nech=24*30*24
N=length(f);
xh=(1:nech)/24;

In MATLAB:

sf2(1:nech)=0.;
sf2vel(1:nech)=0.;
count(1:nech)=0.;

for i=1:nech
    for j=1:N-i-1
        sf2(i)=sf2(i) (f(j i)-f(j))^2;
        count(i)=count(i) 1; 
    end
    sf2(i)=sf2(i)/count(i);
end

In Python:

def structFunPython(f,N,nech):
    sf2 = np.zeros(N)
    count = np.zeros(N)
    for i in range(nech):
        indN = np.arange(1,N-i-1)
        for j in indN:
            sf2[i]  = np.power((f[i j]-f[i]),2)
            count[i]  = 1
        sf2[i] = sf2[i]/count[i]
    return sf2

With cython:

import cython
cimport numpy as np
import numpy as np
def structFun(np.ndarray f,N,nech):
    cdef np.ndarray sf2 = np.zeros(N), count = np.zeros(N),
    for i in range(nech):
        indN = np.arange(1,N-i-1)
        for j in indN:
            sf2[i]  = np.power((f[i j]-f[i]),2)
            count[i]  = 1
        sf2[i] = sf2[i]/count[i]
    return sf2

Executions times:

Matlab: 7.8377 sec
Python: 3651.35 sec
Cython: 3336.21 sec

I have a hard time believing Python and Cython (especially Cython) are that slow for the same computation, so I think I must have made an error in my Python/Cython loops, but I can't see where.

CodePudding user response：

Disclaimer: as @norok2 pointed out in a comment, my approach allocates at least N*(N-1)/2 doubles due to the use of pdist. For your N = 70299 this means about 18.5 GB of doubles in an array. Other index arrays will have similar size. So unless some of your use cases have smaller N, the vectorised approach in this answer is only feasible if you have a lot of memory.

As others have noted, just translating code from one language to another will not lead to optimal code in both languages. And using Cython alone is no guarantee for speed-up, in the same way that using NumPy alone is no guarantee for speed-up.

norok2's answer nicely shows you how you can use numba or something similar to compile your numerical code. This can give you something quite similar to what you have in MATLAB in terms of performance, since MATLAB has a just-in-time (JIT) compiler of its own. There's also wiggle room to optimize your code for compilation, because multiple implementations can end up with wildly different performance.

Anyway, I want to make the point that you can also speed up your code by using advanced features in NumPy and SciPy. In particular, what you want is to compute pairwise squared distances among a set of 1d points. This is what scipy.spatial.distance.pdist can do for you (with the 'sqeuclidean' squared Euclidean norm). The upside is that it only compute each pairwise distance once (which is great for CPU and memory performance), but the downside is that picking out the contributions that you want to sum up a bit more cumbersome.

Anyway, here is the code corresponding to your Python implementation (with the fix that the inner loop uses np.arange(1, N-i) rather than np.arange(1, N-i-1)):

from scipy.spatial.distance import pdist

def pdisty(f, nech):
    offset = f.size - nech
    res = np.zeros_like(f, shape=nech)
    dists = pdist(f[:, None], metric='sqeuclidean')
    counts = np.arange(offset, f.size)[::-1]
    inds = np.repeat(np.arange(res.size), counts)
    np.add.at(res, inds, dists[:inds.size])
    res /= counts
    return res

What happens here is that

we compute the pairwise distance of each unique pair of array values and store it to dists
we compute the number of pairs each point is involved in (this is what we'll have to normalise at the end), store it to counts
figure out the 1d index of each value in dists that it corresponds to (this is the hard part), store it in inds
use np.add.at to accumulate each contribution to the appropriate output index
normalise with the counts.

Here are some timings for N = 1000, where func2() is the corresponding function from norok2's answer:

>>> %timeit structFunPython(f, f.size - 1)
... %timeit func2(f, f.size - 1)
... %timeit pdisty(f, f.size - 1)
1.48 s ± 89.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
274 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
36.7 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The above solution is considerably faster, but of course it's still slower than a fully compiled solution. If you have an issue with additional dependencies or getting llvm installed on your system, this might be a reasonable compromise. The bottom line is that code should be adapted to the language you're trying to optimise it in.

For completeness' sake here are the implementations I used for the comparison (I slightly changed the signatures because N can be computed from the input array, and I fixed some fencepost errors):

def structFunPython(f, nech):
    """Very slightly modified from the question"""
    N = f.size
    sf2 = np.zeros(nech)
    count = np.zeros(nech)
    for i in range(nech):
        indN = np.arange(1,N-i)
        for j in indN:
            sf2[i]  = np.power((f[i j]-f[i]),2)
            count[i]  = 1
        sf2[i] = sf2[i]/count[i]
    return sf2


def func2(f_arr, nech):
    """Very slightly modified from norok2's answer

    See https://stackoverflow.com/a/71704834/5067311

    """
    n = f_arr.size
    sf2 = np.zeros(nech)
    for i in range(nech):
        for j in range(1, n - i):
            sf2[i]  = (f_arr[i   j] - f_arr[i]) ** 2
        sf2[i] /= (n - i - 1)
    return sf2

With these definitions all three functions give the same result within machine precision:

rng = np.random.default_rng()
N = 1000
nech = N - 2
f = rng.random(1000)
assert np.allclose(structFunPython(f, nech), func2(f, nech))
assert np.allclose(structFunPython(f, nech), pdisty(f, nech))

CodePudding user response：

The way I would rewrite the MATLAB code as equivalent Python code could be (beware of indices starting at 1 in MATLAB and at 0 in Python... since I cannot know how this should be adapted without context I went with the simplest approach):

import numpy as np


def func(f_arr, nech, n):
    sf2 = np.zeros(nech)
    count = np.zeros(nech)
    for i in range(nech):
        for j in range(n - i):
            sf2[i]  = (f_arr[i   j] - f_arr[i]) ** 2
            count[i]  = 1
        sf2[i] /= count[i]
    return sf2

Note that count[i] = 1 is useless as the final value of count[i] is known in advance, and actually the whole count is useless, e.g.:

import numpy as np


def func2(f_arr, nech, n):
    sf2 = np.zeros(nech)
    for i in range(nech):
        for j in range(n - i):
            sf2[i]  = (f_arr[i   j] - f_arr[i]) ** 2
        sf2[i] /= (n - i)
    return sf2

Speed-up

This a manual case of Numba speed up. This is just as easy as adding/using a Numba @njit decorator:

import numba as nb


func_nb = nb.njit(func)
func2_nb = nb.njit(func2)

Now, func, func_nb, func2 and func2_nb all perform the same computation:

nech = n = 6
f_arr = np.arange(n)
print(func(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func_nb(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func2(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func2_nb(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]

If you really need to stick to Cython, here is an implementation based on func2:

%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


import numpy as np
import cython as cy


cpdef func2_cy(f_arr, nech, n):
    sf2 = np.zeros(nech)
    _func2_cy(f_arr.astype(np.float_), sf2, nech, n)
    return sf2


cdef _func2_cy(double[:] f_arr, double[:] sf2, cy.int nech, cy.int n):
    for i in range(nech):
        for j in range(1, n - i):
            sf2[i] = sf2[i]   (f_arr[i   j] - f_arr[i]) ** 2
        sf2[i] = sf2[i] / (n - i)

This is significantly more complicated to write compared to Numba, but achieves similar performances. The trick is to have a function _func2_cy which operates with little to no interaction with Python (read: it goes at C speed).

Again the result is the same as func2:

nech = n = 6
f_arr = np.arange(n)
print(func2_cy(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]

Timings

With some little toy benchmarking we get a feeling of the speed-ups, including a comparable function written as you did, and the vectorized solution proposed in Andras Deak's very fine answer (but fixing the indices to match the above):

def func_OP(f, nech, n):
    sf2 = np.zeros(n)
    count = np.zeros(n)
    for i in range(nech):
        indN = np.arange(n - i)  # <-- indexing fixed
        for j in indN:
            sf2[i]  = np.power((f[i j]-f[i]),2)
            count[i]  = 1
        sf2[i] = sf2[i] / count[i]
    return sf2


func_OP_nb = nb.njit(func_OP)

def func_pdisty(f, nech, n):
    res = np.zeros(nech)
    dists = scipy.spatial.distance.pdist(f[:, None], metric='sqeuclidean')
    counts = np.arange(n - 1, n - nech - 1, -1)
    inds = np.repeat(np.arange(res.size), counts)
    np.add.at(res, inds, dists[:inds.size])
    res /= (counts   1)
    return res

nech = n = 6
f_arr = np.arange(n)
print(func_OP(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]
print(func_pdisty(f_arr, nech, n))
# [9.16666667 6.         3.5        1.66666667 0.5        0.        ]

nech = n = 1000
f_arr = np.arange(n)
%timeit func_OP(f_arr, nech, n)
# 1 loop, best of 5: 1.5 s per loop
%timeit func(f_arr, nech, n)
# 1 loop, best of 5: 567 ms per loop
%timeit func2(f_arr, nech, n)
# 1 loop, best of 5: 352 ms per loop
%timeit func_OP_nb(f_arr, nech, n)
# 1000 loops, best of 5: 1.87 ms per loop
%timeit func_nb(f_arr, nech, n)
# 1000 loops, best of 5: 1.7 ms per loop
%timeit func2_nb(f_arr, nech, n)
# 1000 loops, best of 5: 768 µs per loop
%timeit func_pdisty(f_arr, nech, n)
# 10 loops, best of 5: 44.5 ms per loop
%timeit func2_cy(f_arr, nech, n)
# 1000 loops, best of 5: 1 ms per loop

nech = n = 2000
f_arr = np.arange(n)
%timeit func_OP(f_arr, nech, n)
# 1 loop, best of 5: 6.01 s per loop
%timeit func(f_arr, nech, n)
# 1 loop, best of 5: 2.3 s per loop
%timeit func2(f_arr, nech, n)
# 1 loop, best of 5: 1.42 s per loop
%timeit func_OP_nb(f_arr, nech, n)
# 100 loops, best of 5: 7.31 ms per loop
%timeit func_nb(f_arr, nech, n)
# 100 loops, best of 5: 6.82 ms per loop
%timeit func2_nb(f_arr, nech, n)
# 100 loops, best of 5: 3.05 ms per loop
%timeit func_pdisty(f_arr, nech, n)
# 1 loop, best of 5: 344 ms per loop
%timeit func2_cy(f_arr, nech, n)
# 100 loops, best of 5: 3.95 ms per loop

nech = n = 4000
f_arr = np.arange(n)
%timeit func_OP(f_arr, nech, n)
# 1 loop, best of 5: 24.3 s per loop
%timeit func(f_arr, nech, n)
# 1 loop, best of 5: 9.27 s per loop
%timeit func2(f_arr, nech, n)
# 1 loop, best of 5: 5.71 s per loop
%timeit func_OP_nb(f_arr, nech, n)
# 10 loops, best of 5: 29 ms per loop
%timeit func_nb(f_arr, nech, n)
# 10 loops, best of 5: 27.3 ms per loop
%timeit func2_nb(f_arr, nech, n)
# 100 loops, best of 5: 12.2 ms per loop
%timeit func_pdisty(f_arr, nech, n)
# 1 loop, best of 5: 706 ms per loop
%timeit func2_cy(f_arr, nech, n)
# 100 loops, best of 5: 15.9 ms per loop

...and with the input sizes you provided:

nech = 24 * 30 * 24
n = 70299
f_arr = np.random.random(n)
%timeit -n1 -r1 func_OP(f_arr, nech, n)
# 1 loop, best of 1: 1h 4min 50s per loop
%timeit -n1 -r1 func(f_arr, nech, n)
# 1 loop, best of 1: 21min 14s per loop
%timeit -n1 -r1 func2(f_arr, nech, n)  # only one run / loop
# 1 loop, best of 1: 13min 9s per loop
%timeit func_OP_nb(f_arr, nech, n)
# 1 loop, best of 5: 4.74 s per loop
%timeit func_nb(f_arr, nech, n)
# 1 loop, best of 5: 4 s per loop
%timeit func2_nb(f_arr, nech, n)
# 1 loop, best of 5: 1.62 s per loop
# %timeit func_pdisty(f_arr, nech, n)
# -- MEMORY ERROR --
%timeit func2_cy(f_arr, nech, n)
# 1 loop, best of 5: 2.2 s per loop