Home > Back-end >  Slicing along NumPy ndarray axis with varying indices
Slicing along NumPy ndarray axis with varying indices

Time:10-13

Given a ndarray ar of shape (n, m) I want to "extract" subsequences along axis 1 of length k with k<m. In case of a known starting index start for the subsequences of length k this can be solved with new_ar = ar[:, start:end] (or just start:start k).

However, what if I have a list start_list and an end_list of length n (or just the start_list, since the length of the subsequence is known anyway), which contains the starting indices (and ending indices) of the subsequences I want to extract? Intuitively I tried ar[:, start_list:end_list], but this throws TypeError: slice indices must be integers or None or have an __index__ method.

What would be a solution to this problem without the usage of loops and leveraging NumPys methods? For my problem the for-loop took 30 mins, but this has to have a NumPy-style 5ms solution since it's just indexing.

[edit]: Since the problem is probably better understood with code (thank you for the hints), I'll try to make it more compact what I want and show what I did to solve it with a loop.

I have an ndarray of shape (40450, 200000), representing 40450 signals of length 200000 each. The signals are shifted and I want to align them. So I want to extract subsequences of length say 190000 from each of the 40450 sequences. For this, I have a list start_list of length 40450, containing the starting indices for the subsequences (each of the 40450 subsequences I want to extract has a different starting point in the original sequence of length 200000).

I can solve this with a for loop (ar contains the original sequences, start_list the starting indices):

k = 190000
ar_new = np.zeros((40450, k))
for i in range(ar_new.shape[0]):
    ar_new[i] = ar[i, start_list[i]:start_list[i] k]

If e. g. start_list[0] is 0, this means that I need ar[0, 0:190000], if start_list[10000] is 1337, this means I need ar[10000, 1337:1337 190000] etc.

But this takes >30 mins for my case and I am sure it can somehow be solved with NumPy built-in methods/some slicing magic.

CodePudding user response:

After some trials

In [14]: a = np.array(range(200000), dtype=float)
    ...: b = np.array(range(200000), dtype=float)
    ...: start, k = 100, 190000

In [15]: %timeit for _ in range(1000): a[:k] = a[s:s k]
26.4 ms ± 9.04 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [16]: %timeit for _ in range(1000): b[:k] = a[s:s k]
44.8 ms ± 902 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I am thinking ① if you can do without the non-aligned data, overwriting seems FASTER ② at any rate, if the process is contained in RAM, I'd expect to have my results in the range 1÷10 seconds, not 30 minutes ③ if your problem is swapping, overwriting avoids allocating approximately 4*4E4*2E5 ⇒ 32E9 bytes of memory.

CodePudding user response:

from numpy.lib.stride_tricks import as_strided

# test data
n, m = 5, 10
arr = np.arange(n*m).reshape(n, m)
k = 5
start_list = [0, 1, 2, 1, 0]

# main code
n, m = arr.shape
isize = arr.dtype.itemsize
x = 1   m - k    # a supporting intermediate dimension

assert k < m
assert len(start_list) == n
assert all(0 <= i < x for i in start_list)

arr_modified = as_strided(arr, shape=(n,x,k), strides=(m*isize, isize, isize))
arr_new = arr_modified[[*range(n)], start_list]
  • Related