Given a ndarray ar
of shape (n, m)
I want to "extract" subsequences along axis 1 of length k
with k<m
. In case of a known starting index start
for the subsequences of length k
this can be solved with new_ar = ar[:, start:end]
(or just start:start k
).
However, what if I have a list start_list
and an end_list
of length n
(or just the start_list
, since the length of the subsequence is known anyway), which contains the starting indices (and ending indices) of the subsequences I want to extract? Intuitively I tried ar[:, start_list:end_list]
, but this throws TypeError: slice indices must be integers or None or have an __index__ method
.
What would be a solution to this problem without the usage of loops and leveraging NumPys methods? For my problem the for-loop took 30 mins, but this has to have a NumPy-style 5ms solution since it's just indexing.
[edit]: Since the problem is probably better understood with code (thank you for the hints), I'll try to make it more compact what I want and show what I did to solve it with a loop.
I have an ndarray of shape (40450, 200000)
, representing 40450
signals of length 200000
each. The signals are shifted and I want to align them. So I want to extract subsequences of length say 190000
from each of the 40450
sequences. For this, I have a list start_list
of length 40450
, containing the starting indices for the subsequences (each of the 40450
subsequences I want to extract has a different starting point in the original sequence of length 200000
).
I can solve this with a for loop (ar
contains the original sequences, start_list
the starting indices):
k = 190000
ar_new = np.zeros((40450, k))
for i in range(ar_new.shape[0]):
ar_new[i] = ar[i, start_list[i]:start_list[i] k]
If e. g. start_list[0]
is 0
, this means that I need ar[0, 0:190000]
, if start_list[10000]
is 1337
, this means I need ar[10000, 1337:1337 190000]
etc.
But this takes >30 mins for my case and I am sure it can somehow be solved with NumPy built-in methods/some slicing magic.
CodePudding user response:
After some trials
In [14]: a = np.array(range(200000), dtype=float)
...: b = np.array(range(200000), dtype=float)
...: start, k = 100, 190000
In [15]: %timeit for _ in range(1000): a[:k] = a[s:s k]
26.4 ms ± 9.04 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: %timeit for _ in range(1000): b[:k] = a[s:s k]
44.8 ms ± 902 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I am thinking ① if you can do without the non-aligned data, overwriting seems FASTER ② at any rate, if the process is contained in RAM, I'd expect to have my results in the range 1÷10 seconds, not 30 minutes ③ if your problem is swapping, overwriting avoids allocating approximately 4*4E4*2E5 ⇒ 32E9
bytes of memory.
CodePudding user response:
from numpy.lib.stride_tricks import as_strided
# test data
n, m = 5, 10
arr = np.arange(n*m).reshape(n, m)
k = 5
start_list = [0, 1, 2, 1, 0]
# main code
n, m = arr.shape
isize = arr.dtype.itemsize
x = 1 m - k # a supporting intermediate dimension
assert k < m
assert len(start_list) == n
assert all(0 <= i < x for i in start_list)
arr_modified = as_strided(arr, shape=(n,x,k), strides=(m*isize, isize, isize))
arr_new = arr_modified[[*range(n)], start_list]