how to find a sequence of numbers in a numpy array column-CodePudding

I have a numpy array (shape: 10x2) such as the one below:

              array
index      label feature
  0          121    a
  1          131    b 
  2          113    c
  3          131    d
  4          223    e
  5          242    f
  6          212    g 
  7          131    h
  8          113    i
  9          131    j

I want to be able to find the indices that match a certain sequence get the items in the feature list that correspond to the sequence,

e.g. given the sequence [131,113,131], I would find to get index 1 and 7 (the starting indices) or the list of indices that correspond to the sequence ([1,2,3] and [7,8,9]) and then finally get the features that correspond to the sequence: [b,c,d] and [h,i,j].

My current solution is below and gives me the starting indices of the sequences but it is not very generalizable to longer sequences and a bit difficult to follow

import numpy as np

v = np.array([[121,1],
         [131,1],
         [113,1],
         [131,1],
         [223,1],
         [242,1],
         [212,1],
         [131,1],
         [113,1],
         [131,1]])

sequence = [131,113,131]

c = [ind for ind, x in enumerate(v[:,0]) if (ind 1 < len(v[:,0]) and ind 2 < len(v[:,0])) if (x == sequence[0] and v[:,0][ind 1] == sequence[1] and v[:,0][ind 2] == sequence[2])]

I would prefer a solution that uses only numpy as I am restricted to an old system that has some out-of-date custom packages needed for other parts of my script but would welcome seeing it in pandas or any other package. I see this as a type of template matching problem but cannot seem to find an elegant solution. Thank you in advance!

CodePudding user response：

I got your result using np.where and had to use reduce on np.roll to combine the conditions. This finds the first indices of the sub-series. Then to get the features just resize the result and add it to an arange of the length of what you need to find and that's it:

from operator import and_
from functools import reduce

a = np.array([ ... ])
find = [131, 113, 131]

indices = np.where(reduce(and_ , ((np.roll(a['label'], -r) == i) for r, i in enumerate(find))))[0]
result = a['feature'][np.arange(len(find))   np.resize(indices, (indices.size, 1))])

[['b' 'c' 'd']
 ['h' 'i' 'j']]

I assume you're using python 2 and I don't have that on my computer but if so remove the import of reduce, and the comprehension may need to be taken out and made it's own loop. Although this does work on python 3.x.

CodePudding user response：

A numpy only option. The steps and outputs explain the flow.

import numpy as np

v = np.array([[121,1],
         [131,1],
         [113,1],
         [131,1],
         [223,1],
         [242,1],
         [212,1],
         [131,1],
         [113,1],
         [131,1]])

# converted to np array
sequence = np.array([131,113,131])

print()
print("# Find starting points of seq in array")
print("v[:,0] = ", v[:,0])
print("v[:,0] == sequence[0] = ", v[:,0] == sequence[0])
start_pos = np.where(v[:,0] == sequence[0])[0]
print("result", start_pos)

print()
print("# Drop all indexes which can give index error")
print("initial", start_pos)
seq_len = sequence.shape[0]
max_possible_idx = v.shape[0]-sequence.shape[0]
start_pos = start_pos[start_pos <= max_possible_idx]
print("result", start_pos)

print()
print("# Generate index sequences to be matched")
idx_seq = numpy.arange(seq_len).reshape(seq_len,1)
m = np.tile(idx_seq, (1, start_pos.shape[0]))
idx_mat = m start_pos
print("result \n", idx_mat) # read them column wise

print()
print("# Compare values from each index sequence with given sequence")
bools = np.apply_along_axis(lambda x: v[:,0][x] == sequence, 0, idx_mat)
print(bools)
print(bools.all(0))
print(start_pos[bools.all(0)])

Output:

# Find starting points of seq in array
v[:,0] =  [121 131 113 131 223 242 212 131 113 131]
v[:,0] == sequence[0] =  [False  True False  True False False False  True False  True]
result [1 3 7 9]

# Drop all indexes which can give index error
initial [1 3 7 9]
result [1 3 7]

# Generate index sequences to be matched
result 
 [[1 3 7]
 [2 4 8]
 [3 5 9]]

# Compare values from each index sequence with given sequence
[[ True  True  True]
 [ True False  True]
 [ True False  True]]
[ True False  True]
[1 7]

This can be further improved, by using more of higher order functions, but the general idea:

find all positions of first element of sequence in v
Generate a matrix of indexes, each column denotes sequential indexes to be matched.
Match each slice generated by each index sequence from v to the sequence