Suppose I have a list (or a NumPy array) of strings, e.g., X = ['abc', 'def', 'ghi', 'jkl'], a list of indices, e.g., J = [0, 3], and a random permutation p, e.g., [0, 3, 2, 5, 1, 4].
I want to find the most time efficient way to concatenate the strings in X corresponding to the indices J, i.e., concatenate X[j] for each j in J to get 'abcjkl', and apply the permutation p to this string, i.e., 'ajclbk'. Applying a permutation p to a string Y results in another string Z such that Z[i] = Y[p[i]].
Right now, I have X, J, and p as NumPy arrays and do
X = numpy.array(['abc', 'def', 'ghi', 'jkl'])
J = numpy.array([0, 3])
p = numpy.array([0, 3, 2, 5, 1, 4])
Y = ''.join(X[J])
Z = numpy.array(list(Y))
res = ''.join(Z[p])
Is there a faster way to achieve this? I don't have to use NumPy arrays if a solution with lists exist. In my application, list X can have 10 Million strings (formed using the English alphabet) each of length 1000, indices J can be of length 5 Million, and permutation p can be of length 5 Billion.
CodePudding user response:
Yes, the strings are of the same length (on the order of 1000). You can assume that they are restricted to the English Alphabet
Under these conditions I would expect Numpy to offer considerable advantages. (I also assume here that X
can be pre-processed and will be reused with varying J
and p
values.)
Represent X
as a 2D array of bytes (really, 8-bit numeric values, not the Python bytes
type):
X = np.array([list(b'abc'), list(b'def'), list(b'ghi'), list(b'jkl')], dtype=np.uint8)
Concatenate by slicing appropriately and then using ravel
:
J = (0, 3)
Y = X[J,:].ravel()
Permute as before; then instead of using join
to join up the bytes (there is an equivalent for the bytes
type, but it's overkill), just pass the array directly to the bytes
constructor:
p = (0, 3, 2, 5, 1, 4)
Z = bytes(Y[p,]) # b'ajclbk'
If necessary, we can .decode
that to a string at the end, just as we would encode
while creating X
.