Concatenate and shuffle entries in a list of strings in Python-CodePudding

Suppose I have a list (or a NumPy array) of strings, e.g., X = ['abc', 'def', 'ghi', 'jkl'], a list of indices, e.g., J = [0, 3], and a random permutation p, e.g., [0, 3, 2, 5, 1, 4].

I want to find the most time efficient way to concatenate the strings in X corresponding to the indices J, i.e., concatenate X[j] for each j in J to get 'abcjkl', and apply the permutation p to this string, i.e., 'ajclbk'. Applying a permutation p to a string Y results in another string Z such that Z[i] = Y[p[i]].

Right now, I have X, J, and p as NumPy arrays and do

X = numpy.array(['abc', 'def', 'ghi', 'jkl'])
J = numpy.array([0, 3])
p = numpy.array([0, 3, 2, 5, 1, 4])
Y = ''.join(X[J])
Z = numpy.array(list(Y))
res = ''.join(Z[p])

Is there a faster way to achieve this? I don't have to use NumPy arrays if a solution with lists exist. In my application, list X can have 10 Million strings (formed using the English alphabet) each of length 1000, indices J can be of length 5 Million, and permutation p can be of length 5 Billion.

CodePudding user response：

Yes, the strings are of the same length (on the order of 1000). You can assume that they are restricted to the English Alphabet

Under these conditions I would expect Numpy to offer considerable advantages. (I also assume here that X can be pre-processed and will be reused with varying J and p values.)

Represent X as a 2D array of bytes (really, 8-bit numeric values, not the Python bytes type):

X = np.array([list(b'abc'), list(b'def'), list(b'ghi'), list(b'jkl')], dtype=np.uint8)

Concatenate by slicing appropriately and then using ravel:

J = (0, 3)
Y = X[J,:].ravel()

Permute as before; then instead of using join to join up the bytes (there is an equivalent for the bytes type, but it's overkill), just pass the array directly to the bytes constructor:

p = (0, 3, 2, 5, 1, 4)
Z = bytes(Y[p,]) # b'ajclbk'

If necessary, we can .decode that to a string at the end, just as we would encode while creating X.