How to Parallelize/Vectorize String manipulation?-CodePudding

I have few functions for string manipulations, but they also involve libraries other than python's built-in (example: spacy)

Profiling my code tells me that for loops are consuming the most time and I have seen vectorizing as a recommendation to resolve this.

I am asking this question as a primer to my exploration and hence would refrain from dumping the whole code here - rather I will use a simple example of string concatenation and my question is how to vectorize it.

This post quickly explained me vectorization I tried to implement it on strings.

li = list(range(50000))
li = [str(i) for i in li]
nump_arr = np.char.array(li)

def python_for():
    return [num   'x' for num in li]

def numpy_vec():
    return nump_arr   'x'

print("python_for",min(Timer(python_for).repeat(10, 10)))
print("numpy_vec",min(Timer(numpy_vec).repeat(10, 10)))

Results:

python_for 0.048397099948488176
numpy_vec 0.4274819999700412
Python for loop is 8x faster than Numpy

As can be seen , numpy arrays are significantly slower than python For-loops for strings and vice versa for numbers.

I haven't used a simple numpy.array as it throws an error - "ufunc 'add' did not contain a loop with signature matching types (dtype('<U5'), dtype('<U1')) -> None" (for the below code)

li = list(range(50000))
li = [str(i) for i in li]
nump_arr = np.array(li)
nump_arr   's'

np.char.array was recommended in this post

Question:

How can I speed up my string manipulations?
Is numpy array not recommended for string manipulations?

Using numpy(v1.23.1)

CodePudding user response：

Increasing the list/array elements count by a factor of 10 and using a slightly different timing mechanism as follows:

import numpy
from timeit import timeit

lc = list(map(str, range(500_000)))
la = numpy.char.array(lc)

def func_1():
    return [e 'x' for e in lc]

def func_2():
    return la 'x'

for func in func_1, func_2:
    print(func.__name__, timeit(lambda: func(), number=100))

...produces the following output:

func_1 4.441046968000137
func_2 26.463288379000005

...which seems to suggest that numpy (v1.23.2) may not be ideally suited to this kind of requirement.

In case it's relevant: macOS 12.5.1, 32 GB 2666 MHz DDR4, 3 GHz 10-Core Intel Xeon W

CodePudding user response：

Try using map and lambda (without numpy). Examples that might be relevant here:

list(map(str, range(50000))

and

convert = lambda s: s 'x'
list(map(convert, lc))

You can also combine all into one function:

convert = lambda s: str(s) 'x'
list(map(convert, range(50000)))

CodePudding user response：

A small array of string dtype:

In [139]: A = np.array([f'{i}' for i in range(5)])    
In [140]: A
Out[140]: array(['0', '1', '2', '3', '4'], dtype='<U1')

np.char has functions that apply string methods to elements of an array; np.char.array does the same, but I believe the docs now suggest using the functions.

In [141]: np.char.add(A,'s')
Out[141]: array(['0s', '1s', '2s', '3s', '4s'], dtype='<U2')

Another approach is to make an object dtype array, and let the object dtype mechanism apply the python string operators:

In [142]: B = A.astype(object)    
In [143]: B
Out[143]: array(['0', '1', '2', '3', '4'], dtype=object)    
In [144]: B 's'
Out[144]: array(['0s', '1s', '2s', '3s', '4s'], dtype=object)

is join for strings; with object dtype, numpy essentially iterates, calling each element's own method.

Or with a plain list of strings:

In [145]: alist = A.tolist()
In [146]: alist
Out[146]: ['0', '1', '2', '3', '4']
 
In [148]: [i 's' for i in alist]
Out[148]: ['0s', '1s', '2s', '3s', '4s']

Some timings with a large array:

In [149]: A = np.array([f'{i}' for i in range(50000)])    
In [150]: timeit A = np.array([f'{i}' for i in range(50000)])
25 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

np.char.add:

In [151]: timeit np.char.add(A,'s')
55.8 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [152]: B = A.astype(object)
In [153]: timeit B 's'
5.21 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [154]: alist = A.tolist()
In [155]: timeit [i 's' for i in alist]
6.92 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

object dtype is in the same ball park as a list comprehension - here it's a bit faster, but not the orders of magnitude we see with numeric methods.

map is similar to list comprehension:

In [156]: timeit list(map(lambda x: x 's', alist))
10.5 ms ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy doesn't implement its own string methods; it uses python's. For pure array things like reshape it's fast, but doesn't offer much when creating new strings.

It's tempting to use np.vectorize. It has a speed disclaimer, though in recent versions it seems to do a bit better than list comprehensions; here it's more like the np.char timings:

In [157]: timeit np.vectorize(lambda x: x 's', otypes=['U10'])(A)
37.8 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)