I have few functions for string manipulations, but they also involve libraries other than python's built-in (example: spacy)
Profiling my code tells me that for loops are consuming the most time and I have seen vectorizing as a recommendation to resolve this.
I am asking this question as a primer to my exploration and hence would refrain from dumping the whole code here - rather I will use a simple example of string concatenation and my question is how to vectorize it.
This post quickly explained me vectorization I tried to implement it on strings.
li = list(range(50000))
li = [str(i) for i in li]
nump_arr = np.char.array(li)
def python_for():
return [num 'x' for num in li]
def numpy_vec():
return nump_arr 'x'
print("python_for",min(Timer(python_for).repeat(10, 10)))
print("numpy_vec",min(Timer(numpy_vec).repeat(10, 10)))
Results:
python_for 0.048397099948488176
numpy_vec 0.4274819999700412
Python for loop is 8x faster than Numpy
As can be seen , numpy arrays are significantly slower than python For-loops for strings and vice versa for numbers.
I haven't used a simple numpy.array as it throws an error - "ufunc 'add' did not contain a loop with signature matching types (dtype('<U5'), dtype('<U1')) -> None" (for the below code)
li = list(range(50000))
li = [str(i) for i in li]
nump_arr = np.array(li)
nump_arr 's'
np.char.array was recommended in this post
Question:
- How can I speed up my string manipulations?
- Is numpy array not recommended for string manipulations?
Using numpy(v1.23.1)
CodePudding user response:
Increasing the list/array elements count by a factor of 10 and using a slightly different timing mechanism as follows:
import numpy
from timeit import timeit
lc = list(map(str, range(500_000)))
la = numpy.char.array(lc)
def func_1():
return [e 'x' for e in lc]
def func_2():
return la 'x'
for func in func_1, func_2:
print(func.__name__, timeit(lambda: func(), number=100))
...produces the following output:
func_1 4.441046968000137
func_2 26.463288379000005
...which seems to suggest that numpy (v1.23.2) may not be ideally suited to this kind of requirement.
In case it's relevant: macOS 12.5.1, 32 GB 2666 MHz DDR4, 3 GHz 10-Core Intel Xeon W
CodePudding user response:
Try using map and lambda (without numpy). Examples that might be relevant here:
list(map(str, range(50000))
and
convert = lambda s: s 'x'
list(map(convert, lc))
You can also combine all into one function:
convert = lambda s: str(s) 'x'
list(map(convert, range(50000)))
CodePudding user response:
A small array of string dtype:
In [139]: A = np.array([f'{i}' for i in range(5)])
In [140]: A
Out[140]: array(['0', '1', '2', '3', '4'], dtype='<U1')
np.char
has functions that apply string methods to elements of an array; np.char.array
does the same, but I believe the docs now suggest using the functions.
In [141]: np.char.add(A,'s')
Out[141]: array(['0s', '1s', '2s', '3s', '4s'], dtype='<U2')
Another approach is to make an object dtype array, and let the object dtype mechanism apply the python string operators:
In [142]: B = A.astype(object)
In [143]: B
Out[143]: array(['0', '1', '2', '3', '4'], dtype=object)
In [144]: B 's'
Out[144]: array(['0s', '1s', '2s', '3s', '4s'], dtype=object)
is join for strings; with object dtype, numpy essentially iterates, calling each element's own method.
Or with a plain list of strings:
In [145]: alist = A.tolist()
In [146]: alist
Out[146]: ['0', '1', '2', '3', '4']
In [148]: [i 's' for i in alist]
Out[148]: ['0s', '1s', '2s', '3s', '4s']
Some timings with a large array:
In [149]: A = np.array([f'{i}' for i in range(50000)])
In [150]: timeit A = np.array([f'{i}' for i in range(50000)])
25 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
np.char.add
:
In [151]: timeit np.char.add(A,'s')
55.8 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [152]: B = A.astype(object)
In [153]: timeit B 's'
5.21 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [154]: alist = A.tolist()
In [155]: timeit [i 's' for i in alist]
6.92 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
object
dtype is in the same ball park as a list comprehension - here it's a bit faster, but not the orders of magnitude we see with numeric methods.
map
is similar to list comprehension:
In [156]: timeit list(map(lambda x: x 's', alist))
10.5 ms ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
numpy
doesn't implement its own string methods; it uses python's. For pure array things like reshape
it's fast, but doesn't offer much when creating new strings.
It's tempting to use np.vectorize
. It has a speed disclaimer, though in recent versions it seems to do a bit better than list comprehensions; here it's more like the np.char
timings:
In [157]: timeit np.vectorize(lambda x: x 's', otypes=['U10'])(A)
37.8 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)