The fastest way possible to split a long string-CodePudding

I have a very long string in Python:

x = "12;14;14;14;18;12;17;19" # I only show a small part of it : there are 10 millions of ;

The goal is to transform it into:

y = array([12, 14, 14, 14, 18, 12, 17, 19], dtype=int)

One way to do this is to use array(x.split(";")) or numpy.fromtostring.

But both are extremely slow.

Is there quicker way to do it in python?

Thank you very much and have a nice day.

CodePudding user response：

Strings are slow, especially unicode ones. CPython is slow, especially loops. Numpy is not really design to (efficiently) deal with strings. I do not think Numpy can do this faster than fromstring yet. The only solutions I can come up with are using Numba, Cython or even basic C extensions. The simplest solution is to use Numba, the fastest is to use Cython/C-extensions.

Unfortunately Numba is very slow for strings/bytes so far (open issue that is not planed to be solved any time soon). Some tricks are needed so that Numba can compute this efficiently: the string needs to be converted to a Numpy array. This means it must be first encoded to a byte-array first to avoid any variable-sized encoding (like UTF-8). np.frombuffer seems the fastest solution to convert the buffer to a Numpy array. Since the input is a read-only array (unusual, but efficient), the Numba signature is not very easy to read.

Here is the final solution:

import numpy as np
import numba as nb

@nb.njit(nb.int32[::1](nb.types.Array(nb.uint8, 1, 'C', readonly=True,)))
def compute(arr):
    sep = ord(';')
    base = ord('0')
    minus = ord('-')
    count = 1
    for c in arr:
        count  = c == sep
    res = np.empty(count, np.int32)
    val = 0
    positive = True
    cur = 0
    for c in arr:
        if c != sep and c != minus:
            val = (val * 10)   c - base
        elif c == minus:
            positive = False
        else:
            res[cur] = val if positive else -val
            cur  = 1
            val = 0
            positive = True
    if cur < count:
        res[cur] = val if positive else -val
    return res

x = ';'.join(np.random.randint(0, 200, 10_000_000).astype(str))
result = compute(np.frombuffer(x.encode('ascii'), np.uint8))

Note that the Numba solution performs no checks for sake of performance. It also assume the numbers are positive ones. Thus, you must ensure the input is valid. Alternatively, you can perform additional check in it (at the expense of a slower code).

Here are performance results on my machine with a i5-9600KF processor (with Numpy 1.22.4 on Windows):

np.fromstring(x, dtype=np.int32, sep=';'):       8927 ms
np.array(re.split(";", x), dtype=np.int32):      1419 ms
np.array(x.split(";"), dtype=np.int32):          1196 ms
Numba implementation:                              78 ms
Numba implementation (without negative numbers):   67 ms

This solution is 114 times faster than np.fromstring and 15 times faster than the fastest solution (based on split). Note that removing the support for negative numbers makes the Numba function 18% faster. Also, note that 10~12% of the time is spent in encode.

CodePudding user response：

You can try:

import re
x = "12;14;14;14;18;12;17;19"
re.split( ";", x )
#['12', '14', '14', '14', '18', '12', '17', '19']