Split string to dict with good performance-CodePudding

I search best method for split long string look like b'a: 1\nb: 2\n ...' - about 50-70 keys.

Length of string is 8-10K bytes. So, I have about 1K strings per second.

By best method looks like:

dict(x.split(b": ") for x in bytes(headers).split(b'\n'))

Maybe cython gives good result?

CodePudding user response：

may be your are looking for something like this which will only use itertools to save memory on long strings and

from itertools import pairwise
    
def string_to_dict(str_value):
    #l is a list of indices of each '\n' inside the string 
   l = []
   i = 0
   while i<len(str_value):
       if str_value[i] == b'\n':l.append(i)
       i =1
   #pairwise(l) will give us a list of 2-tuple indices to get each
   # substring in the format 'key:value'
   #str_value[x[0] 1:x[1]].split(b': ') will give us (key, value) tuple
   #to dynamically create the global dict
   result_dict = dict(str_value[x[0] 1:x[1]].split(b': ') for x in pairwise(l))
   return result_dict

or more efficient again, the following will save memory at the price of the compute

def string_to_dict(str_value):
    w = (i for i in range(len(str_value)) if str_value[i]==b'\n')
    result_dict = dict(str_value[x[0] 1:x[1]].split(b': ') for x in pairwise(w))
    return result_dict

CodePudding user response：

As long as the input is well-formed, we could replace the : delimiter with the same delimiter (\n), and split both at once, then slice for keys/values.

The code looks something like:

def fast_split(data):
    items = bytes(data).replace(b": ", b"\n").split(b"\n")
    return dict(zip(items[::2], items[1::2]))

On my machine, its about 3x faster-

from timeit import timeit

size = 100000
test_str = b"\n".join([b"a: 1"] * size)


def slow_split(data):
    return dict(x.split(b": ") for x in bytes(data).split(b'\n'))

def fast_split(data):
    items = bytes(data).replace(b": ", b"\n").split(b"\n")
    return dict(zip(items[::2], items[1::2]))

print(fast_split(test_str) == slow_split(test_str))

print(timeit("slow_split(test_str)", number=100, setup="from __main__ import slow_split, test_str"))
print(timeit("fast_split(test_str)", number=100, setup="from __main__ import fast_split, test_str"))

True
1.373571052972693
0.4970768200000748