Home > Back-end >  Split string to dict with good performance
Split string to dict with good performance

Time:12-18

I search best method for split long string look like b'a: 1\nb: 2\n ...' - about 50-70 keys.

Length of string is 8-10K bytes. So, I have about 1K strings per second.

By best method looks like:

dict(x.split(b": ") for x in bytes(headers).split(b'\n'))

Maybe cython gives good result?

CodePudding user response:

may be your are looking for something like this which will only use itertools to save memory on long strings and

from itertools import pairwise
    
def string_to_dict(str_value):
    #l is a list of indices of each '\n' inside the string ​
   ​l = []
   ​i = 0
   ​while i<len(str_value):
       ​if str_value[i] == b'\n':l.append(i)
       ​i =1
   ​#pairwise(l) will give us a list of 2-tuple indices to get each
   # substring in the format 'key:value'
   #str_value[x[0] 1:x[1]].split(b': ') will give us (key, value) tuple
   #to dynamically create the global dict
   result_dict = dict(str_value[x[0] 1:x[1]].split(b': ') for x in pairwise(l))
   return result_dict

or more efficient again, the following will save memory at the price of the compute

def string_to_dict(str_value):
    w = (i for i in range(len(str_value)) if str_value[i]==b'\n')
    result_dict = dict(str_value[x[0] 1:x[1]].split(b': ') for x in pairwise(w))
    return result_dict

CodePudding user response:

As long as the input is well-formed, we could replace the : delimiter with the same delimiter (\n), and split both at once, then slice for keys/values.

The code looks something like:

def fast_split(data):
    items = bytes(data).replace(b": ", b"\n").split(b"\n")
    return dict(zip(items[::2], items[1::2]))

On my machine, its about 3x faster-

from timeit import timeit

size = 100000
test_str = b"\n".join([b"a: 1"] * size)


def slow_split(data):
    return dict(x.split(b": ") for x in bytes(data).split(b'\n'))

def fast_split(data):
    items = bytes(data).replace(b": ", b"\n").split(b"\n")
    return dict(zip(items[::2], items[1::2]))

print(fast_split(test_str) == slow_split(test_str))

print(timeit("slow_split(test_str)", number=100, setup="from __main__ import slow_split, test_str"))
print(timeit("fast_split(test_str)", number=100, setup="from __main__ import fast_split, test_str"))
True
1.373571052972693
0.4970768200000748
  • Related