Home > OS >  Replace and = is abismally slow
Replace and = is abismally slow


I've made following code that deciphers some byte-arrays into "Readable" text for a translation project.

with open(Path(cur_file), mode="rb") as file:
    contents = file.read()

text = ""
for i in range(0, len(contents), 2): # Since it's encoded in UTF16 or similar, there should always be pairs of 2 bytes
    byte = contents[i]
    byte_2 = contents[i 1]
    if byte == 0x00 and byte_2 == 0x00:
        text ="[0x00 0x00]"
    elif byte != 0x00 and byte_2 == 0x00:
        #print("Normal byte")
        if chr(byte) in printable:
            text =chr(byte)
        elif byte == 0x00:
            text ="["   "0x{:02x}".format(byte)   "]"
        #print("Special byte")
        text ="["   "0x{:02x}".format(byte)   " "   "0x{:02x}".format(byte_2)   "]"
# Some dirty replaces - Probably slow but what do I know - It works
text = text.replace("[0x0e]n[0x01]","[USERNAME_1]") # Your name
text = text.replace("[0x0e]n[0x03]","[USERNAME_3]") # Your name
text = text.replace("[0x0e]n[0x08]","[TOWNNAME_8]") # Town name
text = text.replace("[0x0e]n[0x09]","[TOWNNAME_9]") # Town name
text = text.replace("[0x0e]n[0x0a]","[CHARNAME_A]") # Character name

text = text.replace("[0x0a]","[ENTER]") # Generic enter

lang_dict[emsbt_key_name] = text

While this code does work and produce output like:

Cancel[0x00 0x00]

And more complex ones, I've stumbled upon a performance problem when I loop it through 60000 files.

I've read a couple of questions regarding = with large strings and people say that join is preferred with large strings. However, even with strings of just under 1000 characters, a single file takes about 5 seconds to store, which is a lot.

I almost feel like it's starts fast and gets progressively slower and slower.

What would be a way to optimize this code? I feel it's also abysmal.

Any help or clue is greatly appreciated.

EDIT: Added cProfile output:

         261207623 function calls (261180607 primitive calls) in 95.364 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    284/1    0.002    0.000   95.365   95.365 {built-in method builtins.exec}
        1    0.000    0.000   95.365   95.365 start.py:1(<module>)
        1    0.610    0.610   94.917   94.917 emsbt_to_json.py:21(to_json)
    11179   11.807    0.001   85.829    0.008 {method 'index' of 'list' objects}
 62501129   49.127    0.000   74.146    0.000 pathlib.py:578(__eq__)
125048857   18.401    0.000   18.863    0.000 pathlib.py:569(_cparts)
 63734640    6.822    0.000    6.828    0.000 {built-in method builtins.isinstance}
   160958    0.183    0.000    4.170    0.000 pathlib.py:504(_from_parts)
   160958    0.713    0.000    3.942    0.000 pathlib.py:484(_parse_args)
    68959    0.110    0.000    3.769    0.000 pathlib.py:971(absolute)
   160959    1.600    0.000    2.924    0.000 pathlib.py:56(parse_parts)
    91999    0.081    0.000    1.624    0.000 pathlib.py:868(__new__)
    68960    0.028    0.000    1.547    0.000 pathlib.py:956(rglob)
    68960    0.090    0.000    1.518    0.000 pathlib.py:402(_select_from)
    68959    0.067    0.000    1.015    0.000 pathlib.py:902(cwd)
       37    0.001    0.000    0.831    0.022 __init__.py:1(<module>)
   937462    0.766    0.000    0.798    0.000 pathlib.py:147(splitroot)
    11810    0.745    0.000    0.745    0.000 {method '__exit__' of '_io._IOBase' objects}
   137918    0.143    0.000    0.658    0.000 pathlib.py:583(__hash__)

EDIT: Upon further inspection with line_profiler, turns out that the culprit isn't even in above code. It's well outside that code where I read search over the indexes to see if there is 1 file (looking ahead of the index). This apparently consumes a whole lot of CPU time.

CodePudding user response:

Just in case it provides you pathways to search, if I was in your case I'd do two separate checks over 100 files for example timing:

  • How much time it takes to execute only the for loop.
  • How much it takes to do only the six replaces.

If any takes most of the total time, I'd try to find a solution just for that bit. For raw replacements there are specific software designed for massive replacements. I hope it helps in some way.

CodePudding user response:

You might use .format to replace = and following way, let say you have code like this

text = ""
for i in range(10):
    text  = "["   "{}".format(i)   "]"
print(text)  # [0][1][2][3][4][5][6][7][8][9]

which is equvialent to

text = ""
for i in range(10):
    text = "{}[{}]".format(text, i)
print(text)  # [0][1][2][3][4][5][6][7][8][9]

Observe that other string formatting ways might be used as above, I elected to use .format as you are already using it.

  • Related