I've made following code that deciphers some byte-arrays into "Readable" text for a translation project.
with open(Path(cur_file), mode="rb") as file:
contents = file.read()
file.close()
text = ""
for i in range(0, len(contents), 2): # Since it's encoded in UTF16 or similar, there should always be pairs of 2 bytes
byte = contents[i]
byte_2 = contents[i 1]
if byte == 0x00 and byte_2 == 0x00:
text ="[0x00 0x00]"
elif byte != 0x00 and byte_2 == 0x00:
#print("Normal byte")
if chr(byte) in printable:
text =chr(byte)
elif byte == 0x00:
pass
else:
text ="[" "0x{:02x}".format(byte) "]"
else:
#print("Special byte")
text ="[" "0x{:02x}".format(byte) " " "0x{:02x}".format(byte_2) "]"
# Some dirty replaces - Probably slow but what do I know - It works
text = text.replace("[0x0e]n[0x01]","[USERNAME_1]") # Your name
text = text.replace("[0x0e]n[0x03]","[USERNAME_3]") # Your name
text = text.replace("[0x0e]n[0x08]","[TOWNNAME_8]") # Town name
text = text.replace("[0x0e]n[0x09]","[TOWNNAME_9]") # Town name
text = text.replace("[0x0e]n[0x0a]","[CHARNAME_A]") # Character name
text = text.replace("[0x0a]","[ENTER]") # Generic enter
lang_dict[emsbt_key_name] = text
While this code does work and produce output like:
Cancel[0x00 0x00]
And more complex ones, I've stumbled upon a performance problem when I loop it through 60000 files.
I've read a couple of questions regarding = with large strings and people say that join is preferred with large strings. However, even with strings of just under 1000 characters, a single file takes about 5 seconds to store, which is a lot.
I almost feel like it's starts fast and gets progressively slower and slower.
What would be a way to optimize this code? I feel it's also abysmal.
Any help or clue is greatly appreciated.
EDIT: Added cProfile output:
261207623 function calls (261180607 primitive calls) in 95.364 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
284/1 0.002 0.000 95.365 95.365 {built-in method builtins.exec}
1 0.000 0.000 95.365 95.365 start.py:1(<module>)
1 0.610 0.610 94.917 94.917 emsbt_to_json.py:21(to_json)
11179 11.807 0.001 85.829 0.008 {method 'index' of 'list' objects}
62501129 49.127 0.000 74.146 0.000 pathlib.py:578(__eq__)
125048857 18.401 0.000 18.863 0.000 pathlib.py:569(_cparts)
63734640 6.822 0.000 6.828 0.000 {built-in method builtins.isinstance}
160958 0.183 0.000 4.170 0.000 pathlib.py:504(_from_parts)
160958 0.713 0.000 3.942 0.000 pathlib.py:484(_parse_args)
68959 0.110 0.000 3.769 0.000 pathlib.py:971(absolute)
160959 1.600 0.000 2.924 0.000 pathlib.py:56(parse_parts)
91999 0.081 0.000 1.624 0.000 pathlib.py:868(__new__)
68960 0.028 0.000 1.547 0.000 pathlib.py:956(rglob)
68960 0.090 0.000 1.518 0.000 pathlib.py:402(_select_from)
68959 0.067 0.000 1.015 0.000 pathlib.py:902(cwd)
37 0.001 0.000 0.831 0.022 __init__.py:1(<module>)
937462 0.766 0.000 0.798 0.000 pathlib.py:147(splitroot)
11810 0.745 0.000 0.745 0.000 {method '__exit__' of '_io._IOBase' objects}
137918 0.143 0.000 0.658 0.000 pathlib.py:583(__hash__)
EDIT: Upon further inspection with line_profiler, turns out that the culprit isn't even in above code. It's well outside that code where I read search over the indexes to see if there is 1 file (looking ahead of the index). This apparently consumes a whole lot of CPU time.
CodePudding user response:
Just in case it provides you pathways to search, if I was in your case I'd do two separate checks over 100 files for example timing:
- How much time it takes to execute only the for loop.
- How much it takes to do only the six replaces.
If any takes most of the total time, I'd try to find a solution just for that bit. For raw replacements there are specific software designed for massive replacements. I hope it helps in some way.
CodePudding user response:
You might use .format
to replace =
and
following way, let say you have code like this
text = ""
for i in range(10):
text = "[" "{}".format(i) "]"
print(text) # [0][1][2][3][4][5][6][7][8][9]
which is equvialent to
text = ""
for i in range(10):
text = "{}[{}]".format(text, i)
print(text) # [0][1][2][3][4][5][6][7][8][9]
Observe that other string formatting ways might be used as above, I elected to use .format
as you are already using it.