I can use a for
loop to loop over two byte sequences and return the index at the first difference of course:
bytes1 = b'12345'
bytes2 = b'1F345'
for index, pair in enumerate(zip(bytes1, bytes2)):
if pair[0] != pair[1]:
print(index)
break
But I don't think that's a smart and fast way to do it. I would hope a native method exists that I can call to get this done. Is there something that can help me here? I can also use numpy if it helps.
I also want to clarify that this will run many times, with medium sequences. Approximately 300MB is expected, chunked by 100kB. I might be able to change that for larger if it helps significantly.
CodePudding user response:
a solution with numpy is to convert them to an array of uint8 then xor them and use argmax to get the first non-zero.
import numpy as np
bytes1 = b'12345'
bytes2 = b'1F345'
bytes3 = np.frombuffer(bytes1,dtype=np.uint8)
bytes4 = np.frombuffer(bytes2,dtype=np.uint8)
max_loc = np.flatnonzero(bytes3 ^ bytes4)[0]
print(max_loc)
1
problem is that this still iterates to the end of the string on all functions, it's done in C so it is not too slow, slicing long array into multiple smaller ones can reduce the overhead for very long arrays.
Edit: modified argmax
to the correct flatnonzero
as pointed by @jasonharper, which throws an indexError
if both inputs are equal.
CodePudding user response:
If using numba is ok:
import numba
@numba.jit()
def method2(bytes1, bytes2):
idx = 0
while idx < len(bytes1):
if bytes1[idx] != bytes2[idx]:
return idx
idx = 1
return idx
Note that first run of this function will be significantly slower (due to compilation performed by numba). Takes like 2 seconds.
Then, for each next run of the function:
- easy case you posted, index = 1 -> numba is 2x faster,
- for index = 100 -> numba is 33x faster