Is there a fast way to find the first offset at which two large byte sequences differ?-CodePudding

I can use a for loop to loop over two byte sequences and return the index at the first difference of course:

bytes1 = b'12345'
bytes2 = b'1F345'
for index, pair in enumerate(zip(bytes1, bytes2)):
    if pair[0] != pair[1]:
        print(index)
        break

But I don't think that's a smart and fast way to do it. I would hope a native method exists that I can call to get this done. Is there something that can help me here? I can also use numpy if it helps.

I also want to clarify that this will run many times, with medium sequences. Approximately 300MB is expected, chunked by 100kB. I might be able to change that for larger if it helps significantly.

CodePudding user response：

a solution with numpy is to convert them to an array of uint8 then xor them and use argmax to get the first non-zero.

import numpy as np
bytes1 = b'12345'
bytes2 = b'1F345'
bytes3 = np.frombuffer(bytes1,dtype=np.uint8)
bytes4 = np.frombuffer(bytes2,dtype=np.uint8)
max_loc = np.flatnonzero(bytes3 ^ bytes4)[0]
print(max_loc)

problem is that this still iterates to the end of the string on all functions, it's done in C so it is not too slow, slicing long array into multiple smaller ones can reduce the overhead for very long arrays.

Edit: modified argmax to the correct flatnonzero as pointed by @jasonharper, which throws an indexError if both inputs are equal.

CodePudding user response：

If using numba is ok:

import numba
@numba.jit()
def method2(bytes1, bytes2):
    idx = 0
    while idx < len(bytes1):
        if bytes1[idx] != bytes2[idx]:
            return idx
        idx  = 1
    return idx

Note that first run of this function will be significantly slower (due to compilation performed by numba). Takes like 2 seconds.

Then, for each next run of the function:

easy case you posted, index = 1 -> numba is 2x faster,
for index = 100 -> numba is 33x faster