Python3 re.search() fails with binary string-CodePudding

I've been reading XOR questions for hours now, but have yet to find one that addresses my problem. It's probably an issue of using the wrong terminology, but perhaps someone can help me out.

I have a 32 bit Windows executable that I read into a binary string

with open('file.exe', 'rb') as f:
    blob = f.read()

No issues there. However, there is a section of the file that is XOR encoded with a single-byte key, which I would like to decode. I don't know the value of the XOR, but I do have a known section of the decoded data that I can look for, so I have an XOR brute force:

for x in range(0, 255):
   known_string = bytearray('\x00\x01\x00\x01\x00'.encode())
   for i in range(len(known_string)):
       known_string[i] ^= x
   if re.search(bytes(known_string), blob):
       return x

This works in many cases; however, if the value of x is above 31, I get the ASCII representation of the XOR-ed bytes, rather than \x.., which causes the re.search() to fail.

In this case, the XOR value should be 0x50. If I XOR the known string with that value, I expect to get

bytearray(b'\x50\x51\x50\x51\x50').

Instead, I get

bytearray(b'23232').

I need the first representation so that I can search for it in blob, but nothing I have tried yields the expected result.

UPDATE FOR MORE CONTEXT

Per the answer below, the representation of the bytes is not my issue. However, I have a second sequence of known bytes that are separated by 3 unknown bytes, which is why I'm using a regex search. If I implement the solution below, I have something that looks like this:

for x in range(256):
    sequence_1 = bytes(byte ^ x for byte in b'\x00\x01\x00\x01\x00')
    sequence_2 = bytes(byte ^ x for byte in b'\x00\x02\x00\x01\x00')

    full_sequence = sequence_1   b'...'   sequence_2
    if re.search(full_sequence, blob):
        return x

In this case, with an XOR value of 0x50, full_sequence is

b'PQPQP...PRPQP'

The actual string in the blob is

b'...PQPQPRPPPRPQP...'

It is my understanding that re.search() should match the sequence to this string, but it does not return any matches.

CodePudding user response：

This works in many cases; however, if the value of x is above 31, I get the ASCII representation of the XOR-ed bytes, rather than \x.., which causes the re.search() to fail.

Why would the way the byte string is represented affect re.search()?

It's merely a representation. Under the hood the values are the same. All of these 0-to-255 byte strings are the same:

byte_string = bytes(i for i in range(256))
byte_string_default_representation = b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()* ,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
byte_string_hex = b"\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20\x21\x22\x23\x24\x25\x26\x27\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f\x30\x31\x32\x33\x34\x35\x36\x37\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f\x40\x41\x42\x43\x44\x45\x46\x47\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f\x50\x51\x52\x53\x54\x55\x56\x57\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f\x60\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
byte_string_oct = b"\000\001\002\003\004\005\006\007\010\011\012\013\014\015\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037\040\041\042\043\044\045\046\047\050\051\052\053\054\055\056\057\060\061\062\063\064\065\066\067\070\071\072\073\074\075\076\077\100\101\102\103\104\105\106\107\110\111\112\113\114\115\116\117\120\121\122\123\124\125\126\127\130\131\132\133\134\135\136\137\140\141\142\143\144\145\146\147\150\151\152\153\154\155\156\157\160\161\162\163\164\165\166\167\170\171\172\173\174\175\176\177\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377"
print(byte_string == byte_string_default_representation == byte_string_hex == byte_string_oct)

Output:

True

Here's how I would write it.

# Your range(0, 255) loop misses 255.
# No need for .encode. Just make it a bytes literal by appending b.
# A generator comprehension might be simpler than using a bytearray.
# No need for regex, just use the in operator.
for x in range(256):
    known_string = bytes(byte^x for byte in b'\x00\x01\x00\x01\x00')
    if known_string in blob:
        return x

I could not replicate your not getting any matches.

Using your blob and sequences, I ran into an exception: error: nothing to repeat. While x was 0x28, full_sequence was b'()()(...(*()(' which is an absurd pattern. So then I wrapped the call to re.search() in a try-except block, and I noticed it found 2 matches. One was absurd, when x was 0x7c, full_sequence was b'|}|}|...|~|}|'.

Instead of regex, try using bytes.find() twice. It returns -1 if it does not find a match. The 2nd and 3rd arguments are the start and end indexes for where to find a match.

import re

blob = b'PQPQPRPPPRPQP'

for x in range(256):
    sequence_1 = bytes(byte ^ x for byte in b'\x00\x01\x00\x01\x00')
    sequence_2 = bytes(byte ^ x for byte in b'\x00\x02\x00\x01\x00')

    full_sequence = sequence_1   b'...'   sequence_2
    try:
        if re.search(full_sequence, blob):
            print(f"re.search match found! {x=:x}, {full_sequence=}")
    except Exception:
        continue

for x in range(256):
    sequence_1 = bytes(byte ^ x for byte in b'\x00\x01\x00\x01\x00')
    index = blob.find(sequence_1)
    if index != -1:
        sequence_2 = bytes(byte ^ x for byte in b'\x00\x02\x00\x01\x00')
        index  = len(sequence_1) 3
        if blob.find(sequence_2, index, index len(sequence_2)) != -1:
            print(f"blob.find match found! {x=:x}, {sequence_1=}, {sequence_1=}")

Output:

re.search match found! x=50, full_sequence=b'PQPQP...PRPQP'
re.search match found! x=7c, full_sequence=b'|}|}|...|~|}|'
blob.find match found! x=50, sequence_1=b'PQPQP', sequence_1=b'PQPQP'