Home > front end >  Find all occurrences of bytestrings in a python code snippet
Find all occurrences of bytestrings in a python code snippet

Time:11-12

I'm trying to parse python snippets, some of which contains bytestrings. for example:

"""
from gzip import decompress as __;_=exec;_(__(b'\x1f\x8b\x08\x00\xcbYmc\x02\xff\xbd7i\xb3\xdaJv\xdf\xdf\xaf /I\xf9\xbar\xc6%\x81@\x92k\x9c)\x16I,b\x95Xm\x87\x92Z-$\xd0\x86\x16\x10LM~{N\x03\xd7\xc6\xd7\x9e%\xa9\xa9PE/\xa7\xcf\xbeuk\xd3\xacm\xdd"\x94\x1b\'\xa5\xda\x04"H\x17\xae\xe3t\xf4\xcdn\x03\xa9/&T>\x13\xdbu\g=\x9f\x13~\x11\xf6\x9b\xd7\x15~\xb2\xe7\xbc\xe6\xc2K\xb8\x18\x03\xfd|[\x7f\xe8\xb8I;\xf0\xf1\x93\xec\x83\x8eo15\x8dC\xfc\xc6I\xf1\xfd\xf5r\x8f\xeb\x0f\xd7\xc53#\xa8<_\xb2Py\xbe\xe1\xde\xff\x0fk&\x93\xa8V\x18\x00\x00'))

x = b"\x1f\x8b\x08"

y = "hello world"
"""

Is there a regex pattern I can use to correctly find those strings?

I have tried implementing a regex query myself, like so:

bytestrings= re.findall(r'b"(. ?)"', text)   re.findall(r"b\'(. ?)'", text)

I was expecting to receive an array

[b'\x1f\x8b\x08\x00\xcbYmc\x02\xff\xbd7i\xb3\xdaJv\xdf\xdf\xaf /I\xf9\xbar\xc6%\x81@\x92k\x9c)\x16I,b\x95Xm\x87\x92Z-$\xd0\x86\x16\x10LM~{N\x03\xd7\xc6\xd7\x9e%\xa9\xa9PE/\xa7\xcf\xbeuk\xd3\xacm\xdd"\x94\x1b\'\xa5\xda\x04"H\x17\xae\xe3t\xf4\xcdn\x03\xa9/&T>\x13\xdbu\g=\x9f\x13~\x11\xf6\x9b\xd7\x15~\xb2\xe7\xbc\xe6\xc2K\xb8\x18\x03\xfd|[\x7f\xe8\xb8I;\xf0\xf1\x93\xec\x83\x8eo15\x8dC\xfc\xc6I\xf1\xfd\xf5r\x8f\xeb\x0f\xd7\xc53#\xa8<_\xb2Py\xbe\xe1\xde\xff\x0fk&\x93\xa8V\x18\x00\x00', b"\x1f\x8b\x08"]

instead it returns an empty array.

CodePudding user response:

This isn't a job for regular expressions, but for a Python parser.

import ast

code = """
...
"""

tree = ast.parse(code)

Now you can walk the tree looking for values of type ast.Constant whose value attributes have type bytes. Do this by defining a subclass of ast.NodeVisitor and overriding its visit_Constant method. This method will be called on each node of type ast.Constant in the tree, letting you examine the value. Here, we simply add appropriate values to a global list.

bytes_literals = []

class BytesLiteralCollector(ast.NodeVisitor):
    def visit_Constant(self, node):
        if isinstance(node.value, bytes):
            bytes_literals.append(node.value)

BytesLiteralCollector().visit(tree)

The documentation for NodeVisitor is not great. Aside from the two documented methods visit and generic_visit, I believe you can define visit_* where * can be any of the node types defined in the abstract grammar presented at the start of the documentation.

You can use print(ast.dump(ast.parse(code), indent=4)) to get a more-or-less readable representation of the tree that your visitor will walk.

  • Related