Home > Back-end >  How to parse and tokenize instructions in a text file using regex with Python
How to parse and tokenize instructions in a text file using regex with Python

Time:08-24

I'm trying to build a language interpreter using regular expressions with Python. The language is a basic free language meant to formalize the possible commands sent to a drill robot, and the syntax consists of three main groups of instructions to move the robot:

1. ROTATE LEFT, ROTATE RIGHT, GO N UNITS (N is any natural number)
2. DRILL
3. REPEAT N TIMES { ... }

The idea is to take a file with instructions (maybe multiple in one line) and for each instruction execute specific methods on the robot, and if any syntactic error appears I have to specify in which line it ocurred. One example of such a file is:

GO 4 UNITS ROTATE LEFT ROTATE LEFT
GO 5 UNITS DRILL GO 2 UNITS
REPEAT 4 TIMES { GO 5 UNITS ROTATE LEFT } GO 1 UNITS
REPEAT 2 TIMES {
   GO 5 UNITS
   ROTATE RIGHT DRILL
   REPEAT 2 TIMES { GO 1 UNITS DRILL } ROTATE LEFT
}

My plan is to go line by line matching the possible instructions using regex and executing the corresponding method to modify the state of the robot. This turned out to be relatively simple, but the problem arises when I try to deal with the REPEAT N TIMES { ... } instruction, since it can take multiple lines and can be recursive. I could match for REPEAT N TIMES { and then look for a } but because of the instruction possibly being multiline and the recursion this could go really messy really fast.

I'd appreciate any tips and suggestions on the simplest way to face this problem using regex.

CodePudding user response:

This solves your problem with basic text parsing. This will not handle malformed programs. Don't Do That, as we say...

body = """\
GO 4 UNITS ROTATE LEFT ROTATE LEFT
GO 5 UNITS DRILL GO 2 UNITS
REPEAT 4 TIMES { GO 5 UNITS ROTATE LEFT } GO 1 UNITS
REPEAT 2 TIMES {
   GO 5 UNITS
   ROTATE RIGHT DRILL
   REPEAT 2 TIMES { GO 1 UNITS DRILL } ROTATE LEFT
}""".split()

def handle_go( u ):
    print( "going %d units" % int(u))

def handle_rotate( direc ):
    print( "rotating %s" % direc.lower() )

def handle_drill():
    print( "drilling" )

def find_matching_brace(body):
    nest = 0
    for i, n in enumerate(body):
        if n == '{':
            nest  = 1
        if n == '}':
            if not nest:
                return i
            nest -= 1

def process(body):
    while body:
        verb = body.pop(0)
        if verb == "GO":
            handle_go( body.pop(0) )
            assert body.pop(0) == 'UNITS'
        elif verb == "ROTATE":
            handle_rotate(body.pop(0))
        elif verb == "DRILL":
            handle_drill()
        elif verb == "REPEAT":
            count = body.pop(0)
            assert body.pop(0)=="TIMES"
            assert body.pop(0)=="{"
            closing = find_matching_brace(body)
            newbody = body[0:closing]
            print("repeat", count, closing, newbody)
            for _ in range(int(count)):
                process(newbody[:])
            body = body[closing 1:]
            
process(body)

Output:

going 4 units
rotating left
rotating left
going 5 units
drilling
going 2 units
repeat 4 5 ['GO', '5', 'UNITS', 'ROTATE', 'LEFT']
going 5 units
rotating left
going 5 units
rotating left
going 5 units
rotating left
going 5 units
rotating left
going 1 units
repeat 2 17 ['GO', '5', 'UNITS', 'ROTATE', 'RIGHT', 'DRILL', 'REPEAT', '2', 'TIMES', '{', 'GO', '1', 'UNITS', 'DRILL', '}', 'ROTATE', 'LEFT']
going 5 units
rotating right
drilling
repeat 2 4 ['GO', '1', 'UNITS', 'DRILL']
going 1 units
drilling
going 1 units
drilling
rotating left
going 5 units
rotating right
drilling
repeat 2 4 ['GO', '1', 'UNITS', 'DRILL']
going 1 units
drilling
going 1 units
drilling
rotating left
  • Related