I am trying to match multiple elements to a single string with little to no luck.
The regex should return all the elements that are in the token array, as many times as they occur in the string in the same order they occur, this would be a basic lexing algorithm for a very basic C compiler.
Is there a way i could transform my array to a working pattern where the elements are essentially unordered? I have not found any other patterns that could work in my case as the elements of my array could appear anywhere in the string.
file = """
int main() {
return 2;
}"""
tokens = ['{', '}', '\(', '\)', ';', "int", "return", '[a-zA-Z]\w*', '[0-9] ']
def lex(file):
results = []
for i in tokens:
r = re.match(r".?" i ".",file)
if r != None:
results.append(r.group())
return r
the output should be something like this:
r = ["int", "main", "(", ")", "{", "return", "2", ";", "}"]
CodePudding user response:
Based on the solution from What is the Python way of doing a \G
anchored parsing loop you can use
import re
file = """
int main() {
return 2;
}"""
tokens = ['{','}',r'\(',r'\)',';',"int","return",r'[a-zA-Z]\w*','[0-9] ']
p = re.compile(fr"\s*({'|'.join(tokens)})")
def tokenize(w, pattern):
index = 0
m = pattern.match(w, index)
o = []
while m and index != m.end():
o.append(m.group(1))
index = m.end()
m = pattern.match(w, index)
return o
print(tokenize(file, p))
# => ['int', 'main', '(', ')', '{', 'return', '2', ';', '}']
See the Python demo. See the regex demo.
Basically, this matches any of the patterns in the tokens
list consecutively after zero or more whitespaces starting from the start of the string.
This also means you must have a complete set of token patterns that might appear in the input, else, this will stumble at non-matching text.