may I ask a question about the problem I've been getting these days? I would appreciate it so much if you guys would like to help me solve this :)
So, I have this simple string that I want to parse, using the '@' keyword (only parse this if '@' is outside of a string that's inside a string). The reason behind this is I'm trying to learn how to parse some strings based on certain keywords to parse/split because I'm trying to implement my own 'simple programming language'...
Here's an example that I've made using regex: (spaces after the '@' keyword is doesn't really matter)
# Ignore the 'println(' thing, it's basically a builtin print statement that I made, so
# you can only focus on the string itself :)
# (?!\B"[^"]*)@(?![^"]*"\B)
# As I looking up how to use this thing with regex, I found this one that basically
# split the strings into elements by '@' keyword, but not splitting it if '@' is found
# inside a string. Here's what I mean:
# '"[email protected]"' --- found '@' inside a string, so don't parse it
# '"[email protected]" @ x' --- found '@' outside a string, so after being parsed would be like this:
# ['"[email protected]", x']
print_args = re.split(r'(?!\B"[^"]*)@(?![^"]*"\B)', codes[x].split('println(')[-1].removesuffix(')\n' or ')'))
vars: list[str] = []
result_for_testing: list[str] = []
for arg in range(0, len(print_args)):
# I don't know if this works because it's split the string for each space, but
# if there are some spaces inside a string, it would be considered as the spaces
# that should've been split, but it should not be going to be split because
# because that space is inside a string that is inside a string, not outside a
# string that is inside a string.
# Example 1: '"Hello, World!" @ x @ y' => ['"Hello, World!"', x, y]
# Example 2: '"Hello, World! " @ x @ y' => ['"Hello, World! "', x, y]
# At this point, the parsing doesn't have to worry about unnecessary spaces inside a string, just like the example 2 is...
compare: list[str] = print_args[arg].split()
# This one is basically checking if '"'is not in a string that has been parsed (in this
# case is a word that doesn't have '"'). Else, append the whole thing for the rest of
# the comparison elements
# Here's the string: '"Value of a is: " @ a @ "String"' [for example 1]
# Example 1: ['"Value of a is: "', 'a', '"String"'] (This one is correct)
# Here's the string: '" Value of a is: " @ a @ " String"'
# Example 2: ['" Value of a is: " @ a @ " String"'] (This one is incorrect)
vars.append(compare[0]) if '"' not in compare[0] else vars.append(" ".join(compare[0:]))
for v in range(0, len(vars)):
# This thing is just doing it job, appending the same elements in 'vars'
# to the 'result_for_testing'
result_for_testing.append(vars[v])
print(result_for_testing)
After these kinds of operations, the output I get for basic things to be parsed without unnecessary spaces is like this:
string_to_be_parsed: str = '"Value of a is: " @ a @ "String"'
Output > ['"Value of a is: "', 'a', '"String"'] # As what I'm expected to be...
But somehow it's broken when something like this (with unnecessary spaces):
string_to_be_parsed: str = '" Value of a is: " @ a @ " String "'
Output > ['" Value of a is: " @ a @ " String "']
# Incorrect result and I'm hoping the result will be like this:
Expected Output > [" Value of a is: ", a, " String "]
# If there are spaces inside a string, it just has to be ignored, but I don't know how to do it
Alright, guys, that's the problems I've encountered, and the conclusion is:
- How to parse the string and split each string inside a string by the '@' keyword, but it's not going to get split if '@' is found inside a string in a string?
Example: '"@ in a string inside a string" @ is_out_from_a_string'
The result should be: ['"@ in a string inside a string"', is_out_from_a_string]
- While parsing the strings, how to ignore all the spaces inside a string in a string?
Example: '" unnecessary spaces here too" @ x @ y @ z " this one too"'
The result should be: ['" unnecessary spaces here too"', x, y, z, '" this one too"']
Once again, I would really appreciate your hard work to help me find the solutions for the problems I got, and if there's something I did wrong or misconception, please tell me where, and how should I fix it :)
Thank you :)
CodePudding user response:
When talking about programming languages, a string.split() and nested loops aren't going to be enough. Programming languages usually split this into two steps: the tokenizer or lexer, and the parser. The tokenizer takes the input string (code in your-lang) and returns a list of tokens that represent keywords, identifiers, etc. In your code, this is each element in the result.
Either way, you're probably going to want to restructure your code a bit. For a tokenizer, here's some python-ish pseudocode:
yourcode = input
tokens = []
cursor = 0
while cursor < len(yourcode):
yourcode = yourcode[cursor:-1] # remove previously scanned tokens
match token regex from list of regexes
if match == token:
add add token of matched type to tokens
cursor = len(matched string)
elif match == whitespace:
cursor = len(matched whitespace)
else throw error invalid token
This uses a cursor to advance through your input string and extract tokens, as a direct answer to your question. For the list of regexes, simply use a list of pairs, where each pair includes a regex and a string describing the token type.
However, for a first programming language project, building a manual tokenizer and parser is probably not the way to go as it can get extremely complex very quickly, though it is a great learning experience once you're comfortable with the basics. I would consider look at using a parser generator. I have used one called SLY with python as well as PLY (SLY's predecessor) with good results. Parser generators take a grammar
, a description of your language in a specific format, and output a program that can parse your language so that you can worry about the functionality of the language itself more than how you parse the text/code input.
It also may be worth doing some more research before beginning your implementation. Specifically, I would recommend reading about Abstract Syntax Trees
and parsing algorithms, specifically recursive descent
which is what you would be writing if you built a parser manually, and LALR(1)
(Lookahead Left-to-Right) which is what SLY generates.
ASTs are the output of a parser (what the parser generator does for you) and are used to interpret or compile your language. They are fundamental to the construction of programming languages, so I would start there. This video explains syntax trees, and there are many python-specific videos on parsing as well. This series also covers using SLY to create a simple language in python.
EDIT: In regards to the specific parsing of the @ sign before a string, I would recommend using one token type for the @ sign and another for your string literal. In your parser, you can check if the next token is a string literal when the parser encounters an @ symbol. This will decrease complexity by splitting up your regexes, and also allow you to reuse the tokens if you implement functionality that also uses @ or string literals in the future.