Parsing a string with some certain keywords to split (outside of string literals), but not splitted-CodePudding

may I ask a question about the problem I've been getting these days? I would appreciate it so much if you guys would like to help me solve this :)

So, I have this simple string that I want to parse, using the '@' keyword (only parse this if '@' is outside of a string that's inside a string). The reason behind this is I'm trying to learn how to parse some strings based on certain keywords to parse/split because I'm trying to implement my own 'simple programming language'...

Here's an example that I've made using regex: (spaces after the '@' keyword is doesn't really matter)

# Ignore the 'println(' thing, it's basically a builtin print statement that I made, so
# you can only focus on the string itself :)

# (?!\B"[^"]*)@(?![^"]*"\B)
# As I looking up how to use this thing with regex, I found this one that basically
# split the strings into elements by '@' keyword, but not splitting it if '@' is found
# inside a string. Here's what I mean:

# '"[email protected]"'     --- found '@' inside a string, so don't parse it
# '"[email protected]" @ x' --- found '@' outside a string, so after being parsed would be like this:
# ['"[email protected]", x']
print_args = re.split(r'(?!\B"[^"]*)@(?![^"]*"\B)', codes[x].split('println(')[-1].removesuffix(')\n' or ')'))
vars: list[str] = []
result_for_testing: list[str] = []
            
for arg in range(0, len(print_args)):
    # I don't know if this works because it's split the string for each space, but
    # if there are some spaces inside a string, it would be considered as the spaces
    # that should've been split, but it should not be going to be split because
    # because that space is inside a string that is inside a string, not outside a
    # string that is inside a string.

    # Example 1: '"Hello, World!" @   x @     y' => ['"Hello, World!"', x, y]
    # Example 2: '"Hello,      World!      " @    x @   y' => ['"Hello,      World!      "', x, y]
    # At this point, the parsing doesn't have to worry about unnecessary spaces inside a string, just like the example 2 is...
    compare: list[str] = print_args[arg].split()

    # This one is basically checking if '"'is not in a string that has been parsed (in this
    # case is a word that doesn't have '"'). Else, append the whole thing for the rest of
    # the comparison elements
    
    # Here's the string: '"Value of a is: " @ a @ "String"' [for example 1]
    # Example 1: ['"Value of a is: "', 'a', '"String"'] (This one is correct)

    # Here's the string: '"   Value of a is: " @ a @ "   String"'
    # Example 2: ['" Value of a is: " @ a @ " String"'] (This one is incorrect)
    vars.append(compare[0]) if '"' not in compare[0] else vars.append(" ".join(compare[0:]))
    
    for v in range(0, len(vars)):
        # This thing is just doing it job, appending the same elements in 'vars'
        # to the 'result_for_testing'
        result_for_testing.append(vars[v])

print(result_for_testing)

After these kinds of operations, the output I get for basic things to be parsed without unnecessary spaces is like this:

string_to_be_parsed: str = '"Value of a is: " @ a @ "String"'
Output > ['"Value of a is: "', 'a', '"String"'] # As what I'm expected to be...

But somehow it's broken when something like this (with unnecessary spaces):

string_to_be_parsed: str = '"   Value    of  a  is:     "    @     a   @  "   String  "'
Output > ['" Value of a is: " @ a @ " String "']
# Incorrect result and I'm hoping the result will be like this:

Expected Output > ["   Value    of  a  is:     ", a, "   String  "]
# If there are spaces inside a string, it just has to be ignored, but I don't know how to do it

Alright, guys, that's the problems I've encountered, and the conclusion is:

How to parse the string and split each string inside a string by the '@' keyword, but it's not going to get split if '@' is found inside a string in a string?

Example: '"@ in a string inside a string" @ is_out_from_a_string'
The result should be: ['"@ in a string inside a string"', is_out_from_a_string]

While parsing the strings, how to ignore all the spaces inside a string in a string?

Example: '"    unnecessary      spaces  here      too" @ x @ y @ z "   this   one     too"'
The result should be: ['"    unnecessary      spaces  here      too"', x, y, z, '"   this   one     too"']

Once again, I would really appreciate your hard work to help me find the solutions for the problems I got, and if there's something I did wrong or misconception, please tell me where, and how should I fix it :)

Thank you :)

CodePudding user response：

When talking about programming languages, a string.split() and nested loops aren't going to be enough. Programming languages usually split this into two steps: the tokenizer or lexer, and the parser. The tokenizer takes the input string (code in your-lang) and returns a list of tokens that represent keywords, identifiers, etc. In your code, this is each element in the result.

Either way, you're probably going to want to restructure your code a bit. For a tokenizer, here's some python-ish pseudocode:

yourcode = input
tokens = []
cursor = 0
while cursor < len(yourcode):
    yourcode = yourcode[cursor:-1] # remove previously scanned tokens
    match token regex from list of regexes
    if match == token:
        add add token of matched type to tokens
        cursor  = len(matched string)
    elif match == whitespace:
        cursor  = len(matched whitespace)
    else throw error invalid token

This uses a cursor to advance through your input string and extract tokens, as a direct answer to your question. For the list of regexes, simply use a list of pairs, where each pair includes a regex and a string describing the token type.

However, for a first programming language project, building a manual tokenizer and parser is probably not the way to go as it can get extremely complex very quickly, though it is a great learning experience once you're comfortable with the basics. I would consider look at using a parser generator. I have used one called SLY with python as well as PLY (SLY's predecessor) with good results. Parser generators take a grammar, a description of your language in a specific format, and output a program that can parse your language so that you can worry about the functionality of the language itself more than how you parse the text/code input.

It also may be worth doing some more research before beginning your implementation. Specifically, I would recommend reading about Abstract Syntax Trees and parsing algorithms, specifically recursive descent which is what you would be writing if you built a parser manually, and LALR(1)(Lookahead Left-to-Right) which is what SLY generates.

ASTs are the output of a parser (what the parser generator does for you) and are used to interpret or compile your language. They are fundamental to the construction of programming languages, so I would start there. This video explains syntax trees, and there are many python-specific videos on parsing as well. This series also covers using SLY to create a simple language in python.

EDIT: In regards to the specific parsing of the @ sign before a string, I would recommend using one token type for the @ sign and another for your string literal. In your parser, you can check if the next token is a string literal when the parser encounters an @ symbol. This will decrease complexity by splitting up your regexes, and also allow you to reuse the tokens if you implement functionality that also uses @ or string literals in the future.