Home > front end >  Parsing a string containing code into a list / tree in python
Parsing a string containing code into a list / tree in python

Time:02-11

as the title suggests I'm trying to parse a piece of code into a tree or a list. First off I would like to thank for any contribution and time spent on this. So far my code is doing what I expect, yet I am not sure that this is the optimal / most generic way to do this.

Problem

1. I want to have a more generic solution since in the future I am going to need further analysis of this sintax. 2. I am unable right now to separate the operators like '=' or '>=' as you can see below in the output I share. In the future I might change the content of the list / tree from strings to tuples so i can identify the kind of operator (parameter, comparison like = or >= ....). But this is not a real need right now.

Research

My first attempt was parsing the text character by character, but my code was getting too messy and barely readable, so I assumed that I was doing something wrong there (I don't have that code to share here anymore) So i started looking around how people where doing it and found some approaches that didn't necessarily fullfil the requirements of simplicity and generic. I would share the links to the sites but I didn't keep track of them.

The Syntax of the code

The syntax is pretty simple, after all I'm no interested in types or any further detail. just the functions and parameters. strings are defined as 'my string', variables as !variable and numbers as in any other language. Here is a sample of code:
db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)

My Output

Here my output is partialy correct since I'm still unable to separate the "= '3'" part (of course I have to separate it because in this case its a comparison operator and not part of a string)

[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "= '3'", "'4'", "'5'"]}, '6']}]

Desired Output

[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "=", "'3'", "'4'", "'5'"]}, '6']}]

My code so far

The parseRecursive method is the entry point.

import re

class FileParser:

    #order is important to avoid miss splits
    COMPARATOR_SIGN = {
        '@='
        ,'@<>'
        ,'<>'
        ,'>='
        ,'<='
        ,'='
        ,'>'
        ,'<'
    }

    def __init__(self):
        pass

    def __charExistsInOccurences(self,current_needle, needles, text):
        """
        check if other needles are present in text
        current_needle : string -> the current needle being evaluated
        needles : list -> list of needles
        text : string/list<string> -> a string or a list of string to evaluate
        """
        #if text is a string convert it to list of strings
        text = text if isinstance(text, list) else [text]
        
        exists = False

        for t in text:
            #check if needle is inside text value
            for needle in needles:
                    #dont check the same key
                    if needle != current_needle:
                        regex_search_needle = split_regex = '\s*' '\s*'.join(needle)   '\s*'
                        #list of 1's and 0's . 1 if another character is found in the string.
                        found = [1 if re.search(regex_search_needle, x) else 0 for x in t]
                        if sum(found) > 0:
                            exists = True
                            break

        return exists
        

    def findOperator(self, needles, haystack):
        """
        split parameters from operators
        needles : list -> list of operators
        haystack : string
        """
        string_open = haystack.find("'")
        
        #if no string has been found set the index to 0
        if string_open < 0:
            string_open = 0

        occurences = []

        string_closure = haystack.rfind("'")
        operator = ''
        for needle in needles:
            #regex to ignore the possible spaces between characters of the needle
            split_regex = '\s*' '\s*'.join(needle)   '\s*'
            
            #parse parameters before and after the string
            before_string = re.split(split_regex, haystack[0:string_open])
            after_string = re.split(split_regex, haystack[string_closure 1:])


            #check if any other needle exists in the results found
            before_string_exists = self.__charExistsInOccurences(needle, needles, before_string)
            after_string_exists = self.__charExistsInOccurences(needle, needles, after_string)

            #if the operator has been found merge the results with the occurences and assign the operator
            if not before_string_exists and not after_string_exists:
                occurences.extend(before_string)
                occurences.extend([haystack[string_open:string_closure 1]])
                occurences.extend(after_string)
                operator = needle
        
        #filter blank spaces generated
        occurences = list(filter(lambda x: len(x.strip())>0,occurences))
        result_check = [1 if x==haystack else 0 for x in occurences]
        #if the haystack was originaly a simple string like '1' the occurences list is going to be filled with the same character over and over due to the before string an after string part
        if len(result_check) == sum(result_check):
            occurences= [haystack]
            operator = ''

        return operator, occurences
 




    def parseRecursive(self,text):
        """
        parse a block of text
        text : string 
        """

        assert(len(text) < 1, "text is empty")

        function_open = text.find('(')
        accumulated_params = []
        if function_open > -1:
            #there is another function nested
            text_prev_function = text[0:function_open]
            
            #find last space coma or equal to retrieve the function name
            last_space = -1
            for j in range(len(text_prev_function)-1, 0 , -1):
                if text_prev_function[j] == ' ' or text_prev_function[j] == ',' or text_prev_function[j] == '=':
                    last_space = j
                    break

            func_name = ''

            if last_space > -1:
                #there is something else behind the function name
                func_name = text_prev_function[last_space 1:]
                #no parentesis before so previous characters from function name are parameters
                text_prev_func_params = list(filter(lambda x: len(x.strip())>0,text_prev_function[:last_space 1].split(',')))
                text_prev_func_params = [x.strip() for x in text_prev_func_params]
                #debug here
                #accumulated_params.extend(text_prev_func_params)

                for itext_prev in text_prev_func_params:
                    operator, text_prev_operator = self.findOperator(self.COMPARATOR_SIGN,itext_prev)
                    if operator == '':
                        accumulated_params.extend(text_prev_operator)
                    else:
                        text_prev_operator.append(operator)
                        accumulated_params.extend(text_prev_operator)
                    
                #accumulated_params.extend(text_prev_operator)
            else:
                #function name is the start of the string
                func_name = text_prev_function[0:].strip()
            
            #find the closure of parentesis
            function_close = text.rfind(')')
            #parse the next function and extend the current list of parameters
            next_func = text[function_open 1:function_close]
            func_params = {func_name : self.parseRecursive(next_func)}
            accumulated_params.append(func_params)

            #
            # parameters after the function 
            #
            new_text = text[function_close 1:]
            accumulated_params.extend(self.parseRecursive(new_text))
        else:
            #there is no other function nested
            split_text = text.split(',')
            current_func_params = list(filter(lambda x: len(x.strip())>0,split_text))
            current_func_params = [x.strip() for x in current_func_params]
            accumulated_params.extend(current_func_params)
        
        #accumulated_params = list(filter(lambda x: len(x.strip())>0,accumulated_params))
        return accumulated_params

text = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"
obj = FileParser()
print(obj.parseRecursive(text))

CodePudding user response:

You can use pyparsing to deal with such a case.
* pyparsing can be installed by pip install pyparsing

Code:

import pyparsing as pp

# A parsing pattern
w = pp.Regex(r'(?:![^(),] )|[^(), ] ') ^ pp.Suppress(',')
pattern = w   pp.nested_expr('(', ')', content=w)

# A recursive function to transform a pyparsing result into your desirable format
def transform(elements):
    stack = []
    for e in elements:
        if isinstance(e, list):
            key = stack.pop()
            stack.append({key: transform(e)})
        else:
            stack.append(e)
    return stack

# A sample
string = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"

# Operations to parse the sample string
elements = pattern.parse_string(string).as_list()
result = transform(elements)

# Assertion
assert result == [{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]

# Show the result
print(result)

Output:

[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]

Note:

  • If there is an unbalanced parenthesis inside () (for example a(b(c), a(b)c), etc), an unexpected result is obtained or an IndexError is raised. So be careful in such cases.
  • At the moment, only a single sample is available to make a pattern to parse string. So if you encounter a parsing error, provide more examples in your question.
  • Related