as the title suggests I'm trying to parse a piece of code into a tree or a list. First off I would like to thank for any contribution and time spent on this. So far my code is doing what I expect, yet I am not sure that this is the optimal / most generic way to do this.
1. I want to have a more generic solution since in the future I am going to need further analysis of this sintax. 2. I am unable right now to separate the operators like '=' or '>=' as you can see below in the output I share. In the future I might change the content of the list / tree from strings to tuples so i can identify the kind of operator (parameter, comparison like = or >= ....). But this is not a real need right now.Research
My first attempt was parsing the text character by character, but my code was getting too messy and barely readable, so I assumed that I was doing something wrong there (I don't have that code to share here anymore) So i started looking around how people where doing it and found some approaches that didn't necessarily fullfil the requirements of simplicity and generic. I would share the links to the sites but I didn't keep track of them.
The Syntax of the code
The syntax is pretty simple, after all I'm no interested in types or any further detail. just the functions and parameters. strings are defined as 'my string', variables as !variable and numbers as in any other language. Here is a sample of code:
db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)
My Output
Here my output is partialy correct since I'm still unable to separate the "= '3'" part (of course I have to separate it because in this case its a comparison operator and not part of a string)
[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "= '3'", "'4'", "'5'"]}, '6']}]
Desired Output
[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "=", "'3'", "'4'", "'5'"]}, '6']}]
My code so far
The parseRecursive method is the entry point.
import re
class FileParser:
#order is important to avoid miss splits
def __init__(self):
def __charExistsInOccurences(self,current_needle, needles, text):
check if other needles are present in text
current_needle : string -> the current needle being evaluated
needles : list -> list of needles
text : string/list<string> -> a string or a list of string to evaluate
#if text is a string convert it to list of strings
text = text if isinstance(text, list) else [text]
exists = False
for t in text:
#check if needle is inside text value
for needle in needles:
#dont check the same key
if needle != current_needle:
regex_search_needle = split_regex = '\s*' '\s*'.join(needle) '\s*'
#list of 1's and 0's . 1 if another character is found in the string.
found = [1 if, x) else 0 for x in t]
if sum(found) > 0:
exists = True
return exists
def findOperator(self, needles, haystack):
split parameters from operators
needles : list -> list of operators
haystack : string
string_open = haystack.find("'")
#if no string has been found set the index to 0
if string_open < 0:
string_open = 0
occurences = []
string_closure = haystack.rfind("'")
operator = ''
for needle in needles:
#regex to ignore the possible spaces between characters of the needle
split_regex = '\s*' '\s*'.join(needle) '\s*'
#parse parameters before and after the string
before_string = re.split(split_regex, haystack[0:string_open])
after_string = re.split(split_regex, haystack[string_closure 1:])
#check if any other needle exists in the results found
before_string_exists = self.__charExistsInOccurences(needle, needles, before_string)
after_string_exists = self.__charExistsInOccurences(needle, needles, after_string)
#if the operator has been found merge the results with the occurences and assign the operator
if not before_string_exists and not after_string_exists:
occurences.extend([haystack[string_open:string_closure 1]])
operator = needle
#filter blank spaces generated
occurences = list(filter(lambda x: len(x.strip())>0,occurences))
result_check = [1 if x==haystack else 0 for x in occurences]
#if the haystack was originaly a simple string like '1' the occurences list is going to be filled with the same character over and over due to the before string an after string part
if len(result_check) == sum(result_check):
occurences= [haystack]
operator = ''
return operator, occurences
def parseRecursive(self,text):
parse a block of text
text : string
assert(len(text) < 1, "text is empty")
function_open = text.find('(')
accumulated_params = []
if function_open > -1:
#there is another function nested
text_prev_function = text[0:function_open]
#find last space coma or equal to retrieve the function name
last_space = -1
for j in range(len(text_prev_function)-1, 0 , -1):
if text_prev_function[j] == ' ' or text_prev_function[j] == ',' or text_prev_function[j] == '=':
last_space = j
func_name = ''
if last_space > -1:
#there is something else behind the function name
func_name = text_prev_function[last_space 1:]
#no parentesis before so previous characters from function name are parameters
text_prev_func_params = list(filter(lambda x: len(x.strip())>0,text_prev_function[:last_space 1].split(',')))
text_prev_func_params = [x.strip() for x in text_prev_func_params]
#debug here
for itext_prev in text_prev_func_params:
operator, text_prev_operator = self.findOperator(self.COMPARATOR_SIGN,itext_prev)
if operator == '':
#function name is the start of the string
func_name = text_prev_function[0:].strip()
#find the closure of parentesis
function_close = text.rfind(')')
#parse the next function and extend the current list of parameters
next_func = text[function_open 1:function_close]
func_params = {func_name : self.parseRecursive(next_func)}
# parameters after the function
new_text = text[function_close 1:]
#there is no other function nested
split_text = text.split(',')
current_func_params = list(filter(lambda x: len(x.strip())>0,split_text))
current_func_params = [x.strip() for x in current_func_params]
#accumulated_params = list(filter(lambda x: len(x.strip())>0,accumulated_params))
return accumulated_params
text = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"
obj = FileParser()
CodePudding user response:
You can use pyparsing to deal with such a case.
* pyparsing
can be installed by pip install pyparsing
import pyparsing as pp
# A parsing pattern
w = pp.Regex(r'(?:![^(),] )|[^(), ] ') ^ pp.Suppress(',')
pattern = w pp.nested_expr('(', ')', content=w)
# A recursive function to transform a pyparsing result into your desirable format
def transform(elements):
stack = []
for e in elements:
if isinstance(e, list):
key = stack.pop()
stack.append({key: transform(e)})
return stack
# A sample
string = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"
# Operations to parse the sample string
elements = pattern.parse_string(string).as_list()
result = transform(elements)
# Assertion
assert result == [{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]
# Show the result
[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]
- If there is an unbalanced parenthesis inside
(for examplea(b(c)
, etc), an unexpected result is obtained or anIndexError
is raised. So be careful in such cases. - At the moment, only a single sample is available to make a pattern to parse string. So if you encounter a parsing error, provide more examples in your question.