I have a series of conditional expressions as strings like: "('a' AND ('b' OR 'c')) OR ('d' AND ('e' OR ('g' AND 'h')))"
which are nested to a non-predictable depth. I have no control over this data; I need to be able to work with it.
I'd like python to be able to understand these structures so that I can manipulate them .
I've tried creating nested python lists by breaking at the parentheses (using pyparsing or unutbu's/falsetru's parser):
>>> str1="('a' AND ('b' OR 'c')) OR ('d' AND ('e' OR ('g' AND 'h')))"
>>> var1=parse_nested(text, r'\s*\(', r'\)\s*')
>>> var1
[["'a' AND", ["'b' OR 'c'"]], 'OR', ["'d' AND", ["'e' OR", ["'g' AND 'h'"]]]]
...But then to be honest I don't know how to get python to interpret the AND/OR relationships between the objects. I feel like i'm going completely the wrong direction with this.
My ideal output would be to be a data structure that maintains the relationship types between the entities in a nested way, so that (for example) it would be easy to create a json or YAML:
OR:
- AND:
- a
- OR:
- b
- c
- AND:
- d
- OR:
- e
- AND:
- g
- h
CodePudding user response:
As far as I know there isn't a builtin or something like that for this. I would just loop over it and check for the separators, then you can separate them and remove spaces.
CodePudding user response:
Assuming the quoted terms are syntactically valid python strings, it should be possible to use the tokenize module to parse the input. The only additonal requirement is that all embedded escapes are doubled. Also note that the string tokens returned by the tokeniser are always quoted, so they can be safely eval'd.
The demo script below provides a basic implementation. The output is a list of lists, with AND/OR
coverted to singleton types - but if you need a different structure it should be simple enough to adapt the code accordingly:
import tokenize, token as tokens
from io import StringIO
class ParseError(Exception):
def __init__(self, message, tokens):
super().__init__(f'{message}: {tokens}')
AND = type('AND', (object,), {'__repr__': lambda x: 'AND'})()
OR = type('OR', (object,), {'__repr__': lambda x: 'OR'})()
NAMES = {'AND': AND, 'OR': OR, 'TRUE': True, 'FALSE': False}
OPS = {tokens.LPAR, tokens.RPAR}
OTHER = {
tokens.ENDMARKER, tokens.INDENT, tokens.DEDENT,
tokens.NEWLINE, tokens.NL, tokens.COMMENT,
}
def parse_expression(string):
string = StringIO(string)
result = current = []
stack = [current]
for token in tokenize.generate_tokens(string.readline):
if (kind := token.type) == tokens.OP:
if token.exact_type == tokens.LPAR:
current.append(current := [])
stack.append(current)
elif token.exact_type == tokens.RPAR:
current = stack.pop()
else:
raise ParseError('invalid operator', token)
elif kind == tokens.NAME:
if name := NAMES.get(token.string.upper()):
current.append(name)
else:
raise ParseError('invalid name', token)
elif kind == tokens.STRING or kind == tokens.NUMBER:
current.append(eval(token.string))
elif kind not in OTHER:
raise ParseError('invalid token', token)
return result
str1 = "('a' AND ('b' OR 'c')) OR ('d' AND ('e' OR ('g' AND 'h')))"
str2 = "('(AND)' AND (42 OR True)) or ('\\'d\\'' AND ('e' OR ('g' AND 'h')))"
print(parse_expression(str1))
print(parse_expression(str2))
Output:
[['a', AND, ['b', OR, 'c'], OR, ['d', AND, ['e', OR, ['g', AND, 'h']]]]]
[['(AND)', AND, [42, OR, True], OR, ["'d'", AND, ['e', OR, ['g', AND, 'h']]]]]
CodePudding user response:
Lark has what you need:
https://github.com/lark-parser/lark
import lark
SHAPE_GRAMMAR = (
r'''
?start: expr
expr : variable | bin_expr | expr (WS BIN_OP WS expr)* | "(" WS* expr WS* ")"
// A binary expression
bin_expr : variable WS BIN_OP WS variable
// The valid binary operators
BIN_OP : "AND" | "OR" | "XOR" | "NOR" | "NAND"
variable : ESCAPED_SQ_STRING
// Can also use lark directives like `%import common.WS`
_STRING_INNER: /.*?/
_STRING_ESC_INNER: _STRING_INNER /(?<!\\)(\\\\)*?/
ESCAPED_SQ_STRING : "'" _STRING_ESC_INNER "'"
WS: /[ \t\f\r\n]/
''')
parser = lark.Lark(SHAPE_GRAMMAR, start='start', parser='earley')
print(parser.parse("'A'").pretty())
print(parser.parse("'A' OR 'c'").pretty())
print(parser.parse("'A' OR 'B' OR 'c'").pretty())
print(parser.parse("('A' OR 'B' )").pretty())
print(parser.parse("('A' OR 'B' OR 'C' )").pretty())
print(parser.parse("('A' OR ('B' OR 'C') )").pretty())
print(parser.parse("('a' AND ('b' OR 'c')) OR ('d' AND ('e' OR ('g' AND 'h')))").pretty())
Final output is:
expr
expr
expr
expr
variable 'a'
AND
expr
expr
bin_expr
variable 'b'
OR
variable 'c'
OR
expr
expr
expr
variable 'd'
AND
expr
expr
expr
variable 'e'
OR
expr
expr
bin_expr
variable 'g'
AND
variable 'h'
Look at tutorials for how to work with the tree structure.