use python to interpret conditional expression strings-CodePudding

I have a series of conditional expressions as strings like: "('a' AND ('b' OR 'c')) OR ('d' AND ('e' OR ('g' AND 'h')))" which are nested to a non-predictable depth. I have no control over this data; I need to be able to work with it.

I'd like python to be able to understand these structures so that I can manipulate them .

I've tried creating nested python lists by breaking at the parentheses (using pyparsing or unutbu's/falsetru's parser):

>>> str1="('a' AND ('b' OR 'c')) OR ('d' AND ('e' OR ('g' AND 'h')))"
>>> var1=parse_nested(text, r'\s*\(', r'\)\s*')
>>> var1
[["'a' AND", ["'b' OR 'c'"]], 'OR', ["'d' AND", ["'e' OR", ["'g' AND 'h'"]]]]

...But then to be honest I don't know how to get python to interpret the AND/OR relationships between the objects. I feel like i'm going completely the wrong direction with this.

My ideal output would be to be a data structure that maintains the relationship types between the entities in a nested way, so that (for example) it would be easy to create a json or YAML:

OR:
- AND:
  - a
  - OR:
    - b
    - c
- AND:
  - d
  - OR:
    - e
    - AND:
      - g
      - h

CodePudding user response：

As far as I know there isn't a builtin or something like that for this. I would just loop over it and check for the separators, then you can separate them and remove spaces.

CodePudding user response：

Assuming the quoted terms are syntactically valid python strings, it should be possible to use the tokenize module to parse the input. The only additonal requirement is that all embedded escapes are doubled. Also note that the string tokens returned by the tokeniser are always quoted, so they can be safely eval'd.

The demo script below provides a basic implementation. The output is a list of lists, with AND/OR coverted to singleton types - but if you need a different structure it should be simple enough to adapt the code accordingly:

import tokenize, token as tokens
from io import StringIO

class ParseError(Exception):
    def __init__(self, message, tokens):
        super().__init__(f'{message}: {tokens}')

AND = type('AND', (object,), {'__repr__': lambda x: 'AND'})()
OR = type('OR', (object,), {'__repr__': lambda x: 'OR'})()

NAMES = {'AND': AND, 'OR': OR, 'TRUE': True, 'FALSE': False}
OPS = {tokens.LPAR, tokens.RPAR}
OTHER = {
    tokens.ENDMARKER, tokens.INDENT, tokens.DEDENT,
    tokens.NEWLINE, tokens.NL, tokens.COMMENT,
    }

def parse_expression(string):
    string = StringIO(string)
    result = current = []
    stack = [current]
    for token in tokenize.generate_tokens(string.readline):
        if (kind := token.type) == tokens.OP:
            if token.exact_type == tokens.LPAR:
                current.append(current := [])
                stack.append(current)
            elif token.exact_type == tokens.RPAR:
                current = stack.pop()
            else:
                raise ParseError('invalid operator', token)
        elif kind == tokens.NAME:
            if name := NAMES.get(token.string.upper()):
                current.append(name)
            else:
                raise ParseError('invalid name', token)
        elif kind == tokens.STRING or kind == tokens.NUMBER:
            current.append(eval(token.string))
        elif kind not in OTHER:
            raise ParseError('invalid token', token)
    return result


str1 = "('a' AND ('b' OR 'c')) OR ('d' AND ('e' OR ('g' AND 'h')))"
str2 = "('(AND)' AND (42 OR True)) or ('\\'d\\'' AND ('e' OR ('g' AND 'h')))"

print(parse_expression(str1))
print(parse_expression(str2))

Output:

[['a', AND, ['b', OR, 'c'], OR, ['d', AND, ['e', OR, ['g', AND, 'h']]]]]
[['(AND)', AND, [42, OR, True], OR, ["'d'", AND, ['e', OR, ['g', AND, 'h']]]]]

CodePudding user response：

Lark has what you need:

https://github.com/lark-parser/lark

import lark
SHAPE_GRAMMAR = (
    r'''
    ?start: expr

    expr                 : variable | bin_expr | expr (WS BIN_OP WS expr)* | "(" WS* expr WS* ")"

    // A binary expression
    bin_expr  : variable WS BIN_OP WS variable

    // The valid binary operators
    BIN_OP :  "AND" | "OR" | "XOR" | "NOR" | "NAND"

    variable : ESCAPED_SQ_STRING

    // Can  also use lark directives like `%import common.WS`
    _STRING_INNER: /.*?/
    _STRING_ESC_INNER: _STRING_INNER /(?<!\\)(\\\\)*?/
    ESCAPED_SQ_STRING : "'" _STRING_ESC_INNER "'"
    WS: /[ \t\f\r\n]/ 
    ''')

parser = lark.Lark(SHAPE_GRAMMAR,  start='start', parser='earley')
print(parser.parse("'A'").pretty())
print(parser.parse("'A' OR 'c'").pretty())
print(parser.parse("'A' OR 'B' OR 'c'").pretty())
print(parser.parse("('A' OR 'B' )").pretty())
print(parser.parse("('A' OR 'B' OR 'C' )").pretty())
print(parser.parse("('A' OR ('B' OR 'C') )").pretty())

print(parser.parse("('a' AND ('b' OR 'c')) OR ('d' AND ('e' OR ('g' AND 'h')))").pretty())

Final output is:

expr
  expr
    expr
      expr
        variable    'a'
       
      AND
       
      expr
        expr
          bin_expr
            variable    'b'
             
            OR
             
            variable    'c'
   
  OR
   
  expr
    expr
      expr
        variable    'd'
       
      AND
       
      expr
        expr
          expr
            variable    'e'
           
          OR
           
          expr
            expr
              bin_expr
                variable    'g'
                 
                AND
                 
                variable    'h'

Look at tutorials for how to work with the tree structure.