Home > Software engineering >  regex to get all characters between certain split-words
regex to get all characters between certain split-words

Time:08-04

My string contains AND, OR and NOT keywords, each of them is always upper case and pre- and suffixxed with a space.

This is my test-string:

X OR Y OR Z Z AND ZY AND ZZ OR A OR B AND C NOT E NOT F

I would like to get:

  • all blocks connected with AND and separated by either OR, NOT or the beginning/end of the string. For my example i am looking for Z Z AND ZY AND ZZ as well as B AND C. This is what i came up with, which returns Z AND ZY AND ZZ instead of Z Z AND ZY AND ZZ because of the \w, but i can not up with any better idea:
import re

input_string = "X OR Y OR Z Z AND ZY AND ZZ OR A OR B AND C NOT E NOT F"
and_pairs = re.findall("\w AND . ?(?= OR | NOT )", input_string )
  • also i would need all terms preceeded by a NOT, as well as all terms followed by an OR in separate lists.

I dont want to seem lazy, but regex is driving me crazy (unintended rhyme).

CodePudding user response:

I think this should do the trick,

result:

>>> t_string = "X OR Y OR Z Z AND ZY AND ZZ OR A OR B AND C NOT E NOT F"
>>> [item.strip() for sublist in [x.split('NOT') for x in t_string.split('OR')] for item in sublist if 'AND' in item]
['Z Z AND ZY AND ZZ', 'B AND C']

CodePudding user response:

try with split

input_string = "X OR Y OR Z Z AND ZY AND ZZ OR A OR B AND C NOT E NOT F"
split_pairs = re.split("( OR | NOT )", input_string)
and_pairs = []
for and_block in split_pairs:
    if "AND" in and_block:
        and_pairs  = [and_block]
print(and_pairs)

CodePudding user response:

Here's how to find the AND pairs:

import re

input_string = "X OR Y OR Z Z AND ZY AND ZZ OR A OR B AND C NOT E NOT F"
matchRegex = r"(.*?)(?:(?: OR | NOT )(\w )) ?"

regexdata = re.findall(matchRegex, input_string)
regexdata = list(sum(regexdata,())) # flatten matches
print(regexdata)

matches = [""]
for idx, data in enumerate(regexdata): # combine separated matches
        if idx % 2 == 0: matches[-1]  = data
        else: matches.append(data)
print(matches)

matches = list(filter(lambda match: "AND" in match, matches)) # 'and' pairs only
print(matches)

Output:

['X', 'Y', '', 'Z', ' Z AND ZY AND ZZ', 'A', '', 'B', ' AND C', 'E', '', 'F']
['X', 'Y', 'Z Z AND ZY AND ZZ', 'A', 'B AND C', 'E', 'F']
['Z Z AND ZY AND ZZ', 'B AND C']

What this does is first it matches with the regex, then it combines the separated regex groups (index 1 and 2 should be combined, 3 and 4, and so on). Once that is complete, it filter out and outputs only the AND connected parts. If you don't need that last part you can remove it.

  • Related