Home > Software engineering >  Regex matching either or condition with '|' not working
Regex matching either or condition with '|' not working

Time:07-26

I have some sample text as below:

MTG-2022039036 MTG
MTG-LR 3136 / 130 MTG 
MTG-LR 201260 / 9046 ASSIGN
MTG-2021063349 MTG

My desired Results:

2022039036
3136 / 130
201260 / 9046
2021063349

My regex patterns work individually just fine example:

match1 = re.search(r'(\d  \/ ?\d )', ref)
num1 = match1.group(1) if match1 else None
# correctly returns 3136 / 130

match2 = re.search(r'(?:-?)(\d )', ref)
num2 = match2.group(1) if match2 else None
# correctly returns 2021063349

But I want to combine them in one line like below to match either one or other pattern since only one case will occur in each string:

match = re.search(r'(?:-?)(\d )|(\d  \/ ?\d )', ref)
num = match.group(1) if match else None
# This only returns 3136

I feel like I'm doing a very simple thing but no idea why now this doesn't work. I have used '|' for matching either or conditions in pandas str.extract() and had no problems there. Please advise.

CodePudding user response:

With your shown samples please try following regex.

^MTG-[^0-9]*(\d (?:\s /\s \d )?)

Here is the Online demo for above regex.

With Python3 code, please try following, using findall function of re module and in that using re.M flag true for multiline enabling.

import re
var="""MTG-2022039036 MTG
MTG-LR 3136 / 130 MTG
MTG-LR 201260 / 9046 ASSIGN
MTG-2021063349 MTG"""

re.findall(r'^MTG-[^0-9]*(\d (?:\s /\s \d )?)', var, re.M)
['2022039036', '3136 / 130', '201260 / 9046', '2021063349']

CodePudding user response:

There does not seem to be an optional space after the /, but you might use a single pattern:

\b\d (?: / ?\d )?\b

Explanation

  • \b A word boundary to prevent a partial word match
  • \d Match 1 digits
  • (?: / ?\d )? Optionally match / then an optional space and 1 digits
  • \b A word boundary

Regex demo

import re

pattern = r"\b\d (?: / ?\d )?\b"

s = ("MTG-2022039036 MTG\n"
            "MTG-LR 3136 / 130 MTG \n"
            "MTG-LR 201260 / 9046 ASSIGN\n"
            "MTG-2021063349 MTG")

print(re.findall(pattern, s))

Output

['2022039036', '3136 / 130', '201260 / 9046', '2021063349']

Or use a capture group matching the leading MTG- with optional LR, where the group 1 value will be returned by re.findall

\bMTG-(?:LR )?(\d (?: / \d )?)\b

Explanation

  • \bMTG- Match literally with a leading word boundary
  • (?:LR )? Optionally match LR
  • ( Capture group 1
    • \d (?: / \d )? Optionally match / then an optional space and 1 digits
  • ) Close group 1
  • \b A word boundary

Regex demo

  • Related